Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update node-not-ready-basic-troubleshooting.md #1614

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Take basic troubleshooting steps to avoid Node Not Ready issues
description: Learn about basic troubleshooting steps to avoid Node Not Ready issues in Azure Kubernetes Service (AKS) cluster nodes.
ms.date: 04/15/2022
ms.date: 09/20/2024
ms.reviewer: rissing, chiragpa, momajed, v-leedennis
ms.service: azure-kubernetes-service
#Customer intent: As an Azure Kubernetes user, I want to take basic troubleshooting steps so that I can avoid Node Not Ready issues in Azure Kubernetes Service (AKS) cluster nodes.
Expand All @@ -21,29 +21,28 @@ Read the [official guide for troubleshooting Kubernetes clusters](https://kubern

## Basic troubleshooting

AKS continuously monitors the health state of worker nodes, and automatically repairs the nodes if they become unhealthy. The Azure Virtual Machine (VM) platform [maintains VMs](/azure/virtual-machines/maintenance-and-updates) that experience issues. AKS and Azure VMs work together to reduce service disruptions for clusters.
AKS continuously monitors the health state of worker nodes, and [automatically repairs](/azure/aks/node-auto-repair) the nodes if they become unhealthy. The Azure Virtual Machine (VM) platform [maintains VMs](/azure/virtual-machines/maintenance-and-updates) that experience issues. AKS and Azure VMs work together to reduce service disruptions for clusters.

For nodes, there are two forms of heartbeats:

- Updates to the *.status* file of a `Node` object.
- Updates to the *.status* of a `Node` object.

- [Lease](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/lease-v1/) objects within the [kube-node-lease](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) namespace. Each `Node` has an associated `Lease` object.

Compared to updates to the *.status* file of a `Node`, a `Lease` is a lightweight resource. Using `Lease` objects for heartbeats reduces the performance impact of these updates for large clusters.
Compared to updates to the *.status* of a `Node`, a `Lease` is a lightweight resource. Using `Lease` objects for heartbeats reduces the performance impact of these updates for large clusters.

The [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) is responsible for creating and updating the *.status* file for `Node` objects. It's also responsible for updating the `Lease` objects that are related to the `Node` objects.
The [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) is responsible for creating and updating the *.status* for `Node` objects. It's also responsible for updating the `Lease` objects that are related to the `Node` objects.

The kubelet updates the `Node` *.status* file if one of the following conditions is true:
- The kubelet updates the node's `.status` either when there is change in status or if there has been no update for a configured interval. The default interval for `.status` updates to Nodes is 5 minutes, which is much longer than the 40 second default timeout for unreachable nodes.
- The kubelet creates and then updates its Lease object every 10 seconds (the default update interval). Lease updates occur independently from updates to the Node's `.status`. If the Lease update fails, the kubelet retries, using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.

- A change in status occurs.

- No update occurs after a configured interval of time.
You can't schedule a `Pod` on a `Node` that has a status of `NotReady` or `Unknown`. You can schedule a `Pod` only on nodes that are in the `Ready` state.

The default interval for status updates to a `Node` is five minutes. This interval is much longer than the 40-second default time-out for unreachable nodes. The kubelet creates and then updates its `Lease` object one time every ten seconds (the default update interval). Updates to `Lease` occur independently from updates to the `Node` status. If the `Lease` update fails, the kubelet retries, using an exponential backoff that starts at 200 milliseconds and is capped at a maximum of seven seconds.
If your node is in the `MemoryPressure`, `DiskPressure`, or `PIDPressure` state, you must manage your resources in order to schedule extra pods on the node. If your node is in `NetworkUnavailable` mode, you must configure the network on the node correctly.

You can't schedule a `Pod` on a `Node` that has a status of `NotReady` or `Unknown`. You can schedule a `Pod` only on nodes that are in the `Ready` state.
AKS manages the lifecycle and operations of agent nodes on your behalf and modifying the IaaS resources associated with the agent nodes is not supported. An example of an unsupported operation is customizing a node through direct SSH connections, updating packages, or modifying the network configuration on the node. For more information, see [AKS support coverage for agent nodes](/azure/aks/support-policies#user-customization-of-agent-nodes).

If your node is in the `MemoryPressure`, `DiskPressure`, or `PIDPressure` state, you must manage your resources in order to schedule extra pods on the node. If your node is in `NetworkUnavailable` mode, you must configure the network on the node correctly. Make sure that the following conditions are met:
Make sure that the following conditions are met:

- Your cluster is in **Succeeded (Running)** state. To check the cluster status on the [Azure portal](https://portal.azure.com), search for and select **Kubernetes services**, and select the name of your AKS cluster. Then, on the cluster's **Overview** page, look in **Essentials** to find the **Status**. Or, enter the [az aks show](/cli/azure/aks#az-aks-show) command in Azure CLI.

Expand All @@ -61,4 +60,8 @@ If your node is in the `MemoryPressure`, `DiskPressure`, or `PIDPressure` state,

- Your cluster is running an [AKS-supported version of Kubernetes](/azure/aks/supported-kubernetes-versions).

## More information

- For troubleshooting steps for Node Not Ready, see [Troubleshoot a change in a healthy node to Not Ready status](node-not-ready-after-being-healthy.md).

[!INCLUDE [Third-party disclaimer](../../../includes/third-party-contact-disclaimer.md)]