A kubernetes woker node, named k8s-worker2
is in state NotReady
. Investigate why this is the case, and perform any appropriate steps to bring the node to a Ready
state, ensuring that any changes are made permanent.
一个k8s节点k8s-work2
状态为NotReady
,调查它的原因并永久恢复为Ready
状态
$ ssh k8s-worker2You can assume elevated privileges on the node with the following command:$ sudo -i
root@k8s-master:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready control-plane 128m v1.24.1
k8s-worker1 Ready 126m v1.24.1
k8s-worker2 Ready 126m v1.24.1
root@k8s-master:~# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-6595874d85-555wq 1/1 Running 0 14m 192.168.194.90 k8s-worker1
nginx-deployment-6595874d85-76k9v 1/1 Running 0 25m 192.168.194.85 k8s-worker1
nginx-deployment-6595874d85-bw8rx 1/1 Running 0 14m 192.168.194.92 k8s-worker1
nginx-deployment-6595874d85-j8jtw 1/1 Running 0 25m 192.168.126.22 k8s-worker2
nginx-deployment-6595874d85-jqvhh 1/1 Running 0 14m 192.168.126.29 k8s-worker2
nginx-deployment-6595874d85-m4ql9 1/1 Running 0 14m 192.168.126.28 k8s-worker2
nginx-deployment-6595874d85-nm984 1/1 Running 0 14m 192.168.194.91 k8s-worker1
nginx-deployment-6595874d85-vm66r 1/1 Running 0 14m 192.168.126.27 k8s-worker2
nginx-deployment-6595874d85-vszn8 1/1 Running 0 14m 192.168.194.93 k8s-worker1
nginx-deployment-6595874d85-wc8wr 1/1 Running 0 25m 192.168.126.23 k8s-worker2
通过 ssh
切换到 k8s-work2
root@k8s-worker2:~# systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node AgentLoaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)Drop-In: /etc/systemd/system/kubelet.service.d└─10-kubeadm.confActive: active (running) since Mon 2023-03-20 03:46:55 CST; 2h 8min agoDocs: https://kubernetes.io/docs/home/Main PID: 8303 (kubelet)Tasks: 14 (limit: 4219)Memory: 41.1MCPU: 1min 15.726sCGroup: /system.slice/kubelet.service└─8303 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-r>Mar 20 05:32:20 k8s-worker2 kubelet[8303]: I0320 05:32:20.655052 8303 scope.go:110] "RemoveContainer" containerID="1571afb042a8db01ad1e25de3ff3cac5a6851b07e4e5edf4a31c96abc31a4420"
Mar 20 05:32:20 k8s-worker2 kubelet[8303]: E0320 05:32:20.660918 8303 remote_runtime.go:604] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find con>
Mar 20 05:32:20 k8s-worker2 kubelet[8303]: I0320 05:32:20.660955 8303 pod_container_deletor.go:52] "DeleteContainer returned error" containerID={Type:containerd ID:1571afb042a8db01ad1e25de3ff3cac5a6851b07e4e5edf4a31c>
Mar 20 05:32:21 k8s-worker2 kubelet[8303]: I0320 05:32:21.849902 8303 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=807337b4-1c21-4355-8fa6-d2310b0c9a32 path="/var/lib/kubelet/pods/807337b4-1c2>
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.454363 8303 topology_manager.go:200] "Topology Admit Handler"
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.467982 8303 reconciler.go:270] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-g4d59\" (UniqueName: \"kubernetes.io/pro>
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.475057 8303 topology_manager.go:200] "Topology Admit Handler"
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.491783 8303 topology_manager.go:200] "Topology Admit Handler"
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.568687 8303 reconciler.go:270] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-2x9bq\" (UniqueName: \"kubernetes.io/pro>
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.568762 8303 reconciler.go:270] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-n94sr\" (UniqueName: \"kubernetes.io/pro>
systemctl stop kubelet.service
root@k8s-worker2:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready control-plane 134m v1.24.1
k8s-worker1 Ready 132m v1.24.1
k8s-worker2 NotReady 132m v1.24.1
root@k8s-worker2:~# kubectl describe nodes k8s-worker2
Name: k8s-worker2
Roles:
Labels: beta.kubernetes.io/arch=amd64beta.kubernetes.io/os=linuxkubernetes.io/arch=amd64kubernetes.io/hostname=k8s-worker2kubernetes.io/os=linux
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.socknode.alpha.kubernetes.io/ttl: 0projectcalico.org/IPv4Address: 172.18.108.68/20projectcalico.org/IPv4IPIPTunnelAddr: 192.168.126.0volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 20 Mar 2023 03:47:08 +0800
Taints: node.kubernetes.io/unreachable:NoExecutenode.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:HolderIdentity: k8s-worker2AcquireTime: RenewTime: Mon, 20 Mar 2023 05:58:23 +0800
Conditions:Type Status LastHeartbeatTime LastTransitionTime Reason Message---- ------ ----------------- ------------------ ------ -------NetworkUnavailable False Mon, 20 Mar 2023 03:48:09 +0800 Mon, 20 Mar 2023 03:48:09 +0800 CalicoIsUp Calico is running on this nodeMemoryPressure Unknown Mon, 20 Mar 2023 05:58:26 +0800 Mon, 20 Mar 2023 05:59:13 +0800 NodeStatusUnknown Kubelet stopped posting node status.DiskPressure Unknown Mon, 20 Mar 2023 05:58:26 +0800 Mon, 20 Mar 2023 05:59:13 +0800 NodeStatusUnknown Kubelet stopped posting node status.PIDPressure Unknown Mon, 20 Mar 2023 05:58:26 +0800 Mon, 20 Mar 2023 05:59:13 +0800 NodeStatusUnknown Kubelet stopped posting node status.Ready Unknown Mon, 20 Mar 2023 05:58:26 +0800 Mon, 20 Mar 2023 05:59:13 +0800 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:InternalIP: 172.18.108.68Hostname: k8s-worker2
Capacity:cpu: 2ephemeral-storage: 20261208Kihugepages-1Gi: 0hugepages-2Mi: 0memory: 3674908Kipods: 110
Allocatable:cpu: 2ephemeral-storage: 18672729262hugepages-1Gi: 0hugepages-2Mi: 0memory: 3572508Kipods: 110
System Info:Machine ID: 96616f9e696f480c970862c129963a34System UUID: 23ededd5-582b-491a-852f-f2caef69680cBoot ID: 869910d0-a51f-4e7d-8a6b-17fec5740011Kernel Version: 5.15.0-58-genericOS Image: Ubuntu 22.04.1 LTSOperating System: linuxArchitecture: amd64Container Runtime Version: containerd://1.6.18Kubelet Version: v1.24.1Kube-Proxy Version: v1.24.1
Non-terminated Pods: (8 in total)Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age--------- ---- ------------ ---------- --------------- ------------- ---default nginx-deployment-6595874d85-j8jtw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 33mdefault nginx-deployment-6595874d85-jqvhh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22mdefault nginx-deployment-6595874d85-m4ql9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22mdefault nginx-deployment-6595874d85-vm66r 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22mdefault nginx-deployment-6595874d85-wc8wr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 33mkube-system calico-node-cwbqh 250m (12%) 0 (0%) 0 (0%) 0 (0%) 134mkube-system coredns-74586cf9b6-htqlm 100m (5%) 0 (0%) 70Mi (2%) 170Mi (4%) 121mkube-system kube-proxy-dgfgl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 134m
Allocated resources:(Total limits may be over 100 percent, i.e., overcommitted.)Resource Requests Limits-------- -------- ------cpu 350m (17%) 0 (0%)memory 70Mi (2%) 170Mi (4%)ephemeral-storage 0 (0%) 0 (0%)hugepages-1Gi 0 (0%) 0 (0%)hugepages-2Mi 0 (0%) 0 (0%)
Events:Type Reason Age From Message---- ------ ---- ---- -------Normal NodeNotReady 2m18s node-controller Node k8s-worker2 status is now: NodeNotReady
output:
可以看到Conditions
这里Message
有很明显的信息说kubelet stopped
k8s-worker2
的kubelet.service 状态root@k8s-worker2:~# systemctl status kubelet.service
○ kubelet.service - kubelet: The Kubernetes Node AgentLoaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)Drop-In: /etc/systemd/system/kubelet.service.d└─10-kubeadm.confActive: inactive (dead) since Mon 2023-03-20 05:58:27 CST; 8min agoDocs: https://kubernetes.io/docs/home/Process: 8303 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/contain>Main PID: 8303 (code=exited, status=0/SUCCESS)CPU: 1min 17.733sMar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.467982 8303 reconciler.go:270] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-g4d59\" (UniqueName: \"kubernetes.io/pro>
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.475057 8303 topology_manager.go:200] "Topology Admit Handler"
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.491783 8303 topology_manager.go:200] "Topology Admit Handler"
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.568687 8303 reconciler.go:270] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-2x9bq\" (UniqueName: \"kubernetes.io/pro>
Mar 20 05:39:00 k8s-worker2 kubelet[8303]: I0320 05:39:00.568762 8303 reconciler.go:270] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-n94sr\" (UniqueName: \"kubernetes.io/pro>
Mar 20 05:58:27 k8s-worker2 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Mar 20 05:58:27 k8s-worker2 kubelet[8303]: I0320 05:58:27.504326 8303 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
Mar 20 05:58:27 k8s-worker2 systemd[1]: kubelet.service: Deactivated successfully.
Mar 20 05:58:27 k8s-worker2 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Mar 20 05:58:27 k8s-worker2 systemd[1]: kubelet.service: Consumed 1min 17.733s CPU time.
ouput:
从这里Active:inactive(dead)
可以很清楚的知道服务挂掉了,需要将服务起一下。
当然这里可以查看pods
的信息
root@k8s-worker2:~# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-6595874d85-555wq 1/1 Running 0 30m 192.168.194.90 k8s-worker1
nginx-deployment-6595874d85-6dvdc 1/1 Running 0 5m38s 192.168.194.94 k8s-worker1
nginx-deployment-6595874d85-76k9v 1/1 Running 0 41m 192.168.194.85 k8s-worker1
nginx-deployment-6595874d85-bw8rx 1/1 Running 0 30m 192.168.194.92 k8s-worker1
nginx-deployment-6595874d85-j8jtw 1/1 Terminating 0 41m 192.168.126.22 k8s-worker2
nginx-deployment-6595874d85-jqvhh 1/1 Terminating 0 30m 192.168.126.29 k8s-worker2
nginx-deployment-6595874d85-m4ql9 1/1 Terminating 0 30m 192.168.126.28 k8s-worker2
nginx-deployment-6595874d85-nm984 1/1 Running 0 30m 192.168.194.91 k8s-worker1
nginx-deployment-6595874d85-q5bdz 1/1 Running 0 5m38s 192.168.194.98 k8s-worker1
nginx-deployment-6595874d85-rvlqh 1/1 Running 0 5m38s 192.168.194.96 k8s-worker1
nginx-deployment-6595874d85-v4txt 1/1 Running 0 5m38s 192.168.194.97 k8s-worker1
nginx-deployment-6595874d85-v6j2x 1/1 Running 0 5m38s 192.168.194.95 k8s-worker1
nginx-deployment-6595874d85-vm66r 1/1 Terminating 0 30m 192.168.126.27 k8s-worker2
nginx-deployment-6595874d85-vszn8 1/1 Running 0 30m 192.168.194.93 k8s-worker1
nginx-deployment-6595874d85-wc8wr 1/1 Terminating 0 41m 192.168.126.23 k8s-worker2
root@k8s-worker2:~# kubectl get pods -o wide | grep worker2
nginx-deployment-6595874d85-j8jtw 1/1 Terminating 0 42m 192.168.126.22 k8s-worker2
nginx-deployment-6595874d85-jqvhh 1/1 Terminating 0 31m 192.168.126.29 k8s-worker2
nginx-deployment-6595874d85-m4ql9 1/1 Terminating 0 31m 192.168.126.28 k8s-worker2
nginx-deployment-6595874d85-vm66r 1/1 Terminating 0 31m 192.168.126.27 k8s-worker2
nginx-deployment-6595874d85-wc8wr 1/1 Terminating 0 42m 192.168.126.23 k8s-worker2
output:
可以看到k8s-worker2
上运行的pods
都处于 Terminating
状态。
既然知道了是kubelet 服务挂掉了,我们将他起一下就可以。
systemctl enable --now kubelet.service
可以查看 enable
和 --now
的参数解释
root@k8s-worker2:~# systemctl enable --help
systemctl [OPTIONS...] COMMAND ...Query or send control commands to the system manager.
...status [PATTERN...|PID...] Show runtime status of one or more units
...start UNIT... Start (activate) one or more unitsstop UNIT... Stop (deactivate) one or more unitsreload UNIT... Reload one or more unitsrestart UNIT... Start or restart one or more units
...enable [UNIT...|PATH...] Enable one or more unit filesdisable UNIT... Disable one or more unit files
...
Options:-h --help Show this help--version Show package version--system Connect to system manager--user Connect to user service manager
...-s --signal=SIGNAL Which signal to send--what=RESOURCES Which types of resources to remove--now Start or stop unit after enabling or disabling it--dry-run Only print what would be doneCurrently supported by verbs: halt, poweroff, reboot,
...
root@k8s-worker2:~# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master Ready control-plane 157m v1.24.1 172.18.108.66 Ubuntu 22.04.1 LTS 5.15.0-58-generic containerd://1.6.18
k8s-worker1 Ready 155m v1.24.1 172.18.108.67 Ubuntu 22.04.1 LTS 5.15.0-58-generic containerd://1.6.18
k8s-worker2 Ready 155m v1.24.1 172.18.108.68 Ubuntu 22.04.1 LTS 5.15.0-58-generic containerd://1.6.18
root@k8s-worker2:~# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-6595874d85-555wq 1/1 Running 0 43m 192.168.194.90 k8s-worker1
nginx-deployment-6595874d85-6dvdc 1/1 Running 0 18m 192.168.194.94 k8s-worker1
nginx-deployment-6595874d85-76k9v 1/1 Running 0 54m 192.168.194.85 k8s-worker1
nginx-deployment-6595874d85-bw8rx 1/1 Running 0 43m 192.168.194.92 k8s-worker1
nginx-deployment-6595874d85-nm984 1/1 Running 0 43m 192.168.194.91 k8s-worker1
nginx-deployment-6595874d85-q5bdz 1/1 Running 0 18m 192.168.194.98 k8s-worker1
nginx-deployment-6595874d85-rvlqh 1/1 Running 0 18m 192.168.194.96 k8s-worker1
nginx-deployment-6595874d85-v4txt 1/1 Running 0 18m 192.168.194.97 k8s-worker1
nginx-deployment-6595874d85-v6j2x 1/1 Running 0 18m 192.168.194.95 k8s-worker1
nginx-deployment-6595874d85-vszn8 1/1 Running 0 43m 192.168.194.93 k8s-worker1
这里可以看到节点是Ready
状态 , 可以重启下看看是否支持重启自动恢复为Ready状态
查看kubelet.service
信息
可以看到服务处于active
状态
当某个节点有问题时需要上去看看节点详细信息
kubectl describe nodes k8s-worker2