How to fix MicroK8s restart loop

If you have enabled the cis-hardening plugin in your microk8s cluster, you might experience instability issues, especially after a node restart. In this article, I will explain the cause of this issue and how to fix microk8s in this scenario.

Symptoms

Sometimes it is not obvious that we have an issue with our microk8s configuration. The symptoms are very subtle and it can take some time to figure out the root case. Typical symptoms are:

  • Pods failing to connect to other services
  • Pods not resolving service names
  • kubectl port-forward disconnecting after few seconds
  • kubectl exec -ti disconnecting after few seconds
  • kubectl get nodes showing some nodes constantly switching from Ready to Not Ready
  • kubectl unable to reach the kube-apiserver

If you are experiencing these issues there is high chance that your microk8s node is stuck in a restart loop. You can verify if this is the case by issuing the sudo microk8s status command a few times.

marcol@k8s-master:~$ sudo microk8s status
microk8s is running
[...]

marcol@k8s-master:~$ sudo microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.

The --protect-kernel-defaults flag

If you have installed the cis-hardening plugin in your cluster, there is high chance that the instability is caused by the --protect-kernel-defaults flag, especially if the issues started after a node restarts.

There are many ways to confirm this is the case. The easiest is to run a sudo microk8s inspect which will package for us an inspection report stored at /var/snap/microk8s/current/inspection-report

We can then read the journal of snap.microk8s.daemon-kubelite to figure out if the root cause is our kernel configuration.

root@k8s-master:/var/snap/microk8s/current# cat inspection-report/snap.microk8s.daemon-kubelite/journal.log | grep -e kernel/panic -e vm/overcommit_memory -e kernel/panic_on_oops

Jan 30 04:56:28 k8s-master microk8s.daemon-kubelite[35998]: E0130 04:56:28.616813 35998 kubelet.go:1511] "Failed to start ContainerManager" err="[invalid kernel flag: kernel/panic_on_oops, expected value: 1, actual value: 0, invalid kernel flag: vm/overcommit_memory, expected value: 1, actual value: 0, invalid kernel flag: kernel/panic, expected value: 10, actual value: 0]"

The issue is now very clear. MicroK8s expects a different configuration for the kernel/panic_on_oops, vm/overcommit_memory, and kernel/panic flags. Since the cis-hardening plugin is preventing the kubelite daemon from changing the kernel defaults, the service keeps restarting.

Solution

You have two options.

You can set the --protect-kernel-default flag in your /var/snap/microk8s/current/args/kubelet to false.

root@k8s-master:/var/snap/microk8s/current# cat args/kubelet | grep protect-kernel
--protect-kernel-defaults=false

Or you can change the kernel configuration in /etc/sysctl.conf by adding the following:

vm.overcommit_memory = 1
kernel.panic = 10
kernel.panic_on_oops = 1

Both options are valid. I would use the first option, if you are unfamiliar with the kernel and you are not so sure of what those flags do. Otherwise, you can comply with the cis-hardening recommendations by modifying them. Note that the cis-hardening recommendation does not specify a value for those flags, it is simply preventing the daemon from adopting a different configuration than the one defined in the host.

Hope this helps!

References