Restoring a Kubernetes Cluster

So you are running containerized applications in a Kubernetes (K8s) cluster and suddenly the K8s control plane is acting weird. This might be due to the control plane’s database being inaccessible and/or corrupted. Although losing the control plane does not affect the currently running workload, you will not be able to monitor this workload nor manage it until the database is back up and running.

This article describes the process of restoring a K8s cluster to a healthy state by restoring the etcd database from a backup. Etcd is by far the most widely used database technology in K8s and one of the most commonly used tools to install/configure K8s clusters (the “kubeadm” tool) automatically integrates the etcd functionality into the control plane nodes, as illustrated by the below picture coming from the official Kubernetes documentation :

Using “kubeadm” to create a K8s cluster by default will install the control plane components as static pods on each control plane node. This includes the etcd database, which is used by the K8s API server to maintain information about all K8s objects in the cluster.

By the way … you can also configure your K8s cluster to use an external etcd database

Backup of the etcd database

In order to be able to do any type of restore you obviously need to have a backup available first. This backup is achieved by using the “etcdctl snapshot save” command, which means you will need to have access to the etcdctl command (etcd client) which is also needed to do the restore. You can either use the client which is pre-installed in the etcd pod or install it on a separate system and point that client to the etcd server (for example use “sudo apt install etcd-client” to install the client .. depending of your client OS).

The snapshot can be done at any time (no database “quiescing” of any kind needed) by running the command :

ETCDCTL_API=3 etcdctl --endpoints <etcd server IP address>:2379 --cacert <ca cert> --cert <server cert> --key <server key> snapshot save etcdbackup.db

NOTE : You have to specify ETCDCTL_API=3 to signal to etcdctl that you need API version 3 to be able to use the snapshot command. Alternatively you can perform a one time “export ETCDCTL_API=3” command and use the plain “etcdctl” command.

root@myk8s-control-plane:~/etcd-backup# ETCDCTL_API=3 etcdctl snapshot save etcd-backup.db --endpoints --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
2024-01-10 14:26:05.034497 I | clientv3: opened snapshot stream; downloading
2024-01-10 14:26:05.208164 I | clientv3: completed snapshot read; closing
Snapshot saved at etcd-backup.db
root@myk8s-control-plane:~/etcd-backup# ls -al
total 2896
drwxr-xr-x 2 root root    4096 Jan 10 14:26 .
drwx------ 1 root root    4096 Jan 10 14:23 ..
-rw-r--r-- 1 root root 2953248 Jan 10 14:26 etcd-backup.db

Test/Dev Cluster

Let’s first look at restoring a simple K8s cluster with a single control plane and a single worker node. This would be a test/dev cluster, as a single control plane offers no High Availability.

In order to restore the snapshot we created previously we first need to stop the etcd database as well as its main consumer – the kube-apiserver. In a K8s cluster created with kubeadm we can achieve this by temporarily moving the .yaml files from the static pod folder (by default this is /etc/kubernetes/manifests) to a different location. We can check that these services no longer run by checking the container runtime (most commonly these days containerd) for example using “crictl ps”.

root@myk8s-control-plane:~/etcd-backup# mv /etc/kubernetes/manifests/*.yaml .
root@myk8s-control-plane:~/etcd-backup# ls -al
total 2912
drwxr-xr-x 2 root root    4096 Jan 10 14:39 .
drwx------ 1 root root    4096 Jan 10 14:23 ..
-rw-r--r-- 1 root root 2953248 Jan 10 14:26 etcd-backup.db
-rw------- 1 root root    2407 Jan 10 11:42 etcd.yaml
-rw------- 1 root root    3896 Jan 10 11:42 kube-apiserver.yaml
-rw------- 1 root root    3429 Jan 10 11:42 kube-controller-manager.yaml
-rw------- 1 root root    1463 Jan 10 11:42 kube-scheduler.yaml
root@myk8s-control-plane:~/etcd-backup# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD
3ef0d65ea9ab4       9d5429f6d7697       48 minutes ago      Running             kube-proxy               0                   3b60c02b94d78       kube-proxy-rdrdb
2b626c164cc50       697605b359357       About an hour ago   Running             speaker                  0                   1a3183dc5af73       speaker-vl678
4dc4989025874       ce18e076e9d4b       3 hours ago         Running             local-path-provisioner   0                   81fbc4589ee9d       local-path-provisioner-6bc4bddd6b-9b5g4
f7a42f910c10d       ead0a4a53df89       3 hours ago         Running             coredns                  0                   c4d800f65c5c6       coredns-5d78c9869d-m8b57
d3e268a712ab0       ead0a4a53df89       3 hours ago         Running             coredns                  0                   58bdc61343208       coredns-5d78c9869d-vcj8j

Now we can move the existing database to a safe location (just to be sure) and replace it with a restore of the etcd snapshot :

root@myk8s-control-plane:~/etcd-backup# mv /var/lib/etcd .
root@myk8s-control-plane:~/etcd-backup# ETCDCTL_API=3 etcdctl snapshot restore etcd-backup.db --data-dir /var/lib/etcd
2024-01-10 15:22:22.596939 I | mvcc: restore compact to 13014
2024-01-10 15:22:22.603321 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32

And finally we can restart the controller manager functions again by moving the static pod definitions back :

root@myk8s-control-plane:~/etcd-backup# mv *.yaml /etc/kubernetes/manifests/
root@myk8s-control-plane:~/etcd-backup# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
391ac13e3700d       86b6af7dd652c       3 seconds ago       Running             etcd                      0                   b20f741c9bf0b       etcd-myk8s-control-plane
0e2115845491f       9f8f3a9f3e8a9       3 seconds ago       Running             kube-controller-manager   0                   8ac9daf2a8bbf       kube-controller-manager-myk8s-control-plane
33557b92ad114       c604ff157f0cf       3 seconds ago       Running             kube-apiserver            0                   431fd31b57b47       kube-apiserver-myk8s-control-plane
e270a0049ee0c       205a4d549b94d       4 seconds ago       Running             kube-scheduler            0                   4cdbbdf6f9338       kube-scheduler-myk8s-control-plane
3ef0d65ea9ab4       9d5429f6d7697       2 hours ago         Running             kube-proxy                0                   3b60c02b94d78       kube-proxy-rdrdb
<rest op output hidden ...>

Your K8s cluster is now back to the state of the time when the snapshot was taken. You might need to restart the kubelet on each node, so maybe a good idea to do that anyway to reset connections to the API server.

Note that this procedure only saves and restores the etcd database. This is fine as long as you are running stateless applications. However when you also have stateful applications storing data on the worker nodes or (preferably) on external systems using components like persistent volumes it makes more sense to use actual backup solutions like “Velero” or “Kasten” to name a few. More on that in a future post.

HA Production Cluster

When you run a production K8s cluster you probably have at least 3 nodes, each running a copy of the etcd database. When running in this mode etcd is distributed (the “d” in etcd) across all the control plane nodes. In order to restore the entire etcd cluster we need to restore the same etcd snapshot/backup on each node. The procedure to restore as described before is basically repeated on each control plane node (where all nodes are down during the restore process). However there is one major difference which has to do with the actual etcdctl restore command. This command will be slightly different on each node to configure the proper local IP address and local hostname as well as configure all the IP addresses of the members of the etcd database cluster.
Assuming we have a cluster with 3 control plane nodes with hostname cp1, cp2 and cp3 and IP address, and respectively the command to restore the database on cp1 for example would be :

root@cp1:~/etcd-backup# ETCDCTL_API=3 etcdctl snapshot restore backup.db --data-dir /var/lib/ --name cp1 --initial-cluster cp1=,cp2=,cp3= --initial-cluster-token new-etcd-cluster --initial-advertise-peer-urls

In a HA etcd cluster one of the nodes is the leader which takes care of updating the database initially and using a logging system makes sure the other member nodes will be able to synchronize their local copy of the database. It is always a good idea when shutting down an etcd cluster (or for example replacing all nodes with new ones) to stop the leader last. To find out which node is the leader you can use the etcdctl endpoint status command :

root@ha-k8s-control-plane:/# ETCDCTL_API=3 etcdctl --endpoints,, --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key  endpoint status -w table
| | eadff6a055b2649b |   3.5.9 |  7.0 MB |      true |         2 |     487938 |
| | 701bb008d1e08bea |   3.5.9 |  7.1 MB |     false |         2 |     487938 |
| | 926c2d51a0207869 |   3.5.9 |  7.0 MB |     false |         2 |     487938 |

… and checkout the column “IS LEADER”.

vSphere 7.0 – VM fails to boot from iso

In the last two weeks I refreshed my vSphere and vSAN lab environment, which included upgrading to vSphere 7.0. I decided to do a fresh install of both vCenter and my ESXi hosts which went pretty smooth.

After upgrading I decided to also create a fresh Windows and Ubuntu desktop as part of my Horizon lab (upgraded to 7.12 in the meantime as this was the minimum release supported with vCenter 7.0).

Normally installing a clean Windows or Ubuntu desktop is not a problem (kind of next/next/finish) but this time I just didn’t manage to get the VM’s boot from .iso. I made sure the virtual DVD device was but still once the VM was booting it tried to boot from the network and the DVD device was showing as disconnected.
Since I had seen some problem already with replicating VMs with the latest virtual hardware (version 17) I initially retried the installation with a new VM with 6.7 compatibility (vHW 14) as this has been successful in the previous environment without any problems and with the .iso file located on the same NFS datastore. This also failed, so the problem was not related to the virtual hardware.

Then I checked the vmware.log file of the VM that failed to boot from .iso and I noticed the following :

2020-05-25T08:04:58.619Z| vcpu-0| I125: CDROM: Connecting sata0:0 to ‘/vmfs/volumes/b8b05642-89df68b7/ubuntu-18.04.3-desktop-amd64.iso’. type=2 remote=0
2020-05-25T08:04:58.620Z| vcpu-0| I125: FILE:open error on /vmfs/volumes/b8b05642-89df68b7/ubuntu-18.04.3-desktop-amd64.iso: Read-only file system
2020-05-25T08:04:58.620Z| vcpu-0| I125: AIOGNRC: Failed to open ‘/vmfs/volumes/b8b05642-89df68b7/ubuntu-18.04.3-desktop-amd64.iso’ : Read-only file system (1e0002) (0x21).
2020-05-25T08:04:58.620Z| vcpu-0| I125: CDROM-IMG: image open for ‘/vmfs/volumes/b8b05642-89df68b7/ubuntu-18.04.3-desktop-amd64.iso’ failed: Read-only file system (1e0002).
2020-05-25T08:04:58.620Z| vcpu-0| I125: CDROM-IMG: Failed to connect ‘/vmfs/volumes/b8b05642-89df68b7/ubuntu-18.04.3-desktop-amd64.iso’.
2020-05-25T08:04:58.620Z| vcpu-0| I125: CDROM: Failed to connect CDROM device ‘/vmfs/volumes/b8b05642-89df68b7/ubuntu-18.04.3-desktop-amd64.iso’.
2020-05-25T08:04:58.620Z| vcpu-0| I125: Msg_Post: Warning
2020-05-25T08:04:58.620Z| vcpu-0| I125: [msg.cdromImage.cantOpen] Cannot connect file “/vmfs/volumes/b8b05642-89df68b7/ubuntu-18.04.3-desktop-amd64.iso” as a CD-ROM image: Read-only file system
2020-05-25T08:04:58.620Z| vcpu-0| I125: [msg.device.startdisconnected] Virtual device ‘sata0:0’ will start disconnected.

The fileshare provided to ESXi as an NFS datastore was presented with Read Only permissions (and always had been), but seeing these messages in the log made me change the permission to Read/Write and voila …. I was able to succesfully boot the VM’s from the .iso file on the NFS datastore and continue my guest OS installation.

So apparently vSphere 7.0 requires R/W access for booting from an .iso file.

VMware Re-certification requirements have changed (for the better …)

As of February 4th the requirement to re-certify every 2 years when you hold an active VCP certification no longer exists (unless you are a VMware partner or are in some other program that requires you to have a more current certification).

This means you will now have more time to study towards a more recent certification, plus the upgrade path is shorter in many cases by simply taking the latest version of a specific VCP exam.
This shorter path can be taken as long as you are no more than 3 versions behind the most current VCP version for any particular track.

This announcement also means that some VCP certifications that were previously de-activated now have become active again !

For all details (like understanding that this is only for the VCP level certification, that there are different requirements when you upgrade to a different track, etc.) please see the official VMware blog post on this announcement.

Happy re-certifying !

2019 VCP Certifications

As mentioned in an earlier blog post naming of the VCP certifications is changing to reflect the year in which the certification was achieved rather than the version of the product that the certification applies to.

As of this week (January 16th to be precise) the new VCP certification naming is effective. Currently the following certifications are available :

For more details please read the VMware education blog around this topic.

VMware Certification Naming Changes

Last week VMware Education announced a change in naming the various certifications, where the year in which the certification is achieved is reflected in the name of the certification.

Until now the name of the certification reflected the version of the product that it was related to (for example VCP6-DCV referred to the vSphere 6.0 release). This may cause confusion about the currency of a specific certification, since the pace where product releases are made available is not very strict, which is also reflected in the certification (-exams). For example my VCP4-DCV certification was 15 months older than my VCP5-DCV certification, but the latter was over 3 years older than my VCP6-DCV certification.

Also both my “DCV” and “DTM” certifications are valid but one is called VCP6 and the other is called VCP7 (as they relate to vSphere 6.0 and Horizon 7.0 respectively).

So changing the name to reflect the year where the certification was achieved does make sense and will result in certifications like VCP-DTM 2019 and VCAP-DCV Deploy 2020.

It is important to understand that the change is only with regard to the naming of the certification. This means that there are no changes in requirements to achieve a certification or for re-certification (so a certification is still valid for 2 years and can be renewed by taking a newer exam in the same track or taking an exam in a different track). Also the name of the certification exam wil still reflect the product version that the exam questions are based on.

More detailed information about this announcement can be found in the FAQ document on the VMware certification website.

vSAN 6.7 Encryption

In vSphere 6.5 VMware introduced the possibility to encrypt Virtual Machine data on a per VM basis. This is achieved by using VAIO filtering and a specific policy is used to indicate whether a VM needs to be encrypted or not.

With vSAN 6.6 another way of encryption was introduced which means that the entire vSAN datastore is encrypted and as a result every VM that is stored on the vSAN datastore gets encrypted (and hence no specific policy is required).

For both encryption methodologies a KMS server (or cluster of KMS servers for production environments) that supports the KMIP protocol needs to be installed and configured in vCenter. Although both vSphere and vSAN encryption can use the same configured KMS server/cluster there is a small but important difference in the way the keys that are required for encrypting the data are communicated to the ESXi hosts.

In the case of vSphere (VM) encryption, ESXi needs to be able to communicate to vCenter to get the specific Key Encryption Key (KEK) for a VM when this VM needs to start (or is created). So when vCenter is not available, such actions possibly cannot be initiated.

For vSAN encryption however, an ESXi host only needs to communicate with vCenter when vSAN encryption is enabled. At that moment the KEK ID’s required to store the Data Encryption Keys (DEK) that are used to encrypt the disks are sent from vCenter to the ESXi hosts. Using these KEK ID’s the host will communicate directly with the KMS server to get the actual KEK.

To show this mechanism I have created a little demo video. For my own educational purpose I have used the vSphere (and vSAN) 6.7 version which allows me to use the new vSphere (HTML5) client functionality.

Read more

Upgrading my vSAN Cluster

Some time ago I decided to upgrade my home lab environment running vSphere (from 6.0 U3 to 6.5 U1) and vSAN (from 6.2 to 6.6.1).

I started with upgrading the vCenter appliance which is quite a smooth upgrade process. The only problem I had is that initially the upgrade wizard did not give me a choice to select “Tiny” as the size for the new appliance. This appeared to be an issue with the disk usage of the existing appliance. After deleting a bunch of old log files and dump files from the old vCenter appliance I retried the upgrade wizard and this time the “Tiny” option was available – which is a better fit for my “tiny” lab 🙂 – and the upgrade process went just fine.

Next up was the ESXi upgrade (I have three hosts). First try was doing an in-place upgrade using Update Manager.

Read more

VMware vSAN Specialist exam experience

Recently VMware Education announced the availability of the “vSAN Specialist” exam which entitles those who pass it to receive the “vSAN Specialist 2017” badge. The badge holder is a “technical professional who understands the vSAN 6.6 architecture and its complete feature set, knows how to conduct a vSAN design and deployment exercise, can implement a live vSAN hyper-converged infrastructure environment based on certified hardware/software components and best practices, and can administer/operate a vSAN cluster properly“.

As I consider myself to be a vSAN specialist I thought this one should be rather easy to achieve, so after I read about it last week, I immediately scheduled my exam at Pearson VUE and took it today.

Read more

VMware VVOL’s with Nimble Storage

VMware Virtual Volumes (aka VVOL) was introduced in vSphere 6.0 to allow vSphere administrators to be able to manage external storage resources (and especially the storage requirements for individual VM’s) through a policy-based mechanism (called Storage Policy Based Management – or SPBM).

VVOL in itself is not a product, but more of a framework that VMware has defined where each storage vendor can use this framework to enable SPBM for vSphere administrators by implementing the underlying components like VASA providers, Containers with its Storage Capabilities and Protocol Endpoint in their own way (a good background on VVOLs can be found in this KB article). This makes it easy for each storage vendor to get started with introducing VVOL support, but also means that it is not easy comparing different vendors with regard to this feature (“YES we support VVOL’s …” does not really say much about the way  an individual vendor has implemented this feature in their storage array and how they compare to other vendors).

In this blog I want to show the way Nimble Storage (now part of HPE) has implemented VVOL support. For now I will focus on the initial integration part. In a future blog I will show how this integration can be used to address the Nimble Storage capabilities for individual VM’s through the use of storage policies.

Read more

Creating a new vSAN 6.6 cluster

Last month VMware released vSAN version 6.6 as a patch release of vSphere (6.5.0d). New features included Data-at-Rest encryption,  enhanced stretched clusters with local protection, change of vSAN communication from multicast to unicast and many more.
Perhaps al ittle less impressive but yet very useful change is the (simple) way a new vSAN cluster is configured. To illustrate this I have recorded a short demo of the configuration of a new vSAN 6.6 cluster.

Read more