Lost Kubernetes cluster

Metric

Target

RPO

0 (configuration stored in git)

RTO

4 hours

Note

This applies to all Kubernetes clusters managed by crate/kubernetes-gitops.

If a cluster has been completely destroyed (accidentally or not), then it will need to be re-created, and all the things inside of it re-created separately.

Re-creating the clusters

Creation scripts for most of the clusters are held in the salt repo: https://github.com/crate/infrastructure/tree/master/scripts/k8s

Quick note on the objectives here:

  • Typically we are not using the latest Kubernetes Version and are always rather at the lower of what is supported by the kubernetes services. k8s manifests need to be aligned with the k8s version on the cluster!

  • We set a pod limit of 100 to have some space to grow. Be aware that this eats up a lot of IP addresses, by just starting a node.

  • ATM we run production on Ds16_v5 instances.

  • Make sure to have the three availability zones configured per nodepool.

  • Availability of the Kubernetes API is set to H/A.

  • Although the API is public we have white-listed a bunch of IP-addresses. For an administrator/user you need to have Wireguard VPN active. Our dynamic Jenkins-Runners pick an IP address out of this list, as they also need to have access to Kubernetes for deployments or to run maintenance jobs. 167.235.233.190/32,49.12.45.112/32,167.235.233.134/32,167.235.245.180/32,23.88.34.31/32,116.202.23.229/32,116.203.179.85/32,167.235.235.254/32,116.203.101.248/32,116.203.120.187/32,195.201.130.63/32,167.235.30.244/32

  • Kubernetes-admin Diagnostic logging is configured to log to the infra Storage Account in the region.

  • Kubernetes autoscaler is on for all of nodepools. On Azure is deployed per default with the AKS Cluster. On EKS we deploy it with flux from the kubernetes-gitops repository.

  • A dedicated system nodepool.

  • A worker nodepool.

  • A worker nodepool for shared instances, which has a special set of taints: cratedb=shared:NoSchedule. and lables: cratedb:shared.

Bootstrapping

Refer to the instructions in crate/kubernetes-gitops of the particular cluster (see branches).

Note

For infrastructure bootstrapping we rely on Flux v2, which has a Mozilla SOPS integration. The secrets we need for bootstrapping the Kubernetes Cluster (just a few) are SOPS’d by Azure KMS and age.

Re-creating customer clusters

Follow the instructions in Deleted Clusters to recover customer clusters.