Lost Kubernetes cluster
=======================

+---------------+----------------------------------------------------------------------+
| Metric        | Target                                                               |
+===============+======================================================================+
| RPO           | 0 (configuration stored in git)                                      |
+---------------+----------------------------------------------------------------------+
| RTO           | 4 hours                                                              |
+---------------+----------------------------------------------------------------------+

.. note::

    This applies to all Kubernetes clusters managed by `crate/kubernetes-gitops`_.

If a cluster has been completely destroyed (accidentally or not), then it will need to be
re-created, and all the things inside of it re-created separately.

Re-creating the clusters
------------------------

Creation scripts for most of the clusters are held in the salt repo:
`https://github.com/crate/infrastructure/tree/master/scripts/k8s <https://github.com/crate/infrastructure/tree/master/scripts/k8s>`_

Quick note on the objectives here:

- Typically we are not using the latest Kubernetes Version and are always rather at the lower of what is supported by the kubernetes services. k8s manifests need to be aligned with the k8s version on the cluster!
- We set a pod limit of ``100`` to have some space to grow. Be aware that this eats up a lot of IP addresses, by just starting a node.
- ATM we run production on ``Ds16_v5`` instances.
- Make sure to have the three availability zones configured per nodepool.
- Availability of the Kubernetes API is set to ``H/A``.
- Although the API is public we have white-listed a bunch of IP-addresses. For an administrator/user you need to have Wireguard VPN active. Our dynamic Jenkins-Runners pick an IP address out of this list, as they also need to have access to Kubernetes for deployments or to run maintenance jobs. ``167.235.233.190/32,49.12.45.112/32,167.235.233.134/32,167.235.245.180/32,23.88.34.31/32,116.202.23.229/32,116.203.179.85/32,167.235.235.254/32,116.203.101.248/32,116.203.120.187/32,195.201.130.63/32,167.235.30.244/32``
- Kubernetes-admin Diagnostic logging is configured to log to the infra Storage Account in the region.
- Kubernetes autoscaler is on for all of nodepools. On Azure is deployed per default with the AKS Cluster. On EKS we deploy it with flux from the ``kubernetes-gitops`` repository.
- A dedicated system nodepool.
- A worker nodepool.
- A worker nodepool for shared instances, which has a special set of ``taints:`` ``cratedb=shared:NoSchedule``. and ``lables:`` ``cratedb:shared``.

Bootstrapping
-------------

Refer to the instructions in `crate/kubernetes-gitops`_ of the particular cluster (see
branches).

.. note::

    For infrastructure bootstrapping we rely on Flux v2, which has a Mozilla SOPS
    integration. The secrets we need for bootstrapping the Kubernetes Cluster (just a few)
    are SOPS'd by Azure KMS and `age`.


Re-creating customer clusters
-----------------------------

Follow the instructions in :doc:`recovery-clusters` to recover customer clusters.

.. _crate/kubernetes-gitops: https://github.com/crate/kubernetes-gitops