Basic Procedure

Step 1: Do Not Panic

Disasters happen.

Step 2: Determine Severity

Severity levels are defined in our SLA document. In a nutshell:

Severity

Description

S1

Production Customer Cluster(s) are non-functional in some way.

S2

The user of the services is experiencing a severe loss of service in a production environment. While operations can continue in a restricted fashion, important features are unavailable with no acceptable workaround, and the business of the user is severely impacted. This includes, i.e., CrateDB Cloud management console being down.

S3

The user of the Services of production or test/development environment is experiencing a minor loss of service. The impact is an inconvenience, which may require a workaround to restore functionality.

S4

Anything else (not generally used for the purpose of DR).

Step 3: Inform Team

Inform the #team-cratedb-cloud channel (with a @ecosystem tag) that something has happened.

S1 & S2: Out of Hours? Call the team lead. Don’t know number? Use the “Add Responder” feature in OpsGenie to the alert that fired, this will trigger a notification to that person’s phone.

Step 4: Coordinate in Zoom

S1 & S2: Join the ecosystemA zoom channel and inform others that you are waiting there.

S3: Can do the same, however comms can happen over Slack if deemed sufficient by the team.

Step 5: Assign Roles

Recovery Lead: person taking charge of recovery operations. This person leads in the recovery effort and distributes work to other people helping. This should be a senior sysadmin.

Communications Lead: person handling communications about the disaster recovery effort. This person does not have to participate in fixing anything themselves but should always be in the loop about what is going on, progress and ETAs.

For transparency purposes, the assignees of these roles should be clearly communicated in the #team-cratedb-cloud channel.

This should not be the same person.

Step 6: Inform CE

Warning

This section is work-in-progress and has not been agreed.

Does the disaster affect CrateDB Cloud customer clusters, i.e., they are unavailable (unreachable, not running or otherwise compromised) or there has been data loss?

Customer Clusters Affected

Inform Customer Engineering (CE) team in the #customer-engineering slack channel ASAP.

Out-of-Hours, when CE is not available, the Communications Lead should also inform the customer about an outage in their cluster. The procedure for that is outlined in the ServiceInfo Wiki Page.

This customer list contains contact details of all of our customers. If you cannot find the right customer in Confluence (i.e. they are a cloud customer without a support plan), please lookup the Organization’s support email in brain and notify directly.

Note

Email template:

[Being prepared by CE team]

Customer Clusters Not Affected

Proceed to next step.

Step 7: Determine Cause and Create Plan

The team collectively discusses the ongoing disaster in the zoom channel and comes up with a plan of how to recover. The Recovery Lead leads this effort. Refer to possible scenarios and plans in this documentation portal.

As another reference, the support repository in GitHub has many post-mortems with steps on how certain issues were investigated and resolved.

Step 8: Communicate Progress

Recovery progress must be communicated periodically in the #team-cratedb-cloud channel by the communications lead. Updates at least once an hour. If there are no updates, this should also be communicated as such. An ETA for recovery should be provided ASAP.

Where possible, communications should be kept in a thread, unless there are significant developments to be announced.

Step 9: Post Mortem

If issue has not been encountered before, a post-mortem report should be written up in the support repository in GitHub.