Basic Procedure¶
Step 1: Do Not Panic¶
Disasters happen.
Step 2: Determine Severity¶
Severity levels are defined in our SLA document. In a nutshell:
Severity |
Description |
|---|---|
S1 |
Production Customer Cluster(s) are non-functional in some way. |
S2 |
The user of the services is experiencing a severe loss of service in a production environment. While operations can continue in a restricted fashion, important features are unavailable with no acceptable workaround, and the business of the user is severely impacted. This includes, i.e., CrateDB Cloud management console being down. |
S3 |
The user of the Services of production or test/development environment is experiencing a minor loss of service. The impact is an inconvenience, which may require a workaround to restore functionality. |
S4 |
Anything else (not generally used for the purpose of DR). |
Step 3: Inform Team¶
Inform the #team-cratedb-cloud channel (with a @ecosystem tag) that something has
happened.
S1 & S2: Out of Hours? Call the team lead. Don’t know number? Use the “Add Responder” feature in OpsGenie to the alert that fired, this will trigger a notification to that person’s phone.
Step 4: Coordinate in Zoom¶
S1 & S2: Join the ecosystemA zoom channel and inform others that you are waiting there.
S3: Can do the same, however comms can happen over Slack if deemed sufficient by the team.
Step 5: Assign Roles¶
Recovery Lead: person taking charge of recovery operations. This person leads in the recovery effort and distributes work to other people helping. This should be a senior sysadmin.
Communications Lead: person handling communications about the disaster recovery effort. This person does not have to participate in fixing anything themselves but should always be in the loop about what is going on, progress and ETAs.
For transparency purposes, the assignees of these roles should be clearly communicated
in the #team-cratedb-cloud channel.
This should not be the same person.
Step 6: Inform CE¶
Warning
This section is work-in-progress and has not been agreed.
Does the disaster affect CrateDB Cloud customer clusters, i.e., they are unavailable (unreachable, not running or otherwise compromised) or there has been data loss?
Customer Clusters Affected¶
Inform Customer Engineering (CE) team in the #customer-engineering slack channel ASAP.
Out-of-Hours, when CE is not available, the Communications Lead should also inform the customer about an outage in their cluster. The procedure for that is outlined in the ServiceInfo Wiki Page.
This customer list contains contact details of all of our customers. If you cannot find the right customer in Confluence (i.e. they are a cloud customer without a support plan), please lookup the Organization’s support email in brain and notify directly.
Note
Email template:
[Being prepared by CE team]
Customer Clusters Not Affected¶
Proceed to next step.
Step 7: Determine Cause and Create Plan¶
The team collectively discusses the ongoing disaster in the zoom channel and comes up with a plan of how to recover. The Recovery Lead leads this effort. Refer to possible scenarios and plans in this documentation portal.
As another reference, the support repository in GitHub has many post-mortems with steps on how certain issues were investigated and resolved.
Step 8: Communicate Progress¶
Recovery progress must be communicated periodically in the #team-cratedb-cloud channel
by the communications lead. Updates at least once an hour.
If there are no updates, this should also be communicated as such.
An ETA for recovery should be provided ASAP.
Where possible, communications should be kept in a thread, unless there are significant developments to be announced.
Step 9: Post Mortem¶
If issue has not been encountered before, a post-mortem report should be written up in the support repository in GitHub.