Continuous Delivery & Deployment

Continuous Delivery

Continuous delivery (CD or CDE) is a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time and, when releasing the software, doing so manually.

Wikipedia

As a service provider for CrateDB Cloud, Crate.io needs to be at the top of its game when it comes to deploying new CrateDB clusters for customers. To facilitate a good user experience, it is crucial that our software works as expected.

A good way to test that an application is behaving as intended is a deployment to a dedicated testing environment that provides the same or very similar behavioral characteristics as the production environment. This is know as dev/prod parity.

At Crate.io, for the CrateDB Cloud services, our CI/CD tool (namely Jenkins) will build deployment artifacts for every change pushed to a service’s master branch. Following a successful build, that artifact will automatically be deployed to all development environments. Within minutes, the corresponding change will be available for further testing in the development environments.

We accomplish that behavior by tagging and pushing each Docker image we build with the corresponding git commit short-hash (first 7 characters). We then trigger the deployment, using the same git commit short-hash as a version.

The implementation of this workflow is in the generic_docker_build_publish.groovy script in the jenkins-dsl repository.

Furthermore, we build Docker images as outlined above when a git “version” tag or release branch is pushed. More precisely, any git revision that has the format (\d+\.)+\d+ will result in a build.

Continuous Deployment

Continuous deployment (CD) is a software engineering approach in which software functionalities are delivered frequently through automated deployments. CD contrasts with continuous delivery, a similar approach in which software functionalities are also frequently delivered and deemed to be potentially capable of being deployed but are actually not deployed.

Wikipedia

Attention

Continuous deployment has not been fully implemented. This section describes ideas to reach that goal.

While we continuously deploy new changes of a CrateDB Cloud service’s master branch to a development environment, deployments to production still require a couple of extra, manual steps.

As outlined in the documentation about our continuous delivery setup above, these steps should be automated as much as possible.

Proposal I

A first step to get continuous deployment to production could be automatically deploying to production when a git “version” tag (\d+\.\d+\.\d+) is pushed. After the deployments to the development environments completed successfully.

The problem with this could be that we don’t have a way to find out when a deployment was actually successful (see Problems & Improvements for additional information). Thus, deploying a git “version” tag to the development environments might work from the kubectl perspective, but if the application fails to start, that’s not detected by Jenkins. Thus, an effectively failed development deployment may still result in a deployment to production even though it shouldn’t.

On the other hand, when the regular continuous deployments to the development environments succeed, we should have the confidence that the “version releases” can be deployed to production without much hassle.

Proposal II

This proposal assumes, requires, and enforces that only “forward” deployments happen. That means one can only re-deploy the currently deployed commit or a commit that is “newer” than the last deployed commit. “Newer” means, in a linear line of commits, a commit that has the current commit as its parent or as one of its parents’ parents.

Forward-only deployments are usually more straight forward. One doesn’t need to think about consequences that reverting code changes has on e.g. a database. This becomes clear when one looks at the following example:

Let’s say there exists some code uses some database table. That code together with the database table has been deprecated and should now be removed. This can be done in two steps:

  1. Remove the code in question

  2. Remove the database table

If, after removing the database table, a bug occurs, a direct rollback to the “old” version of the code is not possible. The old version still used the database table that was just removed. A rollback would now require to recreate the database table; either from scratch or restore from a backup. Only afterwards, the code can be reverted to the old version.

When one only performs forward deployments, then the changes, introducing the bug can be reverted and released as a new version, without requiring the table to be restored. 1

With that in mind, this proposal is about being able to deploy each and every commit (or at least pushes to the master branch) to production. It sits on top of the existing continuous delivery to the development environments. But additionally, upon manual approval, one can deploy a commit to production. That has multiple advantages:

  • Changes to production can be small. Small changes are easier to reason about and can easier be validated. If problems occure that can quickly be narrowed down to the commit or pull request that introduced them.

  • By leveraging something called feature toggles, we can ship code for new features or behavior to developemnt and even to production without that code having an impact on customers. Instead, e.g. the product team can turn features on for themselves, test them, and when they’re happy with how they work, they can release them to the world.

  • Expanding on the idea of feature toggles, there is an opportunity for A/B testing. But that’s for a later stage.

The implementation of this proposal would require a move from a Jenkins multiJob — as it is currently the case — to a Jenkins Pipeline where an Input Step can be used.

1

There are definitely exceptions to the rule. But this is the usual and common case

Problems & Improvements

  • Currently, deployments don’t verify that the changed service is actually successfully deployed. They are more a “fire & forget” style.

    This is due to the fact how Kubernetes works. When deploying new versions of a service, we generate a Kubernetes manifest for the service in question and apply that manifest using kubectl apply -f manifest.yaml. This command, however, doesn’t wait for a successful execution by Kubernetes, but rather triggers the required operations and completes.

    Ideally, we would monitor the successful start of new Kubernetes pods, etc. and only then mark a deployment job successful within Jenkins.