Spinnaker @ GIPHY

August 28, 2020 by Bryant Rockoff

Like many companies, GIPHY Engineering has been using Kubernetes for the past several years to help our teams quickly build, compile into containers, and distribute applications to our AWS servers.


One of the problems with any Kubernetes distribution is: well, the distribution. There is an amalgamation of tools out there vying for your attention (and, in many cases, your $$$) to help with getting your code out the door in Kubernetes. There are also a plethora of tools to help template your Kubernetes manifests, including Helm, which GIPHY Engineering has been developing internally, specific to our deployment mechanisms.

Why Spinnaker?

Pretty Pretty

This all leads to the question, both for Kubernetes and CI/CD newcomers: which tools are best to get you up and running?


At GIPHY Engineering, prior to Spinnaker, we had largely been handling deployments through a combination of Jenkins and kubectl (manual), via static manifests in a Github repo maintained by the GIPHY Engineering team.


However, for the past year the Site Reliability Team has been working to transition our team to Spinnaker, in conjunction with Jenkins and Helm, to help deploy our applications across our multiple K8s clusters and environments in an automated fashion. While security and validation/auditing played a part in this decision, our primary reasons for choosing Spinnaker were:


  1. It was easy for us to integrate Spinnaker into our already developed Jenkins pipelines and tooling. (More on this to come.)
  2. Spinnaker allowed us the ability to quickly, and easily, distribute our code across multiple Kubernetes clusters efficiently, and to onboard and stand up new clusters in the tool with minimal work.
  3. GIPHY Engineering had been looking for a way to handle automated Canary testing and releases, both of which Spinnaker supported out of the box via its Kayenta microservice.


There are many ways a team can trigger pipelines in Spinnaker, including Github webhooks, Pub/Sub, and GCS Artifacts. We chose a combination of Jenkins and General Webhooks for our use-case due to the ability for us to quickly iterate and add these into our already developed pipelines.

Running the Spinnaker

Jenkins and the Jenkins Shared Library

In order to get Spinnaker out the door quickly and easily with our current infrastructure, we utilized a proprietary Jenkins Shared Library. This Library to helped facilitate the transition to deployment via our new Spinnaker tooling by allowing us to make simple tweaks to our already developed Jenkinsfiles in our repositories that looked similar to the following:


@Library("giphy-jenkins-library@master") _
...
...
stages {
    stage('Define Spinnaker variables') {
       steps {
           script {
               variablesSpinnaker(environment: spinEnv,
                   helmDir: "helm/${env.SERVICE_NAME}",
                   imageName: "${env.JENKINS_JOB_NAME}",
                   imageTag: "${BUILD_VERSION}",
                   primaryBranch: "develop"
               )
           }
       }
    }

    stage('Prepare helm package') {
         steps {
             kubernetesHelmPush(helmDir: "helm/${env.SERVICE_NAME}",
                 helmPackageName: env.jslHelmPackageName,
                 helmPackageVersion: env.jslHelmPackageVersion,
                 helmPackageVersionSha: env.jslHelmPackageVersionSha,
                 imageTag: env.jslImageTag
             )
         }
     }

     stage('Trigger Spinnaker') {
         steps {
             spinnakerTriggerPrep(spinAppName: "${env.SERVICE_NAME}"
             )
         }
     }
}

While this is a condensed version of our live Jenkinsfiles, the shared library in its current state handles the following logic:


  1. Generating a list of predefined values we send to Spinnaker (variablesSpinnaker)
  2. Determining if an update to an application’s helm chart has been made, and pushing it to our internal helm repository if necessary
  3. Triggering a Spinnaker deployment (either via Jenkins Artifact or Webhook JSON,depending on the needs of the app)


Here’s an example of an artifact the Jenkins Shared Library generates (which is archived at the end of the pipelines so Spinnaker can retrieve it):

{
    "helmPackageVersion": "0.1.13",
    "helmPackage": "bartender",
    "helmRepository": "stable",
    "branchName": "develop",
    "environment": "staging",
    "imageTag": "develop-dc67395-f38b4f1",
    "longCommit": "f38b4f17726839a02896846d884fcc9e28bcce0e"
}


By using a Jenkins Shared Library, the work our developers had put into their pipelines did not need any kind of rewriting, and only about 30-40 additional lines of Groovy to implement.


Our Pipeline

New

Once our Jenkins Shared Library was in a state that allowed for our deployment mechanisms, we began working on additional tooling to help get our applications out the door. This included tooling to implement secret distribution via Vault, as well as the introduction of Helm into our ecosystem (including base Helm templates our developers could work off of when migrating their applications).


We settled largely on three types (I sense a theme here) of Spinnaker pipelines.


Basic Pipelines

Staging Pipeline

This is our most basic and simplest pipeline: It is triggered by the completion of a Jenkins pipeline, grabs that pipeline’s generated/archived artifact, installs our secrets from Vault, “Bakes” (Spinnaker speak for “templating”) our Helm chart, and deploys the baked chart to the cluster of our choice. At GIPHY Engineering, we primarily use these pipelines for deploying single-use, or long term, environments, such as staging, which don’t usually involve additional testing suites to run at launch.


In regards to the “baking” of our chart, Spinnaker uses a specific Parameter Expression language to help facilitate passing information from your Spinnaker triggers and Jenkins jobs to the bake process, as well as between stages within a pipeline.


In practice, this looks something like the following in Spinnaker’s pipeline JSON for Jenkins Artifact pipelines (for our Bake Helm Chart step):
{
  "expectedArtifacts": [
    {
      "defaultArtifact": {
        "kind": "default.s3",
        "type": "s3/object"
      },
      "displayName": "helm-artifact",
      "id": "helm-artifact",
      "matchArtifact": {
        "kind": "base64",
        "name": "helm-artifact",
        "type": "embedded/base64"
      },
      "useDefaultArtifact": false
    }
  ],
  "inputArtifacts": [
    {
      "account": "giphy-aws-mgmt",
      "id": "initial-artifact-id"
    }
  ],
  "name": "Bake Helm Chart",
  "namespace": "${ trigger[\"properties\"][\"namespace\"]}",
  "outputName": "${ trigger[\"properties\"][\"helmPackage\"]}-${trigger[\"properties\"][\"branchName\"] }",
  "overrides": {
    "deployment.labels.github\\.com/revision": "${ trigger[\"properties\"][\"longCommit\"] ?: '' }",
    "env.secretRefs.primary": "${ #stage(\"Create Secrets\")[\"context\"][\"generatedPrimarySecretName\"] ?: '' }",
    "environment": "${ trigger[\"properties\"][\"environment\"] }",
    "fullnameOverride": "${ trigger[\"properties\"][\"helmPackage\"] }",
    "image.tag": "${ trigger[\"properties\"][\"imageTag\"] }"
  },
  "templateRenderer": "HELM2",
  "type": "bakeManifest"
}


The key point here is that we are passing a fair amount of information to our Helm chart from our trigger properties sent from Jenkins, which were generated by our Jenkins Shared Library. These properties are rendered and set as Helm Overrides during the bake process.


Environment Pipelines

Environment Pipeline


Our second type of pipeline here at GIPHY Engineering is the “Environment Pipeline.” This is a Webhook type pipeline. Why? Because we send Spinnaker some additional information, outside of that listed above, then use that information to generate unique environments.


Unlike our basic pipeline, which has a minimal amount of “overrides” for Helm, this pipeline type can have many overrides assigned to an application to help facilitate the creation of an entirely new environment.


Currently, these environments are unique to a specific application, but work is currently ongoing to allow for our entire stack, and specific applications, to be deployed in tandem.


Canary Pipelines

Canary

Environment Pipeline


Finally, the last type of pipeline we have is the Canary Pipeline. At GIPHY Engineering, this Pipeline has a pretty substantial amount of customization, including several Custom Jobs we created to facilitate copying current Deployment objects for use as “Baseline” versions for canary analysis, as well as the actual Canary Analysis step (which pulls in specific metrics from New Relic and Datadog).

Canary
The General Pathway for Canary Pipelines

As you can see, the Canary pipeline duplicates the currently running production image (v1.0) as a “baseline” for our analysis, and creates a new “canary” deployment to compare the metrics from both.


Once the analysis completes, Spinnaker works to determine if the changes deployed are positive or negative based on a score comprised of several metrics. If the score is above a certain threshold, the canary passes, and is deployed. An example for our “web” pipeline can be seen below:

Analysis
In a successful canary test, the baseline is indistinguishable from the canary.



How Spinnaker handles a typical release with a successful canary.


Of course, not every Canary goes so smoothly. For example, we recently had a Canary test for our web application that saw a large spike in the error rate of our application upon deployment. Spinnaker was able to determine this was problematic, and halted the deployment of that code, based on the generated report. Our developers were able to work on the bug after the canary deployment ended, and ship the change two days later:


 A failed canary that Spinnaker was able to automatically halt before releasing to 100% production.


 How Spinnaker handled a failed canary.


In the above screenshots, you can see the report that Spinnaker generated for our “failed” canary (which saw a large spike in the application’s error rate), and below it, how our Pipeline reacted to the failed report. When the canary completed, rather than shipping the code out to Production fully, as occurred in our first example, it noted the failure within the Pipeline. It then deleted the Baseline and Canary deployments in production, keeping the currently deployed production code active and intact.

Spinnaker at GIPHY in 2020

New Pipeline

Moving forward, our team is working on completing onboarding all of our applications into Spinnaker. This includes work on even simpler pipelines for our developers to help migrate applications that haven’t been ported to our new helm based bake systems. These pipelines use the githubFile artifact type in Spinnaker to pull in updated files from Github, based on push events for that file, and deploy them out to Spinnaker.


In this case, our compliled.yaml file is generated by Jenkins, and is a compiled version of all of our Kubernetes static objects (deployments, service, ingress, etc.).


We’re also working on internal POCs to help onboard our developers even faster, with improved processes for onboarding new applications into Spinnaker. More on this to come in a future GIPHY Engineering blog post. 😉

Is this for me?

Tl;dr — Spinnaker will be a major time investment for your team. Like any Kubernetes tooling, or tooling in general, make sure it fits your team’s needs before going down this path!

Should I

Our SRE team at GIPHY Engineering is relatively small, so the development time for us to get Spinnaker out the door, with the tooling we envisioned and needed, took about a year in total. This is a substantial time investment into a tool, and one that many startups and small organizations may simply not have the time to invest in.


Spinnaker is a beast. Plain and simple. In fact, many large organizations have entire teams dedicated to just maintaining Spinnaker in their infrastructure. It’s something our small team has had to grapple with, and certainly one of the pitfalls of the tool.


That being said, when Spinnaker works, it just works; and it works well. While our initial launch was slower than expected, we’ve since had a lot of success with the tool, including catching bugs before they hit production via our Canary pipelines, as well as reducing human error in our previous deployment systems. It has also substantially automated our processes in our deployment pipelines.


Additionally, because Spinnaker is so extensible, the amount of customization you and your team can pour into it is nearly limitless. For example, we’ve developed full integration test suites surrounding our pipelines, which can be triggered based on commit messages sent with our Jenkins artifact payloads.

More Information

You can find all of the information you need on Spinnaker at its OSS website, https://spinnaker.io


For teams looking for enterprise level solutions and support, we highly recommend Armory, who we partnered with throughout our development process.


Finally, if you want to learn more about how our Spinnaker infrastructure works, including a live demo, you can find a presentation I gave last fall on the topic at a Meetup event in New York City on Youtube.


Stay well, and healthy

<3 SRE@GIPHY

— Bryant Rockoff, Site Reliability Engineer

Stay Well