GitHub Engineering

An open source parser for GitHub Actions

2019-02-07T00:00:00+00:00

Since the beta release of GitHub Actions last October, thousands of users have added workflow files to their repositories. But until now, those files only work with the tools GitHub provided: the Actions editor, the Actions execution platform, and the syntax highlighting built into pull requests. To expand that universe, we need to release the parser and the specification for the Actions workflow language as open source. Today, we’re doing that.

We believe that tools beyond GitHub should be able to run workflows. We believe there should be programs to check, format, compose, and visualize workflow files. We believe that text editors can provide syntax highlighting and autocompletion for Actions workflows. And we believe all that can only happen if the Actions community is empowered to build these tools along with us. That can happen better and faster if there is a single language specification and a free parser implementation.

The first project to use the open source parser will be act, which is @nektos’s tool for running Actions workflows in a local development environment.

The parser and language specification are both in actions/workflow-parser, which we’re sharing under an MIT license. As of today, there is a Go implementation, which is the same code that powers both the Actions UI and the Actions execution platform. The repository also contains a Javascript parser in development, along with syntax-highlighting configurations for Atom and Vim.

GitHub Actions

GitHub Actions is platform for automating software development workflows, from idea to production. Developers add a simple text file to their repository, .github/main.workflow, to describe automation. The workflow file describes how events like pushing code or opening and closing issues map to automation actions, implemented in any Docker container. Those automation actions have whatever powers you grant them: pushing commits to the repository, cutting a new release, building it through continuous integration, deploying it to staging in the cloud, testing the deployment, flipping it to production, and announcing it to the world – and any others you can build.

Every workflow begins with an event and runs through a set of actions to reach some target or goal. Those events and actions are described in a main.workflow file, which you can create and edit with the visual editor or any text editor you like. Here is a simple example:

workflow "when I push" {
  on = "push"
  resolves = "ci"
}

action "ci" {
  uses = "docker://golang:latest"
  runs = "./script/cibuild"
}

Whenever I push to a branch that contains that file, the when I push workflow executes. It resolves the target action ci, which runs ./script/cibuild in a golang:latest Docker container. I can add more workflow blocks to harness more events, and I can add more action blocks to run after the ci action or in parallel with it.

The Actions workflow language

All main.workflow files are written in the Actions workflow language, which is a subset of Hashicorp’s HCL. In fact, our parser builds on top of the open source hashicorp/hcl parser.

All Actions workflow files are valid HCL, but not all HCL files are valid workflows. The Actions workflow parser is stricter, allowing only a specific set of keywords and prohibiting nested objects, among other restrictions. The reason for that is a long-standing goal of Actions: making the Actions editor and the text representation of workflows equivalent and interchangeable. Any file you write in the graphical editor can be expressed in a main.workflow file, of course, but also: any main.workflow can be fully displayed and edited in the graphical editor. There is one exception to this: the graphical editor does not display comments. But it preserves them: changes you make in the graphical editor do not disturb comments you have added to your main.workflow file.

Contributing

We have two reasons for releasing the parser. First, we want to encourage the Actions community to build tools that generate and manipulate workflow files. The language specification should help developers understand the language, and the parser should save developers the trouble of writing their own.

Second, we welcome your contributions. If you find bugs, please open an issue or send a pull request. If you want to add a syntax highlighter for a new editor or implement the parser in another language, we welcome that. The only real limitation is on features that go beyond the parser to the rest of Actions. For broader questions and suggestions about the rest of Actions, reach out through support; we’re listening.

Upgrading GitHub from Rails 3.2 to 5.2

2018-09-28T00:00:00+00:00

On August 15th GitHub celebrated a major milestone: our main application is now running on the latest version of Rails: 5.2.1! :tada:

In total the project took a year and a half to upgrade from Rails 3.2 to Rails 5.2. Along the way we took time to clean up technical debt and improve the overall codebase while doing the upgrade. Below we’ll talk about how we upgraded Rails, lessons we learned and whether we’d do it again.

How did we do it?

Upgrading Rails on an application as large and as trafficked as GitHub is no small task. It takes careful planning, good organization, and patience. The upgrade started out as kind of a hobby; engineers would work on it when they had free time. There was no dedicated team. As we made progress and gained traction it became not only something we hoped we could do, but a priority.

Since GitHub is so important to our community, we can’t stop feature development or bug fixes in order to upgrade Rails.

Instead of using a long-running branch to upgrade Rails we added the ability to dual boot the application in multiple versions of Rails. We created two Gemfile.lock’s: one for the current version Gemfile.lock and one for the future version Gemfile_next.lock. The dual booting allows us to regularly deploy changes for the next version to GitHub without requiring long running branches or altering how production works. We do this by conditionally loading the code.

if GitHub.rails_3_2?
  ## 3.2 code (i.e. production a year and a half ago)
elsif GitHub.rails_4_2?
  # 4.2 code
else
  # all 5.0+ future code, ensuring we never accidentally
  # fall back into an old version going forward
end

Each time we got a minor version of Rails green we’d make the CI job required for all pushes to the GitHub application and start work on the next version. While we worked on Rails 4.1 a CI job would run on every push for 3.2 and 4.0. When 4.1 was green we’d swap out 4.1 for 4.0 and get to work on 4.2. This allowed us to prevent regressions once a version of Rails was green, and time for engineers to get used to writing code that worked in multiple versions of Rails.

The two versions that we deployed were 4.2 and 5.2. We deployed 4.2 because it was a huge milestone and was the first version of Rails that hadn’t been EOL’d yet (as an aside: we’d been backporting security fixes to 3.2 but not to 4.0+ so we couldn’t deploy 4.0 or 4.1. Rest assured your security is our top priority).

To roll out the Rails upgrade we created a careful and iterative process. We’d first deploy to our testing environment and requested volunteers from each team to click test their area of the codebase to find any regressions the test suite missed.

After fixing those regressions, we deployed in off-hours to a percentage of production servers. During each deploy we’d collect data about exceptions and performance of the site. With that information we’d fix bugs that came up and repeat those steps until the error rate was low enough be considered equal to the previous version. We merged the upgrade once we could deploy to full production for 30 minutes at peak traffic with no visible impact.

This process allowed us to deploy 4.2 and 5.2 with minimal customer impact and no down time.

Lessons Learned

The Rails upgrade took a year and a half. This was for a few reasons. First, Rails upgrades weren’t always smooth and some versions had major breaking changes. Rails improved the upgrade process for the 5 series so this meant that while 3.2 to 4.2 took 1 year, 4.2 to 5.2 only took 5 months.

Another reason is the GitHub codebase is 10 years old. Over the years technical debt builds and there’s bound to be gremlins lurking in the codebase. If you’re on an old version of Rails, your engineers will need to add more monkey patches or implement features that exist upstream.

Lastly, when we started it wasn’t clear what resources were needed to support the upgrade and since most of us had never done a Rails upgrade before, we were learning as we went. The project originally began with 1 full-time engineer and a small army of volunteers. We grew that team to 4 full-time engineers plus the volunteers. Each version bump meant we learned more and the next version went even faster.

Through this work we learned some important lessons that we hope make your next upgrade easier:

Upgrade early and upgrade often. The closer you are to a new version of Rails, the easier upgrades will be. This encourages your team to fix bugs in Rails instead of monkey-patching the application or reinventing features that exist upstream.
Keep upgrade infrastructure in place. There will always be a new version to upgrade to, so once you’re on a modern version of Rails add a build to run against the master branch. This will catch bugs in Rails and your application early, make upgrades easier, and increase your upstream contributions.
Upstream your tooling instead of rolling your own. The more you push upstream to gems or Rails, the less logic you need in your application. Save your application code for what truly makes your company special (i.e. Pull Requests), instead of tools to make your application run smoothly (i.e. concurrent testing libraries)
Avoid using private API’s in your frameworks. Rails has a lot of code that’s not private but isn’t documented on purpose. That code is subject to change without notice, so writing code that relies on private code can easily break in an upgrade.
Address technical debt often. It’s easy to think “this is working, why mess with it”, but if no one knows how that code works, it can quickly become a bottleneck for upgrades. Try to prevent coupling your application logic too closely to your framework. Ensure that the line where your application ends and your framework begins is clear. You can do this by addressing technical debt before it becomes difficult to remove.
Do incremental upgrades. Each minor version of Rails provides the deprecation warnings for the next version. By upgrading from 3.2 to 4.0, 4.0 to 4.1, etc we were able to identify problems in the next version early and define clear milestones.
Keep up the momentum. Rails upgrades can seem daunting. Create ways in which your team can have quick wins to keep momentum going. Share the responsibility across teams so that everyone is familiar with the new version of the framework and prevent burnout. Once you’re on the newest version add a build to your app that periodically runs your suite against edge Rails so you can catch bugs in your code or your framework early.
Expect things to break. Upgrades are hard and in an application as large as GitHub things are bound to break. While we didn’t take the site down during the upgrade we had issues with CI, local development, slow queries, and other problems that didn’t show up in our CI builds or click testing.

Was it worth it?

Absolutely.

Upgrading Rails has allowed us to address technical debt in our application. Our test suite is now closer to vanilla Rails, we were able to remove StateMachine in favor of Active Record enums, and start replacing our job runner with Active Job. And that’s just the beginning.

Rails upgrades are a lot of hard work and can be time-consuming, but they also open up a ton possibilities. We can push more of our tooling upstream to Rails, address areas of technical debt, and be one of the largest codebases running on the most recent version of Rails. Not only does this benefit us at GitHub, it benefits our customers and the open source community.

Towards Natural Language Semantic Code Search

2018-09-18T00:00:00+00:00

This blog post complements a live demonstration on our recently announced site: experiments.github.com

Motivation

Searching code on GitHub is currently limited to keyword search. This assumes either the user knows the syntax, or can anticipate what keywords might be in comments surrounding the code they are looking for. Our machine learning scientists have been researching ways to enable the semantic search of code.

To fully grasp the concept of semantic search, consider the below search query, “ping REST api and return results”:

Note that the demonstrated semantic search returns reasonable results even though there are no keywords in common between the search query and the text (the code & comments found do not contain the words “Ping”, “REST” or “api”)! The implications of augmenting keyword search with semantic search are profound. For example, such a capability would expedite the process of on-boarding new software engineers onto projects and bolster the discoverability of code in general.

In this post, we want to share how we are leveraging deep learning to make progress towards this goal. We also share an open source example with code and data that you can use to reproduce these results!

Introduction

One of the key areas of machine learning research underway at GitHub is representation learning of entities, such as repos, code, issues, profiles and users. We have made significant progress towards enabling semantic search by learning representations of code that share a common vector space as text. For example, consider the below diagram:

In the above example, Text 2 (blue) is a reasonable description of the code, whereas Text 1 (red) is not related to the code at all. Our goal is to learn representations where (text, code) pairs that describe the same concept are close neighbors, whereas unrelated (text, code) pairs are further apart. By representing text and code in the same vector space, we can vectorize a user’s search query and lookup the nearest neighbor that represents code. Below is a four-part description of the approach we are currently using to accomplish this task:

1. Learning Representations of Code

In order to learn a representation of code, we train a sequence-to-sequence model that learns to summarize code. A way to accomplish this for Python is to supply (code, docstring) pairs where the docstring is the target variable the model is trying to predict. One active area of research for us is incorporating domain specific optimizations like tree-based LSTMs, gated-graph networks and syntax-aware tokenization. Below is a screenshot that showcases the code summarizer model at work. In this example, there are two python functions supplied as input, and in both cases the model produces a reasonable summary of the code as output:

It should be noted that in the above examples, the model produces the summary by using the entire code blob, not merely the function name.

Building a code summarizer is a very exciting project on its own, however, we can utilize the encoder from this model as a general purpose feature extractor for code. After extracting the encoder from this model, we can fine-tune it for the task of mapping code to the vector space of natural language.

We can evaluate this model objectively using the BLEU score. Currently we have been able to achieve a BLEU score of 13.5 on a holdout set of python code, using the fairseq-py library for sequence to sequence models.

2. Learning Representations of Text Phrases

In addition to learning a representation for code, we needed to find a suitable representation for short phrases (like sentences found in Python docstrings). Initially, we experimented with the Universal Sentence Encoder, a pre-trained encoder for text that is available on TensorFlow Hub. While the embeddings from worked reasonably well, we found that it was advantageous to learn embeddings that were specific to the vocabulary and semantics of software development. One area of ongoing research involves evaluating different domain-specific corpuses for training our own model, ranging from GitHub issues to third party datasets.

To learn this representation of phrases, we trained a neural language model by leveraging the fast.ai library. This library gave us easy access to state of the art architectures such as AWD LSTMs, and to techniques such as cyclical learning rates with random restarts. We extracted representations of phrases from this model by summarizing the hidden states using the concat pooling approach found in this paper.

One of the most challenging aspects of this exercise was to evaluate the quality of these embeddings. We are currently building a variety of downstream supervised tasks similar to those outlined here that will aid us in evaluating the quality of these embeddings objectively. In the meantime, we sanity check our embeddings by manually examining the similarity between similar phrases. The below screenshot illustrates examples where we search the vectorized docstrings for similarity against user-supplied phrases:

3. Mapping Code Representations To The Same Vector-Space as Text

Next, we map the code representations we learned from the code summarization model (part 1) to the vector space of text. We accomplish this by fine-tuning the encoder of this model. The inputs to this model are still code blobs, however the target variable the model is now the vectorized version of docstrings. These docstrings are vectorized using the approach discussed in the previous section.

Concretely, we perform multi-dimensional regression with cosine proximity loss to bring the hidden state of the encoder into the same vector-space as text.

We are actively researching alternate approaches that directly learn a joint vector space of code and natural language, borrowing from some ideas outlined here.

4. Creating a Semantic Search System

Finally, after successfully creating a model that can vectorize code into the same vector-space as text, we can create a semantic search mechanism. In its most simple form, we can store the vectorized version of all code in a database, and perform nearest neighbor lookups to a vectorized search query.

Another active area of our research is determining the best way to augment existing keyword search with semantic results and how to incorporate additional information such as context and relevance. Furthermore, we are actively exploring ways to evaluate the quality of search results that will allow us to iterate quickly on this problem. We leave these topics for discussion in a future blog post.

Summary

The below diagram summarizes all the steps in our current semantic-search workflow:

We are exploring ways to improve almost every component of this approach, including data preparation, model architecture, evaluation procedures, and overall system design. What is described in this blog post is only a minimal example that scratches the surface.

Open Source Examples

Our open-source end-to-end tutorial contains a detailed walkthrough of the approach outlined in this blog, along with code and data you can use to reproduce the results.

This open source example (with some modifications) is also used as a tutorial for the kubeflow project, which is implemented here.

Limitations and Intended Use Case(s)

We believe that semantic code search will be most useful for targeted searches of code within specific entities such as repos, organizations or users as opposed to general purpose “how to” queries. The live demonstration of semantic code search hosted on our recently announced Experiments site does not allow users to perform targeted searches of repos. Instead, this demonstration is designed to share a taste of what might be possible and searches only a limited, static set of python code.

Furthermore, like all machine learning techniques, the efficacy of this approach is limited by the training data used. For example, the data used to train these models are (code, docstring) pairs. Therefore, search queries that closely resemble a docstring have the greatest chance of success. On the other hand, queries that do not resemble a docstring or contain concepts for which there is little data may not yield sensible results. Therefore, it is not difficult to challenge our live demonstration and discover the limitations of this approach. Nevertheless, our initial results indicate that this is an extremely fruitful area of research that we are excited to share with you.

There are many more use cases for semantic code search. For example, we could extend the ideas presented here to allow users to search for code using the language of their choice (French, Mandarin, Arabic, etc.) across many different programming languages simultaneously.

Get In Touch

This is an exciting time for the machine learning research team at GitHub and we are looking to expand. If our work interests you, please get in touch!

Removing jQuery from GitHub.com frontend

2018-09-06T00:00:00+00:00

We have recently completed a milestone where we were able to drop jQuery as a dependency of the frontend code for GitHub.com. This marks the end of a gradual, years-long transition of increasingly decoupling from jQuery until we were able to completely remove the library. In this post, we will explain a bit of history of how we started depending on jQuery in the first place, how we realized when it was no longer needed, and point out that—instead of replacing it with another library or framework—we were able to achieve everything that we needed using standard browser APIs.

Why jQuery made sense early on

GitHub.com pulled in jQuery 1.2.1 as a dependency in late 2007. For a bit of context, that was a year before Google released the first version of their Chrome browser. There was no standard way to query DOM elements by a CSS selector, no standard way to animate visual styles of an element, and the XMLHttpRequest interface pioneered by Internet Explorer was, like many other APIs, inconsistent between browsers.

jQuery made it simple to manipulate the DOM, define animations, and make “AJAX” requests— basically, it enabled web developers to create more modern, dynamic experiences that stood out from the rest. Most importantly of all, the JavaScript features built in one browser with jQuery would generally work in other browsers, too. In those early days of GitHub when most of its features were still getting fleshed out, this allowed the small development team to prototype rapidly and get new features out the door without having to adjust code specifically for each web browser.

The simple interface of jQuery also served as a blueprint to craft extension libraries that would later serve as building blocks for the rest of GitHub.com frontend: pjax and facebox.

We will always be thankful to John Resig and the jQuery contributors for creating and maintaining such a useful and, for the time, essential library.

Web standards in the later years

Over the years, GitHub grew into a company with hundreds of engineers and a dedicated team gradually formed to take responsibility for the size and quality of JavaScript code that we serve to web browsers. One of the things that we’re constantly on the lookout for is technical debt, and sometimes technical debt grows around dependenices that once provided value, but whose value dropped over time.

When it came to jQuery, we compared it against the rapid evolution of supported web standard in modern browsers and realized:

The $(selector) pattern can easily be replaced with querySelectorAll();
CSS classname switching can now be achieved using Element.classList;
CSS now supports defining visual animations in stylesheets rather than in JavaScript;
$.ajax requests can be performed using the Fetch Standard;
The addEventListener() interface is stable enough for cross-platform use;
We could easily encapsulate the event delegation pattern with a lightweight library;
Some syntactic sugar that jQuery provides has become reduntant with the evolution of JavaScript language.

Furthermore, the chaining syntax didn’t satisfy how we wanted to write code going forward. For example:

$('.js-widget')
  .addClass('is-loading')
  .show()

This syntax is simple to write, but to our standards, doesn’t communicate intent really well. Did the author expect one or more js-widget elements on this page? Also, if we update our page markup and accidentally leave out the js-widget classname, will an exception in the browser inform us that something went wrong? By default, jQuery silently skips the whole expresion when nothing matched the initial selector; but to us, such behavior was a bug rather than a feature.

Finally, we wanted to start annotating types with Flow to perform static type checking at build time, and we concluded that the chaining syntax doesn’t lend itself well to static analysis, since almost every result of a jQuery method call is of the same type. We chose Flow over alternatives because, at the time, features such as @flow weak mode allowed us to progressively and efficiently start applying types to a codebase which was largely untyped.

All in all, decoupling from jQuery would mean that we could rely on web standards more, have MDN web docs be de-facto default documentation for our frontend developers, maintain more resilient code in the future, and eventually drop a 30 kB dependency from our packaged bundles, speeding up page load times and JavaScript execution times.

Incremental decoupling

Even with an end goal in sight, we knew that it wouldn’t be feasible to just allocate all resources we had to rewriting everything from jQuery to vanilla JS. If anything, such a rushed endeavor would likely lead to many regressions in site functionality that we would later have to weed out. Instead, we:

Set up metrics that tracked ratio of jQuery calls used per overall line of code and monitored that graph over time to make sure that it’s either staying constant or going down, not up.
We discouraged importing jQuery in any new code. To facilitate that using automation, we created eslint-plugin-jquery which would fail CI checks if anyone tried to use jQuery features, for example $.ajax.
There were now plenty of violations of eslint rules in old code, all of which we’ve annotated with specific eslint-disable rules in code comments. To the reader of that code, those comments would serve as a clear signal that this code doesn’t represent our current coding practices.
We created a pull request bot that would leave a review comment on a pull request pinging our team whenever somebody tried to add a new eslint-disable rule. This way we would get involved in code review early and suggest alternatives.
A lot of old code had explicit coupling to external interfaces of pjax and facebox jQuery plugins, so we’ve kept their interfaces relatively the same while we’ve internally replaced their implementation with vanilla JS. Having static type checking helped us have greater confidence around those refactorings.
Plenty of old code interfaced with rails-behaviors, our adapter for the Ruby on Rails approach to “unobtrusive” JS, in a way that they would attach an AJAX lifecycle handler to certain forms:
```
  // LEGACY APPROACH
  $(document).on('ajaxSuccess', 'form.js-widget', function(event, xhr, settings, data) {
    // insert response data somewhere into the DOM
  })
```
Instead of having to rewrite all of those call sites at once to the new approach, we’ve opted to trigger fake ajax* lifecycle events and keep these forms submitting their contents asynchronously as before; only this time fetch() was used internally.
We maintained a custom build of jQuery and whenever we’ve identified that we’re not using a certain module of jQuery anymore, we would remove it from the custom build and ship a slimmer version. For instance, after we have removed the final usage of jQuery-specific CSS pseudo-selectors such as :visible or :checkbox, we were able to remove the Sizzle module; and when the last $.ajax call was replaced with fetch(), we were able to remove the AJAX module. This served a dual purpose: speeding up JavaScript execution times while at the same time ensuring that no new code is created that would try using the removed functionality.
We kept dropping support for old Internet Explorer versions as soon as it would be feasible to, as informed by our site analytics. Whenever use of a certain IE version dropped below a certain threshold, we would stop serving JavaScript to it and focus on testing against and supporting more modern browsers. Dropping support for IE 8–9 early on allowed us to adopt many native browser features that would otherwise be hard to polyfill.
As part of our refined approach to building frontend features on GitHub.com, we focused on getting away with regular HTML foundation as much as we could, and only adding JavaScript behaviors as progressive enhancement. As a result, even those web forms and other UI elements that were enhanced using JS would usually also work with JavaScript disabled in the browser. In some cases, we were able to delete certain legacy behaviors altogether instead of having to rewrite them in vanilla JS.

With these and similar efforts combined over the years, we were able gradually reduce our dependence on jQuery until there was not a single line of code referencing it anymore.

Custom Elements

One technology that has been making waves in the recent years is Custom Elements: a component library native to the browser, which means that there are no additional bytes of a framework for the user to download, parse and compile.

We had created a few Custom Elements based on the v0 specification since 2014. However, as standards were still in flux back then, we did not invest as much. It was not until 2017 when the Web Components v1 spec was released and implemented in both Chrome and Safari that we began to adopt Custom Elements on a wider scale.

During the jQuery migration, we looked for patterns that would be suitable for extraction as custom elements. For example, we converted our facebox usage for displaying modal dialogs to the <details-dialog> element.

Our general philosophy of striving for progressive enhancement extends to custom elements as well. This means that we keep as much of the content in markup as possible and only add behaviors on top of that. For example, <local-time> shows the raw timestamp by default and gets upgraded to translate the time to the local timezone, while <details-dialog>, when nested in the <details> element, is interactive even without JavaScript, but gets upgraded with accessibility enhancements.

Here is an example of how a <local-time> custom element could be implemented:

// The local-time element displays time in the user's current timezone
// and locale.
//
// Example:
//   <local-time datetime="2018-09-06T08:22:49Z">Sep 6, 2018</local-time>
//
class LocalTimeElement extends HTMLElement {
  static get observedAttributes() {
    return ['datetime']
  }

  attributeChangedCallback(attrName, oldValue, newValue) {
    if (attrName === 'datetime') {
      const date = new Date(newValue)
      this.textContent = date.toLocaleString()
    }
  }
}

if (!window.customElements.get('local-time')) {
  window.LocalTimeElement = LocalTimeElement
  window.customElements.define('local-time', LocalTimeElement)
}

One aspect of Web Components that we’re looking forward to adopting is Shadow DOM. The powerful nature of Shadow DOM has the potential to unlock a lot of possibilities for the web, but that also makes it harder to polyfill. Because polyfilling it today incurs a performance penalty even for code that manipulates parts of the DOM unrelated to web components, it is unfeasible for us to start using it in production.

Polyfills

These are the polyfills that helped us transition to using standard browser features. We try to serve most of these polyfills only when absolutely necessary, i.e. to outdated browsers as part of a separate “compatibility” JavaScript bundle.

GLB: GitHub’s open source load balancer

2018-08-08T00:00:00+00:00

At GitHub, we serve tens of thousands of requests every second out of our network edge, operating on GitHub’s metal cloud. We’ve previously introduced GLB, our scalable load balancing solution for bare metal datacenters, which powers the majority of GitHub’s public web and git traffic, as well as fronting some of our most critical internal systems such as highly available MySQL clusters. Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.

GLB Director is a Layer 4 load balancer which scales a single IP address across a large number of physical machines while attempting to minimise connection disruption during any change in servers. GLB Director does not replace services like haproxy and nginx, but rather is a layer in front of these services (or any TCP service) that allows them to scale across multiple physical machines without requiring each machine to have unique IP addresses.

Scaling an IP using ECMP

The basic property of a Layer 4 load balancer is the ability to take a single IP address and spread inbound connections across multiple servers. To scale a single IP address to handle more traffic than any single machine can process, we need to not only split amongst backend servers, but also need to be able to scale up the servers that handle the load balancing themselves. This is essentially another layer of load balancing.

Typically we think of an IP address as referencing a single physical machine, and routers as moving a packet to the next closest router to that machine. In the simplest case where there’s always a single best next hop, routers pick that hop and forward all packets there until the destination is reached.

In reality, most networks are far more complicated. There is often more than a single path available between two machines, for example where multiple ISPs are available or even when two routers are joined together with more than one physical cable to increase capacity and provide redundancy. This is where Equal-Cost Multi-Path (ECMP) routing comes in to play - rather than routers picking a single best next hop, where they have multiple hops with the same cost (usually defined as the number of ASes to the destination), they instead hash traffic so that connections are balanced across all available paths of equal cost.

ECMP is implemented by hashing each packet to determine a relatively consistent selection of one of the available paths. The hash function used here varies by device, but typically it’s a consistent hash based on the source and destination IP address as well as the source and destination port for TCP traffic. This means that multiple packets for the same ongoing TCP connection will typically traverse the same path, meaning that packets will arrive in the same order even when paths have different latencies. Notably in this case, the paths can change without any disruption to connections because they will always end up at the same destination server, and at that point the path it took is mostly irrelevant.

An alternative use of ECMP can come in to play when we want to shard traffic across multiple servers rather than to the same server over multiple paths. Each server can announce the same IP address with BGP or another similar network protocol, causing connections to be sharded across those servers, with the routers blissfully unaware that the connections are being handled in different places, not all ending on the same machine as would traditionally be the case.

While this shards traffic as we had hoped, it has one huge drawback: when the set of servers that are announcing the same IP change (or any path or router along the way changes), connections must rebalance to maintain an equal balance of connections on each server. Routers are typically stateless devices, simply making the best decision for each packet without consideration to the connection it is a part of, which means some connections will break in this scenario.

In the above example on the left, we can imagine that each colour represents an active connection. A new proxy server is added to announce the same IP. The router diligently adjusts the consistent hash to move 1/3 connections to the new server while keeping 2/3 connections where they were. Unfortunately for those 1/3 connections that were already in progress, the packets are now arriving on a server that doesn’t know about the connection, and so they fail.

Split director/proxy load balancer design

The issue with the previous ECMP-only solution is that it isn’t aware of the full context for a given packet, nor is it able to store data for each packet/connection. As it turns out, there are commonly used patterns to help out with this situation by implementing some stateful tracking in software, typically using a tool like Linux Virtual Server (LVS). We create a new tier of “director” servers that take packets from the router via ECMP, but rather than relying on the router’s ECMP hashing to choose the backend proxy server, we instead control the hashing and store state (which backend was chosen) for all in-progress connections. When we change the set of proxy tier servers, the director tier hopefully hasn’t changed, and our connection will continue.

Although this works well in many cases, it does have some drawbacks. In the above example, we add both a LVS director and backend proxy server at the same time. The new director receives some set of packets, but doesn’t have any state yet (or has delayed state), so hashes it as a new connection and may get it wrong (and cause the connection to fail). A typical workaround with LVS is to use multicast connection syncing to keep the connection state shared amongst all LVS director servers. This still requires connection state to propagate, and also still requires duplicate state - not only does each proxy need state for each connection in the Linux kernel network stack, but every LVS director also needs to store a mapping of connection to backend proxy server.

Removing all state from the director tier

When we were designing GLB, we decided we wanted to improve on this situation and not duplicate state at all. GLB takes a different approach to that described above, by using the flow state already stored in the proxy servers as part of maintaining established Linux TCP connections from clients.

For each incoming connection, we pick a primary and secondary server that could handle that connection. When a packet arrives on the primary server and isn’t valid, it is forwarded to the secondary server. The hashing to choose the primary/secondary server is done once, up front, and is stored in a lookup table, and so doesn’t need to be recalculated on a per-flow or per-packet basis. When a new proxy server is added, for 1/N connections it becomes the new primary, and the old primary becomes the secondary. This allows existing flows to complete, because the proxy server can make the decisions with its local state, the single source of truth. Essentially this gives packets a “second chance” at arriving at the expected server that holds their state.

Even though the director will still send connections to the wrong server, that server will then know how to forward on the packet to the correct server. The GLB director tier is completely stateless in terms of TCP flows: director servers can come and go at any time, and will always pick the same primary/secondary server providing their forwarding tables match (but they rarely change). To change proxies, some care needs to be taken, which we describe below.

Maintaining invariants: rendezvous hashing

The core of the GLB Director design comes down to picking that primary and secondary server consistently, and to allow the proxy tier servers to drain and fill as needed. We consider each proxy server to have a state, and carefully adjust the state as a way of adding and removing servers.

We create a static binary forwarding table, which is generated identically on each director server, to map incoming flows to a given primary and secondary server. Rather than having complex logic to pick from all available servers at packet processing time, we instead use some indirection by creating a table (65k rows), with each row containing a primary and secondary server IP address. This is stored in memory as flat array of binary data, taking about 512kb per table. When a packet arrives, we consistently hash it (based on packet data alone) to the same row in that table (using the hash as an index into the array), which provides a consistent primary and secondary server pair.

We want each server to appear approximately equally in both the primary and secondary fields, and to never appear in both in the same row. When we add a new server, we desire some rows to have their primary become secondary, and the new server become primary. Similarly, we desire the new server to become secondary in some rows. When we remove a server, in any rows where it was primary, we want the secondary to become primary, and another server to pick up secondary.

This sounds complex, but can be summarised succinctly with a couple of invariants:

As we change the set of servers, the relative order of existing servers should be maintained.
The order of servers should be computable without any state other than the list of servers (and maybe some predefined seeds).
Each server should appear at most once in each row.
Each server should appear approximately an equal number of times in each column.

Reading the problem that way, Rendezvous hashing is an ideal choice, since it can trivially satisfy these invariants. Each server (in our case, the IP) is hashed along with the row number, the servers are sorted by that hash (which is just a number), and we get a unique order for servers for that given row. We take the first two as the primary and secondary respectively.

Relative order will be maintained because the hash for each server will be the same regardless of which other servers are included. The only information required to generate the table is the IPs of the servers. Since we’re just sorting a set of servers, the servers only appear once. Finally, if we use a good hash function that is pseudo-random, the ordering will be pseudo-random, and so the distribution will be even as we expect.

Draining, filling, adding and removing proxies

Adding or removing proxy servers require some care in our design. This is because a forwarding table entry only defines a primary/secondary proxy, so the draining/failover only works with at most a single proxy host in draining. We define the following valid states and state transitions for a proxy server:

When a proxy server is active, draining or filling, it is included in the forwarding table entries. In a stable state, all proxy servers are active, and the rendezvous hashing described above will have an approximately even and random distribution of each proxy server in both the primary and secondary columns.

As a proxy server transitions to draining, we adjust the entries in the forwarding table by swapping the primary and secondary entries we would have otherwise included:

This has the effect of sending packets to the server that was previously secondary first. Since it receives the packets first, it will accept SYN packets and therefore take any new connections. For any packet it doesn’t understand as relating to a local flow, it forwards it to the other server (the previous primary), which allows existing connections to complete.

This has the effect of draining the desired server of connections gracefully, after which point it can be removed completely, and proxies can shuffle in to fill the empty secondary slots:

A node in filling looks just like active, since the table inherently allows a second chance:

This implementation requires that no more than one proxy server at a time is in any state other than active, which in practise has worked well at GitHub. The state changes to proxy servers can happen as quickly as the longest connection duration that needs to be maintained. We’re working on extensions to the design that support more than just a primary and secondary, and some components (like the header listed below) already include initial support for arbitrary server lists.

Encapsulation within the datacenter

We now have an algorithm to consistently pick backend proxy servers and operate on them, but how do we actually move packets around the datacenter? How do we encode the secondary server inside the packet so the primary can forward a packet it doesn’t understand?

Traditionally in the LVS setup, an IP over IP (IPIP) tunnel is used. The client IP packet is encapsulated inside an internal datacenter IP packet and forwarded on to the proxy server, which decapsulates it. We found that it was difficult to encode the additional server metadata inside IPIP packets, as the only standard space available was the IP Options, and our datacenter routers passed packets with unknown IP options to software for processing (which they called “Layer 2 slow path”), taking speeds from millions to thousands of packets per second.

To avoid this, we needed to hide the data inside a different packet format that the router wouldn’t try to understand. We initially adopted raw Foo-over-UDP (FOU) with a custom Generic Route Encapsulation (GRE) payload, essentially encapsulating everything inside a UDP packet. We recently transitioned to Generic UDP Encapsulation (GUE), which is a layer on top FOU which provides a standard for encapsulating IP protocols inside a UDP packet. We place our secondary server’s IP inside the private data of the GUE header. From a router’s perspective, these packets are all internal datacenter UDP packets between two normal servers.

 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+\
|          Source port          |        Destination port       | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ UDP
|             Length            |            Checksum           | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/
| 0 |C|   Hlen  |  Proto/ctype  |             Flags             | GUE
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Private data type (0)     |  Next hop idx |   Hop count   |\
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
|                             Hop 0                             | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ GLB
|                              ...                              | private
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ data
|                             Hop N                             | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/

Another benefit to using UDP is that the source port can be filled in with a per-connection hash so that they are flow within the datacenter over different paths (where ECMP is used within the datacenter), and received on different RX queues on the proxy server’s NIC (which similarly use a hash of TCP/IP header fields). This is not possible with IPIP because most commodity datacenter NICs are only able to understand plain IP, TCP/IP and UDP/IP (and a few others). Notably, the NICs we use cannot look inside IP/IP packets.

When the proxy server wants to send a packet back to the client, it doesn’t need to be encapsulated or travel back through our director tier, it can be sent directly to the client (often called “Direct Server Return”). This is typical of this sort of load balancer design and is especially useful for content providers where the majority of traffic flows outbound with a relatively small amount of traffic inbound.

This leaves us with a packet flow that looks like the following:

DPDK for 10G+ line rate packet processing

Since we first publicly discussed our initial design, we’ve completely rewritten glb-director to use DPDK, an open source project that allows very fast packet processing from userland by bypassing the Linux kernel. This has allowed us to achieve NIC line rate processing on commodity NICs with commodity CPUs, and allows us to trivially scale our director tier to handle as much inbound traffic as our public connectivity requires. This is particularly important during DDoS attacks, where we do not want our load balancer to be a bottleneck.

One of our initial goals with GLB was that our load balancer could run on commodity datacenter hardware without any server-specific physical configuration. Both GLB director and proxy servers are provisioned like normal servers in our datacenter. Each server has a bonded pair of network interfaces, and those interfaces are shared between DPDK and Linux on GLB director servers.

Modern NICs support SR-IOV, a technology that enables a single NIC to act like multiple NICs from the perspective of the operating system. This is typically used by virtual machine hypervisors to ask the real NIC (“Physical Function”) to create multiple pretend NICs for each VM (“Virtual Functions”). To enable DPDK and the Linux kernel to share NICs, we use flow bifurcation, which sends specific traffic (destined to GLB-run IP addresses) to our DPDK process on a Virtual Function while leaving the rest of the packets with the Linux kernel’s networking stack on the Physical Function.

We’ve found that the packet processing rates of DPDK on a Virtual Function are acceptable for our requirements. GLB Director uses a DPDK Packet Distributor pattern to spread the work of encapsulating packets across any number of CPU cores on the machine, and since it is stateless this can be highly parallelised.

GLB Director supports matching and forwarding inbound IPv4 and IPv6 packets containing TCP payloads, as well as inbound ICMP Fragmentation Required messages used as part of Path MTU Discovery, by peeking into the inner layers of the packet during matching.

Bringing test suites to DPDK with Scapy

One problem that typically arises in creating (or using) technologies that operate at high speeds due to using low-level primitives (like communicating with the NIC directly) is that they become significantly more difficult to test. As part of creating the GLB Director, we also created a test environment that supports simple end-to-end packet flow testing of our DPDK application, by leveraging the way DPDK provides an Environment Abstraction Layer (EAL) that allows a physical NIC and a libpcap-based local interface to appear the same from the view of the application.

This allowed us to write tests in Scapy, a wonderfully simple Python library for reading, manipulating and writing packet data. By creating a Linux Virtual Ethernet Device, with Scapy on one side and DPDK on the other, we were able to pass in custom crafted packets and validate what our software would provide on the other side, being a fully GUE-encapsulated packet directed to the expected backend proxy server.

This allows us to test more complex behaviours such as traversing layers of ICMPv4/ICMPv6 headers to retrieve the original IPs and TCP ports for correct forwarding of ICMP messages from external routers.

Healthchecking of proxies for auto-failover

Part of the design of GLB is to handle server failure gracefully. The current design of having a designated primary/secondary for a given forwarding table entry / client means that we can work around single-server failure by running health checks from the perspective of each director. We run a service called glb-healthcheck which continually validates each backend server’s GUE tunnel and arbitrary HTTP port.

When a server fails, we swap the primary/secondary entries anywhere that server is primary. This performs a “soft drain” of the server, which provides the best chance for connections to gracefully fail over. If the healthcheck failure is a false positive, connections won’t be disrupted, they will just traverse a slightly different path.

Second chance on proxies with iptables

The final component that makes up GLB is a Netfilter module and iptables target that runs on every proxy server and allows the “second chance” design to function.

This module provides a simple task deciding whether the inner TCP/IP packet inside every GUE packet is valid locally according to the Linux kernel TCP stack, and if it isn’t, forwards it to the next proxy server (the secondary) rather than decapsulating it locally.

In the case where a packet is a SYN (new connection) or is valid locally for an established connection, it simply accepts it locally. We then use the Linux kernel 4.x GUE support provided as part of the fou module to receive the GUE packet and process it locally.

Available today as open source

When we started down the path of writing a better datacenter load balancer, we decided that we wanted to release it open source so that others could benefit from and share in our work. We’re excited to be releasing all the components discussed here as open source at github/glb-director. We hope this will allow others to reuse our work and contribute to a common standard software load balancing solution that runs on commodity hardware in physical datacenter environments.

Also, we’re hiring!

GLB and the GLB Director has been an ongoing project designed, authored, reviewed and supported by various members of GitHub’s Production Engineering organisation, including @joewilliams, @nautalice, @ross, @theojulienne and many others. If you’re interested in joining us in building great infrastructure projects like GLB, our Data Center team is hiring production engineers specialising in Traffic Systems, Network and Facilities.

MySQL High Availability at GitHub

2018-06-20T00:00:00+00:00

GitHub uses MySQL as its main datastore for all things non-git, and its availability is critical to GitHub’s operation. The site itself, GitHub’s API, authentication and more, all require database access. We run multiple MySQL clusters serving our different services and tasks. Our clusters use classic master-replicas setup, where a single node in a cluster (the master) is able to accept writes. The rest of the cluster nodes (the replicas) asynchronously replay changes from the master and serve our read traffic.

The availability of master nodes is particularly critical. With no master, a cluster cannot accept writes: any writes that need to be persisted cannot be persisted. Any incoming changes such as commits, issues, user creation, reviews, new repositories, etc., would fail.

To support writes we clearly need to have an available writer node, a master of a cluster. But just as important, we need to be able to identify, or discover, that node.

On a failure, say a master box crash scenario, we must ensure the existence of a new master, as well as be able to quickly advertise its identity. The time it takes to detect a failure, run the failover and advertise the new master’s identity makes up the total outage time.

This post illustrates GitHub’s MySQL high availability and master service discovery solution, which allows us to reliably run a cross-data-center operation, be tolerant of data center isolation, and achieve short outage times on a failure.

High availability objectives

The solution described in this post iterates on, and improves, previous high availability (HA) solutions implemented at GitHub. As we scale, our MySQL HA strategy must adapt to changes. We wish to have similar HA strategies for our MySQL and for other services within GitHub.

When considering high availability and service discovery, some questions can guide your path into an appropriate solution. An incomplete list may include:

How much outage time can you tolerate?
How reliable is crash detection? Can you tolerate false positives (premature failovers)?
How reliable is failover? Where can it fail?
How well does the solution work cross-data-center? On low and high latency networks?
Will the solution tolerate a complete data center (DC) failure or network isolation?
What mechanism, if any, prevents or mitigates split-brain scenarios (two servers claiming to be the master of a given cluster, both independently and unknowingly to each other accepting writes)?
Can you afford data loss? To what extent?

To illustrate some of the above, let’s first consider our previous HA iteration, and why we changed it.

Moving away from VIP and DNS based discovery

In our previous iteration, we used:

orchestrator for detection and failover, and
VIP and DNS for master discovery.

In that iteration, clients discovered the writer node by using a name, e.g. mysql-writer-1.github.net. The name resolved to a Virtual IP address (VIP) which the master host would acquire.

Thus, on a normal day, clients would just resolve the name, connect to the resolved IP, and find the master listening on the other side.

Consider this replication topology, spanning three different data centers:

In the event of a master failure, a new server, one of the replicas, must be promoted in its place.

orchestrator will detect a failure, promote a new master, and then act to reassign the name/VIP. Clients don’t actually know the identity of the master: all they have is a name, and that name must now resolve to the new master. However, consider:

VIPs are cooperative: they are claimed and owned by the database servers themselves. To acquire or release a VIP, a server must send an ARP request. The server owning the VIP must first release it before the newly promoted master acquires it. This has some undesired effects:

An orderly failover operation will first contact the dead master and request that it release the VIP, and then contact the newly promoted master and request that it grab the VIP. What if the old master cannot be reached or refuses to release the VIP? Given that there’s a failure scenario on that server in the first place, it is not unlikely that it would fail to respond in a timely manner, or indeed respond at all.
- We can end up with a split-brain: two hosts claiming to have the same VIP. Different clients may connect to either of those servers, depending on the shortest network path.
- The source of truth here depends on the cooperation of two independent servers, and this setup is unreliable.
Even if the old master does cooperate, the workflow wastes precious time: the switch to the new master waits while we contact the old master.
And even as the VIP changes, existing client connections are not guaranteed to disconnect from the old server, and we may still experience a split-brain.

In parts of our setup VIPs are bound by physical location. They are owned by a switch or a router. Thus, we can only reassign the VIPs onto co-located servers. In particular, in some cases we cannot assign the VIP to a server promoted in a different data center, and must make a DNS change.

DNS changes take longer to propagate. Clients cache DNS names for a preconfigured time. A cross-DC failover implies more outage time: it will take more time to make all clients aware of the identity of the new master.

These limitations alone were enough to push us in search of a new solution, but for even more consideration were:

Masters were self-injecting themselves with heartbeats via the pt-heartbeat service, for the purpose of lag measurement and throttling control. The service had to be kicked off on the newly promoted master. If possible, the service would be shut down on the old master.
Likewise, Pseudo-GTID injection was self-managed by the masters. It would need to kick off on the new master, and preferably stop on the old master.
The new master was set as writable. The old master was to be set as read_only, if possible.

These extra steps were a contributing factor to the total outage time and introduced their own failures and friction.

The solution worked, and GitHub has had successful MySQL failovers that went well under the radar, but we wanted our HA to improve on the following:

Be data center agnostic.
Be tolerant of data center failure.
Remove unreliable cooperative workflows.
Reduce total outage time.
As much as possible, have lossless failovers.

GitHub’s HA solution: orchestrator, Consul, GLB

Our new strategy, along with collateral improvements, solves or mitigates much of the concerns above. In today’s HA setup, we have:

orchestrator to run detection and failovers. We use a cross-DC orchestrator/raft setup as depicted below.
Hashicorp’s Consul for service discovery.
GLB/HAProxy as a proxy layer between clients and writer nodes. Our GLB director is open sourced.
anycast for network routing.

The new setup removes VIP and DNS changes altogether. And while we introduce more components, we are able to decouple the components and simplify the task, as well as be able to utilize solid and stable solutions. A breakdown follows.

A normal flow

On a normal day the apps connect to the write nodes through GLB/HAProxy.

The apps are never aware of the master’s identity. As before, they use a name. For example, the master for cluster1 would be mysql-writer-1.github.net. In our current setup, however, this name gets resolved to an anycast IP.

With anycast, the name resolves to the same IP everywhere, but traffic is routed differently based on a client’s location. In particular, in each of our data centers we have GLB, our highly available load balancer, deployed on multiple boxes. Traffic to mysql-writer-1.github.net always routes to the local data center’s GLB cluster. Thus, all clients are served by local proxies.

We run GLB on top of HAProxy. Our HAProxy has writer pools: one pool per MySQL cluster, where each pool has exactly one backend server: the cluster’s master. All GLB/HAProxy boxes in all DCs have the exact same pools, and they all indicate the exact same backend servers in these pools. Thus, if an app wishes to write to mysql-writer-1.github.net, it matters not which GLB server it connects to. It will always get routed to the actual cluster1 master node.

As far as the apps are concerned, discovery ends at GLB, and there is never a need for re-discovery. It’s all on GLB to route the traffic to the correct destination.

How does GLB know which servers to list as backends, and how do we propagate changes to GLB?

Discovery via Consul

Consul is well known as a service discovery solution, and also offers DNS services. In our solution, however, we utilize it as a highly available key-value (KV) store.

Within Consul’s KV store we write the identities of cluster masters. For each cluster, there’s a set of KV entries indicating the cluster’s master fqdn, port, ipv4, ipv6.

Each GLB/HAProxy node runs consul-template: a service that listens on changes to Consul data (in our case: changes to clusters masters data). consul-template produces a valid config file and is able to reload HAProxy upon changes to the config.

Thus, a change in Consul to a master’s identity is observed by each GLB/HAProxy box, which then reconfigures itself, sets the new master as the single entity in a cluster’s backend pool, and reloads to reflect those changes.

At GitHub we have a Consul setup in each data center, and each setup is highly available. However, these setups are independent of each other. They do not replicate between each other and do not share any data.

How does Consul get told of changes, and how is the information distributed cross-DC?

orchestrator/raft

We run an orchestrator/raft setup: orchestrator nodes communicate to each other via raft consensus. We have one or two orchestrator nodes per data center.

orchestrator is charged with failure detection, with MySQL failover, and with communicating the change of master to Consul. Failover is operated by the single orchestrator/raft leader node, but the change, the news that a cluster now has a new master, is propagated to all orchestrator nodes through the raft mechanism.

As orchestrator nodes receive the news of a master change, they each communicate to their local Consul setups: they each invoke a KV write. DCs with more than one orchestrator representative will have multiple (identical) writes to Consul.

Putting the flow together

In a master crash scenario:

The orchestrator nodes detect failures.
The orchestrator/raft leader kicks off a recovery. A new master gets promoted.
orchestrator/raft advertises the master change to all raft cluster nodes.
Each orchestrator/raft member receives a leader change notification. They each update the local Consul’s KV store with the identity of the new master.
Each GLB/HAProxy has consul-template running, which observes the change in Consul’s KV store, and reconfigures and reloads HAProxy.
Client traffic gets redirected to the new master.

There is a clear ownership of responsibilities for each component, and the entire design is both decoupled as well as simplified. orchestrator doesn’t know about the load balancers. Consul doesn’t need to know where the information came from. Proxies only care about Consul. Clients only care about the proxy.

Furthermore:

There are no DNS changes to propagate.
There is no TTL.
The flow does not need the dead master’s cooperation. It is largely ignored.

Additional details

To further secure the flow, we also have the following:

HAProxy is configured with a very short hard-stop-after. When it reloads with a new backend server in a writer-pool, it automatically terminates any existing connections to the old master.
- With hard-stop-after we don’t even require cooperation from the clients, and this mitigates a split-brain scenario. It’s noteworthy that this isn’t hermetic, and some time passes before we kill old connections. But there’s then a point in time after which we’re comfortable and expect no nasty surprises.
We do not strictly require Consul to be available at all times. In fact, we only need it to be available at failover time. If Consul happens to be down, GLB continues to operate with the last known values and makes no drastic moves.
GLB is set to validate the identity of the newly promoted master. Similarly to our context-aware MySQL pools, a check is made on the backend server, to confirm it is indeed a writer node. If we happen to delete the master’s identity in Consul, no problem; the empty entry is ignored. If we mistakenly write the name of a non-master server in Consul, no problem; GLB will refuse to update it and keep running with last known state.

We further tackle concerns and pursue HA objectives in the following sections.

orchestrator/raft failure detection

orchestrator uses a holistic approach to detecting failure, and as such it is very reliable. We do not observe false positives: we do not have premature failovers, and thus do not suffer unnecessary outage time.

orchestrator/raft further tackles the case for a complete DC network isolation (aka DC fencing). A DC network isolation can cause confusion: servers within that DC can talk to each other. Is it they that are network isolated from other DCs, or is it other DCs that are being network isolated?

In an orchestrator/raft setup, the raft leader node is the one to run the failovers. A leader is a node that gets the support of the majority of the group (quorum). Our orchestrator node deployment is such that no single data center makes a majority, and any n-1 DCs do.

In the event of a complete DC network isolation, the orchestrator nodes in that DC get disconnected from their peers in other DCs. As a result, the orchestrator nodes in the isolated DC cannot be the leaders of the raft cluster. If any such node did happen to be the leader, it steps down. A new leader will be assigned from any of the other DCs. That leader will have the support of all the other DCs, which are capable of communicating between themselves.

Thus, the orchestrator node that calls the shots will be one that is outside the network isolated data center. Should there be a master in an isolated DC, orchestrator will initiate the failover to replace it with a server in one of the available DCs. We mitigate DC isolation by delegating the decision making to the quorum in the non-isolated DCs.

Quicker advertisement

Total outage time can further be reduced by advertising the master change sooner. How can that be achieved?

When orchestrator begins a failover, it observes the fleet of servers available to be promoted. Understanding replication rules and abiding by hints and limitations, it is able to make an educated decision on the best course of action.

It may recognize that a server available for promotion is also an ideal candidate, such that:

There is nothing to prevent the promotion of the server (and potentially the user has hinted that such server is preferred for promotion), and
The server is expected to be able to take all of its siblings as replicas.

In such a case orchestrator proceeds to first set the server as writable, and immediately advertises the promotion of the server (writes to Consul KV in our case), even while asynchronously beginning to fix the replication tree, an operation that will typically take a few more seconds.

It is likely that by the time our GLB servers have been fully reloaded, the replication tree is already intact, but it is not strictly required. The server is good to receive writes!

Semi-synchronous replication

In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. It provides a way to achieve lossless failovers: any change applied on the master is either applied or waiting to be applied on one of the replicas.

Consistency comes with a cost: a risk to availability. Should no replica acknowledge receipt of changes, the master will block and writes will stall. Fortunately, there is a timeout configuration, after which the master can revert back to asynchronous replication mode, making writes available again.

We have set our timeout at a reasonably low value: 500ms. It is more than enough to ship changes from the master to local DC replicas, and typically also to remote DCs. With this timeout we observe perfect semi-sync behavior (no fallback to asynchronous replication), as well as feel comfortable with a very short blocking period in case of acknowledgement failure.

We enable semi-sync on local DC replicas, and in the event of a master’s death, we expect (though do not strictly enforce) a lossless failover. Lossless failover on a complete DC failure is costly and we do not expect it.

While experimenting with semi-sync timeout, we also observed a behavior that plays to our advantage: we are able to influence the identity of the ideal candidate in the event of a master failure. By enabling semi-sync on designated servers, and by marking them as candidates, we are able to reduce total outage time by affecting the outcome of a failure. In our experiments we observe that we typically end up with the ideal candidates, and hence run quick advertisements.

Heartbeat injection

Instead of managing the startup/shutdown of the pt-heartbeat service on promoted/demoted masters, we opted to run it everywhere at all times. This required some patching so as to make pt-heartbeat comfortable with servers either changing their read_only state back and forth or completely crashing.

In our current setup pt-heartbeat services run on masters and on replicas. On masters, they generate the heartbeat events. On replicas, they identify that the servers are read-only and routinely recheck their status. As soon as a server is promoted as master, pt-heartbeat on that server identifies the server as writable and begins injecting heartbeat events.

orchestrator ownership delegation

We further delegated to orchestrator:

Pseudo-GTID injection,
Setting the promoted master as writable, clearing its replication state, and
Setting the old master as read_only, if possible.

On all things new-master, this reduces friction. A master that is just being promoted is clearly expected to be alive and accessible, or else we would not promote it. It makes sense, then, to let orchestrator apply changes directly to the promoted master.

Limitations and drawbacks

The proxy layer makes the apps unaware of the master’s identity, but it also masks the apps’ identities from the master. All the master sees are connections coming from the proxy layer, and we lose information about the actual source of the connection.

As distributed systems go, we are still left with unhandled scenarios.

Notably, on a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high.

More scenarios exist: the outage of Consul at the time of the failover; partial DC isolation; others. We understand that with distributed systems of this nature it is impossible to close all of the loopholes, so we focus on the most important cases.

The results

Our orchestrator/GLB/Consul setup provides us with:

Reliable failure detection,
Data center agnostic failovers,
Typically lossless failovers,
Data center network isolation support,
Split-brain mitigation (more in the works),
No cooperation dependency,
Between 10 and 13 seconds of total outage time in most cases.
- We see up to 20 seconds of total outage time in less frequent cases, and up to 25 seconds in extreme cases.

Conclusion

The orchestration/proxy/service-discovery paradigm uses well known and trusted components in a decoupled architecture, which makes it easier to deploy, operate and observe, and where each component can independently scale up or down. We continue to seek improvements as we continuously test our setup.

Performance Impact of Removing OOBGC

2018-05-18T00:00:00+00:00

Until last week, GitHub used an Out of Band Garbage Collector (OOBGC) in production. Since removing it, we decreased CPU time across our production machines by 10%. Let’s talk about what an OOBGC is, when to use it, and when not to use it. Then follow up with some statistics about the impact of removing it from GitHub’s stack.

What is an Out of Band Garbage Collector?

An OOBGC is not really a Garbage Collector, but more of a technique to use when deciding when to collect garbage in your program. Instead of allowing the GC to run normally, the GC is stopped before processing a web request, then restarted after the response has been sent to the client. Meaning that garbage collection occurs “out of band” of request and response processing.

When to use an Out of Band Garbage Collector

Ruby’s GC is a “stop the world, mark and sweep” collector. Which means that when the GC runs, your program pauses, and when the GC finishes your program resumes. The time your program is paused is called “pause time”, and while your program is paused it can’t do anything. Historically, Ruby’s GC would pause the program for long periods of time. We would rather clients don’t wait around for the GC to run, so only executing GC after each request made sense.

When not to use an Out of Band Garbage Collector

In the past years, Ruby’s Garbage Collector has undergone many performance improvements. These changes include: becoming a generational collector, incremental marking, and lazy sweeping. A generational collector reduces the overall amount of work the GC needs to do. Incremental marking and lazy sweeping mean that the GC can execute concurrently with your program. What these techniques add up to is less time spent in GC, and higher throughput of your program.

Since the OOBGC runs the GC after the response is finished, it can cause the web worker to take longer in order to be ready to process the next incoming request. This means that clients can suffer from latency due to queuing wait times.

If a particular request doesn’t allocate enough garbage to warrant a GC execution under normal conditions, then the OOBGC could cause the process to do more work than it would have without the OOBGC.

Finally, the OOBGC can cause full collections (examining old and new objects) which defeats the generational GC optimizations.

GitHub has been using Ruby in production for a long time, and at the time adding an OOBGC made sense and worked well. However, it is always good to question assumptions, especially after technological advancements such as the improvements made in Ruby’s GC. We wanted to see if running an OOBGC was still necessary for our application after upgrading to Ruby 2.4, so we decided to remove it and observe the impact.

Impact of removing the OOBGC

After removing the OOBGC, we saw a 10% drop in Kubernetes cluster CPU utilization:

This graph compares cluster CPU utilization from the current day, previous day, and previous week:

The blue line is CPU utilization for the day the patch went out. You can see a great drop around 15:20.

This graph shows the difference in core utilization before and after OOBGC removal. In other words “number of cores used yesterday” minus “number of cores used today”:

We saw a savings of between 400 and around 1000 cores depending on usage at that point in the day.

Finally, removing OOBGC reduced average response times by about 25% (the gray line is with OOBGC, the blue line is without):

Of course, removing OOBGC was not an all around win. Incremental marking and lazy sweeping amortize the cost of memory collection over time. This means that memory usage will increase on average, and that is what we observed in production:

Conclusion

For our application, the CPU savings far outdid the price we had to pay in average memory usage. Removing the OOBGC from our system resulted in a great savings for our systems. Taking measurements, acting on data, and questioning assumptions is one of the most difficult and fun parts of being an engineer. This time it paid off for us, and hopefully this post can help you too!

Improving your OSS dependency workflow with Licensed

2018-03-07T00:00:00+00:00

GitHub recently open sourced Licensed in the hopes that it is as helpful to the OSS community as it has been to us.

<disclaimer>
1 of 1 consulted lawyers agree, Licensed is not a replacement for the legal advice of a human.
</disclaimer>

Glossary

Before we go any further, let’s review a few terms that will be repeated throughout this article

Dependency: An external software package used in an application
- i.e. packages that are required or imported like Octokit, ActiveRecord, React
Dependency source: A class that can enumerate dependencies for an application
- i.e. by invoking a package management tool such as bundler, npm, bower, or cabal.

What is Licensed?

Licensed helps GitHub engineers make efficient use of OSS by surfacing potential problems with a dependency’s license early in our development cycle, ensuring we maintain dependency license documentation throughout our development cycle.

In practice, enumerating dependencies can be difficult. In the easiest scenario a package manager provides a full listing of project dependencies in a parseable file. More difficult scenarios require detailed knowledge of CLI tools, such as using go list for a general purpose Golang solution or ghc-pkg for Haskell package managers.

How Licensed works

Licensed works in any Git repository to find, cache and check license metadata for dependencies. It can detect dependencies from multiple language types and package managers across multiple projects in a single repository. This flexibility allows Licensed to work equally well for a monolith repository as it would for a repository containing a single project.

Licensed uses a configuration file to determine how and where to enumerate dependencies for a repository. Configuration files specify one or more Licensed applications, where an application describes a location to enumerate dependencies and a directory to store metadata. For more information on configuration files and Licensed applications, see the Licensed documentation.

Finding license metadata

Licensed enumerates dependencies for each application’s source path found in the configuration. For each dependency found, Licensed finds the dependency source location in the local environment and extracts their basic metadata (e.g. name, version, homepage and summary).

Licensed uses Licensee to determine each dependency’s license(s) and find it’s license text (e.g. LICENSE) from the local dependency source location.

Caching license metadata

Once Licensed has the dependency’s metadata, it caches the metadata and license information for the project at the cache path(s) specified in the Licensed configuration file.

Storing the dependency data in a source control repository enables checking dependency data as part of the development workflow. Requiring updates to license data whenever dependencies change forces the license data to stay up to date and relevant.

Keeping the cached data in a source control repository also means you automatically get a history of every dependency change in a single location. Tracking down when a specific dependency changed becomes easier when there is a common location and fewer commits to look through.

Many dependencies’ licenses require distributing a copy of the licenses when used in downstream projects. Licensed makes it easy to automate the build and distribution of these licenses, and collectively an open source bill of materials for your project, along with the project source.

Checking license metadata

Lastly, Licensed is used to report any dependencies needing review. When checking dependency licenses, Licensed performs the following verifications:

Verify cached license metadata exists for the dependency
Verify the cached metadata is for the correct dependency version
Verify the cached metadata has license text
Verify the cached metadata has uses an allowed license, or the dependency has been reviewed and accepted

Licensed as part of the developer workflow at GitHub

GitHub engineers have a shared responsibility to ensure that their projects stay compliant with our OSS license requirements.

As the first line of defense in ensuring that dependencies meet our OSS license requirements, each repository has a CI job that checks dependency licenses. This process generally has little impact on developers, and only requires additional effort when a change might not meet our requirements.

When a license needs to be updated, it’s easy to do:

A developer opens a pull request that includes changes to the project dependencies
The repository CI job shows dependency license(s) need review, providing feedback on next steps to resolving the errors
The developer caches license data for the updated dependencies, including the metadata files in the pull request
The repository CODEOWNERS file requests a review from subject matter experts
The subject matter expert reviews the changes and provides guidance to resolve any remaining questions.

This process works very well at GitHub. Involving subject matter experts early in the process reduces friction on the developer and prevents the developer from adding dependencies into the product under license terms that don’t meet our requirements.

Extending Licensed for new dependency sources

Whenever a new project is started, we always try to use the best tool for the job. In many cases this means a new language or framework that isn’t supported by Licensed. To handle these cases, we’ve made adding a new dependency source to Licensed as easy as possible.

Creating new dependency sources in Licensed is easy. Here is a simple example:

module MyProject
  class MySource

    # Required.  I need a configuration for basic functionality
    def initialize(config)
      @config = config
    end

    # Required.  Tell the world the name of the dependency source
    def type
      "my source"
    end

    # Required.  Give the world the dependencies found for `@config`
    def dependencies
      # Will this parse a package manager file?
      # Will this use CLI tools to find dependencies?
      # Nope!  I'm a hardcoded list!
      [
        Dependency.new(
          @config.source_path, # location used to find license text (e.g. LICENSE)
          name: "licensed",
          type: type,
          homepage: "https://github.com/github/licensed",
          version: "0.13.0",
          summary: "Extract and validate the licenses of dependencies."
        )
      ]
    end
  end
end

Next steps

Future development for Licensed will focus on

Reducing friction when using Licensed in developer workflows
Reducing friction when adding new dependency sources
Adding new dependency sources :smile:

Licensed isn’t just about open source, it is open source itself. Interested in adapting the tool to your team’s workflow or adding support for your favorite package manager? We’d love your help.

February 28th DDoS Incident Report

2018-03-01T00:00:00+00:00

On Wednesday, February 28, 2018 GitHub.com was unavailable from 17:21 to 17:26 UTC and intermittently unavailable from 17:26 to 17:30 UTC due to a distributed denial-of-service (DDoS) attack. We understand how much you rely on GitHub and we know the availability of our service is of critical importance to our users. To note, at no point was the confidentiality or integrity of your data at risk. We are sorry for the impact of this incident and would like to describe the event, the efforts we’ve taken to drive availability, and how we aim to improve response and mitigation moving forward.

Background

Cloudflare described an amplification vector using memcached over UDP in their blog post this week, “Memcrashed - Major amplification attacks from UDP port 11211”. The attack works by abusing memcached instances that are inadvertently accessible on the public internet with UDP support enabled. Spoofing of IP addresses allows memcached’s responses to be targeted against another address, like ones used to serve GitHub.com, and send more data toward the target than needs to be sent by the unspoofed source. The vulnerability via misconfiguration described in the post is somewhat unique amongst that class of attacks because the amplification factor is up to 51,000, meaning that for each byte sent by the attacker, up to 51KB is sent toward the target.

Over the past year we have deployed additional transit to our facilities. We’ve more than doubled our transit capacity during that time, which has allowed us to withstand certain volumetric attacks without impact to users. We’re continuing to deploy additional transit capacity and develop robust peering relationships across a diverse set of exchanges. Even still, attacks like this sometimes require the help of partners with larger transit networks to provide blocking and filtering.

The incident

Between 17:21 and 17:30 UTC on February 28th we identified and mitigated a significant volumetric DDoS attack. The attack originated from over a thousand different autonomous systems (ASNs) across tens of thousands of unique endpoints. It was an amplification attack using the memcached-based approach described above that peaked at 1.35Tbps via 126.9 million packets per second.

At 17:21 UTC our network monitoring system detected an anomaly in the ratio of ingress to egress traffic and notified the on-call engineer and others in our chat system. This graph shows inbound versus outbound throughput over transit links:

Given the increase in inbound transit bandwidth to over 100Gbps in one of our facilities, the decision was made to move traffic to Akamai, who could help provide additional edge network capacity. At 17:26 UTC the command was initiated via our ChatOps tooling to withdraw BGP announcements over transit providers and announce AS36459 exclusively over our links to Akamai. Routes reconverged in the next few minutes and access control lists mitigated the attack at their border. Monitoring of transit bandwidth levels and load balancer response codes indicated a full recovery at 17:30 UTC. At 17:34 UTC routes to internet exchanges were withdrawn as a follow-up to shift an additional 40Gbps away from our edge.

The first portion of the attack peaked at 1.35Tbps and there was a second 400Gbps spike a little after 18:00 UTC. This graph provided by Akamai shows inbound traffic in bits per second that reached their edge:

Next steps

Making GitHub’s edge infrastructure more resilient to current and future conditions of the internet and less dependent upon human involvement requires better automated intervention. We’re investigating the use of our monitoring infrastructure to automate enabling DDoS mitigation providers and will continue to measure our response times to incidents like this with a goal of reducing mean time to recovery (MTTR).

We’re going to continue to expand our edge network and strive to identify and mitigate new attack vectors before they affect your workflow on GitHub.com.

We know how much you rely on GitHub for your projects and businesses to succeed. We will continue to analyze this and other events that impact our availability, build better detection systems, and streamline response.

Weak cryptographic standards removal notice

2018-02-01T00:00:00+00:00

Last year we announced the deprecation of several weak cryptographic standards. Then we provided a status update toward the end of last year outlining some changes we’d made to make the transition easier for clients. We quickly approached the February 1, 2018 cutoff date we mentioned in previous posts and, as a result, pushed back our schedule by one week. On February 8, 2018 we’ll start disabling the following:

TLSv1/TLSv1.1: This applies to all HTTPS connections, including web, API, and git connections to https://github.com and https://api.github.com.
diffie-hellman-group1-sha1: This applies to all SSH connections to github.com
diffie-hellman-group14-sha1: This applies to all SSH connections to github.com

We’ll disable the algorithms in two stages:

February 8, 2018 19:00 UTC (11:00 am PST): Disable deprecated algorithms for one hour
February 22, 2018 19:00 UTC (11:00 am PST): Permanently disable deprecated algorithms

While only a small fraction of traffic currently makes use of the deprecated algorithms, and many clients will automatically transition and start using the new algorithms, there is invariably going to be a small fraction of clients that will be impacted. We expect most of these are older systems that are no longer maintained, but continue to access Git/the GitHub API using the deprecated algorithms. To help mitigate this, we will temporarily disable support for the deprecated algorithms for one hour on February 8, 2018 19::00 UTC. By disabling support for the deprecated algorithms for a small window, these systems will temporarily fail to connect to GitHub. We will then restore support for the deprecated algorithms and provide a two week grace period for these systems to upgrade their libraries before we disable support for the deprecated algorithms permanently on February 22, 2018.

Known incompatible clients

As noted above, the vast majority of traffic should be unaffected by this change. However, there are a few remaining clients that we anticipate will be affected. Fortunately, the majority of clients can be updated to work with TLSv1.2.

Git-Credential-Manager-for-Windows < v1.14.0

Git-Credential-Manager-for-Windows < v1.14.0 does not support TLSv1.2. This can be addressed by updating to v1.14.0.

Git on Red Hat 5, < 6.8, and < 7.2

Red Hat 5, 6, and 7 shipped with Git clients that did not support TLSv1.2. This can be addressed by updating to versions 6.8 and 7.2 (or greater) respectively. Unfortunately, Red Hat 5 does not have a point release that supports TLSv1.2. We advise that users of Red Hat 5 upgrade to a newer version of the operating system.

Java releases < JDK 8

As noted in this blog post by Oracle, TLSv1 was used by default for JDK releases prior to JDK 8. JDK 8 changed this behavior and defaults to TLSv1.2. Any client (ex. JGit is one such popular client) that runs on older versions of the JDK is affected. This can be addressed by updating to JDK >= 8 or explicitly opting in to TLSv1.2 in JDK 7 (look at the https.protocols JSSE tuning parameter). Unfortunately, versions of the JDK <= 6 do not support TLSv1.2. We advise users of JDK <= 6 to upgrade to a newer version of the JDK.

Visual Studio

Visual Studio ships with specific versions of Git for Windows and the Git Credential Manager for Windows (GCM). Microsoft has updated the latest versions of Visual Studio 2017 to work with TLSv1.2 Git servers. We advise users of Visual Studio to upgrade to the latest release by clicking on the in-product notification flag or by checking for an update directly from the IDE. Microsoft has provided additional guidance on the Visual Studio developer community support forum.

Conclusion

As always, if you have any questions or concerns related to this announcement, please don’t hesitate to contact us.