Geo: Enable Geo for Disaster Recovery on GitLab.com
> *The best way to understand the pain that users experience with our product is to experience it for yourself.* ~ The GitLab Product Handbook Engineering Owner: @dbalexandre ### Current Status **Staging** : The chef roles to configure the node has been created. Next we need to create the VM and instantiate an instance of GitLab on it to behave as the secondary node. We are hoping to have the backfill started by Dec-20th. - [x] Terraform changes for staging - [x] Chef role created for the secondary node - [x] Terraform and Chef role applied to create the secondary node - [x] Enable database replication - [x] Configure Geo through GitLab's admin interface - [x] Start backfill operation - [x] Backfill operation completed Follow up: - [x] Add secondary node to update scripts for staging - [x] Add secondary node to monitoring - [x] Close out documentation **Ops Instance** : We're following the progress of the Infrastructure team and are ready to support them if needed to [install Geo on the Ops Instance](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6675). **Production** : Rollout is blocked on the [replication issue](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7293). Continuing to create [notes for how Geo will support disaster recovery strategy](https://docs.google.com/document/d/1LwLPcSHTCO-Ti6CbX2K92WTWY5hUWeMOA9X1HGp2mFM/edit). ### Summary GitLab.com is the SaaS offering hosted by GitLab. Gitlab.com is by far the largest GitLab instance and is used by GitLab to [dogfood GitLab itself](https://about.gitlab.com/handbook/engineering/index.html#dogfooding). Currently, GitLab.com does not use GitLab Geo for disaster recovery purposes. This has many disadvantages and the Geo Team is working with Infrastructure to enable Geo on GitLab.com. Infrastructure has also compiled a long [library entry](https://about.gitlab.com/handbook/engineering/infrastructure/library/disaster-recovery/). ### Problem to solve <!-- What problem are we solving for them? Tight problem description that everyone can rally around. --> As part of GitLab's Business Continuity Strategy, and in order to meet a 99.95% rate of availability SLA, GitLab.com needs to utilise a Disaster Recovery solution. Currently, GitLab.com does not use GitLab Geo for disaster recovery. We would like to enable GitLab Geo for replication of mission critical data to a different location. This includes replicating * Repositories * Wikis * LFS Objects * Attachments * CI Artifacts * Database * Pages (currently not supported by Geo) * Container Registry ### Higher intent We [dogfood everything](https://about.gitlab.com/handbook/product/index.html#dogfood-everything) and not using Geo on GitLab.com is directly opposed to this value. In addition, it reduces customer trust in GitLab Geo. GitLab requires an acceptable DR solution for BCP reasons. ### Proposal #### Geo to synchronize mission critical data. In this iteration, mission critical GitLab.com data will be synchronised via Geo to a secondary location e.g. US Central. The secondary location (`https://dr.gitlab.com`) will not serve any user traffic and does not initially require the full GitLab.com infrastructure setup. Instead, a minimal deployment will be created that allows only for syncing data. In case of a disaster, the infrastructure team will still be required to deploy additional applications and servers. This will be delivered in three phases: 1. [Defining the Geo architecture and detailed rollout plans](https://gitlab.com/groups/gitlab-org/-/epics/1907) 1. [Rolling out Geo on staging.gitlab.com](https://gitlab.com/groups/gitlab-org/-/epics/1908) 1. [Rolling out Geo to dr.gitlab.com](https://gitlab.com/groups/gitlab-org/-/epics/1909) **Benefits** - Dogfooding enabled - Mission critical data is synchronised continuously - Lower cost of maintaining the infrastructure because it does not require the same scale as main deployment (no user traffic) - Advantages of using Geo itself (verification etc.) **Downsides** - Geo [current limitations](https://docs.gitlab.com/ee/administration/geo/replication/#current-limitations) apply - DR process is still highly manual, no automatic failover - Regular testing of DR solution is difficult or not possible. - DR solution is not a full "Hot Region" ### Intended users <!-- Who's the target user? Target user description. --> * [Systems administrators](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sidney-systems-administrator) ### What does success look like, and how can we measure that? * After defining RTO and RPO and with sufficient automation, a trial run could be conducted to assess if we stay within these limits * A live dr.gitlab.com solution ### Links / references * https://about.gitlab.com/handbook/engineering/infrastructure/library/disaster-recovery/ Related Epics: * https://gitlab.com/groups/gitlab-org/-/epics/75 - Hashed Storage * https://gitlab.com/groups/gitlab-org/-/epics/526 - Security Review for Geo * https://gitlab.com/groups/gitlab-org/-/epics/576 - Selective Sync * https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/12 - Infrastructure's Epic tracking this deployment * https://gitlab.com/groups/gitlab-org/-/epics/1906 - ARCHIVE
epic