Skip to content

I2I: Nightly Releases for AMP #25616

@danielrozenberg

Description

@danielrozenberg

Background

The AMP runtime library is released and updated on a weekly schedule, with a 5 to 6 day canarying period (approximately 1% of AMP documents) to give the team an opportunity to analyze logs for errors, warnings, and performance regressions and for the community to report issues found in the wild. This process has worked well in the past but with the amount of new code getting checked into the GitHub repository, the amount of new bugs also introduced has grown accordingly. The project has had 17 P0 issues in 2019 (last checked: November 15), mandating either a canary and/or a production cherry pick. The cherry pick process takes a considerable amount of time for release engineers and erodes community trust.

This I2I proposes the introduction of automated nightly builds coupled with a new release process, both with the aim of reducing the amount of cherry picks, the number of users affected by new issues, time spent by release engineers, and to increase community trust in releases.

Non-goals

  • Directly reduce the number of P0 issues: there are multiple other ongoing projects with this aim. This I2I is concerned with reducing the number of potentially affected users of any future P0’s, and with streamlining the recovery process once a P0 is discovered
  • Changes to any other type of release channel

Prerequisites and related features

See also

Overview

The current release schedule and process is described in release-schedule.md. We propose the following changes:

  • Monday – Friday
    • [new] [automated] Create a new nightly release every day by choosing the last commit merged into the master branch that has passed all automated tests (“green” commit) before midnight PST, Monday through Thursday (that is, Monday 00:00, immediately after Sunday ends)
      • If no green commit exists, skip that day. This is an unlikely scenario over a 24 hour period but has happened occasionally in the past, usually due to infrastructure issues, such as a failed 3rd party integration.
    • [new] [automated] The release will be automatically promoted to the nightly channel (both percentage-traffic and opt-in) at 9 AM PT (most release engineers begin their work day at ~9 AM PT, but some are in different timezones), subject to the same code-freeze policy (i.e., no promotions to nightly channel during major US holidays)
      • The release on-duty should be attentive to this process
  • Tuesday
    • [changed] On Tuesday 11 AM PT the on duty release engineer will choose a nightly build from the previous week (see description below of how the release engineer should choose which nightly to promote) to promote to the canary and rc opt-in channels (not to the percentage-traffic channels)
      • Note that this is a change from the existing schedule, in which the last “green” commit before 11 AM PT is the one that gets promoted to these opt-in channels. This means that on Tuesday, the code being pushed to these channels is at the latest from Thursday 11:59:59 PM PT
    • [unchanged] The previous rc is promoted to prod/control as before
    • [changed] The previous rc is also promoted to the new nightly-control channel, which will be served to an equal population volume as nightly. This is to allow fair comparisons between nightly and prod when it comes to finding new error reports
  • Wednesday
    • [unchanged] This week’s opt-in canary/rc are promoted to the percentage-traffic channels

Promotion strategies from nightly to canary/rc

The release on-duty should choose to promote to canary/rc the most recent nightly build from the previous week that:

  1. Does not introduce a performance regression (see I2I: Move CSI collection to GCP Monitoring Metrics #25228) and
  2. Does not have any discovered P0 issues

That is, if Friday’s release has no performance regression and no P0 issues, promote that.

Week # Monday Tuesday Wednesday Thursday Friday
n After midnight release 1 is created release 2 is created release 3 is created release 4 is created release 5 is created
Working hours release 1 is promoted to nightly release 2 is promoted to nightly release 3 is promoted to nightly release 4 is promoted to nightly release 5 is promoted to nightly
n+1 After midnight release 6 is created
Working hours release 5 is promoted to canary
Ideal scenario: no performance regressions or P0 issues were found on Friday’s nightly release, so we promote that one to canary.

Otherwise, looking into Thursday, Wednesday, and so on until a nightly is found without those problems.

Week # Monday Tuesday Wednesday Thursday Friday
n After midnight release 1 is created release 2 is created release 3 is created release 4 is created release 5 is created
Working hours release 1 is promoted to nightly release 2 is promoted to nightly release 3 is promoted to nightly release 4 is promoted to nightly release 5 is promoted to nightly
n+1 After midnight release 6 is created
Working hours release 3 is promoted to canary
Scenario: a performance regression or P0 issue is found on both Friday’s and Thursday’s nightly releases, but not on Wednesday’s. We promote Wednesday’s release to canary.

If Monday’s nightly release also has those problems, skip promoting a canary release this week.

Week # Monday Tuesday Wednesday Thursday Friday
n After midnight release 1 is created release 2 is created release 3 is created release 4 is created release 5 is created
Working hours release 1 is promoted to nightly release 2 is promoted to nightly release 3 is promoted to nightly release 4 is promoted to nightly release 5 is promoted to nightly
n+1 After midnight release 6 is created
Working hours nothing is promoted to canary
Scenario: a performance regression or P0 issue is found on all of week n’s nightly releases. We do not promote any release to canary.

Incident resolution

In nightly

Performance regression or P0 issues in nightly builds should be fixed as soon as possible or the offending code should be rolled back until a fix can be produced. We should not bother with rolling back the entire version or cherry picking that version, due to the small traffic it receives and because the issue will be fixed automatically with the next nightly release (either through a fix or through a rollback). Exceptions to this rule should be made when the issue is a P0 that has potential for privacy, security, or data loss issues.

Any nightly release that contains the offending code but not the fix is discarded from future consideration for promotion to canary/rc and prod.

In canary/rc

The release on-duty will backstep from the promoted nightly (e.g., Thursday) until a release without the problem is found in the previous week (e.g., Tuesday), then roll back canary/rc to that version.

Week # Monday Tuesday Wednesday Thursday Friday
n After midnight release 1 is created release 2 is created release 3 is created release 4 is created release 5 is created
Working hours release 1 is promoted to nightly release 2 is promoted to nightly release 3 is promoted to nightly release 4 is promoted to nightly release 5 is promoted to nightly
n+1 After midnight release 6 is created
Working hours release 5 is promoted to canary Release 2 is promoted to canary
Scenario: no performance regressions or P0 issues were found at first on Friday’s nightly release, so we promote that one to canary. The next day we find a P0 in canary and discover it is also present in Thursday’s and Wednesday’s release, but not in Tuesday’s. We rollback canary to Tuesday’s nightly release.

If all of the nightly release from last week have that issue, a cherry pick will be performed per the existing cherry pick instructions.

Week # Monday Tuesday Wednesday Thursday Friday
n After midnight release 1 is created release 2 is created release 3 is created release 4 is created release 5 is created
Working hours release 1 is promoted to nightly release 2 is promoted to nightly release 3 is promoted to nightly release 4 is promoted to nightly release 5 is promoted to nightly
n+1 After midnight release 6 is created
Working hours release 5 is promoted to canary Issue is fixed in commit a on master.

Release 5+CP(a) is created and promoted to canary

Scenario: no performance regressions or P0 issues were found at first on Friday’s nightly release, so we promote that one to canary. The next day we find a P0 in canary and discover it is also present in all of week n’s nightly releases. We fix the issue on master and perform a cherry pick.

In prod

A cherry pick will be performed per the existing cherry pick instructions.

Implementation details

Channel changes

AMP runtime channels*
Traffic diverted by experiments platform Notes
Name Percentage
prod [remainder]
control 0.5% Same binaries as prod**
rc 0.5%
canary 0.5%
experiment{A,B,C}*** 0.5% each
nightly 0.05% Add this
nightly-control 0.05% Add this
Traffic diverted by opt-in cookie
Name
rc
canary
nightly Add this
rtv-specific Proposed in #25205
* only including web channels, i.e., excludes amp4ads, amp4email, etc.

** except for minor changes in the AMP_CONFIG, such as the type and v fields

*** amp4ads experiments have variable percentages, and therefore require their own equivalent control group to compare metric data; this is handled by the experiments platform used

Security and privacy considerations

This change is not expected to introduce any new security or privacy issues.

Alternatives considered/to consider

Do nothing

One option we can consider is to not introduce nightly releases at all.

Auto-scaling traffic

In a previous design review we discussed the option of having traffic for nightly builds scale up automatically until they reach prod status automatically. This idea was rejected for two main reasons:

  1. The traffic diversion platform used at Google does not support this feature out of the box, and implementing this would involve considerable work that might not be worth the effort
  2. The release on-duty’s role as a human observer for sanity checks was preferred over automation

Metadata

Metadata

Assignees

No one assigned

    Labels

    INTENT TO IMPLEMENTProposes implementation of a significant new feature. https://bit.ly/amp-contribute-codeWG: infra

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions