-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Description
Background
The AMP runtime library is released and updated on a weekly schedule, with a 5 to 6 day canarying period (approximately 1% of AMP documents) to give the team an opportunity to analyze logs for errors, warnings, and performance regressions and for the community to report issues found in the wild. This process has worked well in the past but with the amount of new code getting checked into the GitHub repository, the amount of new bugs also introduced has grown accordingly. The project has had 17 P0 issues in 2019 (last checked: November 15), mandating either a canary and/or a production cherry pick. The cherry pick process takes a considerable amount of time for release engineers and erodes community trust.
This I2I proposes the introduction of automated nightly builds coupled with a new release process, both with the aim of reducing the amount of cherry picks, the number of users affected by new issues, time spent by release engineers, and to increase community trust in releases.
Non-goals
- Directly reduce the number of P0 issues: there are multiple other ongoing projects with this aim. This I2I is concerned with reducing the number of potentially affected users of any future P0’s, and with streamlining the recovery process once a P0 is discovered
- Changes to any other type of release channel
Prerequisites and related features
- [prerequisite] I2I: Move CSI collection to GCP Monitoring Metrics #25228 Move CSI collection to GCP Monitoring Metrics
- The current performance metrics take days to get calculated, preventing us from being able to promote releases from nightly to canary until we have them
- [optional prerequisite] I2I: Opt-in to Specific AMP RTV #25205 Opt-in to Specific AMP RTV
- Having this feature can help release engineers quickly determine whether they should revert a release with a discovered issue and which versions have introduced the regression
- [related feature] I2I: Rename AMP's release flavors #25560 Rename AMP's release flavors
- The proposal discussed here is currently using release channel names before any changes proposed in I2I: Rename AMP's release flavors #25560. Once that issue is finalized, this proposal will be updated to reflect those changes.
- [related feature] I2I: Create stable release channel with slower release cadence #25578 Create stable release channel with slower release cadence
- Both this current I2I and I2I: Create stable release channel with slower release cadence #25578 propose process changes to releases, with different end goals and intents
- [related feature] Automate creation of AMP GitHub releases and reporting changes on Slack
- This is proposed work by @estherkim, no GitHub issue yet
See also
- https://github.com/ampproject/wg-infra/issues/18: Document the long-term strategy for disentangling the AMP runtime from the Google AMP Cache
Overview
The current release schedule and process is described in release-schedule.md. We propose the following changes:
- Monday – Friday
- [new] [automated] Create a new
nightlyrelease every day by choosing the last commit merged into themasterbranch that has passed all automated tests (“green” commit) before midnight PST, Monday through Thursday (that is, Monday 00:00, immediately after Sunday ends)- If no green commit exists, skip that day. This is an unlikely scenario over a 24 hour period but has happened occasionally in the past, usually due to infrastructure issues, such as a failed 3rd party integration.
- [new] [automated] The release will be automatically promoted to the
nightlychannel (both percentage-traffic and opt-in) at 9 AM PT (most release engineers begin their work day at ~9 AM PT, but some are in different timezones), subject to the same code-freeze policy (i.e., no promotions tonightlychannel during major US holidays)- The release on-duty should be attentive to this process
- [new] [automated] Create a new
- Tuesday
- [changed] On Tuesday 11 AM PT the on duty release engineer will choose a
nightlybuild from the previous week (see description below of how the release engineer should choose which nightly to promote) to promote to thecanaryandrcopt-in channels (not to the percentage-traffic channels)- Note that this is a change from the existing schedule, in which the last “green” commit before 11 AM PT is the one that gets promoted to these opt-in channels. This means that on Tuesday, the code being pushed to these channels is at the latest from Thursday 11:59:59 PM PT
- [unchanged] The previous
rcis promoted toprod/controlas before - [changed] The previous
rcis also promoted to the newnightly-controlchannel, which will be served to an equal population volume asnightly. This is to allow fair comparisons betweennightlyandprodwhen it comes to finding new error reports
- [changed] On Tuesday 11 AM PT the on duty release engineer will choose a
- Wednesday
- [unchanged] This week’s opt-in
canary/rcare promoted to the percentage-traffic channels
- [unchanged] This week’s opt-in
Promotion strategies from nightly to canary/rc
The release on-duty should choose to promote to canary/rc the most recent nightly build from the previous week that:
- Does not introduce a performance regression (see I2I: Move CSI collection to GCP Monitoring Metrics #25228) and
- Does not have any discovered P0 issues
That is, if Friday’s release has no performance regression and no P0 issues, promote that.
| Week # | Monday | Tuesday | Wednesday | Thursday | Friday | |
| n | After midnight | release 1 is created | release 2 is created | release 3 is created | release 4 is created | release 5 is created |
| Working hours | release 1 is promoted to nightly
|
release 2 is promoted to nightly
|
release 3 is promoted to nightly
|
release 4 is promoted to nightly
|
release 5 is promoted to nightly
|
|
| n+1 | After midnight | release 6 is created | … | … | … | … |
| Working hours | release 5 is promoted to canary
|
|||||
Ideal scenario: no performance regressions or P0 issues were found on Friday’s nightly release, so we promote that one to canary.
|
||||||
Otherwise, looking into Thursday, Wednesday, and so on until a nightly is found without those problems.
| Week # | Monday | Tuesday | Wednesday | Thursday | Friday | |
| n | After midnight | release 1 is created | release 2 is created | release 3 is created | release 4 is created | release 5 is created |
| Working hours | release 1 is promoted to nightly
|
release 2 is promoted to nightly
|
release 3 is promoted to nightly
|
release 4 is promoted to nightly
|
release 5 is promoted to nightly
|
|
| n+1 | After midnight | release 6 is created | … | … | … | … |
| Working hours | release 3 is promoted to canary
|
|||||
Scenario: a performance regression or P0 issue is found on both Friday’s and Thursday’s nightly releases, but not on Wednesday’s. We promote Wednesday’s release to canary.
|
||||||
If Monday’s nightly release also has those problems, skip promoting a canary release this week.
| Week # | Monday | Tuesday | Wednesday | Thursday | Friday | |
| n | After midnight | release 1 is created | release 2 is created | release 3 is created | release 4 is created | release 5 is created |
| Working hours | release 1 is promoted to nightly
|
release 2 is promoted to nightly
|
release 3 is promoted to nightly
|
release 4 is promoted to nightly
|
release 5 is promoted to nightly
|
|
| n+1 | After midnight | release 6 is created | … | … | … | … |
| Working hours | nothing is promoted to canary
|
|||||
Scenario: a performance regression or P0 issue is found on all of week n’s nightly releases. We do not promote any release to canary.
|
||||||
Incident resolution
In nightly
Performance regression or P0 issues in nightly builds should be fixed as soon as possible or the offending code should be rolled back until a fix can be produced. We should not bother with rolling back the entire version or cherry picking that version, due to the small traffic it receives and because the issue will be fixed automatically with the next nightly release (either through a fix or through a rollback). Exceptions to this rule should be made when the issue is a P0 that has potential for privacy, security, or data loss issues.
Any nightly release that contains the offending code but not the fix is discarded from future consideration for promotion to canary/rc and prod.
In canary/rc
The release on-duty will backstep from the promoted nightly (e.g., Thursday) until a release without the problem is found in the previous week (e.g., Tuesday), then roll back canary/rc to that version.
| Week # | Monday | Tuesday | Wednesday | Thursday | Friday | |
| n | After midnight | release 1 is created | release 2 is created | release 3 is created | release 4 is created | release 5 is created |
| Working hours | release 1 is promoted to nightly
|
release 2 is promoted to nightly
|
release 3 is promoted to nightly
|
release 4 is promoted to nightly
|
release 5 is promoted to nightly
|
|
| n+1 | After midnight | release 6 is created | … | … | … | … |
| Working hours | release 5 is promoted to canary
|
Release 2 is promoted to canary
|
||||
Scenario: no performance regressions or P0 issues were found at first on Friday’s nightly release, so we promote that one to canary. The next day we find a P0 in canary and discover it is also present in Thursday’s and Wednesday’s release, but not in Tuesday’s. We rollback canary to Tuesday’s nightly release.
|
||||||
If all of the nightly release from last week have that issue, a cherry pick will be performed per the existing cherry pick instructions.
| Week # | Monday | Tuesday | Wednesday | Thursday | Friday | |
| n | After midnight | release 1 is created | release 2 is created | release 3 is created | release 4 is created | release 5 is created |
| Working hours | release 1 is promoted to nightly
|
release 2 is promoted to nightly
|
release 3 is promoted to nightly
|
release 4 is promoted to nightly
|
release 5 is promoted to nightly
|
|
| n+1 | After midnight | release 6 is created | … | … | … | … |
| Working hours | release 5 is promoted to canary
|
Issue is fixed in commit a on master.
Release 5+ |
||||
Scenario: no performance regressions or P0 issues were found at first on Friday’s nightly release, so we promote that one to canary. The next day we find a P0 in canary and discover it is also present in all of week n’s nightly releases. We fix the issue on master and perform a cherry pick.
|
||||||
In prod
A cherry pick will be performed per the existing cherry pick instructions.
Implementation details
Channel changes
| AMP runtime channels* | |||
| Traffic diverted by experiments platform | Notes | ||
| Name | Percentage | ||
prod
|
[remainder] | ||
control
|
0.5%
|
Same binaries as prod**
|
|
rc
|
0.5%
|
||
canary
|
0.5%
|
||
experiment{A,B,C}***
|
0.5% each
|
||
nightly
|
0.05%
|
Add this | |
nightly-control
|
0.05%
|
Add this | |
| Traffic diverted by opt-in cookie | |||
| Name | |||
rc
|
|||
canary
|
|||
nightly
|
Add this | ||
rtv-specific
|
Proposed in #25205 | ||
* only including web channels, i.e., excludes amp4ads, amp4email, etc.
** except for minor changes in the
*** |
|||
Security and privacy considerations
This change is not expected to introduce any new security or privacy issues.
Alternatives considered/to consider
Do nothing
One option we can consider is to not introduce nightly releases at all.
Auto-scaling traffic
In a previous design review we discussed the option of having traffic for nightly builds scale up automatically until they reach prod status automatically. This idea was rejected for two main reasons:
- The traffic diversion platform used at Google does not support this feature out of the box, and implementing this would involve considerable work that might not be worth the effort
- The release on-duty’s role as a human observer for sanity checks was preferred over automation