You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The AMP runtime library is released and updated on a weekly schedule, with a 5 to 6 day canarying period (approximately 1% of AMP documents) to give the team an opportunity to analyze logs for errors, warnings, and performance regressions and for the community to report issues found in the wild. This process has worked well in the past but with the amount of new code getting checked into the GitHub repository, the amount of new bugs also introduced has grown accordingly. The project has had 17 P0 issues in 2019 (last checked: November 15), mandating either a canary and/or a production cherry pick. The cherry pick process takes a considerable amount of time for release engineers and erodes community trust.
This I2I proposes the introduction of automated nightly builds coupled with a new release process, both with the aim of reducing the amount of cherry picks, the number of users affected by new issues, time spent by release engineers, and to increase community trust in releases.
Non-goals
Directly reduce the number of P0 issues: there are multiple other ongoing projects with this aim. This I2I is concerned with reducing the number of potentially affected users of any future P0’s, and with streamlining the recovery process once a P0 is discovered
The current performance metrics take days to get calculated, preventing us from being able to promote releases from nightly to canary until we have them
Having this feature can help release engineers quickly determine whether they should revert a release with a discovered issue and which versions have introduced the regression
The proposal discussed here is currently using release channel names before any changes proposed in I2I: Rename AMP's release flavors #25560. Once that issue is finalized, this proposal will be updated to reflect those changes.
The current release schedule and process is described in release-schedule.md. We propose the following changes:
Monday – Friday
[new] [automated] Create a new nightly release every day by choosing the last commit merged into the master branch that has passed all automated tests (“green” commit) before midnight PST, Monday through Thursday (that is, Monday 00:00, immediately after Sunday ends)
If no green commit exists, skip that day. This is an unlikely scenario over a 24 hour period but has happened occasionally in the past, usually due to infrastructure issues, such as a failed 3rd party integration.
[new] [automated] The release will be automatically promoted to the nightly channel (both percentage-traffic and opt-in) at 9 AM PT (most release engineers begin their work day at ~9 AM PT, but some are in different timezones), subject to the same code-freeze policy (i.e., no promotions to nightly channel during major US holidays)
The release on-duty should be attentive to this process
Tuesday
[changed] On Tuesday 11 AM PT the on duty release engineer will choose a nightly build from the previous week (see description below of how the release engineer should choose which nightly to promote) to promote to the canary and rc opt-in channels (not to the percentage-traffic channels)
Note that this is a change from the existing schedule, in which the last “green” commit before 11 AM PT is the one that gets promoted to these opt-in channels. This means that on Tuesday, the code being pushed to these channels is at the latest from Thursday 11:59:59 PM PT
[unchanged] The previous rc is promoted to prod/control as before
[changed] The previous rc is also promoted to the new nightly-control channel, which will be served to an equal population volume as nightly. This is to allow fair comparisons between nightly and prod when it comes to finding new error reports
Wednesday
[unchanged] This week’s opt-in canary/rc are promoted to the percentage-traffic channels
Promotion strategies from nightly to canary/rc
The release on-duty should choose to promote to canary/rc the most recent nightly build from the previous week that:
That is, if Friday’s release has no performance regression and no P0 issues, promote that.
Week #
Monday
Tuesday
Wednesday
Thursday
Friday
n
After midnight
release 1 is created
release 2 is created
release 3 is created
release 4 is created
release 5 is created
Working hours
release 1 is promoted to nightly
release 2 is promoted to nightly
release 3 is promoted to nightly
release 4 is promoted to nightly
release 5 is promoted to nightly
n+1
After midnight
release 6 is created
…
…
…
…
Working hours
release 5 is promoted to canary
Ideal scenario: no performance regressions or P0 issues were found on Friday’s nightly release, so we promote that one to canary.
Otherwise, looking into Thursday, Wednesday, and so on until a nightly is found without those problems.
Week #
Monday
Tuesday
Wednesday
Thursday
Friday
n
After midnight
release 1 is created
release 2 is created
release 3 is created
release 4 is created
release 5 is created
Working hours
release 1 is promoted to nightly
release 2 is promoted to nightly
release 3 is promoted to nightly
release 4 is promoted to nightly
release 5 is promoted to nightly
n+1
After midnight
release 6 is created
…
…
…
…
Working hours
release 3 is promoted to canary
Scenario: a performance regression or P0 issue is found on both Friday’s and Thursday’s nightly releases, but not on Wednesday’s. We promote Wednesday’s release to canary.
If Monday’s nightly release also has those problems, skip promoting a canary release this week.
Week #
Monday
Tuesday
Wednesday
Thursday
Friday
n
After midnight
release 1 is created
release 2 is created
release 3 is created
release 4 is created
release 5 is created
Working hours
release 1 is promoted to nightly
release 2 is promoted to nightly
release 3 is promoted to nightly
release 4 is promoted to nightly
release 5 is promoted to nightly
n+1
After midnight
release 6 is created
…
…
…
…
Working hours
nothing is promoted to canary
Scenario: a performance regression or P0 issue is found on all of week n’s nightly releases. We do not promote any release to canary.
Incident resolution
In nightly
Performance regression or P0 issues in nightly builds should be fixed as soon as possible or the offending code should be rolled back until a fix can be produced. We should not bother with rolling back the entire version or cherry picking that version, due to the small traffic it receives and because the issue will be fixed automatically with the next nightly release (either through a fix or through a rollback). Exceptions to this rule should be made when the issue is a P0 that has potential for privacy, security, or data loss issues.
Any nightly release that contains the offending code but not the fix is discarded from future consideration for promotion to canary/rc and prod.
In canary/rc
The release on-duty will backstep from the promoted nightly (e.g., Thursday) until a release without the problem is found in the previous week (e.g., Tuesday), then roll back canary/rc to that version.
Week #
Monday
Tuesday
Wednesday
Thursday
Friday
n
After midnight
release 1 is created
release 2 is created
release 3 is created
release 4 is created
release 5 is created
Working hours
release 1 is promoted to nightly
release 2 is promoted to nightly
release 3 is promoted to nightly
release 4 is promoted to nightly
release 5 is promoted to nightly
n+1
After midnight
release 6 is created
…
…
…
…
Working hours
release 5 is promoted to canary
Release 2 is promoted to canary
Scenario: no performance regressions or P0 issues were found at first on Friday’s nightly release, so we promote that one to canary. The next day we find a P0 in canary and discover it is also present in Thursday’s and Wednesday’s release, but not in Tuesday’s. We rollback canary to Tuesday’s nightly release.
If all of the nightly release from last week have that issue, a cherry pick will be performed per the existing cherry pick instructions.
Week #
Monday
Tuesday
Wednesday
Thursday
Friday
n
After midnight
release 1 is created
release 2 is created
release 3 is created
release 4 is created
release 5 is created
Working hours
release 1 is promoted to nightly
release 2 is promoted to nightly
release 3 is promoted to nightly
release 4 is promoted to nightly
release 5 is promoted to nightly
n+1
After midnight
release 6 is created
…
…
…
…
Working hours
release 5 is promoted to canary
Issue is fixed in commit a on master.
Release 5+CP(a) is created and promoted to canary
Scenario: no performance regressions or P0 issues were found at first on Friday’s nightly release, so we promote that one to canary. The next day we find a P0 in canary and discover it is also present in all of week n’s nightly releases. We fix the issue on master and perform a cherry pick.
* only including web channels, i.e., excludes amp4ads, amp4email, etc.
** except for minor changes in the AMP_CONFIG, such as the type and v fields
*** amp4ads experiments have variable percentages, and therefore require their own equivalent control group to compare metric data; this is handled by the experiments platform used
Security and privacy considerations
This change is not expected to introduce any new security or privacy issues.
Alternatives considered/to consider
Do nothing
One option we can consider is to not introduce nightly releases at all.
Auto-scaling traffic
In a previous design review we discussed the option of having traffic for nightly builds scale up automatically until they reach prod status automatically. This idea was rejected for two main reasons:
The traffic diversion platform used at Google does not support this feature out of the box, and implementing this would involve considerable work that might not be worth the effort
The release on-duty’s role as a human observer for sanity checks was preferred over automation
Background
The AMP runtime library is released and updated on a weekly schedule, with a 5 to 6 day canarying period (approximately 1% of AMP documents) to give the team an opportunity to analyze logs for errors, warnings, and performance regressions and for the community to report issues found in the wild. This process has worked well in the past but with the amount of new code getting checked into the GitHub repository, the amount of new bugs also introduced has grown accordingly. The project has had 17 P0 issues in 2019 (last checked: November 15), mandating either a canary and/or a production cherry pick. The cherry pick process takes a considerable amount of time for release engineers and erodes community trust.
This I2I proposes the introduction of automated nightly builds coupled with a new release process, both with the aim of reducing the amount of cherry picks, the number of users affected by new issues, time spent by release engineers, and to increase community trust in releases.
Non-goals
Prerequisites and related features
See also
Overview
The current release schedule and process is described in release-schedule.md. We propose the following changes:
nightlyrelease every day by choosing the last commit merged into themasterbranch that has passed all automated tests (“green” commit) before midnight PST, Monday through Thursday (that is, Monday 00:00, immediately after Sunday ends)nightlychannel (both percentage-traffic and opt-in) at 9 AM PT (most release engineers begin their work day at ~9 AM PT, but some are in different timezones), subject to the same code-freeze policy (i.e., no promotions tonightlychannel during major US holidays)nightlybuild from the previous week (see description below of how the release engineer should choose which nightly to promote) to promote to thecanaryandrcopt-in channels (not to the percentage-traffic channels)rcis promoted toprod/controlas beforercis also promoted to the newnightly-controlchannel, which will be served to an equal population volume asnightly. This is to allow fair comparisons betweennightlyandprodwhen it comes to finding new error reportscanary/rcare promoted to the percentage-traffic channelsPromotion strategies from
nightlytocanary/rcThe release on-duty should choose to promote to
canary/rcthe most recentnightlybuild from the previous week that:That is, if Friday’s release has no performance regression and no P0 issues, promote that.
nightlynightlynightlynightlynightlycanarynightlyrelease, so we promote that one tocanary.Otherwise, looking into Thursday, Wednesday, and so on until a
nightlyis found without those problems.nightlynightlynightlynightlynightlycanarynightlyreleases, but not on Wednesday’s. We promote Wednesday’s release tocanary.If Monday’s
nightlyrelease also has those problems, skip promoting acanaryrelease this week.nightlynightlynightlynightlynightlycanarynightlyreleases. We do not promote any release tocanary.Incident resolution
In
nightlyPerformance regression or P0 issues in
nightlybuilds should be fixed as soon as possible or the offending code should be rolled back until a fix can be produced. We should not bother with rolling back the entire version or cherry picking that version, due to the small traffic it receives and because the issue will be fixed automatically with the next nightly release (either through a fix or through a rollback). Exceptions to this rule should be made when the issue is a P0 that has potential for privacy, security, or data loss issues.Any
nightlyrelease that contains the offending code but not the fix is discarded from future consideration for promotion tocanary/rcandprod.In
canary/rcThe release on-duty will backstep from the promoted
nightly(e.g., Thursday) until a release without the problem is found in the previous week (e.g., Tuesday), then roll backcanary/rcto that version.nightlynightlynightlynightlynightlycanarycanarynightlyrelease, so we promote that one tocanary. The next day we find a P0 incanaryand discover it is also present in Thursday’s and Wednesday’s release, but not in Tuesday’s. We rollbackcanaryto Tuesday’s nightly release.If all of the
nightlyrelease from last week have that issue, a cherry pick will be performed per the existing cherry pick instructions.nightlynightlynightlynightlynightlycanaryaonmaster.Release 5+
CP(a)is created and promoted tocanarynightlyrelease, so we promote that one tocanary. The next day we find a P0 incanaryand discover it is also present in all of week n’snightlyreleases. We fix the issue onmasterand perform a cherry pick.In
prodA cherry pick will be performed per the existing cherry pick instructions.
Implementation details
Channel changes
prodcontrol0.5%prod**rc0.5%canary0.5%experiment{A,B,C}***0.5%eachnightly0.05%nightly-control0.05%rccanarynightlyrtv-specificamp4ads,amp4email, etc.** except for minor changes in the
AMP_CONFIG, such as thetypeandvfields***
amp4adsexperiments have variable percentages, and therefore require their own equivalent control group to compare metric data; this is handled by the experiments platform usedSecurity and privacy considerations
This change is not expected to introduce any new security or privacy issues.
Alternatives considered/to consider
Do nothing
One option we can consider is to not introduce nightly releases at all.
Auto-scaling traffic
In a previous design review we discussed the option of having traffic for
nightlybuilds scale up automatically until they reachprodstatus automatically. This idea was rejected for two main reasons: