Plan
CI Observability
Improving and fostering ownership & accountability of CI build & test performance in the monorepo, improving the reliability and speed of CI.
Problem
sg/sg is a complex product. Testing and build times are a key element in controlling the quality, yet teams have very little visibility into where they stand, or how their code affects other teams in CI, fostering learned self-helplessness.
Buildkite’s “Test Analytics” feature covers a lot of the same ground, but doesn’t include enough details around things such as the critical path, more customizable graphing capabilities, as well as aggregations besides P50 and unlocking future capabilities for introspecting what causes a cache miss etc.
From a cost measuring perspective, this has previously been a mostly manual process of extracting data from the Buildkite API and attempting to correlate cost with this data via google spreadsheets. On top of being a manual process, this misses out on the actual cost part of CI, which is the underlying infrastructure.
Success criteria
- 70% of tests in sg/sg have a clear owner
- CI reliability >93%
- Average build times <23min
Proposal
Ownership:
Provide a set of (manually updated) Bazel variables to be used as tags in tests to denote ownership (as a stretch goal, we can explore keeping this set automatically updated, depending on the reliability of the sources of truth we have available to us). CI will enforce that the percentage of tests with a tagged owner remains above 70% (possible through simple and fast bazel query commands), so that new tests added have clear ownership defined.
Observability:
Investigate the different sources of Bazel execution data (build event protocol, compact execution log & profile data) to see what combination of these we will need in order to extract the required information from Bazel. These will be stored as buildkite artifacts where they can be queried by the finalization task.
Augment our GCP agent images to log a datapoint when they boot & shutdown, including relevant data from the metadata API server.
Alongside data from buildkite pipeline executions, test tagging & GCP details, we should be able to get answers to the following questions:
- what % of total wall time was spent in tests, grouped by team
- what % of the critical path (the lower bound on CI time) constitutes tests, grouped by team
- what % of other peoples’ CI time is taken up by other teams’ tests
- how many step retries are attributable to flakiness in tests, grouped by team (reliability of a groups’ tests)
- Looking at a graph, can I point out that my changes to tests/builds have had a measurable improvement
- when and what contributed to higher than expected CI costs over the last period
Together, they will provide both the basis for providing global and team-specific reports:
- global reports for engineering leaders/managers to help with prioritizing or delegating to teams to look into their impact on CI times.
- team-specific reports including figures including the above list, with links to dashboards providing more insight and the ability to dig deeper into the data in order to work on the right things.
Milestones
- 70% of Bazel test targets have a defined owner (estimated 1 week)
- Most of the time will be from coordinating with all the teams in order to get the ownership details and can be done async alongside further work.
- Value: ownership of flaky tests can be directly traced for accountability.
- Data emitted by Bazel is exported to BigQuery (estimated 1 week)
- Will involve a discovery phase on whether the profile export is enough, or whether we want the execution log as well as well as some buffer time to gain familiarity with the (often badly documented) data formats.
- Value: what immediate value do we get at this point? Can query the data?
- Weekly reports are sent to slack, providing insight into a teams’ tests effect on CI (estimated 3 days)
- This will be generated by AlfredBot and/or Looker, depending on the capabilities of Looker in generating reports in a format that suits us.
- This can first be posted to a single channel in order to test reliability, quality and usefulness of the reports.
- Refinement of the reports will be a continuous process in order to make them be most relevant & useful.
- Value: teams will start to see & feel accountability for the impact of their test quality on CI times.
- Dashboards (likely on Looker) will be made available for both global + team-specific overviews to allow management to see everyones impact on CI
- Buildkite agents run a systemd one-shot on boot & shutdown to log a row in BigQuery containing instance metadata from the metadata API server (estimated 2 days)
- Value: dev-infra can start attributing CI cost to developer activity in areas of the code resulting in longer than expected CI times.
- CI checks whether new tests have an annotated owner/whether the 70% target is maintained (estimated 1 day)
- Value: ownership remains consistently high instead of developers omitting this information.
Risks
- Teams are not motivated to improve test reliability/performance (time/headcount constraints, lack of urgency, other reasons)
- The data does not provide a reliable set of numbers and ends up being ignored (tests unexpectedly not cached by Bazel, )
Tracked issues
@unassigned
Completed
@Strum355
Completed
@jamesmcnamara
Completed
Plan
CI Observability
Improving and fostering ownership & accountability of CI build & test performance in the monorepo, improving the reliability and speed of CI.
Problem
sg/sg is a complex product. Testing and build times are a key element in controlling the quality, yet teams have very little visibility into where they stand, or how their code affects other teams in CI, fostering learned self-helplessness.
Buildkite’s “Test Analytics” feature covers a lot of the same ground, but doesn’t include enough details around things such as the critical path, more customizable graphing capabilities, as well as aggregations besides P50 and unlocking future capabilities for introspecting what causes a cache miss etc.
From a cost measuring perspective, this has previously been a mostly manual process of extracting data from the Buildkite API and attempting to correlate cost with this data via google spreadsheets. On top of being a manual process, this misses out on the actual cost part of CI, which is the underlying infrastructure.
Success criteria
Proposal
Ownership:
Provide a set of (manually updated) Bazel variables to be used as tags in tests to denote ownership (as a stretch goal, we can explore keeping this set automatically updated, depending on the reliability of the sources of truth we have available to us). CI will enforce that the percentage of tests with a tagged owner remains above 70% (possible through simple and fast bazel query commands), so that new tests added have clear ownership defined.
Observability:
Investigate the different sources of Bazel execution data (build event protocol, compact execution log & profile data) to see what combination of these we will need in order to extract the required information from Bazel. These will be stored as buildkite artifacts where they can be queried by the finalization task.
Augment our GCP agent images to log a datapoint when they boot & shutdown, including relevant data from the metadata API server.
Alongside data from buildkite pipeline executions, test tagging & GCP details, we should be able to get answers to the following questions:
Together, they will provide both the basis for providing global and team-specific reports:
Milestones
Risks
Tracked issues
@unassigned
Completed
#62598)@Strum355
Completed
@jamesmcnamara
Completed