-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Looks like the current CI build is timing out, again. Even with previous countermeasures in place, it appears that these mitigations are no longer sufficient to cut the overall build time to under the maximum one hour limit when building from scratch, or when caching is intentionally busted due to dependency and system package upgrades. For example:
The release_build job linked above demonstrates a context deadline exceeded while attempting to finish build the final workspace package. When caching is used, including ccache and colcon-cache, building the nav2 workspace normally has no such issue. However, when caching is deliberately busted to insure clean hygienic builds upon upstream dependency updates, it seems the current codebase exceeds the deadline limit with respect to CircleCI's free tier jobs.
For reference, resource utilization for the medium docker resource class is almost fully maxed out in terms of core count; exceptions include the bottlenecks in building the first few and last packages in the workspace, such as nav2_utils (?) and nav2_system_tests.
Possible Solutions
Upgrade resource class for CI workers
By default , the currently resource class used for CI docker workers is medium, with 2 vCPUs and 4GB of RAM. Aside from the occasional spike once or twice a month when the caches are busted, this hasn't been much of an issue since caching was added:
Given the current issue, one solution would be to simply upgrade the resource class for workers to something more powerful such as large, with 4 vCPUs and 8GB of RAM:
https://circleci.com/docs/configuration-reference#resourceclass
See here for such an example:
However, in practice, switching to a largeer resource class is not enough, given the bottlenecks described above. Despite the principle of diminishing returns, I am also surprised in how unaffected the build time is; perhaps the windows of time where make jobs exceed the former limit of 2 is countered by the greater degree of context switching between make processes? Note: at the 47min mark, the step in the CPU plot remains the same as before with the medium resource class.
In addition, unsetting all makeflags and config options limiting the parallel job count still results in RAM memory exhaustion:
Optimize codebase to improve build time
Alternatively, we may want to consider optimizing the nav2 codebase to improve build time. One example could be to split larger packages, such as nav2_system_tests into multiple. While this greater granularity could benefit from greater cache retention, for jobs when building from scratch, if parallelizable, it could also help avoid the largest bottleneck in the build test pipeline.
Perhaps we could also ask the MoveIt2 matainers, who manage packages of similar scale and complexity, in how they are keeping their own build time under in check. For example:




