CI build times out

Looks like the current CI build is timing out, again. Even with previous countermeasures in place, it appears that these mitigations are no longer sufficient to cut the overall build time to under the maximum one hour limit when building from scratch, or when caching is intentionally busted due to dependency and system package upgrades. For example:

https://app.circleci.com/pipelines/github/ros-planning/navigation2/7903/workflows/d5454092-e664-417b-ab70-b1c05d5c42e7/jobs/27157

The `release_build` job linked above demonstrates a `context deadline exceeded` while attempting to finish build the final workspace package. When caching is used, including ccache and colcon-cache, building the nav2 workspace normally has no such issue. However, when caching is deliberately busted to insure clean hygienic builds upon upstream dependency updates, it seems the current codebase exceeds the deadline limit with respect to CircleCI's free tier jobs.

For reference, resource utilization for the medium docker resource class is almost fully maxed out in terms of core count; exceptions include the bottlenecks in building the first few and last packages in the workspace, such as `nav2_utils` (?) and `nav2_system_tests`.

![image](https://user-images.githubusercontent.com/2293573/190198700-9d289b8c-1f46-41f9-8293-aff94641a66e.png)
 
## Possible Solutions

### Upgrade resource class for CI workers

By default , the currently resource class used for CI docker workers is `medium`, with 2 vCPUs and 4GB of RAM. Aside from the occasional spike once or twice a month when the caches are busted, this hasn't been much of an issue since caching was added:

![image](https://user-images.githubusercontent.com/2293573/190194787-2893bbee-69c2-4621-8533-a7f14b93072d.png)

Given the current issue, one solution would be to simply upgrade the resource class for workers to something more powerful such as `large`, with 4 vCPUs and 8GB of RAM:

https://circleci.com/docs/configuration-reference#resourceclass

![image](https://user-images.githubusercontent.com/2293573/190191333-88f2cb1a-662e-4af9-ab14-b59588158222.png)

See here for such an example:

- https://github.com/ros-planning/navigation2/pull/3188

However, in practice, switching to a `large`er resource class is not enough, given the bottlenecks described above. Despite the principle of diminishing returns, I am also surprised in how unaffected the build time is; perhaps the windows of time where make jobs exceed the former limit of 2 is countered by the greater degree of context switching between make processes? Note: at the 47min mark, the step in the CPU plot remains the same as before with the `medium` resource class.

https://app.circleci.com/pipelines/github/ros-planning/navigation2/7900/workflows/2812b4a8-00d6-43e3-ba91-86fd378a5a21/jobs/27151

![image](https://user-images.githubusercontent.com/2293573/190223757-1ae68a8a-9e2f-47db-b8ad-72a0b297f6f0.png)

In addition, unsetting all makeflags and config options limiting the parallel job count still results in RAM memory exhaustion:

https://app.circleci.com/pipelines/github/ros-planning/navigation2/7901/workflows/a7d01a5e-804c-4250-9069-cd637afe2cfd/jobs/27168

![image](https://user-images.githubusercontent.com/2293573/190232533-3908a942-45f9-4565-8390-4ce523f2a19a.png)


### Optimize codebase to improve build time

Alternatively, we may want to consider optimizing the nav2 codebase to improve build time. One example could be to split larger packages, such as `nav2_system_tests` into multiple. While this greater granularity could benefit from greater cache retention, for jobs when building from scratch, if parallelizable, it could also help avoid the largest bottleneck in the build test pipeline.

Perhaps we could also ask the MoveIt2 matainers, who manage packages of similar scale and complexity, in how they are keeping their own build time under in check. For example:

- https://github.com/ros-planning/moveit2/pull/1333

cc @SteveMacenski @tylerjw 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI build times out #3189

Possible Solutions

Upgrade resource class for CI workers

Optimize codebase to improve build time

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CI build times out #3189

Description

Possible Solutions

Upgrade resource class for CI workers

Optimize codebase to improve build time

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions