-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
docker-worker and generic-worker both support a retry configuration in task definitions:
- https://github.com/taskcluster/docker-worker/blob/e2eb847d404d7fa776a394c5845c5e98040cc13b/schemas/v1/payload.json#L190-L197
- https://github.com/taskcluster/generic-worker/blob/e31923d3322d3d31af563ae6a1f3e328b9223d5f/multiuser_windows.yml#L222-L237
It takes an list of exit codes. When the task’s command fails with one of these, the worker resolves the task with the queue as “exception” rather than “failed”. The queue will then automatically re-schedule that task again, up to a configurable number of retries (defaults to 5).
In this repository, PRs often fail to land because of some intermittent WPT failure. We mark some test filenames as known intermittents whose failure we ignore, but I suspect that at least some of the source of non-determinism is weakly or not correlated to the filename, as PRs still regularly fail at first but then land after a retry (or a few).
Retrying an entire PR though homu is costly, in terms of overall cycle time. #23383 can help but not when another PR was merged in the meantime, which is common when homu’s queue is non-empty as homu will start on the next PR after a failure quicker than a human can type the retry command.
Taskcluster queue’s retry mechanism is more fine-grained: task level instead of PR level. When running fewer tests again in the second try (one out of 6 WPT chucks for example), we’re less likely to hit another random/intermittent failure.
However, the downside is multiplying the time to reporting deterministic test failures when a PR breaks something. This could be limited by setting a low retry count like 2 or 3 for WPT tasks. (#24768 greatly reduced this time loss.)
@jdm, what do you think?