Split WPT macOS testing into many more chunks by SimonSapin · Pull Request #24768 · servo/servo

SimonSapin · 2019-11-18T10:06:11Z

Before this

Before this PR, we had roughly as many chunks as available workers. Because the the number of test files is a poor estimate for the time needed to run them, we have significant variation in the completion time between chunks when testing one given PR.

servo/taskcluster-config#9 adds a tool to collect this data. Here are two full runs of test_wpt before this PR:

https://community-tc.services.mozilla.com/tasks/groups/DBt9ki9gTdWmwAk-VDorzw

count 1, total 0:00:32, max: 0:00:32    docker  0:00:32
count 1, total 0:59:14, max: 0:59:14    macos-disabled-mac1     0:59:14
count 6, total 4:12:16, max: 1:01:14    macos-disabled-mac1 WPT 0:40:29 0:18:55 0:46:50 0:44:38 1:01:14 0:40:10
count 1, total 0:55:19, max: 0:55:19    macos-disabled-mac9     0:55:19
count 6, total 4:25:09, max: 1:01:40    macos-disabled-mac9 WPT 0:37:58 0:37:24 0:27:18 1:01:40 0:46:17 0:54:31

Times for a given chunk vary between 19 minutes and 61 minutes. Assuming no try testing, with Homu’s serial scheduling of r+ testing this means that that worker sits idle for 42 minutes and our limited CPU resources are under-utilized.

When there are try PRs being tested however, they compete with each other and any r+ PR for the same workers. If we get unlucky, a 61 minute task could only start after some other tasks have finished, Increasing the overall time-to-merge a lot.

This

This PR changes the number of chunks to be significantly more than the number of available workers. When one of them finishes, that worker can pick up another one instead of sitting idle.

Now the ratio of number of tasks to number of workers doesn’t matter: the differences in run time between tasks becomes somewhat of an advantage and the distribution to workers evens out on average.

The number 30 is a bit arbitrary. A higher number reduces resource under-utilization, but increases the effect of per-task overhead. The git cache added in #24753 reduced that overhead, though.

Another worry I had was whether this would make worse the similar problem of unequal scheduling between processes within a task, where some CPU cores sit idle while the rest processes finish their assigned work.

This turned out not to be enough of a problem to negatively affect the total machine time:

https://community-tc.services.mozilla.com/tasks/groups/VnDac92HQU6QmrpzWPCR2w

count 1, total 0:00:48, max: 0:00:48    docker  0:00:48
count 1, total 0:39:04, max: 0:39:04    macos-disabled-mac9     0:39:04
count 31, total 4:03:29, max: 0:15:29   macos-disabled-mac9 WPT
        0:07:26 0:08:39 0:04:21 0:07:13 0:12:47 0:10:11 0:04:01 0:03:36
        0:10:43 0:12:57 0:04:47 0:04:06 0:10:09 0:12:00 0:12:42 0:04:40
        0:04:24 0:12:20 0:12:15 0:03:03 0:07:35 0:11:35 0:07:01 0:04:16
        0:09:40 0:05:08 0:05:01 0:06:29 0:15:29 0:02:28 0:06:27

(4h03min is even lower than above, but seems within variation.)

After this

#23655 proposes automatically restarting failed WPT tasks, in case the failure is intermittent. With the test suite split into more chunks we have fewer tests per chunk, and therefore lower probability that a given one fails. Restarting one of them also causes less repeated work.

## Before this Before this PR, we had roughly as many chunks as available workers. Because the the number of test files is a poor estimate for the time needed to run them, we have significant variation in the completion time between chunks when testing one given PR. servo/taskcluster-config#9 adds a tool to collect this data. Here are two full runs of `test_wpt` before this PR: https://community-tc.services.mozilla.com/tasks/groups/DBt9ki9gTdWmwAk-VDorzw ``` count 1, total 0:00:32, max: 0:00:32 docker 0:00:32 count 1, total 0:59:14, max: 0:59:14 macos-disabled-mac1 0:59:14 count 6, total 4:12:16, max: 1:01:14 macos-disabled-mac1 WPT 0:40:29 0:18:55 0:46:50 0:44:38 1:01:14 0:40:10 count 1, total 0:55:19, max: 0:55:19 macos-disabled-mac9 0:55:19 count 6, total 4:25:09, max: 1:01:40 macos-disabled-mac9 WPT 0:37:58 0:37:24 0:27:18 1:01:40 0:46:17 0:54:31 ``` Times for a given chunk vary between 19 minutes and 61 minutes. Assuming no `try` testing, with Homu’s serial scheduling of `r+` testing this means that that worker sits idle for 42 minutes and our limited CPU resources are under-utilized. When there *are* `try` PRs being tested however, they compete with each other and any `r+` PR for the same workers. If we get unlucky, a 61 minute task could only *start* after some other tasks have finished, Increasing the overall time-to-merge a lot. ## This This PR changes the number of chunks to be significantly more than the number of available workers. When one of them finishes, that worker can pick up another one instead of sitting idle. Now the ratio of number of tasks to number of workers doesn’t matter: the differences in run time between tasks becomes somewhat of an advantage and the distribution to workers evens out on average. The number 30 is a bit arbitrary. A higher number reduces resource under-utilization, but increases the effect of per-task overhead. The git cache added in #24753 reduced that overhead, though. Another worry I had was whether this would make worse the similar problem of unequal scheduling between processes within a task, where some CPU cores sit idle while the rest processes finish their assigned work. This turned out not to be enough of a problem to negatively affect the total machine time: https://community-tc.services.mozilla.com/tasks/groups/VnDac92HQU6QmrpzWPCR2w ``` count 1, total 0:00:48, max: 0:00:48 docker 0:00:48 count 1, total 0:39:04, max: 0:39:04 macos-disabled-mac9 0:39:04 count 31, total 4:03:29, max: 0:15:29 macos-disabled-mac9 WPT 0:07:26 0:08:39 0:04:21 0:07:13 0:12:47 0:10:11 0:04:01 0:03:36 0:10:43 0:12:57 0:04:47 0:04:06 0:10:09 0:12:00 0:12:42 0:04:40 0:04:24 0:12:20 0:12:15 0:03:03 0:07:35 0:11:35 0:07:01 0:04:16 0:09:40 0:05:08 0:05:01 0:06:29 0:15:29 0:02:28 0:06:27 ``` (4h03min is even lower than above, but seems within variation.) ## After this #23655 proposes automatically restarting failed WPT tasks, in case the failure is intermittent. With the test suite split into more chunks we have fewer tests per chunk, and therefore lower probability that a given one fails. Restarting one of them also causes less repeated work.

… since other time-sensitive tasks depend on them. Note: we need to be careful with task priorities, especially in worker pools with limited capacity, since they are absolute and can cause starvation: https://docs.taskcluster.net/docs/manual/tasks/priority

… since they block everything else.

It already was, since the key given to `find_or_create()` did not include `args`.

SimonSapin · 2019-11-18T10:06:25Z

@bors-servo try=wpt-mac

bors-servo · 2019-11-18T10:06:30Z

⌛ Trying commit d58cfee with merge 100aba3...

Split WPT macOS testing into many more chunks ## Before this Before this PR, we had roughly as many chunks as available workers. Because the the number of test files is a poor estimate for the time needed to run them, we have significant variation in the completion time between chunks when testing one given PR. servo/taskcluster-config#9 adds a tool to collect this data. Here are two full runs of `test_wpt` before this PR: https://community-tc.services.mozilla.com/tasks/groups/DBt9ki9gTdWmwAk-VDorzw ``` count 1, total 0:00:32, max: 0:00:32 docker 0:00:32 count 1, total 0:59:14, max: 0:59:14 macos-disabled-mac1 0:59:14 count 6, total 4:12:16, max: 1:01:14 macos-disabled-mac1 WPT 0:40:29 0:18:55 0:46:50 0:44:38 1:01:14 0:40:10 count 1, total 0:55:19, max: 0:55:19 macos-disabled-mac9 0:55:19 count 6, total 4:25:09, max: 1:01:40 macos-disabled-mac9 WPT 0:37:58 0:37:24 0:27:18 1:01:40 0:46:17 0:54:31 ``` Times for a given chunk vary between 19 minutes and 61 minutes. Assuming no `try` testing, with Homu’s serial scheduling of `r+` testing this means that that worker sits idle for 42 minutes and our limited CPU resources are under-utilized. When there *are* `try` PRs being tested however, they compete with each other and any `r+` PR for the same workers. If we get unlucky, a 61 minute task could only *start* after some other tasks have finished, Increasing the overall time-to-merge a lot. ## This This PR changes the number of chunks to be significantly more than the number of available workers. When one of them finishes, that worker can pick up another one instead of sitting idle. Now the ratio of number of tasks to number of workers doesn’t matter: the differences in run time between tasks becomes somewhat of an advantage and the distribution to workers evens out on average. The number 30 is a bit arbitrary. A higher number reduces resource under-utilization, but increases the effect of per-task overhead. The git cache added in #24753 reduced that overhead, though. Another worry I had was whether this would make worse the similar problem of unequal scheduling between processes within a task, where some CPU cores sit idle while the rest processes finish their assigned work. This turned out not to be enough of a problem to negatively affect the total machine time: https://community-tc.services.mozilla.com/tasks/groups/VnDac92HQU6QmrpzWPCR2w ``` count 1, total 0:00:48, max: 0:00:48 docker 0:00:48 count 1, total 0:39:04, max: 0:39:04 macos-disabled-mac9 0:39:04 count 31, total 4:03:29, max: 0:15:29 macos-disabled-mac9 WPT 0:07:26 0:08:39 0:04:21 0:07:13 0:12:47 0:10:11 0:04:01 0:03:36 0:10:43 0:12:57 0:04:47 0:04:06 0:10:09 0:12:00 0:12:42 0:04:40 0:04:24 0:12:20 0:12:15 0:03:03 0:07:35 0:11:35 0:07:01 0:04:16 0:09:40 0:05:08 0:05:01 0:06:29 0:15:29 0:02:28 0:06:27 ``` (4h03min is even lower than above, but seems within variation.) ## After this #23655 proposes automatically restarting failed WPT tasks, in case the failure is intermittent. With the test suite split into more chunks we have fewer tests per chunk, and therefore lower probability that a given one fails. Restarting one of them also causes less repeated work.

SimonSapin · 2019-11-18T10:10:39Z

@bors-servo try=wpt-mac

servo/taskcluster-config#10

SimonSapin · 2019-11-18T10:12:22Z

@bors-servo try=wpt-mac

bors-servo · 2019-11-18T10:12:27Z

⌛ Trying commit e335bcd with merge 8d2337c...

Split WPT macOS testing into many more chunks ## Before this Before this PR, we had roughly as many chunks as available workers. Because the the number of test files is a poor estimate for the time needed to run them, we have significant variation in the completion time between chunks when testing one given PR. servo/taskcluster-config#9 adds a tool to collect this data. Here are two full runs of `test_wpt` before this PR: https://community-tc.services.mozilla.com/tasks/groups/DBt9ki9gTdWmwAk-VDorzw ``` count 1, total 0:00:32, max: 0:00:32 docker 0:00:32 count 1, total 0:59:14, max: 0:59:14 macos-disabled-mac1 0:59:14 count 6, total 4:12:16, max: 1:01:14 macos-disabled-mac1 WPT 0:40:29 0:18:55 0:46:50 0:44:38 1:01:14 0:40:10 count 1, total 0:55:19, max: 0:55:19 macos-disabled-mac9 0:55:19 count 6, total 4:25:09, max: 1:01:40 macos-disabled-mac9 WPT 0:37:58 0:37:24 0:27:18 1:01:40 0:46:17 0:54:31 ``` Times for a given chunk vary between 19 minutes and 61 minutes. Assuming no `try` testing, with Homu’s serial scheduling of `r+` testing this means that that worker sits idle for 42 minutes and our limited CPU resources are under-utilized. When there *are* `try` PRs being tested however, they compete with each other and any `r+` PR for the same workers. If we get unlucky, a 61 minute task could only *start* after some other tasks have finished, Increasing the overall time-to-merge a lot. ## This This PR changes the number of chunks to be significantly more than the number of available workers. When one of them finishes, that worker can pick up another one instead of sitting idle. Now the ratio of number of tasks to number of workers doesn’t matter: the differences in run time between tasks becomes somewhat of an advantage and the distribution to workers evens out on average. The number 30 is a bit arbitrary. A higher number reduces resource under-utilization, but increases the effect of per-task overhead. The git cache added in #24753 reduced that overhead, though. Another worry I had was whether this would make worse the similar problem of unequal scheduling between processes within a task, where some CPU cores sit idle while the rest processes finish their assigned work. This turned out not to be enough of a problem to negatively affect the total machine time: https://community-tc.services.mozilla.com/tasks/groups/VnDac92HQU6QmrpzWPCR2w ``` count 1, total 0:00:48, max: 0:00:48 docker 0:00:48 count 1, total 0:39:04, max: 0:39:04 macos-disabled-mac9 0:39:04 count 31, total 4:03:29, max: 0:15:29 macos-disabled-mac9 WPT 0:07:26 0:08:39 0:04:21 0:07:13 0:12:47 0:10:11 0:04:01 0:03:36 0:10:43 0:12:57 0:04:47 0:04:06 0:10:09 0:12:00 0:12:42 0:04:40 0:04:24 0:12:20 0:12:15 0:03:03 0:07:35 0:11:35 0:07:01 0:04:16 0:09:40 0:05:08 0:05:01 0:06:29 0:15:29 0:02:28 0:06:27 ``` (4h03min is even lower than above, but seems within variation.) ## After this #23655 proposes automatically restarting failed WPT tasks, in case the failure is intermittent. With the test suite split into more chunks we have fewer tests per chunk, and therefore lower probability that a given one fails. Restarting one of them also causes less repeated work.

bors-servo · 2019-11-18T11:44:23Z

💔 Test failed - status-taskcluster

SimonSapin · 2019-11-18T11:47:53Z

(One WPT test has unexpected result)

This is 92 minutes from scheduling to results. After the release build task is finished, this makes better use of all available workers (which are currently 8, not just 6).

SimonSapin · 2019-11-18T12:05:49Z

@bors-servo r=nox

bors-servo · 2019-11-18T12:05:50Z

📌 Commit 67b0b97 has been approved by nox

bors-servo · 2019-11-18T12:05:55Z

⌛ Testing commit 67b0b97 with merge 91d1d5f...

Split WPT macOS testing into many more chunks ## Before this Before this PR, we had roughly as many chunks as available workers. Because the the number of test files is a poor estimate for the time needed to run them, we have significant variation in the completion time between chunks when testing one given PR. servo/taskcluster-config#9 adds a tool to collect this data. Here are two full runs of `test_wpt` before this PR: https://community-tc.services.mozilla.com/tasks/groups/DBt9ki9gTdWmwAk-VDorzw ``` count 1, total 0:00:32, max: 0:00:32 docker 0:00:32 count 1, total 0:59:14, max: 0:59:14 macos-disabled-mac1 0:59:14 count 6, total 4:12:16, max: 1:01:14 macos-disabled-mac1 WPT 0:40:29 0:18:55 0:46:50 0:44:38 1:01:14 0:40:10 count 1, total 0:55:19, max: 0:55:19 macos-disabled-mac9 0:55:19 count 6, total 4:25:09, max: 1:01:40 macos-disabled-mac9 WPT 0:37:58 0:37:24 0:27:18 1:01:40 0:46:17 0:54:31 ``` Times for a given chunk vary between 19 minutes and 61 minutes. Assuming no `try` testing, with Homu’s serial scheduling of `r+` testing this means that that worker sits idle for 42 minutes and our limited CPU resources are under-utilized. When there *are* `try` PRs being tested however, they compete with each other and any `r+` PR for the same workers. If we get unlucky, a 61 minute task could only *start* after some other tasks have finished, Increasing the overall time-to-merge a lot. ## This This PR changes the number of chunks to be significantly more than the number of available workers. When one of them finishes, that worker can pick up another one instead of sitting idle. Now the ratio of number of tasks to number of workers doesn’t matter: the differences in run time between tasks becomes somewhat of an advantage and the distribution to workers evens out on average. The number 30 is a bit arbitrary. A higher number reduces resource under-utilization, but increases the effect of per-task overhead. The git cache added in #24753 reduced that overhead, though. Another worry I had was whether this would make worse the similar problem of unequal scheduling between processes within a task, where some CPU cores sit idle while the rest processes finish their assigned work. This turned out not to be enough of a problem to negatively affect the total machine time: https://community-tc.services.mozilla.com/tasks/groups/VnDac92HQU6QmrpzWPCR2w ``` count 1, total 0:00:48, max: 0:00:48 docker 0:00:48 count 1, total 0:39:04, max: 0:39:04 macos-disabled-mac9 0:39:04 count 31, total 4:03:29, max: 0:15:29 macos-disabled-mac9 WPT 0:07:26 0:08:39 0:04:21 0:07:13 0:12:47 0:10:11 0:04:01 0:03:36 0:10:43 0:12:57 0:04:47 0:04:06 0:10:09 0:12:00 0:12:42 0:04:40 0:04:24 0:12:20 0:12:15 0:03:03 0:07:35 0:11:35 0:07:01 0:04:16 0:09:40 0:05:08 0:05:01 0:06:29 0:15:29 0:02:28 0:06:27 ``` (4h03min is even lower than above, but seems within variation.) ## After this #23655 proposes automatically restarting failed WPT tasks, in case the failure is intermittent. With the test suite split into more chunks we have fewer tests per chunk, and therefore lower probability that a given one fails. Restarting one of them also causes less repeated work.

SimonSapin · 2019-11-18T12:45:31Z

@bors-servo retry #24611

bors-servo · 2019-11-18T12:45:36Z

⌛ Testing commit 67b0b97 with merge f890d2e...

Split WPT macOS testing into many more chunks ## Before this Before this PR, we had roughly as many chunks as available workers. Because the the number of test files is a poor estimate for the time needed to run them, we have significant variation in the completion time between chunks when testing one given PR. servo/taskcluster-config#9 adds a tool to collect this data. Here are two full runs of `test_wpt` before this PR: https://community-tc.services.mozilla.com/tasks/groups/DBt9ki9gTdWmwAk-VDorzw ``` count 1, total 0:00:32, max: 0:00:32 docker 0:00:32 count 1, total 0:59:14, max: 0:59:14 macos-disabled-mac1 0:59:14 count 6, total 4:12:16, max: 1:01:14 macos-disabled-mac1 WPT 0:40:29 0:18:55 0:46:50 0:44:38 1:01:14 0:40:10 count 1, total 0:55:19, max: 0:55:19 macos-disabled-mac9 0:55:19 count 6, total 4:25:09, max: 1:01:40 macos-disabled-mac9 WPT 0:37:58 0:37:24 0:27:18 1:01:40 0:46:17 0:54:31 ``` Times for a given chunk vary between 19 minutes and 61 minutes. Assuming no `try` testing, with Homu’s serial scheduling of `r+` testing this means that that worker sits idle for 42 minutes and our limited CPU resources are under-utilized. When there *are* `try` PRs being tested however, they compete with each other and any `r+` PR for the same workers. If we get unlucky, a 61 minute task could only *start* after some other tasks have finished, Increasing the overall time-to-merge a lot. ## This This PR changes the number of chunks to be significantly more than the number of available workers. When one of them finishes, that worker can pick up another one instead of sitting idle. Now the ratio of number of tasks to number of workers doesn’t matter: the differences in run time between tasks becomes somewhat of an advantage and the distribution to workers evens out on average. The number 30 is a bit arbitrary. A higher number reduces resource under-utilization, but increases the effect of per-task overhead. The git cache added in #24753 reduced that overhead, though. Another worry I had was whether this would make worse the similar problem of unequal scheduling between processes within a task, where some CPU cores sit idle while the rest processes finish their assigned work. This turned out not to be enough of a problem to negatively affect the total machine time: https://community-tc.services.mozilla.com/tasks/groups/VnDac92HQU6QmrpzWPCR2w ``` count 1, total 0:00:48, max: 0:00:48 docker 0:00:48 count 1, total 0:39:04, max: 0:39:04 macos-disabled-mac9 0:39:04 count 31, total 4:03:29, max: 0:15:29 macos-disabled-mac9 WPT 0:07:26 0:08:39 0:04:21 0:07:13 0:12:47 0:10:11 0:04:01 0:03:36 0:10:43 0:12:57 0:04:47 0:04:06 0:10:09 0:12:00 0:12:42 0:04:40 0:04:24 0:12:20 0:12:15 0:03:03 0:07:35 0:11:35 0:07:01 0:04:16 0:09:40 0:05:08 0:05:01 0:06:29 0:15:29 0:02:28 0:06:27 ``` (4h03min is even lower than above, but seems within variation.) ## After this #23655 proposes automatically restarting failed WPT tasks, in case the failure is intermittent. With the test suite split into more chunks we have fewer tests per chunk, and therefore lower probability that a given one fails. Restarting one of them also causes less repeated work.

bors-servo · 2019-11-18T14:22:54Z

☀️ Test successful - linux-rel-css, linux-rel-wpt, status-taskcluster
Approved by: nox
Pushing f890d2e to master...

SimonSapin added 6 commits November 18, 2019 09:47

Move "extra" WPT testing to its own task (chunk "zero")

1ca9c5b

Raise the priority of decision tasks

26ca284

… since they block everything else.

Don’t pretend that update_wpt() doesn’t use debug assertions

1762cba

It already was, since the key given to `find_or_create()` did not include `args`.

Only use high prority for macOS when testing a PR for merging.

e5f6333

SimonSapin mentioned this pull request Nov 18, 2019

Allow high-priority tasks servo/taskcluster-config#10

Merged

This comment has been minimized.

Sign in to view

SimonSapin force-pushed the wpt0 branch from d58cfee to e335bcd Compare November 18, 2019 10:12

highfive added the S-awaiting-review There is new code that needs to be reviewed. label Nov 18, 2019

Zero-pad the chunk number in WPT task names

67b0b97

highfive added the S-tests-failed The changes caused existing tests to fail. label Nov 18, 2019

SimonSapin force-pushed the wpt0 branch from e335bcd to 67b0b97 Compare November 18, 2019 11:49

highfive removed the S-tests-failed The changes caused existing tests to fail. label Nov 18, 2019

highfive assigned nox Nov 18, 2019

highfive added S-awaiting-merge The PR is in the process of compiling and running tests on the automated CI. and removed S-awaiting-review There is new code that needs to be reviewed. labels Nov 18, 2019

bors-servo merged commit 67b0b97 into master Nov 18, 2019

bors-servo deleted the wpt0 branch November 18, 2019 14:23

highfive removed the S-awaiting-merge The PR is in the process of compiling and running tests on the automated CI. label Nov 18, 2019

This was referenced Nov 18, 2019

Automatically retry WPT tasks with few failures? #23655

Closed

Dealing with stray processes remaining after a task ends taskcluster/generic-worker#260

Open

SimonSapin mentioned this pull request Nov 27, 2019

Reduce the number of expected timeouts in WPT #24880

Open

jdm mentioned this pull request Dec 17, 2019

Intermittent reftest failures on linux (blank rendering) #24726

Closed

Uh oh!

Conversation

SimonSapin commented Nov 18, 2019

Before this

This

After this

Uh oh!

SimonSapin commented Nov 18, 2019

Uh oh!

bors-servo commented Nov 18, 2019

Uh oh!

SimonSapin commented Nov 18, 2019

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SimonSapin commented Nov 18, 2019

Uh oh!

bors-servo commented Nov 18, 2019

Uh oh!

bors-servo commented Nov 18, 2019

Uh oh!

SimonSapin commented Nov 18, 2019

Uh oh!

SimonSapin commented Nov 18, 2019

Uh oh!

bors-servo commented Nov 18, 2019

Uh oh!

bors-servo commented Nov 18, 2019

Uh oh!

SimonSapin commented Nov 18, 2019

Uh oh!

bors-servo commented Nov 18, 2019

Uh oh!

bors-servo commented Nov 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants