We're seeing all of the Linux_android android_device_tests_shard_* master tests timing out on most PRs during approximately regular business hours this week. Looking at the logs, what I see is that it's usually sitting trying to run the first FTL test, and when I click through to the dashboard link in the logs, FTL shows it as "pending". It will eventually run, at least in a couple of cases I monitored, but after the LUCI job has timed out (90 minutes).
I first saw it last Friday, March 13th, but then when I re-ran a couple of failing PRs over the weekend they were fine so I thought it was a temporary hiccup on the FTL side. However, all this week I've been seeing it happen very frequently. Sometimes it seems to manage to run them in time, but rarely, except when I trigger the re-run early in the morning (eastern) or late in the evening, when it seems much more likely to work (I only have anecdotes on this, not hard data. I would expect the FTL team could get general data on usage/backlog for these devices though).
In the past when this has happened it's usually been because the device we are using fell out of the high capacity pool and we needed to updated. @gmackall verified last Friday that these devices are still listed as DEVICE_CAPACITY_HIGH, but it doesn't seem like that's true in practice at the moment (unless there's some kind of demand spike for these devices).
We likely need to either try switching to a different, newer high-capacity device, or discuss with FTL folks about what might be going on.
We're seeing all of the
Linux_android android_device_tests_shard_* mastertests timing out on most PRs during approximately regular business hours this week. Looking at the logs, what I see is that it's usually sitting trying to run the first FTL test, and when I click through to the dashboard link in the logs, FTL shows it as "pending". It will eventually run, at least in a couple of cases I monitored, but after the LUCI job has timed out (90 minutes).I first saw it last Friday, March 13th, but then when I re-ran a couple of failing PRs over the weekend they were fine so I thought it was a temporary hiccup on the FTL side. However, all this week I've been seeing it happen very frequently. Sometimes it seems to manage to run them in time, but rarely, except when I trigger the re-run early in the morning (eastern) or late in the evening, when it seems much more likely to work (I only have anecdotes on this, not hard data. I would expect the FTL team could get general data on usage/backlog for these devices though).
In the past when this has happened it's usually been because the device we are using fell out of the high capacity pool and we needed to updated. @gmackall verified last Friday that these devices are still listed as
DEVICE_CAPACITY_HIGH, but it doesn't seem like that's true in practice at the moment (unless there's some kind of demand spike for these devices).We likely need to either try switching to a different, newer high-capacity device, or discuss with FTL folks about what might be going on.