Skip to content

Mitigate Dr.CI stale comments on PyTorch PRs#5963

Merged
huydhn merged 6 commits intomainfrom
address-drci-refresh-issue
Nov 22, 2024
Merged

Mitigate Dr.CI stale comments on PyTorch PRs#5963
huydhn merged 6 commits intomainfrom
address-drci-refresh-issue

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented Nov 22, 2024

This change fixes a couple of issues with the workflow that refreshes Dr.CI results for all open PRs. The key take away is that this API call scale on the number of open pull requests on a repo. And on PyTorch, it now takes longer than 120 seconds to finish. When that limit is reached, the Vercel function (AWS lambda) will terminate the execution and all PRs that are still in queue will be dropped. Their Dr.CI comments will surely become stale.

Here is an example of the failure https://github.com/pytorch/test-infra/actions/runs/11943802339/job/33293533522. The error is FUNCTION_INVOCATION_TIMEOUT (https://github.com/pytorch/test-infra/actions/runs/11964503897/job/33356932041#step:3:136), and it stops at 2 minutes sharp. It's defined at https://vercel.com/fbopensource/torchci/settings/functions.

A final note, during my debug, I see this new failure shows up flakily from time to time. I'll take a look at it in another PR as it doesn't happen frequently (although it also causes the Dr.CI comment on the PR in question to go stale temporarily)

Failed to update PR 139760 Error: Client network socket disconnected before secure TLS connection was established
    at TLSSocket.onConnectEnd (node:_tls_wrap:1732:19)
    at TLSSocket.emit (node:events:525:35)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:90:21) {
  code: 'ECONNRESET',
  path: null,
  host: 'hyt81izu0c.us-east-1.aws.clickhouse.cloud',
  port: 8443,
  localAddress: undefined
}

Testing

time curl --request POST \
  --url 'https://torchci-git-address-drci-refresh-issue-fbopensource.vercel.app/api/drci/drci' \
  --header 'Authorization: REDACT' \
  --data 'repo=pytorch' \
  --silent --output /dev/null --show-error --fail

return 200 OK now even when the runtime is 3+ minutes (3:12.56 total), it was 504 before

@huydhn huydhn requested review from a team, clee2000, malfet and yangw-dev November 22, 2024 01:52
@vercel
Copy link
Copy Markdown

vercel bot commented Nov 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 22, 2024 2:40am

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 22, 2024
@huydhn huydhn changed the title Fix Dr.CI stale comments on PyTorch PRs Mitigate Dr.CI stale comments on PyTorch PRs Nov 22, 2024
@huydhn
Copy link
Copy Markdown
Contributor Author

huydhn commented Nov 22, 2024

cc @clee2000 I think we should try to profile drci call to see if the overall runtime can be improved here, or if the increase only due to the higher number of open PRs on PyTorch. Running the workflow more often in #5956 help a bit when the list of pull requests is shuffled , but there is no guarantee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants