Skip to content

[CI] Use sccache-0.8.2 for CUDA builds#140614

Closed
malfet wants to merge 3 commits intomainfrom
malfet-patch-28
Closed

[CI] Use sccache-0.8.2 for CUDA builds#140614
malfet wants to merge 3 commits intomainfrom
malfet-patch-28

Conversation

@malfet
Copy link
Copy Markdown
Contributor

@malfet malfet commented Nov 13, 2024

Instead of an ancient prebuilt binary

This is a followup from #121323
For some reason, newer sccache does not work when gcc is invoked with -E option, so one have to special-case -E case in /opt/ccache/bin/gcc wrapper, which had to be special cased to work with nvcc by checking whether -E is passed not only as first or second, but as 3rd argument as well(to be followed up by a generic #142813 ), i.e. to generate following wrapper:

#!/bin/sh

if [ "$1" = "-E" ] || [ "$2" = "-E" ] || [ "$3" = "-E" ]; then
  exec /usr/bin/gcc "$@"
elif [ $(env -u LD_PRELOAD ps -p $PPID -o comm=) != sccache ]; then
  exec sccache /usr/bin/gcc "$@"
else
  exec /usr/bin/gcc "$@"
fi

Without it sccache nvcc hello.cu failed with no-descriptive

    sccache: error: failed to execute compile
    sccache: caused by: Compiler not supported: ""

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Nov 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140614

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 8f911c4 with merge base e52a534 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Nov 13, 2024
@malfet malfet changed the title Use sccache-0.8.2 everywhere [CI] Use sccache-0.8.2 for CUDA builds Nov 13, 2024
@malfet malfet marked this pull request as ready for review November 18, 2024 21:38
@malfet malfet requested a review from jeffdaily as a code owner November 18, 2024 21:38
@malfet
Copy link
Copy Markdown
Contributor Author

malfet commented Nov 18, 2024

@pytorchbot merge -r

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2024
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased malfet-patch-28 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout malfet-patch-28 && git pull --rebase)

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / libtorch-linux-focal-cuda12.4-py3.7-gcc9-debug / build

Details for Dev Infra team Raised by workflow job

@malfet malfet added the no-runner-experiments Bypass Meta/LF runner determinator label Nov 19, 2024
@malfet
Copy link
Copy Markdown
Contributor Author

malfet commented Nov 20, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased malfet-patch-28 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout malfet-patch-28 && git pull --rebase)

wdvr added a commit that referenced this pull request Dec 6, 2024
pytorchmergebot pushed a commit that referenced this pull request Dec 7, 2024
removes sccache from bazel builds. Will move bazel builds to periodic if build succeed

CUDA bazel test succeeded, moving to periodic

Pull Request resolved: #142241
Approved by: https://github.com/malfet
@wdvr
Copy link
Copy Markdown
Contributor

wdvr commented Dec 7, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased malfet-patch-28 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout malfet-patch-28 && git pull --rebase)

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased malfet-patch-28 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout malfet-patch-28 && git pull --rebase)

@malfet malfet changed the title [CI] Use sccache-0.8.2 for CUDA builds [CI] Use sccache-0.9.0 for CPU and CUDA builds Dec 10, 2024
@malfet malfet changed the title [CI] Use sccache-0.9.0 for CPU and CUDA builds [CI] Use sccache-0.8.2 for CUDA builds Dec 10, 2024
@malfet
Copy link
Copy Markdown
Contributor Author

malfet commented Dec 11, 2024

@pytorchbot merge -f "All builds are green"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Dec 11, 2024
Follow up of TODO in #140614

It was found experimentally, that for one GPU architecture, `sccache` passes `-E` as 1st, 2nd or 3rd argument, but it's much better to do this if `-E` is passed as any argument

No need to worry about exit or elif chains, as `exec` aborts script execution

Pull Request resolved: #142813
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
@ZainRizvi
Copy link
Copy Markdown
Contributor

ZainRizvi commented Dec 11, 2024

Does this mean that for ggc -E commands we do not use sccache at all @malfet ? What kind of a perf hit are we expecting when building the jobs?

@ZainRizvi
Copy link
Copy Markdown
Contributor

@pytorchmergebot revert -c nosignal -m "Some build jobs are timing out, I'm guessing because they aren't using the cache. GH job link HUD commit link"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Reverting PR 140614 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit b94a20641459b295573617b97916a09b284dccfd returned non-zero exit code 1

Auto-merging .ci/docker/common/install_cache.sh
CONFLICT (content): Merge conflict in .ci/docker/common/install_cache.sh
error: could not revert b94a2064145... [CI] Use sccache-0.8.2 for CUDA builds (#140614)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@wdvr
Copy link
Copy Markdown
Contributor

wdvr commented Dec 11, 2024

@ZainRizvi yes, but that was already like that, sccache doesn't play nicely with the preprocessor.

@wdvr wdvr added ciflow/trunk Trigger trunk jobs on your pull request and removed ciflow/trunk Trigger trunk jobs on your pull request labels Dec 11, 2024
@malfet
Copy link
Copy Markdown
Contributor Author

malfet commented Dec 11, 2024

@pytorchmergebot revert -c nosignal -m "Some build jobs are timing out, I'm guessing because they aren't using the cache. GH job link HUD commit link"

Those timeouts have resolved itself..

@malfet malfet deleted the malfet-patch-28 branch December 12, 2024 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged no-runner-experiments Bypass Meta/LF runner determinator topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants