You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening this issue to describe and track tasks related to implementing nvcc support in sccache-dist.
tl;dr;sccache should add cicc and ptxas as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.
Background
sccache-dist mode relies on a compiler's ability to compile preprocessed input. The source file is preprocessed on the client, looked up in the cache, and on cache miss the toolchain + preprocessed file is sent to the sccache-dist scheduler for compilation.
This model is not supported by NVIDIA's CUDA compiler nvcc, because nvcc lacks support for compiling preprocessed input. This does not represent a deficit in nvcc, rather it's an inability to align this feature with what nvcc actually does under the hood.
A CUDA C++ file contains standard C++ (CPU-executed) code and CUDA device code side-by-side. Internally nvcc runs a number of preprocessor steps to separate this code into host and device code that are each compiled by different compilers. nvcc can also be instructed to compile the same CUDA device code for different architectures and bundle them into a "fat binary".
The preprocessor output for each device architecture is potentially different, thus there is no single preprocessed input file nvcc can produce that could be fed back in to the compiler later. (A rough analogy is if gcc supported compiling and assembling objects for x86 + ARM which could be executed on either platform).
Rather than attempt to trick nvcc into compiling preprocessed input, sccache can decompose and distribute nvcc's constituent sub-compiler invocations.
Proposal
sccache should add cicc and ptxas as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work. sccache should change its nvcc compiler to run the underlying host and device compiler steps individually, with each step re-entering the standard hash + cache + (distributed) compilation pipeline.
sccache can do this by utilizing the nvcc --dryrun flag, which outputs the underlying calls executed by nvcc:
This output represents a sequence of preprocessing steps that must run on the sccache client, followed by compilation steps on the preprocessed result that can be distributed to sccache-dist workers.
Explanation
Here's a rough break down of the command stages above:
These two lines run the host preprocessor to resolve host-side macros and inline #includes, then run the CUDA front-end to separate the source into host and device source files. The sccache client should run both these steps before requesting any compilation jobs.
This is similar to the prior commands, except for a different GPU arch sm_70. These commands must still run sequentially with respect to each other, but they can run in parallel to the commands from the prior stage.
In this stage, the outputs from the prior two stages are assembled into a .fatbin via the fatbinary invocation, then the original preprocessed host code is combined with the .fatbin and assembled into the final .o by the host compiler. These stages must run sequentially, but can be executed by sccache-dist workers (the final host compiler call can use the existing sccache-dist logic for preprocessing + distributing the work).
Additional Benefits
In addition to supporting sccache-dist in nvcc, this new behavior also benefits sccache clients that aren't configured to use distributed compilation, because sccache can now avoid compiling the underlying .ptx and .cubin device compilation artifacts assembled into the final .o.
For example, a CI job could compile code for all supported device architectures:
Since the above produces an object file with a different hash (hash_subset), today sccache yields a cache miss on this .o file and re-runs nvcc (which itself runs cicc and ptxas) because the arguments + input don't match hash_all produced in CI.
However with the proposed changes, while sccache would still yield a cache miss for the .o produced by the nvcc command, it would yield a cache hit on the underlying .ptx and .cubin files produced by cicc and ptxas respectively, skipping the lions share of the actual compilation done by nvcc.
Opening this issue to describe and track tasks related to implementing
nvccsupport insccache-dist.tl;dr;
sccacheshould addciccandptxasas first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.Background
sccache-distmode relies on a compiler's ability to compile preprocessed input. The source file is preprocessed on the client, looked up in the cache, and on cache miss the toolchain + preprocessed file is sent to thesccache-distscheduler for compilation.This model is not supported by NVIDIA's CUDA compiler
nvcc, becausenvcclacks support for compiling preprocessed input. This does not represent a deficit innvcc, rather it's an inability to align this feature with whatnvccactually does under the hood.A CUDA C++ file contains standard C++ (CPU-executed) code and CUDA device code side-by-side. Internally
nvccruns a number of preprocessor steps to separate this code into host and device code that are each compiled by different compilers.nvcccan also be instructed to compile the same CUDA device code for different architectures and bundle them into a "fat binary".The preprocessor output for each device architecture is potentially different, thus there is no single preprocessed input file nvcc can produce that could be fed back in to the compiler later. (A rough analogy is if
gccsupported compiling and assembling objects for x86 + ARM which could be executed on either platform).Rather than attempt to trick
nvccinto compiling preprocessed input,sccachecan decompose and distributenvcc's constituent sub-compiler invocations.Proposal
sccacheshould addciccandptxasas first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.sccacheshould change itsnvcccompiler to run the underlying host and device compiler steps individually, with each step re-entering the standard hash + cache + (distributed) compilation pipeline.sccachecan do this by utilizing thenvcc --dryrunflag, which outputs the underlying calls executed bynvcc:Click to expand nvcc --dryrun output
This output represents a sequence of preprocessing steps that must run on the sccache client, followed by compilation steps on the preprocessed result that can be distributed to
sccache-distworkers.Explanation
Here's a rough break down of the command stages above:
These two lines run the host preprocessor to resolve host-side macros and inline
#includes, then run the CUDA front-end to separate the source into host and device source files. The sccache client should run both these steps before requesting any compilation jobs.In this phase,
nvcc:x.cu)ciccon the output of step 1 to generate a.ptxfileptxason the output of step 2 to assemble the PTX into a.cubinAll these steps must run sequentially. Step 1 must run on the sccache client, but 2 and 3 can be executed by
sccache-distworkers.This is similar to the prior commands, except for a different GPU arch
sm_70. These commands must still run sequentially with respect to each other, but they can run in parallel to the commands from the prior stage.In this stage, the outputs from the prior two stages are assembled into a
.fatbinvia thefatbinaryinvocation, then the original preprocessed host code is combined with the.fatbinand assembled into the final.oby the host compiler. These stages must run sequentially, but can be executed bysccache-distworkers (the final host compiler call can use the existingsccache-distlogic for preprocessing + distributing the work).Additional Benefits
In addition to supporting
sccache-distinnvcc, this new behavior also benefitssccacheclients that aren't configured to use distributed compilation, becausesccachecan now avoid compiling the underlying.ptxand.cubindevice compilation artifacts assembled into the final.o.For example, a CI job could compile code for all supported device architectures:
The above produces an object file with a certain hash, let's call it
hash_all.A developer may want to compile the same code with the same options, but for a smaller subset of architectures that match the GPU on their machine:
Since the above produces an object file with a different hash (
hash_subset), todaysccacheyields a cache miss on this.ofile and re-runsnvcc(which itself runsciccandptxas) because the arguments + input don't matchhash_allproduced in CI.However with the proposed changes, while
sccachewould still yield a cache miss for the.oproduced by thenvcccommand, it would yield a cache hit on the underlying.ptxand.cubinfiles produced byciccandptxasrespectively, skipping the lions share of the actual compilation done bynvcc.Tasks
Work is ongoing in this branch.
ciccandptxasas first-class compilers supported by sccacheciccandptxastoolchains from client's CUDA toolkitnvcccompiler to callnvcc --dryrun, run each sub-command throughsccacheas appropriate