Always use the ThinLTO pipeline for pre-link optimizations#153446
Always use the ThinLTO pipeline for pre-link optimizations#153446rust-bors[bot] merged 1 commit intorust-lang:mainfrom
Conversation
|
r? @cuviper rustbot has assigned @cuviper. Use Why was this reviewer chosen?The reviewer was selected based on:
|
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Always use the ThinLTO pipeline for pre-link optimizations
| NeedThinLTOBufferPasses = false; | ||
| break; | ||
| case LLVMRustOptStage::PreLinkFatLTO: | ||
| MPM = PB.buildLTOPreLinkDefaultPipeline(OptLevel); |
There was a problem hiding this comment.
Isn't this doing the opposite of what your PR description says this is doing?
There was a problem hiding this comment.
Oops. I had to redo these changes to split it out of my exploration of UnifiedLTO and messed up there.
This comment has been minimized.
This comment has been minimized.
When using cargo this was already effectively done for all dependencies as cargo passes -Clinker-plugin-lto without -Clto=fat/thin. -Clinker-plugin-lto assumes that ThinLTO will be used. The ThinLTO pre-link pipeline is faster than the fat LTO one. And according to the benchmarks in [1] there is barely any runtime performance difference between executables that used fat LTO with the fat vs ThinLTO pre-link pipeline. [1]: https://discourse.llvm.org/t/rfc-a-unified-lto-bitcode-frontend/61774
434027f to
71a31b3
Compare
|
Finished benchmarking commit (35cd5ab): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 0.2%, secondary -0.8%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary 2.1%, secondary 1.7%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -0.5%, secondary -1.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 480.787s -> 492.506s (2.44%) |
|
The above perf run was accidentally using the fat LTO pre-link pipeline unconditionally. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Always use the ThinLTO pipeline for pre-link optimizations
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (8df093d): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -0.1%, secondary 1.4%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary -2.6%, secondary 3.7%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -0.4%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 479.735s -> 483.316s (0.75%) |
|
I think it is safe to say that always using the fat LTO pre-link pipeline is a terrible idea. Using the thin LTO pre-link pipeline unconditionally may have an ever so slight compile time regression of up to 0.2% aside from a large outlier on large-workspace (but that is a pretty noisy benchmark anyway). And for exa there is a 2.4% improvement. The runtime benchmarks don't show any noticable difference, but I don't know if any of them use fat LTO. Maybe someone has an application that uses fat LTO where performance is important to test this with. Ditto with a very space constrained application like for embedded devices. |
|
Seems reasonable to try, and easy to back out if we find problems. @bors r+ rollup |
…viper Always use the ThinLTO pipeline for pre-link optimizations When using cargo this was already effectively done for all dependencies as cargo passes -Clinker-plugin-lto without -Clto=fat/thin. -Clinker-plugin-lto assumes that ThinLTO will be used. The ThinLTO pre-link pipeline is faster than the fat LTO one. And according to the benchmarks in [^1] there is barely any runtime performance difference between executables that used fat LTO with the fat vs ThinLTO pre-link pipeline. This also helps avoid having yet another code path if we want to support Unified LTO (that is a single bitcode file that supports being used for both fat LTO and ThinLTO when using linker plugin LTO, we already support it when rustc does LTO as ThinLTO bitcode is enough of a superset of fat LTO bitcode that it happens to work by accident if you don't explicitly have a check preventing mixing of them for the current set of LTO features that rustc exposes.) I'm currently still investigating if rustc would benefit from Unified LTO and how exactly to integrate it. [^1]: https://discourse.llvm.org/t/rfc-a-unified-lto-bitcode-frontend/61774
Rollup of 4 pull requests Successful merges: - #153202 ([win] Fix truncated unwinds for Arm64 Windows) - #153437 (coretest in miri: fix using unstable libtest features) - #153446 (Always use the ThinLTO pipeline for pre-link optimizations) - #153548 (add test for closure precedence in `TokenStream`s)
Rollup merge of #153446 - bjorn3:llvm_pre_link_thinlto, r=cuviper Always use the ThinLTO pipeline for pre-link optimizations When using cargo this was already effectively done for all dependencies as cargo passes -Clinker-plugin-lto without -Clto=fat/thin. -Clinker-plugin-lto assumes that ThinLTO will be used. The ThinLTO pre-link pipeline is faster than the fat LTO one. And according to the benchmarks in [^1] there is barely any runtime performance difference between executables that used fat LTO with the fat vs ThinLTO pre-link pipeline. This also helps avoid having yet another code path if we want to support Unified LTO (that is a single bitcode file that supports being used for both fat LTO and ThinLTO when using linker plugin LTO, we already support it when rustc does LTO as ThinLTO bitcode is enough of a superset of fat LTO bitcode that it happens to work by accident if you don't explicitly have a check preventing mixing of them for the current set of LTO features that rustc exposes.) I'm currently still investigating if rustc would benefit from Unified LTO and how exactly to integrate it. [^1]: https://discourse.llvm.org/t/rfc-a-unified-lto-bitcode-frontend/61774

When using cargo this was already effectively done for all dependencies as cargo passes -Clinker-plugin-lto without -Clto=fat/thin. -Clinker-plugin-lto assumes that ThinLTO will be used. The ThinLTO pre-link pipeline is faster than the fat LTO one. And according to the benchmarks in 1 there is barely any runtime performance difference between executables that used fat LTO with the fat vs ThinLTO pre-link pipeline.
This also helps avoid having yet another code path if we want to support Unified LTO (that is a single bitcode file that supports being used for both fat LTO and ThinLTO when using linker plugin LTO, we already support it when rustc does LTO as ThinLTO bitcode is enough of a superset of fat LTO bitcode that it happens to work by accident if you don't explicitly have a check preventing mixing of them for the current set of LTO features that rustc exposes.) I'm currently still investigating if rustc would benefit from Unified LTO and how exactly to integrate it.
Footnotes
https://discourse.llvm.org/t/rfc-a-unified-lto-bitcode-frontend/61774 ↩