Skip to content

use Core.CachedGenerator if available#56

Merged
aviatesk merged 1 commit intomasterfrom
avi/use-cached-generator
Jun 28, 2024
Merged

use Core.CachedGenerator if available#56
aviatesk merged 1 commit intomasterfrom
avi/use-cached-generator

Conversation

@aviatesk
Copy link
Copy Markdown
Member

No description provided.

aviatesk added a commit to JuliaLang/julia that referenced this pull request Jun 24, 2024
In Cassette-like systems, where inference has to infer many calls of
`@generated` function and the generated function involves complex code
transformations, the overhead from code generation itself can become
significant. This is because the results of code generation are not
cached, leading to duplicated code generation in the following contexts:
- `method_for_inference_heuristics` for regular inference on cached
  `@generated` function calls (since `method_for_inference_limit_heuristics`
  isn't stored in cached optimized sources, but is attached to
  generated unoptimized sources).
- `retrieval_code_info` for constant propagation on cached `@generated`
  function calls.

Having said that, caching unoptimized sources generated by `@generated`
functions is not a good tradeoff in general cases, considering the
memory space consumed (and the image bloat). The code generation for
generators like `GeneratedFunctionStub` produced by the front end is
generally very simple, and the first duplicated code generation
mentioned above does not occur for `GeneratedFunctionStub`.

So this unoptimized source caching should be enabled in an opt-in manner.

Based on this idea, this commit defines the trait
`abstract type Core.CachedGenerator` as an interface for the external
systems to opt-in. If the generator is a subtype of this trait,
inference caches the generated unoptimized code, sacrificing memory
space to improve the performance of subsequent inferences.

In JuliaDebug/CassetteOverlay.jl#56, the following benchmark results
showed that approximately 1.5~3x inference speedup is achieved by opting
into this feature:

\## Setup
```julia
using CassetteOverlay, BaseBenchmarks, BenchmarkTools

@MethodTable table;
pass = @overlaypass table;
BaseBenchmarks.load!("inference");
benchfunc1() = sin(42)
benchfunc2(xs, x) = findall(>(x), abs.(xs))
interp = BaseBenchmarks.InferenceBenchmarks.InferenceBenchmarker()

\# benchmark inference on entire call graphs from scratch
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)

\# benchmark inference on the call graphs with most of them cached
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
```

\## Benchmark inference on entire call graphs from scratch
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  78.574 ms … 87.653 ms  ┊ GC (min … max): 0.00% … 8.81%
 Time  (median):     83.149 ms              ┊ GC (median):    4.85%
 Time  (mean ± σ):   82.138 ms ±  2.366 ms  ┊ GC (mean ± σ):  3.36% ± 2.65%

  ▂ ▂▂ █     ▂                     █ ▅     ▅
  █▅██▅█▅▁█▁▁█▁▁▁▁▅▁▁▁▁▁▁▁▁▅▁▁▅██▅▅█████████▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  78.6 ms         Histogram: frequency by time        86.8 ms <

 Memory estimate: 52.32 MiB, allocs estimate: 1201192.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.345 s …  1.369 s  ┊ GC (min … max): 2.45% … 3.39%
 Time  (median):     1.355 s             ┊ GC (median):    2.98%
 Time  (mean ± σ):   1.356 s ± 9.847 ms  ┊ GC (mean ± σ):  2.96% ± 0.41%

  █                   █     █                            █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s        Histogram: frequency by time        1.37 s <

 Memory estimate: 637.96 MiB, allocs estimate: 15159639.
```
> with this PR
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 230 samples with 1 evaluation.
 Range (min … max):  19.339 ms … 82.521 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.938 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.665 ms ±  4.666 ms  ┊ GC (mean ± σ):  6.72% ± 8.80%

  ▃▇█▇▄                     ▂▂▃▃▄
  █████▇█▇▆▅▅▆▅▅▁▅▁▁▁▁▁▁▁▁▁██████▆▁█▁▅▇▆▁▅▁▁▅▁▅▁▁▁▁▁▁▅▁▁▁▁▁▁▅ ▆
  19.3 ms      Histogram: log(frequency) by time      29.4 ms <

 Memory estimate: 28.67 MiB, allocs estimate: 590138.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  354.585 ms … 390.400 ms  ┊ GC (min … max): 0.00% … 7.01%
 Time  (median):     368.778 ms               ┊ GC (median):    3.74%
 Time  (mean ± σ):   368.824 ms ±   8.853 ms  ┊ GC (mean ± σ):  3.70% ± 1.89%

             ▃            █
  ▇▁▁▁▁▁▁▁▁▁▁█▁▇▇▁▁▁▁▇▁▁▁▁█▁▁▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  355 ms           Histogram: frequency by time          390 ms <

 Memory estimate: 227.86 MiB, allocs estimate: 4689830.
```

\## Benchmark inference on the call graphs with most of them cached
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.166 μs …  9.799 ms  ┊ GC (min … max): 0.00% … 98.96%
 Time  (median):     46.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.339 μs ± 97.539 μs  ┊ GC (mean ± σ):  2.01% ±  0.99%

    ▁▂▄▆▆▇███▇▆▅▄▃▄▄▂▂▂▁▁▁   ▁▁▂▂▁ ▁ ▂▁ ▁                     ▃
  ▃▇██████████████████████▇████████████████▇█▆▇▇▆▆▆▅▆▆▆▇▆▅▅▅▆ █
  45.2 μs      Histogram: log(frequency) by time        55 μs <

 Memory estimate: 25.27 KiB, allocs estimate: 614.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  303.375 μs …  16.582 ms  ┊ GC (min … max): 0.00% … 97.38%
 Time  (median):     317.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   338.772 μs ± 274.164 μs  ┊ GC (mean ± σ):  5.44% ±  7.56%

       ▃▆██▇▅▂▁
  ▂▂▄▅██████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂ ▃
  303 μs           Histogram: frequency by time          394 μs <

 Memory estimate: 412.80 KiB, allocs estimate: 6224.
```
> with this PR
```
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.444 μs …  1.808 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     5.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.228 μs ± 25.393 μs  ┊ GC (mean ± σ):  5.73% ±  1.40%

      ▄█▇▄
  ▁▂▄█████▇▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.44 μs        Histogram: frequency by time        7.47 μs <

 Memory estimate: 8.72 KiB, allocs estimate: 196.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  211.000 μs …  36.187 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   280.025 μs ± 750.097 μs  ┊ GC (mean ± σ):  6.86% ± 7.16%

  █▆▄▂▁                                                         ▁
  ███████▇▇▇▆▆▆▅▆▅▅▅▅▅▄▅▄▄▄▅▅▁▄▅▃▄▄▄▃▄▄▃▅▄▁▁▃▄▁▃▁▁▁▃▄▃▁▃▁▁▁▃▃▁▃ █
  211 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 374.17 KiB, allocs estimate: 5269.
```
aviatesk added a commit to JuliaLang/julia that referenced this pull request Jun 24, 2024
In Cassette-like systems, where inference has to infer many calls of
`@generated` function and the generated function involves complex code
transformations, the overhead from code generation itself can become
significant. This is because the results of code generation are not
cached, leading to duplicated code generation in the following contexts:
- `method_for_inference_heuristics` for regular inference on cached
  `@generated` function calls (since `method_for_inference_limit_heuristics`
  isn't stored in cached optimized sources, but is attached to
  generated unoptimized sources).
- `retrieval_code_info` for constant propagation on cached `@generated`
  function calls.

Having said that, caching unoptimized sources generated by `@generated`
functions is not a good tradeoff in general cases, considering the
memory space consumed (and the image bloat). The code generation for
generators like `GeneratedFunctionStub` produced by the front end is
generally very simple, and the first duplicated code generation
mentioned above does not occur for `GeneratedFunctionStub`.

So this unoptimized source caching should be enabled in an opt-in manner.

Based on this idea, this commit defines the trait
`abstract type Core.CachedGenerator` as an interface for the external
systems to opt-in. If the generator is a subtype of this trait,
inference caches the generated unoptimized code, sacrificing memory
space to improve the performance of subsequent inferences.
Specifically, the mechanism for caching the unoptimized source uses the
infrastructure already implemented in #54362. Thanks to
#54362, the cache for generated functions is now
partitioned by world age, so even if the unoptimized source is cached,
the existing invalidation system will invalidate it as expected.

In JuliaDebug/CassetteOverlay.jl#56, the following benchmark results
showed that approximately 1.5~3x inference speedup is achieved by opting
into this feature:

\## Setup
```julia
using CassetteOverlay, BaseBenchmarks, BenchmarkTools

@MethodTable table;
pass = @overlaypass table;
BaseBenchmarks.load!("inference");
benchfunc1() = sin(42)
benchfunc2(xs, x) = findall(>(x), abs.(xs))
interp = BaseBenchmarks.InferenceBenchmarks.InferenceBenchmarker()

\# benchmark inference on entire call graphs from scratch
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)

\# benchmark inference on the call graphs with most of them cached
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
```

\## Benchmark inference on entire call graphs from scratch
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  78.574 ms … 87.653 ms  ┊ GC (min … max): 0.00% … 8.81%
 Time  (median):     83.149 ms              ┊ GC (median):    4.85%
 Time  (mean ± σ):   82.138 ms ±  2.366 ms  ┊ GC (mean ± σ):  3.36% ± 2.65%

  ▂ ▂▂ █     ▂                     █ ▅     ▅
  █▅██▅█▅▁█▁▁█▁▁▁▁▅▁▁▁▁▁▁▁▁▅▁▁▅██▅▅█████████▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  78.6 ms         Histogram: frequency by time        86.8 ms <

 Memory estimate: 52.32 MiB, allocs estimate: 1201192.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.345 s …  1.369 s  ┊ GC (min … max): 2.45% … 3.39%
 Time  (median):     1.355 s             ┊ GC (median):    2.98%
 Time  (mean ± σ):   1.356 s ± 9.847 ms  ┊ GC (mean ± σ):  2.96% ± 0.41%

  █                   █     █                            █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s        Histogram: frequency by time        1.37 s <

 Memory estimate: 637.96 MiB, allocs estimate: 15159639.
```
> with this PR
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 230 samples with 1 evaluation.
 Range (min … max):  19.339 ms … 82.521 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.938 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.665 ms ±  4.666 ms  ┊ GC (mean ± σ):  6.72% ± 8.80%

  ▃▇█▇▄                     ▂▂▃▃▄
  █████▇█▇▆▅▅▆▅▅▁▅▁▁▁▁▁▁▁▁▁██████▆▁█▁▅▇▆▁▅▁▁▅▁▅▁▁▁▁▁▁▅▁▁▁▁▁▁▅ ▆
  19.3 ms      Histogram: log(frequency) by time      29.4 ms <

 Memory estimate: 28.67 MiB, allocs estimate: 590138.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  354.585 ms … 390.400 ms  ┊ GC (min … max): 0.00% … 7.01%
 Time  (median):     368.778 ms               ┊ GC (median):    3.74%
 Time  (mean ± σ):   368.824 ms ±   8.853 ms  ┊ GC (mean ± σ):  3.70% ± 1.89%

             ▃            █
  ▇▁▁▁▁▁▁▁▁▁▁█▁▇▇▁▁▁▁▇▁▁▁▁█▁▁▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  355 ms           Histogram: frequency by time          390 ms <

 Memory estimate: 227.86 MiB, allocs estimate: 4689830.
```

\## Benchmark inference on the call graphs with most of them cached
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.166 μs …  9.799 ms  ┊ GC (min … max): 0.00% … 98.96%
 Time  (median):     46.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.339 μs ± 97.539 μs  ┊ GC (mean ± σ):  2.01% ±  0.99%

    ▁▂▄▆▆▇███▇▆▅▄▃▄▄▂▂▂▁▁▁   ▁▁▂▂▁ ▁ ▂▁ ▁                     ▃
  ▃▇██████████████████████▇████████████████▇█▆▇▇▆▆▆▅▆▆▆▇▆▅▅▅▆ █
  45.2 μs      Histogram: log(frequency) by time        55 μs <

 Memory estimate: 25.27 KiB, allocs estimate: 614.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  303.375 μs …  16.582 ms  ┊ GC (min … max): 0.00% … 97.38%
 Time  (median):     317.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   338.772 μs ± 274.164 μs  ┊ GC (mean ± σ):  5.44% ±  7.56%

       ▃▆██▇▅▂▁
  ▂▂▄▅██████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂ ▃
  303 μs           Histogram: frequency by time          394 μs <

 Memory estimate: 412.80 KiB, allocs estimate: 6224.
```
> with this PR
```
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.444 μs …  1.808 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     5.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.228 μs ± 25.393 μs  ┊ GC (mean ± σ):  5.73% ±  1.40%

      ▄█▇▄
  ▁▂▄█████▇▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.44 μs        Histogram: frequency by time        7.47 μs <

 Memory estimate: 8.72 KiB, allocs estimate: 196.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  211.000 μs …  36.187 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   280.025 μs ± 750.097 μs  ┊ GC (mean ± σ):  6.86% ± 7.16%

  █▆▄▂▁                                                         ▁
  ███████▇▇▇▆▆▆▅▆▅▅▅▅▅▄▅▄▄▄▅▅▁▄▅▃▄▄▄▃▄▄▃▅▄▁▁▃▄▁▃▁▁▁▃▄▃▁▃▁▁▁▃▃▁▃ █
  211 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 374.17 KiB, allocs estimate: 5269.
```
aviatesk added a commit to JuliaLang/julia that referenced this pull request Jun 24, 2024
In Cassette-like systems, where inference has to infer many calls of
`@generated` function and the generated function involves complex code
transformations, the overhead from code generation itself can become
significant. This is because the results of code generation are not
cached, leading to duplicated code generation in the following contexts:
- `method_for_inference_heuristics` for regular inference on cached
  `@generated` function calls (since `method_for_inference_limit_heuristics`
  isn't stored in cached optimized sources, but is attached to
  generated unoptimized sources).
- `retrieval_code_info` for constant propagation on cached `@generated`
  function calls.

Having said that, caching unoptimized sources generated by `@generated`
functions is not a good tradeoff in general cases, considering the
memory space consumed (and the image bloat). The code generation for
generators like `GeneratedFunctionStub` produced by the front end is
generally very simple, and the first duplicated code generation
mentioned above does not occur for `GeneratedFunctionStub`.

So this unoptimized source caching should be enabled in an opt-in manner.

Based on this idea, this commit defines the trait
`abstract type Core.CachedGenerator` as an interface for the external
systems to opt-in. If the generator is a subtype of this trait,
inference caches the generated unoptimized code, sacrificing memory
space to improve the performance of subsequent inferences.
Specifically, the mechanism for caching the unoptimized source uses the
infrastructure already implemented in #54362. Thanks to
#54362, the cache for generated functions is now
partitioned by world age, so even if the unoptimized source is cached,
the existing invalidation system will invalidate it as expected.

In JuliaDebug/CassetteOverlay.jl#56, the following benchmark results
showed that approximately 1.5~3x inference speedup is achieved by opting
into this feature:

\## Setup
```julia
using CassetteOverlay, BaseBenchmarks, BenchmarkTools

@MethodTable table;
pass = @overlaypass table;
BaseBenchmarks.load!("inference");
benchfunc1() = sin(42)
benchfunc2(xs, x) = findall(>(x), abs.(xs))
interp = BaseBenchmarks.InferenceBenchmarks.InferenceBenchmarker()

\# benchmark inference on entire call graphs from scratch
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)

\# benchmark inference on the call graphs with most of them cached
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
```

\## Benchmark inference on entire call graphs from scratch
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  78.574 ms … 87.653 ms  ┊ GC (min … max): 0.00% … 8.81%
 Time  (median):     83.149 ms              ┊ GC (median):    4.85%
 Time  (mean ± σ):   82.138 ms ±  2.366 ms  ┊ GC (mean ± σ):  3.36% ± 2.65%

  ▂ ▂▂ █     ▂                     █ ▅     ▅
  █▅██▅█▅▁█▁▁█▁▁▁▁▅▁▁▁▁▁▁▁▁▅▁▁▅██▅▅█████████▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  78.6 ms         Histogram: frequency by time        86.8 ms <

 Memory estimate: 52.32 MiB, allocs estimate: 1201192.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.345 s …  1.369 s  ┊ GC (min … max): 2.45% … 3.39%
 Time  (median):     1.355 s             ┊ GC (median):    2.98%
 Time  (mean ± σ):   1.356 s ± 9.847 ms  ┊ GC (mean ± σ):  2.96% ± 0.41%

  █                   █     █                            █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s        Histogram: frequency by time        1.37 s <

 Memory estimate: 637.96 MiB, allocs estimate: 15159639.
```
> with this PR
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 230 samples with 1 evaluation.
 Range (min … max):  19.339 ms … 82.521 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.938 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.665 ms ±  4.666 ms  ┊ GC (mean ± σ):  6.72% ± 8.80%

  ▃▇█▇▄                     ▂▂▃▃▄
  █████▇█▇▆▅▅▆▅▅▁▅▁▁▁▁▁▁▁▁▁██████▆▁█▁▅▇▆▁▅▁▁▅▁▅▁▁▁▁▁▁▅▁▁▁▁▁▁▅ ▆
  19.3 ms      Histogram: log(frequency) by time      29.4 ms <

 Memory estimate: 28.67 MiB, allocs estimate: 590138.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  354.585 ms … 390.400 ms  ┊ GC (min … max): 0.00% … 7.01%
 Time  (median):     368.778 ms               ┊ GC (median):    3.74%
 Time  (mean ± σ):   368.824 ms ±   8.853 ms  ┊ GC (mean ± σ):  3.70% ± 1.89%

             ▃            █
  ▇▁▁▁▁▁▁▁▁▁▁█▁▇▇▁▁▁▁▇▁▁▁▁█▁▁▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  355 ms           Histogram: frequency by time          390 ms <

 Memory estimate: 227.86 MiB, allocs estimate: 4689830.
```

\## Benchmark inference on the call graphs with most of them cached
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.166 μs …  9.799 ms  ┊ GC (min … max): 0.00% … 98.96%
 Time  (median):     46.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.339 μs ± 97.539 μs  ┊ GC (mean ± σ):  2.01% ±  0.99%

    ▁▂▄▆▆▇███▇▆▅▄▃▄▄▂▂▂▁▁▁   ▁▁▂▂▁ ▁ ▂▁ ▁                     ▃
  ▃▇██████████████████████▇████████████████▇█▆▇▇▆▆▆▅▆▆▆▇▆▅▅▅▆ █
  45.2 μs      Histogram: log(frequency) by time        55 μs <

 Memory estimate: 25.27 KiB, allocs estimate: 614.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  303.375 μs …  16.582 ms  ┊ GC (min … max): 0.00% … 97.38%
 Time  (median):     317.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   338.772 μs ± 274.164 μs  ┊ GC (mean ± σ):  5.44% ±  7.56%

       ▃▆██▇▅▂▁
  ▂▂▄▅██████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂ ▃
  303 μs           Histogram: frequency by time          394 μs <

 Memory estimate: 412.80 KiB, allocs estimate: 6224.
```
> with this PR
```
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.444 μs …  1.808 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     5.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.228 μs ± 25.393 μs  ┊ GC (mean ± σ):  5.73% ±  1.40%

      ▄█▇▄
  ▁▂▄█████▇▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.44 μs        Histogram: frequency by time        7.47 μs <

 Memory estimate: 8.72 KiB, allocs estimate: 196.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  211.000 μs …  36.187 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   280.025 μs ± 750.097 μs  ┊ GC (mean ± σ):  6.86% ± 7.16%

  █▆▄▂▁                                                         ▁
  ███████▇▇▇▆▆▆▅▆▅▅▅▅▅▄▅▄▄▄▅▅▁▄▅▃▄▄▄▃▄▄▃▅▄▁▁▃▄▁▃▁▁▁▃▄▃▁▃▁▁▁▃▃▁▃ █
  211 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 374.17 KiB, allocs estimate: 5269.
```
@aviatesk aviatesk force-pushed the avi/use-cached-generator branch from 2ea19d7 to 5a4d24d Compare June 25, 2024 04:12
aviatesk added a commit to JuliaLang/julia that referenced this pull request Jun 25, 2024
In Cassette-like systems, where inference has to infer many calls of
`@generated` function and the generated function involves complex code
transformations, the overhead from code generation itself can become
significant. This is because the results of code generation are not
cached, leading to duplicated code generation in the following contexts:
- `method_for_inference_heuristics` for regular inference on cached
  `@generated` function calls (since `method_for_inference_limit_heuristics`
  isn't stored in cached optimized sources, but is attached to
  generated unoptimized sources).
- `retrieval_code_info` for constant propagation on cached `@generated`
  function calls.

Having said that, caching unoptimized sources generated by `@generated`
functions is not a good tradeoff in general cases, considering the
memory space consumed (and the image bloat). The code generation for
generators like `GeneratedFunctionStub` produced by the front end is
generally very simple, and the first duplicated code generation
mentioned above does not occur for `GeneratedFunctionStub`.

So this unoptimized source caching should be enabled in an opt-in manner.

Based on this idea, this commit defines the trait
`abstract type Core.CachedGenerator` as an interface for the external
systems to opt-in. If the generator is a subtype of this trait,
inference caches the generated unoptimized code, sacrificing memory
space to improve the performance of subsequent inferences.
Specifically, the mechanism for caching the unoptimized source uses the
infrastructure already implemented in #54362. Thanks to
#54362, the cache for generated functions is now
partitioned by world age, so even if the unoptimized source is cached,
the existing invalidation system will invalidate it as expected.

In JuliaDebug/CassetteOverlay.jl#56, the following benchmark results
showed that approximately 1.5~3x inference speedup is achieved by opting
into this feature:

\## Setup
```julia
using CassetteOverlay, BaseBenchmarks, BenchmarkTools

@MethodTable table;
pass = @overlaypass table;
BaseBenchmarks.load!("inference");
benchfunc1() = sin(42)
benchfunc2(xs, x) = findall(>(x), abs.(xs))
interp = BaseBenchmarks.InferenceBenchmarks.InferenceBenchmarker()

\# benchmark inference on entire call graphs from scratch
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)

\# benchmark inference on the call graphs with most of them cached
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
```

\## Benchmark inference on entire call graphs from scratch
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  78.574 ms … 87.653 ms  ┊ GC (min … max): 0.00% … 8.81%
 Time  (median):     83.149 ms              ┊ GC (median):    4.85%
 Time  (mean ± σ):   82.138 ms ±  2.366 ms  ┊ GC (mean ± σ):  3.36% ± 2.65%

  ▂ ▂▂ █     ▂                     █ ▅     ▅
  █▅██▅█▅▁█▁▁█▁▁▁▁▅▁▁▁▁▁▁▁▁▅▁▁▅██▅▅█████████▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  78.6 ms         Histogram: frequency by time        86.8 ms <

 Memory estimate: 52.32 MiB, allocs estimate: 1201192.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.345 s …  1.369 s  ┊ GC (min … max): 2.45% … 3.39%
 Time  (median):     1.355 s             ┊ GC (median):    2.98%
 Time  (mean ± σ):   1.356 s ± 9.847 ms  ┊ GC (mean ± σ):  2.96% ± 0.41%

  █                   █     █                            █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s        Histogram: frequency by time        1.37 s <

 Memory estimate: 637.96 MiB, allocs estimate: 15159639.
```
> with this PR
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 230 samples with 1 evaluation.
 Range (min … max):  19.339 ms … 82.521 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.938 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.665 ms ±  4.666 ms  ┊ GC (mean ± σ):  6.72% ± 8.80%

  ▃▇█▇▄                     ▂▂▃▃▄
  █████▇█▇▆▅▅▆▅▅▁▅▁▁▁▁▁▁▁▁▁██████▆▁█▁▅▇▆▁▅▁▁▅▁▅▁▁▁▁▁▁▅▁▁▁▁▁▁▅ ▆
  19.3 ms      Histogram: log(frequency) by time      29.4 ms <

 Memory estimate: 28.67 MiB, allocs estimate: 590138.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  354.585 ms … 390.400 ms  ┊ GC (min … max): 0.00% … 7.01%
 Time  (median):     368.778 ms               ┊ GC (median):    3.74%
 Time  (mean ± σ):   368.824 ms ±   8.853 ms  ┊ GC (mean ± σ):  3.70% ± 1.89%

             ▃            █
  ▇▁▁▁▁▁▁▁▁▁▁█▁▇▇▁▁▁▁▇▁▁▁▁█▁▁▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  355 ms           Histogram: frequency by time          390 ms <

 Memory estimate: 227.86 MiB, allocs estimate: 4689830.
```

\## Benchmark inference on the call graphs with most of them cached
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.166 μs …  9.799 ms  ┊ GC (min … max): 0.00% … 98.96%
 Time  (median):     46.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.339 μs ± 97.539 μs  ┊ GC (mean ± σ):  2.01% ±  0.99%

    ▁▂▄▆▆▇███▇▆▅▄▃▄▄▂▂▂▁▁▁   ▁▁▂▂▁ ▁ ▂▁ ▁                     ▃
  ▃▇██████████████████████▇████████████████▇█▆▇▇▆▆▆▅▆▆▆▇▆▅▅▅▆ █
  45.2 μs      Histogram: log(frequency) by time        55 μs <

 Memory estimate: 25.27 KiB, allocs estimate: 614.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  303.375 μs …  16.582 ms  ┊ GC (min … max): 0.00% … 97.38%
 Time  (median):     317.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   338.772 μs ± 274.164 μs  ┊ GC (mean ± σ):  5.44% ±  7.56%

       ▃▆██▇▅▂▁
  ▂▂▄▅██████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂ ▃
  303 μs           Histogram: frequency by time          394 μs <

 Memory estimate: 412.80 KiB, allocs estimate: 6224.
```
> with this PR
```
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.444 μs …  1.808 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     5.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.228 μs ± 25.393 μs  ┊ GC (mean ± σ):  5.73% ±  1.40%

      ▄█▇▄
  ▁▂▄█████▇▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.44 μs        Histogram: frequency by time        7.47 μs <

 Memory estimate: 8.72 KiB, allocs estimate: 196.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  211.000 μs …  36.187 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   280.025 μs ± 750.097 μs  ┊ GC (mean ± σ):  6.86% ± 7.16%

  █▆▄▂▁                                                         ▁
  ███████▇▇▇▆▆▆▅▆▅▅▅▅▅▄▅▄▄▄▅▅▁▄▅▃▄▄▄▃▄▄▃▅▄▁▁▃▄▁▃▁▁▁▃▄▃▁▃▁▁▁▃▃▁▃ █
  211 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 374.17 KiB, allocs estimate: 5269.
```
@aviatesk aviatesk force-pushed the avi/use-cached-generator branch from 5a4d24d to 0752fbe Compare June 25, 2024 08:26
aviatesk added a commit to JuliaLang/julia that referenced this pull request Jun 25, 2024
In Cassette-like systems, where inference has to infer many calls of
`@generated` function and the generated function involves complex code
transformations, the overhead from code generation itself can become
significant. This is because the results of code generation are not
cached, leading to duplicated code generation in the following contexts:
- `method_for_inference_heuristics` for regular inference on cached
  `@generated` function calls (since `method_for_inference_limit_heuristics`
  isn't stored in cached optimized sources, but is attached to
  generated unoptimized sources).
- `retrieval_code_info` for constant propagation on cached `@generated`
  function calls.

Having said that, caching unoptimized sources generated by `@generated`
functions is not a good tradeoff in general cases, considering the
memory space consumed (and the image bloat). The code generation for
generators like `GeneratedFunctionStub` produced by the front end is
generally very simple, and the first duplicated code generation
mentioned above does not occur for `GeneratedFunctionStub`.

So this unoptimized source caching should be enabled in an opt-in manner.

Based on this idea, this commit defines the trait
`abstract type Core.CachedGenerator` as an interface for the external
systems to opt-in. If the generator is a subtype of this trait,
inference caches the generated unoptimized code, sacrificing memory
space to improve the performance of subsequent inferences.
Specifically, the mechanism for caching the unoptimized source uses the
infrastructure already implemented in #54362. Thanks to
#54362, the cache for generated functions is now
partitioned by world age, so even if the unoptimized source is cached,
the existing invalidation system will invalidate it as expected.

In JuliaDebug/CassetteOverlay.jl#56, the following benchmark results
showed that approximately 1.5~3x inference speedup is achieved by opting
into this feature:

\## Setup
```julia
using CassetteOverlay, BaseBenchmarks, BenchmarkTools

@MethodTable table;
pass = @overlaypass table;
BaseBenchmarks.load!("inference");
benchfunc1() = sin(42)
benchfunc2(xs, x) = findall(>(x), abs.(xs))
interp = BaseBenchmarks.InferenceBenchmarks.InferenceBenchmarker()

\# benchmark inference on entire call graphs from scratch
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)

\# benchmark inference on the call graphs with most of them cached
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
```

\## Benchmark inference on entire call graphs from scratch
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  78.574 ms … 87.653 ms  ┊ GC (min … max): 0.00% … 8.81%
 Time  (median):     83.149 ms              ┊ GC (median):    4.85%
 Time  (mean ± σ):   82.138 ms ±  2.366 ms  ┊ GC (mean ± σ):  3.36% ± 2.65%

  ▂ ▂▂ █     ▂                     █ ▅     ▅
  █▅██▅█▅▁█▁▁█▁▁▁▁▅▁▁▁▁▁▁▁▁▅▁▁▅██▅▅█████████▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  78.6 ms         Histogram: frequency by time        86.8 ms <

 Memory estimate: 52.32 MiB, allocs estimate: 1201192.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.345 s …  1.369 s  ┊ GC (min … max): 2.45% … 3.39%
 Time  (median):     1.355 s             ┊ GC (median):    2.98%
 Time  (mean ± σ):   1.356 s ± 9.847 ms  ┊ GC (mean ± σ):  2.96% ± 0.41%

  █                   █     █                            █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s        Histogram: frequency by time        1.37 s <

 Memory estimate: 637.96 MiB, allocs estimate: 15159639.
```
> with this PR
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 230 samples with 1 evaluation.
 Range (min … max):  19.339 ms … 82.521 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.938 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.665 ms ±  4.666 ms  ┊ GC (mean ± σ):  6.72% ± 8.80%

  ▃▇█▇▄                     ▂▂▃▃▄
  █████▇█▇▆▅▅▆▅▅▁▅▁▁▁▁▁▁▁▁▁██████▆▁█▁▅▇▆▁▅▁▁▅▁▅▁▁▁▁▁▁▅▁▁▁▁▁▁▅ ▆
  19.3 ms      Histogram: log(frequency) by time      29.4 ms <

 Memory estimate: 28.67 MiB, allocs estimate: 590138.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  354.585 ms … 390.400 ms  ┊ GC (min … max): 0.00% … 7.01%
 Time  (median):     368.778 ms               ┊ GC (median):    3.74%
 Time  (mean ± σ):   368.824 ms ±   8.853 ms  ┊ GC (mean ± σ):  3.70% ± 1.89%

             ▃            █
  ▇▁▁▁▁▁▁▁▁▁▁█▁▇▇▁▁▁▁▇▁▁▁▁█▁▁▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  355 ms           Histogram: frequency by time          390 ms <

 Memory estimate: 227.86 MiB, allocs estimate: 4689830.
```

\## Benchmark inference on the call graphs with most of them cached
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.166 μs …  9.799 ms  ┊ GC (min … max): 0.00% … 98.96%
 Time  (median):     46.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.339 μs ± 97.539 μs  ┊ GC (mean ± σ):  2.01% ±  0.99%

    ▁▂▄▆▆▇███▇▆▅▄▃▄▄▂▂▂▁▁▁   ▁▁▂▂▁ ▁ ▂▁ ▁                     ▃
  ▃▇██████████████████████▇████████████████▇█▆▇▇▆▆▆▅▆▆▆▇▆▅▅▅▆ █
  45.2 μs      Histogram: log(frequency) by time        55 μs <

 Memory estimate: 25.27 KiB, allocs estimate: 614.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  303.375 μs …  16.582 ms  ┊ GC (min … max): 0.00% … 97.38%
 Time  (median):     317.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   338.772 μs ± 274.164 μs  ┊ GC (mean ± σ):  5.44% ±  7.56%

       ▃▆██▇▅▂▁
  ▂▂▄▅██████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂ ▃
  303 μs           Histogram: frequency by time          394 μs <

 Memory estimate: 412.80 KiB, allocs estimate: 6224.
```
> with this PR
```
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.444 μs …  1.808 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     5.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.228 μs ± 25.393 μs  ┊ GC (mean ± σ):  5.73% ±  1.40%

      ▄█▇▄
  ▁▂▄█████▇▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.44 μs        Histogram: frequency by time        7.47 μs <

 Memory estimate: 8.72 KiB, allocs estimate: 196.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  211.000 μs …  36.187 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   280.025 μs ± 750.097 μs  ┊ GC (mean ± σ):  6.86% ± 7.16%

  █▆▄▂▁                                                         ▁
  ███████▇▇▇▆▆▆▅▆▅▅▅▅▅▄▅▄▄▄▅▅▁▄▅▃▄▄▄▃▄▄▃▅▄▁▁▃▄▁▃▁▁▁▃▄▃▁▃▁▁▁▃▃▁▃ █
  211 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 374.17 KiB, allocs estimate: 5269.
```
aviatesk added a commit to JuliaLang/julia that referenced this pull request Jun 26, 2024
In Cassette-like systems, where inference has to infer many calls of
`@generated` function and the generated function involves complex code
transformations, the overhead from code generation itself can become
significant. This is because the results of code generation are not
cached, leading to duplicated code generation in the following contexts:
- `method_for_inference_heuristics` for regular inference on cached
  `@generated` function calls (since `method_for_inference_limit_heuristics`
  isn't stored in cached optimized sources, but is attached to
  generated unoptimized sources).
- `retrieval_code_info` for constant propagation on cached `@generated`
  function calls.

Having said that, caching unoptimized sources generated by `@generated`
functions is not a good tradeoff in general cases, considering the
memory space consumed (and the image bloat). The code generation for
generators like `GeneratedFunctionStub` produced by the front end is
generally very simple, and the first duplicated code generation
mentioned above does not occur for `GeneratedFunctionStub`.

So this unoptimized source caching should be enabled in an opt-in manner.

Based on this idea, this commit defines the trait
`abstract type Core.CachedGenerator` as an interface for the external
systems to opt-in. If the generator is a subtype of this trait,
inference caches the generated unoptimized code, sacrificing memory
space to improve the performance of subsequent inferences.
Specifically, the mechanism for caching the unoptimized source uses the
infrastructure already implemented in #54362. Thanks to
#54362, the cache for generated functions is now
partitioned by world age, so even if the unoptimized source is cached,
the existing invalidation system will invalidate it as expected.

In JuliaDebug/CassetteOverlay.jl#56, the following benchmark results
showed that approximately 1.5~3x inference speedup is achieved by opting
into this feature:

\## Setup
```julia
using CassetteOverlay, BaseBenchmarks, BenchmarkTools

@MethodTable table;
pass = @overlaypass table;
BaseBenchmarks.load!("inference");
benchfunc1() = sin(42)
benchfunc2(xs, x) = findall(>(x), abs.(xs))
interp = BaseBenchmarks.InferenceBenchmarks.InferenceBenchmarker()

\# benchmark inference on entire call graphs from scratch
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)

\# benchmark inference on the call graphs with most of them cached
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
```

\## Benchmark inference on entire call graphs from scratch
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  78.574 ms … 87.653 ms  ┊ GC (min … max): 0.00% … 8.81%
 Time  (median):     83.149 ms              ┊ GC (median):    4.85%
 Time  (mean ± σ):   82.138 ms ±  2.366 ms  ┊ GC (mean ± σ):  3.36% ± 2.65%

  ▂ ▂▂ █     ▂                     █ ▅     ▅
  █▅██▅█▅▁█▁▁█▁▁▁▁▅▁▁▁▁▁▁▁▁▅▁▁▅██▅▅█████████▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  78.6 ms         Histogram: frequency by time        86.8 ms <

 Memory estimate: 52.32 MiB, allocs estimate: 1201192.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.345 s …  1.369 s  ┊ GC (min … max): 2.45% … 3.39%
 Time  (median):     1.355 s             ┊ GC (median):    2.98%
 Time  (mean ± σ):   1.356 s ± 9.847 ms  ┊ GC (mean ± σ):  2.96% ± 0.41%

  █                   █     █                            █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s        Histogram: frequency by time        1.37 s <

 Memory estimate: 637.96 MiB, allocs estimate: 15159639.
```
> with this PR
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 230 samples with 1 evaluation.
 Range (min … max):  19.339 ms … 82.521 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.938 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.665 ms ±  4.666 ms  ┊ GC (mean ± σ):  6.72% ± 8.80%

  ▃▇█▇▄                     ▂▂▃▃▄
  █████▇█▇▆▅▅▆▅▅▁▅▁▁▁▁▁▁▁▁▁██████▆▁█▁▅▇▆▁▅▁▁▅▁▅▁▁▁▁▁▁▅▁▁▁▁▁▁▅ ▆
  19.3 ms      Histogram: log(frequency) by time      29.4 ms <

 Memory estimate: 28.67 MiB, allocs estimate: 590138.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  354.585 ms … 390.400 ms  ┊ GC (min … max): 0.00% … 7.01%
 Time  (median):     368.778 ms               ┊ GC (median):    3.74%
 Time  (mean ± σ):   368.824 ms ±   8.853 ms  ┊ GC (mean ± σ):  3.70% ± 1.89%

             ▃            █
  ▇▁▁▁▁▁▁▁▁▁▁█▁▇▇▁▁▁▁▇▁▁▁▁█▁▁▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  355 ms           Histogram: frequency by time          390 ms <

 Memory estimate: 227.86 MiB, allocs estimate: 4689830.
```

\## Benchmark inference on the call graphs with most of them cached
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.166 μs …  9.799 ms  ┊ GC (min … max): 0.00% … 98.96%
 Time  (median):     46.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.339 μs ± 97.539 μs  ┊ GC (mean ± σ):  2.01% ±  0.99%

    ▁▂▄▆▆▇███▇▆▅▄▃▄▄▂▂▂▁▁▁   ▁▁▂▂▁ ▁ ▂▁ ▁                     ▃
  ▃▇██████████████████████▇████████████████▇█▆▇▇▆▆▆▅▆▆▆▇▆▅▅▅▆ █
  45.2 μs      Histogram: log(frequency) by time        55 μs <

 Memory estimate: 25.27 KiB, allocs estimate: 614.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  303.375 μs …  16.582 ms  ┊ GC (min … max): 0.00% … 97.38%
 Time  (median):     317.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   338.772 μs ± 274.164 μs  ┊ GC (mean ± σ):  5.44% ±  7.56%

       ▃▆██▇▅▂▁
  ▂▂▄▅██████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂ ▃
  303 μs           Histogram: frequency by time          394 μs <

 Memory estimate: 412.80 KiB, allocs estimate: 6224.
```
> with this PR
```
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.444 μs …  1.808 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     5.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.228 μs ± 25.393 μs  ┊ GC (mean ± σ):  5.73% ±  1.40%

      ▄█▇▄
  ▁▂▄█████▇▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.44 μs        Histogram: frequency by time        7.47 μs <

 Memory estimate: 8.72 KiB, allocs estimate: 196.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  211.000 μs …  36.187 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   280.025 μs ± 750.097 μs  ┊ GC (mean ± σ):  6.86% ± 7.16%

  █▆▄▂▁                                                         ▁
  ███████▇▇▇▆▆▆▅▆▅▅▅▅▅▄▅▄▄▄▅▅▁▄▅▃▄▄▄▃▄▄▃▅▄▁▁▃▄▁▃▁▁▁▃▄▃▁▃▁▁▁▃▃▁▃ █
  211 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 374.17 KiB, allocs estimate: 5269.
```
aviatesk added a commit to JuliaLang/julia that referenced this pull request Jun 27, 2024
In Cassette-like systems, where inference has to infer many calls of
`@generated` function and the generated function involves complex code
transformations, the overhead from code generation itself can become
significant. This is because the results of code generation are not
cached, leading to duplicated code generation in the following contexts:
- `method_for_inference_heuristics` for regular inference on cached
  `@generated` function calls (since `method_for_inference_limit_heuristics`
  isn't stored in cached optimized sources, but is attached to
  generated unoptimized sources).
- `retrieval_code_info` for constant propagation on cached `@generated`
  function calls.

Having said that, caching unoptimized sources generated by `@generated`
functions is not a good tradeoff in general cases, considering the
memory space consumed (and the image bloat). The code generation for
generators like `GeneratedFunctionStub` produced by the front end is
generally very simple, and the first duplicated code generation
mentioned above does not occur for `GeneratedFunctionStub`.

So this unoptimized source caching should be enabled in an opt-in manner.

Based on this idea, this commit defines the trait
`abstract type Core.CachedGenerator` as an interface for the external
systems to opt-in. If the generator is a subtype of this trait,
inference caches the generated unoptimized code, sacrificing memory
space to improve the performance of subsequent inferences.
Specifically, the mechanism for caching the unoptimized source uses the
infrastructure already implemented in #54362. Thanks to
#54362, the cache for generated functions is now
partitioned by world age, so even if the unoptimized source is cached,
the existing invalidation system will invalidate it as expected.

In JuliaDebug/CassetteOverlay.jl#56, the following benchmark results
showed that approximately 1.5~3x inference speedup is achieved by opting
into this feature:

\## Setup
```julia
using CassetteOverlay, BaseBenchmarks, BenchmarkTools

@MethodTable table;
pass = @overlaypass table;
BaseBenchmarks.load!("inference");
benchfunc1() = sin(42)
benchfunc2(xs, x) = findall(>(x), abs.(xs))
interp = BaseBenchmarks.InferenceBenchmarks.InferenceBenchmarker()

\# benchmark inference on entire call graphs from scratch
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)

\# benchmark inference on the call graphs with most of them cached
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
```

\## Benchmark inference on entire call graphs from scratch
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  78.574 ms … 87.653 ms  ┊ GC (min … max): 0.00% … 8.81%
 Time  (median):     83.149 ms              ┊ GC (median):    4.85%
 Time  (mean ± σ):   82.138 ms ±  2.366 ms  ┊ GC (mean ± σ):  3.36% ± 2.65%

  ▂ ▂▂ █     ▂                     █ ▅     ▅
  █▅██▅█▅▁█▁▁█▁▁▁▁▅▁▁▁▁▁▁▁▁▅▁▁▅██▅▅█████████▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  78.6 ms         Histogram: frequency by time        86.8 ms <

 Memory estimate: 52.32 MiB, allocs estimate: 1201192.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.345 s …  1.369 s  ┊ GC (min … max): 2.45% … 3.39%
 Time  (median):     1.355 s             ┊ GC (median):    2.98%
 Time  (mean ± σ):   1.356 s ± 9.847 ms  ┊ GC (mean ± σ):  2.96% ± 0.41%

  █                   █     █                            █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s        Histogram: frequency by time        1.37 s <

 Memory estimate: 637.96 MiB, allocs estimate: 15159639.
```
> with this PR
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 230 samples with 1 evaluation.
 Range (min … max):  19.339 ms … 82.521 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.938 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.665 ms ±  4.666 ms  ┊ GC (mean ± σ):  6.72% ± 8.80%

  ▃▇█▇▄                     ▂▂▃▃▄
  █████▇█▇▆▅▅▆▅▅▁▅▁▁▁▁▁▁▁▁▁██████▆▁█▁▅▇▆▁▅▁▁▅▁▅▁▁▁▁▁▁▅▁▁▁▁▁▁▅ ▆
  19.3 ms      Histogram: log(frequency) by time      29.4 ms <

 Memory estimate: 28.67 MiB, allocs estimate: 590138.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  354.585 ms … 390.400 ms  ┊ GC (min … max): 0.00% … 7.01%
 Time  (median):     368.778 ms               ┊ GC (median):    3.74%
 Time  (mean ± σ):   368.824 ms ±   8.853 ms  ┊ GC (mean ± σ):  3.70% ± 1.89%

             ▃            █
  ▇▁▁▁▁▁▁▁▁▁▁█▁▇▇▁▁▁▁▇▁▁▁▁█▁▁▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  355 ms           Histogram: frequency by time          390 ms <

 Memory estimate: 227.86 MiB, allocs estimate: 4689830.
```

\## Benchmark inference on the call graphs with most of them cached
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.166 μs …  9.799 ms  ┊ GC (min … max): 0.00% … 98.96%
 Time  (median):     46.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.339 μs ± 97.539 μs  ┊ GC (mean ± σ):  2.01% ±  0.99%

    ▁▂▄▆▆▇███▇▆▅▄▃▄▄▂▂▂▁▁▁   ▁▁▂▂▁ ▁ ▂▁ ▁                     ▃
  ▃▇██████████████████████▇████████████████▇█▆▇▇▆▆▆▅▆▆▆▇▆▅▅▅▆ █
  45.2 μs      Histogram: log(frequency) by time        55 μs <

 Memory estimate: 25.27 KiB, allocs estimate: 614.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  303.375 μs …  16.582 ms  ┊ GC (min … max): 0.00% … 97.38%
 Time  (median):     317.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   338.772 μs ± 274.164 μs  ┊ GC (mean ± σ):  5.44% ±  7.56%

       ▃▆██▇▅▂▁
  ▂▂▄▅██████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂ ▃
  303 μs           Histogram: frequency by time          394 μs <

 Memory estimate: 412.80 KiB, allocs estimate: 6224.
```
> with this PR
```
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.444 μs …  1.808 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     5.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.228 μs ± 25.393 μs  ┊ GC (mean ± σ):  5.73% ±  1.40%

      ▄█▇▄
  ▁▂▄█████▇▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.44 μs        Histogram: frequency by time        7.47 μs <

 Memory estimate: 8.72 KiB, allocs estimate: 196.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  211.000 μs …  36.187 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   280.025 μs ± 750.097 μs  ┊ GC (mean ± σ):  6.86% ± 7.16%

  █▆▄▂▁                                                         ▁
  ███████▇▇▇▆▆▆▅▆▅▅▅▅▅▄▅▄▄▄▅▅▁▄▅▃▄▄▄▃▄▄▃▅▄▁▁▃▄▁▃▁▁▁▃▄▃▁▃▁▁▁▃▃▁▃ █
  211 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 374.17 KiB, allocs estimate: 5269.
```
aviatesk added a commit to JuliaLang/julia that referenced this pull request Jun 28, 2024
…54916)

In Cassette-like systems, where inference has to infer many calls of
`@generated` function and the generated function involves complex code
transformations, the overhead from code generation itself can become
significant. This is because the results of code generation are not
cached, leading to duplicated code generation in the following contexts:
- `method_for_inference_heuristics` for regular inference on cached
`@generated` function calls (since
`method_for_inference_limit_heuristics` isn't stored in cached optimized
sources, but is attached to generated unoptimized sources).
- `retrieval_code_info` for constant propagation on cached `@generated`
function calls.

Having said that, caching unoptimized sources generated by `@generated`
functions is not a good tradeoff in general cases, considering the
memory space consumed (and the image bloat). The code generation for
generators like `GeneratedFunctionStub` produced by the front end is
generally very simple, and the first duplicated code generation
mentioned above does not occur for `GeneratedFunctionStub`.

So this unoptimized source caching should be enabled in an opt-in
manner.

Based on this idea, this commit defines the trait `abstract type
Core.CachedGenerator` as an interface for the external systems to
opt-in. If the generator is a subtype of this trait, inference caches
the generated unoptimized code, sacrificing memory space to improve the
performance of subsequent inferences. Specifically, the mechanism for
caching the unoptimized source uses the infrastructure already
implemented in #54362. Thanks to #54362,
the cache for generated functions is now partitioned by world age, so
even if the unoptimized source is cached, the existing invalidation
system will invalidate it as expected.

In JuliaDebug/CassetteOverlay.jl#56, the following benchmark results
showed that approximately 1.5~3x inference speedup is achieved by opting
into this feature:

## Setup
```julia
using CassetteOverlay, BaseBenchmarks, BenchmarkTools

@MethodTable table;
pass = @overlaypass table;
BaseBenchmarks.load!("inference");
benchfunc1() = sin(42)
benchfunc2(xs, x) = findall(>(x), abs.(xs))
interp = BaseBenchmarks.InferenceBenchmarks.InferenceBenchmarker()

# benchmark inference on entire call graphs from scratch
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)

# benchmark inference on the call graphs with most of them cached
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
```

## Benchmark inference on entire call graphs from scratch
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  78.574 ms … 87.653 ms  ┊ GC (min … max): 0.00% … 8.81%
 Time  (median):     83.149 ms              ┊ GC (median):    4.85%
 Time  (mean ± σ):   82.138 ms ±  2.366 ms  ┊ GC (mean ± σ):  3.36% ± 2.65%

  ▂ ▂▂ █     ▂                     █ ▅     ▅
  █▅██▅█▅▁█▁▁█▁▁▁▁▅▁▁▁▁▁▁▁▁▅▁▁▅██▅▅█████████▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  78.6 ms         Histogram: frequency by time        86.8 ms <

 Memory estimate: 52.32 MiB, allocs estimate: 1201192.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.345 s …  1.369 s  ┊ GC (min … max): 2.45% … 3.39%
 Time  (median):     1.355 s             ┊ GC (median):    2.98%
 Time  (mean ± σ):   1.356 s ± 9.847 ms  ┊ GC (mean ± σ):  2.96% ± 0.41%

  █                   █     █                            █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s        Histogram: frequency by time        1.37 s <

 Memory estimate: 637.96 MiB, allocs estimate: 15159639.
```
> with this PR
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc1)
BenchmarkTools.Trial: 230 samples with 1 evaluation.
 Range (min … max):  19.339 ms … 82.521 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.938 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.665 ms ±  4.666 ms  ┊ GC (mean ± σ):  6.72% ± 8.80%

  ▃▇█▇▄                     ▂▂▃▃▄
  █████▇█▇▆▅▅▆▅▅▁▅▁▁▁▁▁▁▁▁▁██████▆▁█▁▅▇▆▁▅▁▁▅▁▅▁▁▁▁▁▁▅▁▁▁▁▁▁▅ ▆
  19.3 ms      Histogram: log(frequency) by time      29.4 ms <

 Memory estimate: 28.67 MiB, allocs estimate: 590138.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  354.585 ms … 390.400 ms  ┊ GC (min … max): 0.00% … 7.01%
 Time  (median):     368.778 ms               ┊ GC (median):    3.74%
 Time  (mean ± σ):   368.824 ms ±   8.853 ms  ┊ GC (mean ± σ):  3.70% ± 1.89%

             ▃            █
  ▇▁▁▁▁▁▁▁▁▁▁█▁▇▇▁▁▁▁▇▁▁▁▁█▁▁▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  355 ms           Histogram: frequency by time          390 ms <

 Memory estimate: 227.86 MiB, allocs estimate: 4689830.
```

## Benchmark inference on the call graphs with most of them cached
> on master
```
julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.166 μs …  9.799 ms  ┊ GC (min … max): 0.00% … 98.96%
 Time  (median):     46.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.339 μs ± 97.539 μs  ┊ GC (mean ± σ):  2.01% ±  0.99%

    ▁▂▄▆▆▇███▇▆▅▄▃▄▄▂▂▂▁▁▁   ▁▁▂▂▁ ▁ ▂▁ ▁                     ▃
  ▃▇██████████████████████▇████████████████▇█▆▇▇▆▆▆▅▆▆▆▇▆▅▅▅▆ █
  45.2 μs      Histogram: log(frequency) by time        55 μs <

 Memory estimate: 25.27 KiB, allocs estimate: 614.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  303.375 μs …  16.582 ms  ┊ GC (min … max): 0.00% … 97.38%
 Time  (median):     317.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   338.772 μs ± 274.164 μs  ┊ GC (mean ± σ):  5.44% ±  7.56%

       ▃▆██▇▅▂▁
  ▂▂▄▅██████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂ ▃
  303 μs           Histogram: frequency by time          394 μs <

 Memory estimate: 412.80 KiB, allocs estimate: 6224.
```
> with this PR
```
@benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc1)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.444 μs …  1.808 ms  ┊ GC (min … max): 0.00% … 99.01%
 Time  (median):     5.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.228 μs ± 25.393 μs  ┊ GC (mean ± σ):  5.73% ±  1.40%

      ▄█▇▄
  ▁▂▄█████▇▄▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  5.44 μs        Histogram: frequency by time        7.47 μs <

 Memory estimate: 8.72 KiB, allocs estimate: 196.

julia> @benchmark BaseBenchmarks.InferenceBenchmarks.@inf_call interp=interp pass(benchfunc2, rand(10), 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  211.000 μs …  36.187 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   280.025 μs ± 750.097 μs  ┊ GC (mean ± σ):  6.86% ± 7.16%

  █▆▄▂▁                                                         ▁
  ███████▇▇▇▆▆▆▅▆▅▅▅▅▅▄▅▄▄▄▅▅▁▄▅▃▄▄▄▃▄▄▃▅▄▁▁▃▄▁▃▁▁▁▃▄▃▁▃▁▁▁▃▃▁▃ █
  211 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 374.17 KiB, allocs estimate: 5269.
```
@aviatesk aviatesk merged commit 3a9575d into master Jun 28, 2024
@aviatesk aviatesk deleted the avi/use-cached-generator branch June 28, 2024 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant