Clarify semantics around do_not_cache and skip_cache_lookup by EricBurnett · Pull Request #55 · bazelbuild/remote-apis

EricBurnett · 2019-02-06T20:54:17Z

Clarify semantics for do_not_cache and skip_cache_lookup fields to cover the semantics of when in-flight actions may be "merged", and to clarify when results are and aren't eligible for caching.

The use-cases of interest I'm considering are:

Minimizing total number of executions for idempotent actions. (The default: neither field is set; only one total execution is required)
Bypassing and overwriting the Action Cache when its results are stale or corrupt. (skip_cache_lookup should be set and the result should still be cached).
Actions with side-effects that should always be re-executed (do_not_cache should be set, multiple in-flight requests should not be merged)
Actions intentionally being (re)run, e.g. for performance analysis across multiple attempts. (do_not_cache should be set; multiple in-flight requests should not be merged).

I took a look at Bazel and I believe its usage to be consistent with these semantics:

--runs_per_test implies uncacheable: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/test/TestRunnerAction.java#L373
Outputs found to be missing after an action cache entry is found result in re-execution with skip_cache_lookup set so the action gets re-executed: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnRunner.java#L209

Any sanity-checks that the comments here are clear and understandable would be great, and/or validation that this is consistent with the semantics expected by other client tools.

Clarify semantics for do_not_cache and skip_cache_lookup fields to cover the semantics of when in-flight actions may be "merged", and to clarify when results are and aren't eligible for caching. The use-cases of interest I'm considering are: - Minimizing total number of executions for idempotent actions. (The default: neither field is set; only one total execution is required) - Bypassing and overwriting the Action Cache when its results are stale or corrupt. (skip_cache_lookup should be set and the result should still be cached). - Actions with side-effects that should always be re-executed (do_not_cache should be set, multiple in-flight requests should not be merged) - Actions intentionally being (re)run, e.g. for performance analysis across multiple attempts. (do_not_cache should be set; multiple in-flight requests should not be merged). I took a look at Bazel and I believe its usage to be consistent with these semantics: - --runs_per_test implies uncacheable: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/test/TestRunnerAction.java#L373 - Outputs found to be missing after an action cache entry is found result in re-execution with skip_cache_lookup set so the action gets re-executed: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnRunner.java#L209

EricBurnett · 2019-02-06T20:55:59Z

@EdSchouten @juergbi FYI.

build/bazel/remote/execution/v2/remote_execution.proto

EdSchouten · 2019-02-06T21:08:02Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // cache entries that reference outputs no longer available or that are
+  // poisoned in any way.
+  // If false, the result may be served from the action cache.
  bool skip_cache_lookup = 3;


It may be noteworthy to mention that Buildbarn currently ignores this flag altogether; it behaves as if it's always set to true. We currently assume that if a client actually calls into Execute() it already checked the AC beforehand. Bazel seems to do that at least.

Is it really worth the effort to have this flag?

I think it's a mistake for Bazel to do two calls. This just adds another round-trip for no benefit. I think at some point Bazel will move to doing a single call to the remote executor, and expect to be able to get a cache hit back.

+1 to Ulf, I'm pretty sure Bazel can already switch to doing the single Execute call and get cache hits back (from RBE), since it already has code in place to do all the uploads if a missing input status is returned. That wasn't the case in the beginning, when we wanted to enforce the order of cache check -> get missing -> upload missing -> execute. But note that while doing that may be a bit simpler, I'm not sure that it will save a round trip often. The saving will only happen when the action is (a) not in the cache, and (b) can be re-executed without any additional uploads. This seems like an unlikely case.
Nevertheless, I think it's important to have the flag to that servers may return cached results, or merge actions in flight, as an optimization.

Ulf: I think it's equivalent right now (as Ola says) - either:
GetActionResult (NOT_FOUND) -> upload action & blobs -> Execute, or
Execute(NOT_FOUND) -> upload actions & blobs -> Execute

You could pre-upload actions & blobs before calling Execute, but you'd only want to do so on incremental compilation paths where you know the action is going to be a miss anyways (at which point whether or not Execute checks the action cache is moot); on clean executions where the action is most likely a cache hit calling any upload APIs before calling Execute would be net worse than calling GetActionResult or Execute and uploading only on miss.

So bazel could call one fewer distinct API if it wanted to (Execute*2 instead of Get...Execute), but still needs to call GetActionResult for cache-only flows. So not really better? On the service owner side, Ed's point holds: it's simpler if a service doesn't have to check the action cache on Execute, and doesn't have to be able to quickly reject actions where blobs are missing.

On the other side, one feature I'd like to eventually add to the API is inline uploads in the Execute call itself. (Probably gated on something in the Capabilities API, so services with separation between CAS and Execute don't have to implement it). At that point build tools can make an optimization when executions are probably incremental or are likely to be critical-path bound of using a single-RPC fast-path:
Execute(inc. inline upload of changed files; merkle tree nodes; action)
You could still maybe argue that skip_cache_lookup is still unnecessary (just set it to true always), as in Ed's case though then the failure mode of build tools calling the "fast" path when there is actually a cache entry is a lot worse.

We already have code in Blaze to upload locally modified files eagerly (we detect locally modified files very early in the build), and we've found that to be a significant performance improvement for highly incremental builds at the time, so it seems likely that that's going to happen here as well. (Caveat: it's possible that Bazel's internals have changed sufficiently that eager uploading is no longer a win.)

+1 for eagerly uploading files - I think it'd still be a boon - but note that merkle trees and action protos also need upload, and can only be constructed at around the time the action is ready for execution anyways.

illicitonion · 2019-02-07T08:39:52Z

FWIW in Pants we don't use this mechanism at all because it's not flexible enough for us. We use a unique value, which we call a seed (that gets set as an environment variable, because there's nowhere else convenient to put it) - rather than using the caching flags; we change the seed if we want to bypass the cache.

The reason we do this is because we want to be able to do performance benchmarks, and we want to include caching effects there. One of our benchmark runs is:
Do a build
Change one file
Flush any in-memory caches (so we hit the remote cache)
Do a build again

We do this with a unique cache-seed for that benchmark run, because it means we know that the cache contains exactly what we expect. If we just used booleans, we would have no convenient way of isolating the cache per-benchmark-run; we only have the flexibility to either entirely disable the cache (where we don't get to test that cache fetches are as fast as we expect), or entirely enable it (where we need to trust that the change we made is globally unique).

In particular, at Twitter, we use this mechanism to do things like "reply the last n commits' CI runs" when validating new Pants releases to detect regressions, where we know the cache has otherwise been populated, and which we wouldn't be able to do with just booleans.

EricBurnett · 2019-02-07T15:27:15Z

@illicitonion thanks for the comment, and the idea of using it to verify new tool builds! That does sound like a useful pattern.

We use "seeds" in some contexts as well (as an additional key+value in the Platform properties rather than an env var - might work for you too?), and I think you're right that it could entirely replace do_not_cache if so desired. (@edbaunton , more food for thought). I don't think it could replace skip_cache_lookup though, as that flag supports use-cases where the same action cache entry needs to be replaced for some reason, but caching is still otherwise desired for this and future builds.

ulfjack · 2019-02-07T17:59:59Z

We've internally been using a flag (outside of the action proto which is used for canonicalization) to indicate that we're not willing to accept a cached response, but are ok with the rerun being cached. We've used this to overwrite bad cache entries, as well as to make sure tests are rerun (e.g., to detect flakiness; note that --runs_per_test also adds a key to the action proto).

I don't think we've ever added a truly random number to an action request proto. That prevents any caching, but can also fill up the cache with non-reusable entries.

EricBurnett · 2019-02-12T20:22:14Z

Since there are no objections that I can see (just discussions that don't seem blocking), anyone want to approve and merge this change? Or is there anything else we're waiting on before that's appropriate.

Thanks!

…ld#55) Clarify semantics for do_not_cache and skip_cache_lookup fields to cover the semantics of when in-flight actions may be "merged", and to clarify when results are and aren't eligible for caching. The use-cases of interest I'm considering are: - Minimizing total number of executions for idempotent actions. (The default: neither field is set; only one total execution is required) - Bypassing and overwriting the Action Cache when its results are stale or corrupt. (skip_cache_lookup should be set and the result should still be cached). - Actions with side-effects that should always be re-executed (do_not_cache should be set, multiple in-flight requests should not be merged) - Actions intentionally being (re)run, e.g. for performance analysis across multiple attempts. (do_not_cache should be set; multiple in-flight requests should not be merged). I took a look at Bazel and I believe its usage to be consistent with these semantics: - --runs_per_test implies uncacheable: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/test/TestRunnerAction.java#L373 - Outputs found to be missing after an action cache entry is found result in re-execution with skip_cache_lookup set so the action gets re-executed: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnRunner.java#L209

Noteworthy changes include: * From remote_execution.proto: - New message field in ExecuteResponse [1]. - Output file and dir. parent folders clarifications [2]. - Better doc. for do_not_cache/skip_cache_lookup [3]. - Documention updates. * From semver.proto: - Documentation updates. [1] bazelbuild/remote-apis#36 [2] bazelbuild/remote-apis#42 [3] bazelbuild/remote-apis#55

…. (bazelbuild#56) * Handling non-normalized paths in tree construction. Fixes bazelbuild#55. * Addressing offline comments

EricBurnett requested review from agoulti, bergsieker, buchgr, ola-rozenfeld and ulfjack as code owners February 6, 2019 20:54

googlebot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Feb 6, 2019

EdSchouten reviewed Feb 6, 2019

View reviewed changes

agoulti approved these changes Feb 12, 2019

View reviewed changes

ola-rozenfeld approved these changes Feb 12, 2019

View reviewed changes

ola-rozenfeld merged commit cfe8e54 into bazelbuild:master Feb 12, 2019

EricBurnett deleted the patch-1 branch February 12, 2019 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify semantics around do_not_cache and skip_cache_lookup#55

Clarify semantics around do_not_cache and skip_cache_lookup#55
ola-rozenfeld merged 1 commit intobazelbuild:masterfrom
EricBurnett:patch-1

EricBurnett commented Feb 6, 2019

Uh oh!

EricBurnett commented Feb 6, 2019

Uh oh!

Uh oh!

EdSchouten Feb 6, 2019

Uh oh!

ulfjack Feb 6, 2019

Uh oh!

ola-rozenfeld Feb 6, 2019

Uh oh!

EricBurnett Feb 6, 2019 •

edited

Loading

Uh oh!

ulfjack Feb 6, 2019

Uh oh!

EricBurnett Feb 7, 2019

Uh oh!

illicitonion commented Feb 7, 2019

Uh oh!

EricBurnett commented Feb 7, 2019 •

edited

Loading

Uh oh!

ulfjack commented Feb 7, 2019

Uh oh!

EricBurnett commented Feb 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

EricBurnett commented Feb 6, 2019

Uh oh!

EricBurnett commented Feb 6, 2019

Uh oh!

Uh oh!

EdSchouten Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

ulfjack Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

ola-rozenfeld Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

EricBurnett Feb 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulfjack Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

EricBurnett Feb 7, 2019

Choose a reason for hiding this comment

Uh oh!

illicitonion commented Feb 7, 2019

Uh oh!

EricBurnett commented Feb 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ulfjack commented Feb 7, 2019

Uh oh!

EricBurnett commented Feb 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

EricBurnett Feb 6, 2019 •

edited

Loading

EricBurnett commented Feb 7, 2019 •

edited

Loading