Skip to content

Clarify semantics around do_not_cache and skip_cache_lookup#55

Merged
ola-rozenfeld merged 1 commit intobazelbuild:masterfrom
EricBurnett:patch-1
Feb 12, 2019
Merged

Clarify semantics around do_not_cache and skip_cache_lookup#55
ola-rozenfeld merged 1 commit intobazelbuild:masterfrom
EricBurnett:patch-1

Conversation

@EricBurnett
Copy link
Copy Markdown
Collaborator

Clarify semantics for do_not_cache and skip_cache_lookup fields to cover the semantics of when in-flight actions may be "merged", and to clarify when results are and aren't eligible for caching.

The use-cases of interest I'm considering are:

  • Minimizing total number of executions for idempotent actions. (The default: neither field is set; only one total execution is required)
  • Bypassing and overwriting the Action Cache when its results are stale or corrupt. (skip_cache_lookup should be set and the result should still be cached).
  • Actions with side-effects that should always be re-executed (do_not_cache should be set, multiple in-flight requests should not be merged)
  • Actions intentionally being (re)run, e.g. for performance analysis across multiple attempts. (do_not_cache should be set; multiple in-flight requests should not be merged).

I took a look at Bazel and I believe its usage to be consistent with these semantics:

Any sanity-checks that the comments here are clear and understandable would be great, and/or validation that this is consistent with the semantics expected by other client tools.

Clarify semantics for do_not_cache and skip_cache_lookup fields to cover the semantics of when in-flight actions may be "merged", and to clarify when results are and aren't eligible for caching.

The use-cases of interest I'm considering are:
  - Minimizing total number of executions for idempotent actions. (The default: neither field is set; only one total execution is required)
 - Bypassing and overwriting the Action Cache when its results are stale or corrupt. (skip_cache_lookup should be set and the result should still be cached).
 - Actions with side-effects that should always be re-executed (do_not_cache should be set, multiple in-flight requests should not be merged)
 - Actions intentionally being (re)run, e.g. for performance analysis across multiple attempts. (do_not_cache should be set; multiple in-flight requests should not be merged).

I took a look at Bazel and I believe its usage to be consistent with these semantics:
 - --runs_per_test implies uncacheable: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/test/TestRunnerAction.java#L373
 - Outputs found to be missing after an action cache entry is found result in re-execution with skip_cache_lookup set so the action gets re-executed: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnRunner.java#L209
@googlebot googlebot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Feb 6, 2019
@EricBurnett
Copy link
Copy Markdown
Collaborator Author

@EdSchouten @juergbi FYI.

// cache entries that reference outputs no longer available or that are
// poisoned in any way.
// If false, the result may be served from the action cache.
bool skip_cache_lookup = 3;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be noteworthy to mention that Buildbarn currently ignores this flag altogether; it behaves as if it's always set to true. We currently assume that if a client actually calls into Execute() it already checked the AC beforehand. Bazel seems to do that at least.

Is it really worth the effort to have this flag?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a mistake for Bazel to do two calls. This just adds another round-trip for no benefit. I think at some point Bazel will move to doing a single call to the remote executor, and expect to be able to get a cache hit back.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Ulf, I'm pretty sure Bazel can already switch to doing the single Execute call and get cache hits back (from RBE), since it already has code in place to do all the uploads if a missing input status is returned. That wasn't the case in the beginning, when we wanted to enforce the order of cache check -> get missing -> upload missing -> execute. But note that while doing that may be a bit simpler, I'm not sure that it will save a round trip often. The saving will only happen when the action is (a) not in the cache, and (b) can be re-executed without any additional uploads. This seems like an unlikely case.
Nevertheless, I think it's important to have the flag to that servers may return cached results, or merge actions in flight, as an optimization.

Copy link
Copy Markdown
Collaborator Author

@EricBurnett EricBurnett Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ulf: I think it's equivalent right now (as Ola says) - either:
GetActionResult (NOT_FOUND) -> upload action & blobs -> Execute, or
Execute(NOT_FOUND) -> upload actions & blobs -> Execute

You could pre-upload actions & blobs before calling Execute, but you'd only want to do so on incremental compilation paths where you know the action is going to be a miss anyways (at which point whether or not Execute checks the action cache is moot); on clean executions where the action is most likely a cache hit calling any upload APIs before calling Execute would be net worse than calling GetActionResult or Execute and uploading only on miss.

So bazel could call one fewer distinct API if it wanted to (Execute*2 instead of Get...Execute), but still needs to call GetActionResult for cache-only flows. So not really better? On the service owner side, Ed's point holds: it's simpler if a service doesn't have to check the action cache on Execute, and doesn't have to be able to quickly reject actions where blobs are missing.

On the other side, one feature I'd like to eventually add to the API is inline uploads in the Execute call itself. (Probably gated on something in the Capabilities API, so services with separation between CAS and Execute don't have to implement it). At that point build tools can make an optimization when executions are probably incremental or are likely to be critical-path bound of using a single-RPC fast-path:
Execute(inc. inline upload of changed files; merkle tree nodes; action)
You could still maybe argue that skip_cache_lookup is still unnecessary (just set it to true always), as in Ed's case though then the failure mode of build tools calling the "fast" path when there is actually a cache entry is a lot worse.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have code in Blaze to upload locally modified files eagerly (we detect locally modified files very early in the build), and we've found that to be a significant performance improvement for highly incremental builds at the time, so it seems likely that that's going to happen here as well. (Caveat: it's possible that Bazel's internals have changed sufficiently that eager uploading is no longer a win.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for eagerly uploading files - I think it'd still be a boon - but note that merkle trees and action protos also need upload, and can only be constructed at around the time the action is ready for execution anyways.

@illicitonion
Copy link
Copy Markdown
Contributor

FWIW in Pants we don't use this mechanism at all because it's not flexible enough for us. We use a unique value, which we call a seed (that gets set as an environment variable, because there's nowhere else convenient to put it) - rather than using the caching flags; we change the seed if we want to bypass the cache.

The reason we do this is because we want to be able to do performance benchmarks, and we want to include caching effects there. One of our benchmark runs is:
Do a build
Change one file
Flush any in-memory caches (so we hit the remote cache)
Do a build again

We do this with a unique cache-seed for that benchmark run, because it means we know that the cache contains exactly what we expect. If we just used booleans, we would have no convenient way of isolating the cache per-benchmark-run; we only have the flexibility to either entirely disable the cache (where we don't get to test that cache fetches are as fast as we expect), or entirely enable it (where we need to trust that the change we made is globally unique).

In particular, at Twitter, we use this mechanism to do things like "reply the last n commits' CI runs" when validating new Pants releases to detect regressions, where we know the cache has otherwise been populated, and which we wouldn't be able to do with just booleans.

@EricBurnett
Copy link
Copy Markdown
Collaborator Author

EricBurnett commented Feb 7, 2019

@illicitonion thanks for the comment, and the idea of using it to verify new tool builds! That does sound like a useful pattern.

We use "seeds" in some contexts as well (as an additional key+value in the Platform properties rather than an env var - might work for you too?), and I think you're right that it could entirely replace do_not_cache if so desired. (@edbaunton , more food for thought). I don't think it could replace skip_cache_lookup though, as that flag supports use-cases where the same action cache entry needs to be replaced for some reason, but caching is still otherwise desired for this and future builds.

@ulfjack
Copy link
Copy Markdown
Collaborator

ulfjack commented Feb 7, 2019

We've internally been using a flag (outside of the action proto which is used for canonicalization) to indicate that we're not willing to accept a cached response, but are ok with the rerun being cached. We've used this to overwrite bad cache entries, as well as to make sure tests are rerun (e.g., to detect flakiness; note that --runs_per_test also adds a key to the action proto).

I don't think we've ever added a truly random number to an action request proto. That prevents any caching, but can also fill up the cache with non-reusable entries.

@EricBurnett
Copy link
Copy Markdown
Collaborator Author

Since there are no objections that I can see (just discussions that don't seem blocking), anyone want to approve and merge this change? Or is there anything else we're waiting on before that's appropriate.

Thanks!

@ola-rozenfeld ola-rozenfeld merged commit cfe8e54 into bazelbuild:master Feb 12, 2019
@EricBurnett EricBurnett deleted the patch-1 branch February 12, 2019 23:32
bergsieker pushed a commit to bergsieker/remote-apis that referenced this pull request Apr 11, 2019
…ld#55)

Clarify semantics for do_not_cache and skip_cache_lookup fields to cover the semantics of when in-flight actions may be "merged", and to clarify when results are and aren't eligible for caching.

The use-cases of interest I'm considering are:
  - Minimizing total number of executions for idempotent actions. (The default: neither field is set; only one total execution is required)
 - Bypassing and overwriting the Action Cache when its results are stale or corrupt. (skip_cache_lookup should be set and the result should still be cached).
 - Actions with side-effects that should always be re-executed (do_not_cache should be set, multiple in-flight requests should not be merged)
 - Actions intentionally being (re)run, e.g. for performance analysis across multiple attempts. (do_not_cache should be set; multiple in-flight requests should not be merged).

I took a look at Bazel and I believe its usage to be consistent with these semantics:
 - --runs_per_test implies uncacheable: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/test/TestRunnerAction.java#L373
 - Outputs found to be missing after an action cache entry is found result in re-execution with skip_cache_lookup set so the action gets re-executed: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnRunner.java#L209
henryaj pushed a commit to henryaj/buildgrid that referenced this pull request Nov 20, 2019
Noteworthy changes include:
* From remote_execution.proto:
  - New message field in ExecuteResponse [1].
  - Output file and dir. parent folders clarifications [2].
  - Better doc. for do_not_cache/skip_cache_lookup [3].
  - Documention updates.
* From semver.proto:
  - Documentation updates.

[1] bazelbuild/remote-apis#36
[2] bazelbuild/remote-apis#42
[3] bazelbuild/remote-apis#55
santigl pushed a commit to santigl/remote-apis that referenced this pull request Aug 26, 2020
…. (bazelbuild#56)

* Handling non-normalized paths in tree construction. Fixes bazelbuild#55.

* Addressing offline comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Pull requests whose authors are covered by a CLA with Google.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants