Move to a simpler form of checksum->solution caching in the remote workspace layer.#62872
Move to a simpler form of checksum->solution caching in the remote workspace layer.#62872CyrusNajmabadi wants to merge 39 commits intodotnet:mainfrom
Conversation
| await SetLastRequestedSolutionAsync(solutionChecksum, fromPrimaryBranch, refCountedLazySolution, cancellationToken).ConfigureAwait(false); | ||
| await _anyBranchSolutionCache.SetLastRequestedSolutionAsync(solutionChecksum, refCountedLazySolution, cancellationToken).ConfigureAwait(false); | ||
| if (updatePrimaryBranch) | ||
| await _primaryBranchSolutionCache.SetLastRequestedSolutionAsync(solutionChecksum, refCountedLazySolution, cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
just inlined this method as i didn't see much benefit in the helper.
| // If this was a notification about the primary solution, then attempt to promote any solution we found to | ||
| // be the solution for this workspace. | ||
| if (fromPrimaryBranch) | ||
| (newSolution, _) = await TryUpdateWorkspaceCurrentSolutionAsync(workspaceVersion, newSolution, cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
this logic is now part of GetOrCreateSolutionAsync
| { | ||
| var currentSolution = this.CurrentSolution; | ||
| // We were asked to update the primary-branch solution. So take the any-branch solution and promote it to | ||
| // the primary-branch-level. |
There was a problem hiding this comment.
this is the clever bit. we effectively ensure that whatever solution we made for this checksum gets promoted to be in the _primarySolutionCache.
What's critical to understand here is that each cache may have teh same checksum point to a different solution. The _anyBranch cache can point from that checksum to a 'forked solution' that it generated (which then has some random branch id).
The _primaryBranch cache will then instead map from the checksum to an actual primary-branch-id solution if possible (which may be a different instance than the forked solution).
Later lookups always look in the _primaryBranch cache first, so this better solution will always be found.
This was much simpler to code up and ensure proper working versus trying to have a single mapping from checksum->solution that then might have to mutate in order update what that checksum pointed at.
| _gate = gate; | ||
| } | ||
|
|
||
| public async ValueTask<ReferenceCountedDisposable<LazySolution>?> TryFastGetSolutionAsync( |
There was a problem hiding this comment.
these methods are basically the same named methods from before, just simplified now that this cache object only stores the last accessed item.
| } | ||
|
|
||
| public async ValueTask<ReferenceCountedDisposable<LazySolution>> SlowGetOrCreateSolutionAsync( | ||
| Checksum solutionChecksum, Func<CancellationToken, Task<Solution>> getSolutionAsync, CancellationToken cancellationToken) |
There was a problem hiding this comment.
this is the same as before, except that getSolutionAsync can be passed in since there is different behavior for the anyBranch cache versus the primaryBranchCache.
| await _anyBranchSolutionCache.SlowGetOrCreateSolutionAsync( | ||
| solutionChecksum, | ||
| cancellationToken => ComputeSolutionAsync(assetProvider, solutionChecksum, cancellationToken), | ||
| cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
note: teh lambda we pass in here is the same code that used to be in SlowGetSolutionAsync at this line:
| var anyBranchSolution = await anyBranchRefCountedSolution.Target.Task.WithCancellation(cancellationToken).ConfigureAwait(false); | ||
| var (primaryBranchSolution, _) = await this.TryUpdateWorkspaceCurrentSolutionAsync(workspaceVersion, anyBranchSolution, cancellationToken).ConfigureAwait(false); | ||
| return primaryBranchSolution; | ||
| }, |
There was a problem hiding this comment.
however, this lambda is different. for the primaryBranch cache we first get hte anyBranch solution, then try to promote it to be the primary branch solution for this workspace. We then cache that new solution instead for this checksum.
| } | ||
|
|
||
| public async Task SetLastRequestedSolutionAsync(Checksum solutionChecksum, ReferenceCountedDisposable<LazySolution> solution, CancellationToken cancellationToken) | ||
| { |
There was a problem hiding this comment.
same as before, just simpler as there's only one 'last requested' solution to update.
|
|
||
| var solutionChecksum = await solution.State.GetChecksumAsync(CancellationToken.None); | ||
| var synched = await remoteWorkspace.GetTestAccessor().GetSolutionAsync(assetProvider, solutionChecksum, fromPrimaryBranch: false, workspaceVersion: -1, CancellationToken.None); | ||
| var synched = await remoteWorkspace.GetTestAccessor().GetSolutionAsync(assetProvider, solutionChecksum, updatePrimaryBranch: false, workspaceVersion: -1, CancellationToken.None); |
There was a problem hiding this comment.
renamed this to be clear what this flag is intending to convey. the flag it an actual request to the RemoteWorkspace to not only get the solution but update it's primary solution to point at that.
| /// <summary> | ||
| /// Wrapper around asynchronously produced solution. The computation for producing the solution will be | ||
| /// canceled when the number of in-flight operations using it goes down to 0. | ||
| /// </summary> |
There was a problem hiding this comment.
i moved away from ReferenceCountedDisposable. It was getting very hard to track what was going on, and the indirections it forced added a lot of complexity around locking and updating internal state here. By using a private type for the cache, and the item being cached, we can safely and uniformly lock around everything, which enables us to perform all operations, synchronously, atomically and sanely.
| public void IncrementInFlightCount_WhileAlreadyHoldingLock() | ||
| { | ||
| Contract.ThrowIfFalse(_cache._gate.CurrentCount == 0); | ||
| Contract.ThrowIfTrue(_inFlightCount < 1); |
There was a problem hiding this comment.
you'll see a ton of contract checks around the bookkeeping. I absolutely do not want to get this wrong as taht could cause:
- an incoming request to try to use an async-solution in teh canceled state (which would cause cancellation throws on a different token).
- a solution to be cached forever if we don't properly clean up after ourselves.
|
|
||
| public void IncrementInFlightCount() | ||
| { | ||
| using (_cache._gate.DisposableWait(CancellationToken.None)) |
There was a problem hiding this comment.
takign this gate is never cancellable in this type. We always want the increment/decrement code (and any cleanup that causes) to always run so we're in a consistent state.
| using (await _gate.DisposableWaitAsync(cancellationToken).ConfigureAwait(false)) | ||
| { | ||
| // From this point on we are mutating state. Ensure we absolutely do not cancel accidentally. | ||
| cancellationToken = default; |
There was a problem hiding this comment.
once we acquire these locks, i always null-out the cancellation token so we don't even write an errant line that causes us to cancel after we've done some mutation.
There was a problem hiding this comment.
any reason to not use CancellationToken.None?
| @@ -3,18 +3,14 @@ | |||
| // See the LICENSE file in the project root for more information. | |||
There was a problem hiding this comment.
this file is effectively rewritten. I recommend looking at the after side.
| @@ -0,0 +1,96 @@ | |||
| // Licensed to the .NET Foundation under one or more agreements. | |||
There was a problem hiding this comment.
The following two types/files (ChecksumToSolutionCache and SolutionAndInFlightCount) are utility types that the main bulk of the change benefits from. I tried to keep them as clear as i coudl given that they're implementing a ref-counting system.
| bool updatePrimaryBranch, | ||
| Func<Solution, ValueTask<T>> implementation, | ||
| CancellationToken cancellationToken) | ||
| { |
There was a problem hiding this comment.
this is the meat of the change and is very relevant.
|
@dibarbet ptal. |
| /// intervening updates, we can cache and keep the solution around instead of having to recompute it. | ||
| /// </summary> | ||
| private Checksum? _lastRequestedChecksum; | ||
| private SolutionAndInFlightCount? _lastRequestedSolution; |
There was a problem hiding this comment.
this is what we used to use the primary branch solution for correct?
| using (await _gate.DisposableWaitAsync(cancellationToken).ConfigureAwait(false)) | ||
| { | ||
| // From this point on we are mutating state. Ensure we absolutely do not cancel accidentally. | ||
| cancellationToken = default; |
There was a problem hiding this comment.
any reason to not use CancellationToken.None?
| // solution we just computed, even if we have returned. This also ensures that if we promoted a | ||
| // non-primary-solution to a primary-solution that it will now take precedence in all our caches for this | ||
| // particular checksum. | ||
| await _anyBranchSolutionCache.SetLastRequestedSolutionAsync(solutionChecksum, solution, cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
I am a little suspicious of this - in that the retrieval of the last solution and the update of the last solution are not sharing the lock in the cache.
It would allow multiple requests to race between getting the solution and updating the last solution (e.g an 'older solution could overwrite the real last request solution). Maybe it doesnt matter, but it seems like the get and update to the cache should happen under a single request to the lock.
so the last solution just becomes an impl detail of the cache
There was a problem hiding this comment.
i like that a lot. let me try to make that change.
There was a problem hiding this comment.
ok. change made.
Nope. I'll switch to that. They're basically interchangeable to me since we write |
|
closing out in favor of #63028 |
The main thing this improves is that currently we have a very very weird model in RemoteWorkspace where a single checksum can map both to a forked solution, but then potentially the primary workspace solution later on (once an 'UpdatePrimaryWorkspace' call comes in). The existing code did not handle this well which led to some serious confusion about which solution to use.
The new code cleanly splits the concepts of the cache for the primary workspace solutions and the cache for non-primary ones, making it easier to reason about.