Batch up and flatten diagnostic updates, second approach.#23409
Batch up and flatten diagnostic updates, second approach.#23409CyrusNajmabadi wants to merge 32 commits intodotnet:masterfrom
Conversation
…lutions from different workspaces.
|
Tagging @sharwell @heejaechang @dotnet/roslyn-ide |
Definitely. Diagnostics were always different because we couldn't do the standard "hear event, then go retrieve data" approach. If that's changed, then we should def just make the diagnostic tagger work that way. |
|
@CyrusNajmabadi there is still slight difference in semantic. our tagger framework is designed for tagger to be the one who drive tag calculation. when event source is raise, it calculate new data in BG, and when it done calculating, it let tagger to know it is time to refresh with newly calculated data. the new approach I suggested above, tagger is still not the one who drive calculating new data, it is still solution crawler who does it. so new ProduceNewTag is called, it doesn't actually ask IDiagnosticService to recalculate the data but ask it to just return latest data it currently have which can be already stale. but, even the pure tagger design, since calculation is done in BG, it doesn't guarantee at the end, it returns completely correct data, it just guarantee at the end, data will be current which is same for diagnostic service. |
From talking to you a long time ago :) The original design was that some diagnostic systems wouldn't actually cache things, and thus we needed events. I'm super happy to make it so we can simplify all of this if we actually can just dynamically retrieve the set of diagnostics when necessary. The diagnostic tagger was always the unfortunate odd tagger out (and i guess rename tracking as well). If we can change this that will be fantastic. |
I wouldn't say that that's the case. The design is that the tagger initiates computation of tags, not that it has to do all the computation itself. It is free to call out to get the data from anywhere it needs (including another system that has already computed and cached it). The only core invariant that the tagging system needs is an ordering between events and production. The assumption is that once the TaggerEventSource fires that it will be possible to ProduceTags immediately afterwards. If you can supply an implementation that follows that contract, then you can be a tagger. |
|
"The original design was that some diagnostic systems wouldn't actually cache things, " ah, yes. things like EnC probably won't be cached before. only diagnostics from diagnostic analyzer were cached. but now they do, so, it should work fine now. |
|
Another question: When you respond to ProduceTagsAsync will you have to go retrieve all the tags for all the DiagnosticUpdateArgs? Will you get a DiagnosticUpdateArgs for each Id for that doucment (even if there are no Diagnostics), or do you only have to query for diagnostics that actually exist? I imagine in practice it won't matter much. This will be on hte BG, so recomputing tags won't be bad. And since they'll get diffed, the editor should still always get the minimal diff that it needs to update. |
Yup. That sounds familiar. I'm glad to see we've changed how diagnostics work. Having an oracle that can always provide them on demand fits so much more nicely into the rest of the roslyn system. Thanks for doing this! -- Note: i'm also happy to make the tagging change if you want. Since you made diagnostics better, it's only fair that someone helps you out on the tagging part :) |
|
@CyrusNajmabadi event source for diagnostic changed events need to tell somehow this event source update is for this object id to the tagger. I didn't think through how to do that yet. but once you have the object id, you can get diagnostics for that id in this document. |
Gotcha. Yeah, right now there's no way to thread that information through. One possible option would be to carry it in a subclass of TaggerEventArgs, and then pass that data ll the way along to ProduceTagsAsync. Or, you could just enumerate all id's for that document and grab all the diagnostics for them. I imagine that most of them will be empty most of the time. If we're returning a cached empty enumerator, then it's likely that we can do this without allocations: Aside: we should probably change IDiagnosticService to not return an IEnumerable, but instead take in an array builder. We coudl then avoid all collection allocations when retrieving data. |
Basically, if you got any other event, you'd have to query for all diagnostics. But if you got a diagnostic event, you'd then be able to just retrieve those . but, as before, it's possible that that level of specificity isn't necessary. |
|
"Basically, if you got any other event, you'd have to query for all diagnostics. But if you got a diagnostic event, you'd then be able to just retrieve those" yep, exactly. other events are really rare so retrieving all and refreshing should be fine. diagnostic changed events are frequent (you can get many per a char change), so doing for 1 specific would be cheaper, especially, there will be many many cancellation. |
|
Need ask mode template. |
@sharwell Can you provide? |
|
@CyrusNajmabadi Yes I'll get it added later tonight |
|
Awesome. Thanks! |
Customer scenarioThrough an unknown sequence of events, the IDE could enter a state where the UI thread failed to process work items in the Bugs this fixesFixes DevDiv 512757 Workarounds, if anyNone. RiskThis is inherently a moderate risk change due to an algorithm change in multi-threaded queuing/caching logic. After reviewing several proposed solutions, this approach appears to minimize the overall risk by focusing specifically on addressing the primary state problem revealed by MemoryWatson. Performance impactFor most users, this is not likely to have a performance impact. However, for a subset of users it will result in a substantial performance improvement. The change is expected to be most easily observed on relatively under-powered machines, where processing the additional work items took more time. Is this a regression from a previous update?No. Root cause analysisThe situation is not known to be reproducible in house. It was revealed by MemoryWatson analysis and triaged as the next leading cause of memory pressure in systems reporting low virtual address space ("at risk" state). Sampling of data from impacted systems consistently showed that the tagger was responsible for the accumulating work items, and despite limitations in the analysis tooling the pigeon hole principle showed that the accumulated work items were often duplicates and/or superseded other queued work items that were not yet processed. How was the bug found?MemoryWatson. Test documentation updated?N/A |
|
@Pilchie ask mode template is ready |
|
I'm a bit concerned about this one - have we done any manual testing of it for noticeable impacts to the experience? @richaverma1 What do you think? |
|
I would def recommend smoke testing this. The main problem is that our previous approach was racey, and the current approach is racey, but i have no idea if the new approach's raceyness is worse for the user experience than before. And for testing, i would highly recommend some sort of large project where we'd be more likely to see negative issues. |
jasonmalinowski
left a comment
There was a problem hiding this comment.
Between this one and #23377 I like the background filtering of this one, but am very concerned about the raciness of the approach, or the raciness of this code in general now that I'm aware of it. 😄
Given it seems we're able to fetch initial diagnostics during the normal case, I don't see why we wouldn't reuse that same code path for the DocumentID changing case. Imagine that the constructor just called ResetCurrentDocumentIdIfNecessary() after setting up the workspace registration...and that was it. Using a single cancellation token to represent "all analysis on this document, which is cancelled only when we Dispose or when we switch IDs" means you can pass that around and re-check it on the UI thread, a pattern we do (or did?) in inline rename. By rechecking the captured token on the UI thread, it gives you an easy way to see if your processing was stale and you already missed it.
Given we already have a token for the initial population, it seems the token could just be reused accordingly.
@CyrusNajmabadi and @sharwell does that make sense? I'm happy to write up a PR off of Cyrus's that at least illustrates the code movement I'm thinking of, even if I can't in any way guarantee it's correctness.
|
|
||
| // Use a chain of tasks to make sure that we process each diagnostic event serially. | ||
| // This also ensures that the first diagnostic event we hear about will be processed | ||
| // after the initial background work to get the first group of diagnostics. |
There was a problem hiding this comment.
Comment this is protected by _gate? Consider a cref too so renames get updated.
| // Disconnect from the old workspace notifications and hook up to the new ones. | ||
| lock (_gate) | ||
| { | ||
| if (_workspace != null) |
There was a problem hiding this comment.
I think there's some theoretical insanity where while we have queued this notification another thread goes off and changes it under you again. In any case, the pattern here is what I would recommend.
|
|
||
| var document = _subjectBuffer.AsTextContainer().GetOpenDocumentInCurrentContext(); | ||
|
|
||
| // Safe to read _currentDocumentId here, we are the fg thread. |
There was a problem hiding this comment.
The comment on _gate doesn't state that unlocked foreground thread reads are OK...
| _workspace.DocumentActiveContextChanged += OnDocumentActiveContextChanged; | ||
| } | ||
|
|
||
| // Kick off a background task to collect the initial set of diagnostics. |
There was a problem hiding this comment.
I don't understand why we wouldn't rerun this process any time the document ID that we're watching changes, which would seem like it would address the race concerns being brought up by the background filtering. Put another way, "we were created and started looking at a DocumentID" and "context changed, you're watching a different DocumentID" really should be the same codepaths.
One pattern you might be able to use then is whenever the document ID we're tracking changes, create a new CancellationTokenSource a la _initialDiagnosicsCancellationSource and cancel the old one. That's captured by the task that does the initial population, but also any tasks that did background filtering and the resulting posts to the UI thread. The UI thread checks the local copy of that token, so if the document is changed, any in-flight queued work sees the token is cancelled and knows not to update anything.
|
@jasonmalinowski I think you should instead code review this (#23448) |
|
@heejaechang OK, I reviewed that one too. |
|
Closing in favor of: #23448 |
This is a followup to #23377. The primary difference between this PR and that one is that we also do filtering of diagnostic events on teh BG (acknowledging that this is racey, but just as racey as before).