Skip to content

Batch up and flatten diagnostic updates, second approach.#23409

Closed
CyrusNajmabadi wants to merge 32 commits intodotnet:masterfrom
CyrusNajmabadi:bgDiagnostics
Closed

Batch up and flatten diagnostic updates, second approach.#23409
CyrusNajmabadi wants to merge 32 commits intodotnet:masterfrom
CyrusNajmabadi:bgDiagnostics

Conversation

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor

This is a followup to #23377. The primary difference between this PR and that one is that we also do filtering of diagnostic events on teh BG (acknowledging that this is racey, but just as racey as before).

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Tagging @sharwell @heejaechang @dotnet/roslyn-ide

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

I think this will make code really simple.

Definitely. Diagnostics were always different because we couldn't do the standard "hear event, then go retrieve data" approach. If that's changed, then we should def just make the diagnostic tagger work that way.

@heejaechang
Copy link
Copy Markdown
Contributor

heejaechang commented Nov 28, 2017

@CyrusNajmabadi there is still slight difference in semantic. our tagger framework is designed for tagger to be the one who drive tag calculation. when event source is raise, it calculate new data in BG, and when it done calculating, it let tagger to know it is time to refresh with newly calculated data.

the new approach I suggested above, tagger is still not the one who drive calculating new data, it is still solution crawler who does it. so new ProduceNewTag is called, it doesn't actually ask IDiagnosticService to recalculate the data but ask it to just return latest data it currently have which can be already stale.

but, even the pure tagger design, since calculation is done in BG, it doesn't guarantee at the end, it returns completely correct data, it just guarantee at the end, data will be current which is same for diagnostic service.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@CyrusNajmabadi I am not sure where you got that info.

From talking to you a long time ago :)

The original design was that some diagnostic systems wouldn't actually cache things, and thus we needed events. I'm super happy to make it so we can simplify all of this if we actually can just dynamically retrieve the set of diagnostics when necessary.

The diagnostic tagger was always the unfortunate odd tagger out (and i guess rename tracking as well). If we can change this that will be fantastic.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

there is still slight difference in semantic. our tagger framework is designed for tagger to be the one who drive tag calculation. when event source is raise, it calculate new data in BG, and when it done calculating,

I wouldn't say that that's the case. The design is that the tagger initiates computation of tags, not that it has to do all the computation itself. It is free to call out to get the data from anywhere it needs (including another system that has already computed and cached it).

The only core invariant that the tagging system needs is an ordering between events and production. The assumption is that once the TaggerEventSource fires that it will be possible to ProduceTags immediately afterwards. If you can supply an implementation that follows that contract, then you can be a tagger.

@heejaechang
Copy link
Copy Markdown
Contributor

@CyrusNajmabadi

"The original design was that some diagnostic systems wouldn't actually cache things, "

ah, yes. things like EnC probably won't be cached before. only diagnostics from diagnostic analyzer were cached. but now they do, so, it should work fine now.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

CyrusNajmabadi commented Nov 28, 2017

Another question:

When you respond to ProduceTagsAsync will you have to go retrieve all the tags for all the DiagnosticUpdateArgs? Will you get a DiagnosticUpdateArgs for each Id for that doucment (even if there are no Diagnostics), or do you only have to query for diagnostics that actually exist?

I imagine in practice it won't matter much. This will be on hte BG, so recomputing tags won't be bad. And since they'll get diffed, the editor should still always get the minimal diff that it needs to update.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

ah, yes. things like EnC probably won't be cached before. only diagnostics from diagnostic analyzer were cached. but now they do, so, it should work fine now.

Yup. That sounds familiar. I'm glad to see we've changed how diagnostics work. Having an oracle that can always provide them on demand fits so much more nicely into the rest of the roslyn system. Thanks for doing this!

--

Note: i'm also happy to make the tagging change if you want. Since you made diagnostics better, it's only fair that someone helps you out on the tagging part :)

@heejaechang
Copy link
Copy Markdown
Contributor

@CyrusNajmabadi event source for diagnostic changed events need to tell somehow this event source update is for this object id to the tagger. I didn't think through how to do that yet. but once you have the object id, you can get diagnostics for that id in this document.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

CyrusNajmabadi commented Nov 28, 2017

I didn't think through how to do that yet. but once you have the object id, you can get diagnostics for that id in this document.

Gotcha. Yeah, right now there's no way to thread that information through. One possible option would be to carry it in a subclass of TaggerEventArgs, and then pass that data ll the way along to ProduceTagsAsync.

Or, you could just enumerate all id's for that document and grab all the diagnostics for them. I imagine that most of them will be empty most of the time. If we're returning a cached empty enumerator, then it's likely that we can do this without allocations:

Aside: we should probably change IDiagnosticService to not return an IEnumerable, but instead take in an array builder. We coudl then avoid all collection allocations when retrieving data.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Gotcha. Yeah, right now there's no way to thread that information through. One possible option would be to carry it in a subclass of TaggerEventArgs, and then pass that data ll the way along to ProduceTagsAsync.

Basically, if you got any other event, you'd have to query for all diagnostics. But if you got a diagnostic event, you'd then be able to just retrieve those . but, as before, it's possible that that level of specificity isn't necessary.

@heejaechang
Copy link
Copy Markdown
Contributor

"Basically, if you got any other event, you'd have to query for all diagnostics. But if you got a diagnostic event, you'd then be able to just retrieve those"

yep, exactly.

other events are really rare so retrieving all and refreshing should be fine. diagnostic changed events are frequent (you can get many per a char change), so doing for 1 specific would be cheaper, especially, there will be many many cancellation.

@Pilchie
Copy link
Copy Markdown
Member

Pilchie commented Nov 28, 2017

Need ask mode template.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Need ask mode template.

@sharwell Can you provide?

@sharwell
Copy link
Copy Markdown
Contributor

@CyrusNajmabadi Yes I'll get it added later tonight

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Awesome. Thanks!

@sharwell
Copy link
Copy Markdown
Contributor

Customer scenario

Through an unknown sequence of events, the IDE could enter a state where the UI thread failed to process work items in the IForegroundNotificationService as fast as they were added. As the work items accumulated, the system became strained for memory and the processing progressively slowed. Eventually the application would fail due to an out of memory condition.

Bugs this fixes

Fixes DevDiv 512757

Workarounds, if any

None.

Risk

This is inherently a moderate risk change due to an algorithm change in multi-threaded queuing/caching logic. After reviewing several proposed solutions, this approach appears to minimize the overall risk by focusing specifically on addressing the primary state problem revealed by MemoryWatson.

Performance impact

For most users, this is not likely to have a performance impact. However, for a subset of users it will result in a substantial performance improvement. The change is expected to be most easily observed on relatively under-powered machines, where processing the additional work items took more time.

Is this a regression from a previous update?

No.

Root cause analysis

The situation is not known to be reproducible in house. It was revealed by MemoryWatson analysis and triaged as the next leading cause of memory pressure in systems reporting low virtual address space ("at risk" state). Sampling of data from impacted systems consistently showed that the tagger was responsible for the accumulating work items, and despite limitations in the analysis tooling the pigeon hole principle showed that the accumulated work items were often duplicates and/or superseded other queued work items that were not yet processed.

How was the bug found?

MemoryWatson.

Test documentation updated?

N/A

@sharwell
Copy link
Copy Markdown
Contributor

@Pilchie ask mode template is ready

@Pilchie
Copy link
Copy Markdown
Member

Pilchie commented Nov 29, 2017

I'm a bit concerned about this one - have we done any manual testing of it for noticeable impacts to the experience? @richaverma1 What do you think?

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

I would def recommend smoke testing this. The main problem is that our previous approach was racey, and the current approach is racey, but i have no idea if the new approach's raceyness is worse for the user experience than before.

And for testing, i would highly recommend some sort of large project where we'd be more likely to see negative issues.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@Pilchie See #23377 as a less risky approach. It keeps things as they are today wrt filtering, but it just batches things up so we only make one UI request every 50ms, instead of N ui requests for N diagnostic notifications.

Copy link
Copy Markdown
Member

@jasonmalinowski jasonmalinowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between this one and #23377 I like the background filtering of this one, but am very concerned about the raciness of the approach, or the raciness of this code in general now that I'm aware of it. 😄

Given it seems we're able to fetch initial diagnostics during the normal case, I don't see why we wouldn't reuse that same code path for the DocumentID changing case. Imagine that the constructor just called ResetCurrentDocumentIdIfNecessary() after setting up the workspace registration...and that was it. Using a single cancellation token to represent "all analysis on this document, which is cancelled only when we Dispose or when we switch IDs" means you can pass that around and re-check it on the UI thread, a pattern we do (or did?) in inline rename. By rechecking the captured token on the UI thread, it gives you an easy way to see if your processing was stale and you already missed it.

Given we already have a token for the initial population, it seems the token could just be reused accordingly.

@CyrusNajmabadi and @sharwell does that make sense? I'm happy to write up a PR off of Cyrus's that at least illustrates the code movement I'm thinking of, even if I can't in any way guarantee it's correctness.


// Use a chain of tasks to make sure that we process each diagnostic event serially.
// This also ensures that the first diagnostic event we hear about will be processed
// after the initial background work to get the first group of diagnostics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment this is protected by _gate? Consider a cref too so renames get updated.

// Disconnect from the old workspace notifications and hook up to the new ones.
lock (_gate)
{
if (_workspace != null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some theoretical insanity where while we have queued this notification another thread goes off and changes it under you again. In any case, the pattern here is what I would recommend.


var document = _subjectBuffer.AsTextContainer().GetOpenDocumentInCurrentContext();

// Safe to read _currentDocumentId here, we are the fg thread.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on _gate doesn't state that unlocked foreground thread reads are OK...

_workspace.DocumentActiveContextChanged += OnDocumentActiveContextChanged;
}

// Kick off a background task to collect the initial set of diagnostics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we wouldn't rerun this process any time the document ID that we're watching changes, which would seem like it would address the race concerns being brought up by the background filtering. Put another way, "we were created and started looking at a DocumentID" and "context changed, you're watching a different DocumentID" really should be the same codepaths.

One pattern you might be able to use then is whenever the document ID we're tracking changes, create a new CancellationTokenSource a la _initialDiagnosicsCancellationSource and cancel the old one. That's captured by the task that does the initial population, but also any tasks that did background filtering and the resulting posts to the UI thread. The UI thread checks the local copy of that token, so if the document is changed, any in-flight queued work sees the token is cancelled and knows not to update anything.

@heejaechang
Copy link
Copy Markdown
Contributor

@jasonmalinowski I think you should instead code review this (#23448)

@jasonmalinowski
Copy link
Copy Markdown
Member

@heejaechang OK, I reviewed that one too.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Closing in favor of: #23448

@CyrusNajmabadi CyrusNajmabadi deleted the bgDiagnostics branch January 16, 2018 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee. Tenet-Reliability Customer telemetry indicates that the product is failing in a crash/hang/dataloss manner.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants