Batch up and flatten diagnostic updates, second approach. by CyrusNajmabadi · Pull Request #23409 · dotnet/roslyn

CyrusNajmabadi · 2017-11-28T00:16:00Z

This is a followup to #23377. The primary difference between this PR and that one is that we also do filtering of diagnostic events on teh BG (acknowledging that this is racey, but just as racey as before).

…lutions from different workspaces.

…stics

CyrusNajmabadi · 2017-11-28T00:16:27Z

Tagging @sharwell @heejaechang @dotnet/roslyn-ide

CyrusNajmabadi · 2017-11-28T21:46:15Z

I think this will make code really simple.

Definitely. Diagnostics were always different because we couldn't do the standard "hear event, then go retrieve data" approach. If that's changed, then we should def just make the diagnostic tagger work that way.

heejaechang · 2017-11-28T21:52:28Z

@CyrusNajmabadi there is still slight difference in semantic. our tagger framework is designed for tagger to be the one who drive tag calculation. when event source is raise, it calculate new data in BG, and when it done calculating, it let tagger to know it is time to refresh with newly calculated data.

the new approach I suggested above, tagger is still not the one who drive calculating new data, it is still solution crawler who does it. so new ProduceNewTag is called, it doesn't actually ask IDiagnosticService to recalculate the data but ask it to just return latest data it currently have which can be already stale.

but, even the pure tagger design, since calculation is done in BG, it doesn't guarantee at the end, it returns completely correct data, it just guarantee at the end, data will be current which is same for diagnostic service.

CyrusNajmabadi · 2017-11-28T21:53:02Z

@CyrusNajmabadi I am not sure where you got that info.

From talking to you a long time ago :)

The original design was that some diagnostic systems wouldn't actually cache things, and thus we needed events. I'm super happy to make it so we can simplify all of this if we actually can just dynamically retrieve the set of diagnostics when necessary.

The diagnostic tagger was always the unfortunate odd tagger out (and i guess rename tracking as well). If we can change this that will be fantastic.

CyrusNajmabadi · 2017-11-28T21:56:26Z

there is still slight difference in semantic. our tagger framework is designed for tagger to be the one who drive tag calculation. when event source is raise, it calculate new data in BG, and when it done calculating,

I wouldn't say that that's the case. The design is that the tagger initiates computation of tags, not that it has to do all the computation itself. It is free to call out to get the data from anywhere it needs (including another system that has already computed and cached it).

The only core invariant that the tagging system needs is an ordering between events and production. The assumption is that once the TaggerEventSource fires that it will be possible to ProduceTags immediately afterwards. If you can supply an implementation that follows that contract, then you can be a tagger.

heejaechang · 2017-11-28T21:56:27Z

@CyrusNajmabadi

"The original design was that some diagnostic systems wouldn't actually cache things, "

ah, yes. things like EnC probably won't be cached before. only diagnostics from diagnostic analyzer were cached. but now they do, so, it should work fine now.

CyrusNajmabadi · 2017-11-28T21:59:31Z

Another question:

When you respond to ProduceTagsAsync will you have to go retrieve all the tags for all the DiagnosticUpdateArgs? Will you get a DiagnosticUpdateArgs for each Id for that doucment (even if there are no Diagnostics), or do you only have to query for diagnostics that actually exist?

I imagine in practice it won't matter much. This will be on hte BG, so recomputing tags won't be bad. And since they'll get diffed, the editor should still always get the minimal diff that it needs to update.

CyrusNajmabadi · 2017-11-28T22:02:22Z

ah, yes. things like EnC probably won't be cached before. only diagnostics from diagnostic analyzer were cached. but now they do, so, it should work fine now.

Yup. That sounds familiar. I'm glad to see we've changed how diagnostics work. Having an oracle that can always provide them on demand fits so much more nicely into the rest of the roslyn system. Thanks for doing this!

--

Note: i'm also happy to make the tagging change if you want. Since you made diagnostics better, it's only fair that someone helps you out on the tagging part :)

heejaechang · 2017-11-28T22:06:11Z

@CyrusNajmabadi event source for diagnostic changed events need to tell somehow this event source update is for this object id to the tagger. I didn't think through how to do that yet. but once you have the object id, you can get diagnostics for that id in this document.

CyrusNajmabadi · 2017-11-28T22:11:12Z

I didn't think through how to do that yet. but once you have the object id, you can get diagnostics for that id in this document.

Gotcha. Yeah, right now there's no way to thread that information through. One possible option would be to carry it in a subclass of TaggerEventArgs, and then pass that data ll the way along to ProduceTagsAsync.

Or, you could just enumerate all id's for that document and grab all the diagnostics for them. I imagine that most of them will be empty most of the time. If we're returning a cached empty enumerator, then it's likely that we can do this without allocations:

Aside: we should probably change IDiagnosticService to not return an IEnumerable, but instead take in an array builder. We coudl then avoid all collection allocations when retrieving data.

CyrusNajmabadi · 2017-11-28T22:17:40Z

Gotcha. Yeah, right now there's no way to thread that information through. One possible option would be to carry it in a subclass of TaggerEventArgs, and then pass that data ll the way along to ProduceTagsAsync.

Basically, if you got any other event, you'd have to query for all diagnostics. But if you got a diagnostic event, you'd then be able to just retrieve those . but, as before, it's possible that that level of specificity isn't necessary.

heejaechang · 2017-11-28T22:37:45Z

"Basically, if you got any other event, you'd have to query for all diagnostics. But if you got a diagnostic event, you'd then be able to just retrieve those"

yep, exactly.

other events are really rare so retrieving all and refreshing should be fine. diagnostic changed events are frequent (you can get many per a char change), so doing for 1 specific would be cheaper, especially, there will be many many cancellation.

Pilchie · 2017-11-28T23:30:24Z

Need ask mode template.

CyrusNajmabadi · 2017-11-28T23:31:36Z

Need ask mode template.

@sharwell Can you provide?

sharwell · 2017-11-28T23:43:39Z

@CyrusNajmabadi Yes I'll get it added later tonight

CyrusNajmabadi · 2017-11-29T00:10:42Z

Awesome. Thanks!

sharwell · 2017-11-29T04:18:35Z

Customer scenario

Through an unknown sequence of events, the IDE could enter a state where the UI thread failed to process work items in the IForegroundNotificationService as fast as they were added. As the work items accumulated, the system became strained for memory and the processing progressively slowed. Eventually the application would fail due to an out of memory condition.

Bugs this fixes

Fixes DevDiv 512757

Workarounds, if any

None.

Risk

This is inherently a moderate risk change due to an algorithm change in multi-threaded queuing/caching logic. After reviewing several proposed solutions, this approach appears to minimize the overall risk by focusing specifically on addressing the primary state problem revealed by MemoryWatson.

Performance impact

For most users, this is not likely to have a performance impact. However, for a subset of users it will result in a substantial performance improvement. The change is expected to be most easily observed on relatively under-powered machines, where processing the additional work items took more time.

Is this a regression from a previous update?

No.

Root cause analysis

The situation is not known to be reproducible in house. It was revealed by MemoryWatson analysis and triaged as the next leading cause of memory pressure in systems reporting low virtual address space ("at risk" state). Sampling of data from impacted systems consistently showed that the tagger was responsible for the accumulating work items, and despite limitations in the analysis tooling the pigeon hole principle showed that the accumulated work items were often duplicates and/or superseded other queued work items that were not yet processed.

How was the bug found?

MemoryWatson.

Test documentation updated?

N/A

sharwell · 2017-11-29T04:18:48Z

@Pilchie ask mode template is ready

Pilchie · 2017-11-29T17:10:10Z

I'm a bit concerned about this one - have we done any manual testing of it for noticeable impacts to the experience? @richaverma1 What do you think?

CyrusNajmabadi · 2017-11-29T19:58:10Z

I would def recommend smoke testing this. The main problem is that our previous approach was racey, and the current approach is racey, but i have no idea if the new approach's raceyness is worse for the user experience than before.

And for testing, i would highly recommend some sort of large project where we'd be more likely to see negative issues.

CyrusNajmabadi · 2017-11-29T20:04:18Z

@Pilchie See #23377 as a less risky approach. It keeps things as they are today wrt filtering, but it just batches things up so we only make one UI request every 50ms, instead of N ui requests for N diagnostic notifications.

jasonmalinowski

Between this one and #23377 I like the background filtering of this one, but am very concerned about the raciness of the approach, or the raciness of this code in general now that I'm aware of it. 😄

Given it seems we're able to fetch initial diagnostics during the normal case, I don't see why we wouldn't reuse that same code path for the DocumentID changing case. Imagine that the constructor just called ResetCurrentDocumentIdIfNecessary() after setting up the workspace registration...and that was it. Using a single cancellation token to represent "all analysis on this document, which is cancelled only when we Dispose or when we switch IDs" means you can pass that around and re-check it on the UI thread, a pattern we do (or did?) in inline rename. By rechecking the captured token on the UI thread, it gives you an easy way to see if your processing was stale and you already missed it.

Given we already have a token for the initial population, it seems the token could just be reused accordingly.

@CyrusNajmabadi and @sharwell does that make sense? I'm happy to write up a PR off of Cyrus's that at least illustrates the code movement I'm thinking of, even if I can't in any way guarantee it's correctness.

jasonmalinowski · 2018-01-04T01:11:16Z

...tures/Core/Implementation/Diagnostics/AbstractDiagnosticsTaggerProvider.AggregatingTagger.cs

+
            // Use a chain of tasks to make sure that we process each diagnostic event serially.
            // This also ensures that the first diagnostic event we hear about will be processed
            // after the initial background work to get the first group of diagnostics.


Comment this is protected by _gate? Consider a cref too so renames get updated.

jasonmalinowski · 2018-01-04T01:14:35Z

...tures/Core/Implementation/Diagnostics/AbstractDiagnosticsTaggerProvider.AggregatingTagger.cs

+                // Disconnect from the old workspace notifications and hook up to the new ones.
+                lock (_gate)
+                {
+                    if (_workspace != null)


I think there's some theoretical insanity where while we have queued this notification another thread goes off and changes it under you again. In any case, the pattern here is what I would recommend.

jasonmalinowski · 2018-01-04T01:16:49Z

...tures/Core/Implementation/Diagnostics/AbstractDiagnosticsTaggerProvider.AggregatingTagger.cs

+
+                var document = _subjectBuffer.AsTextContainer().GetOpenDocumentInCurrentContext();
+
+                // Safe to read _currentDocumentId here, we are the fg thread.


The comment on _gate doesn't state that unlocked foreground thread reads are OK...

jasonmalinowski · 2018-01-04T01:31:10Z

...tures/Core/Implementation/Diagnostics/AbstractDiagnosticsTaggerProvider.AggregatingTagger.cs

+                    _workspace.DocumentActiveContextChanged += OnDocumentActiveContextChanged;
+                }
+
                // Kick off a background task to collect the initial set of diagnostics.


I don't understand why we wouldn't rerun this process any time the document ID that we're watching changes, which would seem like it would address the race concerns being brought up by the background filtering. Put another way, "we were created and started looking at a DocumentID" and "context changed, you're watching a different DocumentID" really should be the same codepaths.

One pattern you might be able to use then is whenever the document ID we're tracking changes, create a new CancellationTokenSource a la _initialDiagnosicsCancellationSource and cancel the old one. That's captured by the task that does the initial population, but also any tasks that did background filtering and the resulting posts to the UI thread. The UI thread checks the local copy of that token, so if the document is changed, any in-flight queued work sees the token is cancelled and knows not to update anything.

heejaechang · 2018-01-04T02:10:24Z

@jasonmalinowski I think you should instead code review this (#23448)

jasonmalinowski · 2018-01-04T02:28:29Z

@heejaechang OK, I reviewed that one too.

CyrusNajmabadi · 2018-01-16T20:54:31Z

Closing in favor of: #23448

CyrusNajmabadi added 16 commits November 25, 2017 16:43

Batch up and flatten diagnostic updates.

b6a906b

Add comment.

086260a

Add named args.

e69cc5a

keeping track of doc-id is insufficient as it may be shared across so…

6725955

…lutions from different workspaces.

Inline method.

a8f0068

Add comment.

0229a5c

Add disposed check back in.

dcc655e

Invert check.

8d76789

Simplify code.

5f18b8e

Convert to switch.

9f3185a

Keep a removal mapping as well as a creation mapping.

6b1cd81

Clean up comment.

d389184

Switch to a single dictionary keyed by a tuple.

a94826d

Use the providerId we were storing.

5d7b71e

Merge remote-tracking branch 'upstream/master' into bulkProcessDiagno…

2f87194

…stics

Do filtering on the bg thread.

9bf6d0f

CyrusNajmabadi mentioned this pull request Nov 28, 2017

Batch up and flatten diagnostic updates. #23377

Closed

CyrusNajmabadi added 12 commits November 27, 2017 16:17

Update comments.

d484869

Check workspace as well.

49b2684

Update comment.

e7b1979

Add comment.

9a87428

Clean up comment.

ffce81d

Add comment.

ff4429b

Add comment.

078afdc

Handle null document.

3813f71

Comment code.

f455e60

Make PR easier to read.

23002cf

Make PR easier to read.

1ee7acf

Only need to store a boolean if we're removing.

cc918ec

CyrusNajmabadi mentioned this pull request Nov 28, 2017

Simplify diagnostic tagging by making it use the standard tagging model. #23448

Merged

CyrusNajmabadi added 3 commits November 29, 2017 13:14

Merge remote-tracking branch 'upstream/master' into bgDiagnostics

1be7829

Simplify even further.

f4d3466

Simplify even further.

63c3bc6

jasonmalinowski reviewed Jan 4, 2018

View reviewed changes

CyrusNajmabadi closed this Jan 16, 2018

CyrusNajmabadi deleted the bgDiagnostics branch January 16, 2018 20:54


		var document = _subjectBuffer.AsTextContainer().GetOpenDocumentInCurrentContext();

		// Safe to read _currentDocumentId here, we are the fg thread.

Conversation

CyrusNajmabadi commented Nov 28, 2017

Uh oh!

CyrusNajmabadi commented Nov 28, 2017

Uh oh!

CyrusNajmabadi commented Nov 28, 2017

Uh oh!

heejaechang commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyrusNajmabadi commented Nov 28, 2017

Uh oh!

CyrusNajmabadi commented Nov 28, 2017

Uh oh!

heejaechang commented Nov 28, 2017

Uh oh!

CyrusNajmabadi commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyrusNajmabadi commented Nov 28, 2017

Uh oh!

heejaechang commented Nov 28, 2017

Uh oh!

CyrusNajmabadi commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyrusNajmabadi commented Nov 28, 2017

Uh oh!

heejaechang commented Nov 28, 2017

Uh oh!

Pilchie commented Nov 28, 2017

Uh oh!

CyrusNajmabadi commented Nov 28, 2017

Uh oh!

sharwell commented Nov 28, 2017

Uh oh!

CyrusNajmabadi commented Nov 29, 2017

Uh oh!

sharwell commented Nov 29, 2017

Customer scenario

Bugs this fixes

Workarounds, if any

Risk

Performance impact

Is this a regression from a previous update?

Root cause analysis

How was the bug found?

Test documentation updated?

Uh oh!

sharwell commented Nov 29, 2017

Uh oh!

Pilchie commented Nov 29, 2017

Uh oh!

CyrusNajmabadi commented Nov 29, 2017

Uh oh!

CyrusNajmabadi commented Nov 29, 2017

Uh oh!

jasonmalinowski left a comment

Choose a reason for hiding this comment

Uh oh!

jasonmalinowski Jan 4, 2018

Choose a reason for hiding this comment

Uh oh!

jasonmalinowski Jan 4, 2018

Choose a reason for hiding this comment

Uh oh!

jasonmalinowski Jan 4, 2018

Choose a reason for hiding this comment

Uh oh!

jasonmalinowski Jan 4, 2018

Choose a reason for hiding this comment

Uh oh!

heejaechang commented Jan 4, 2018

Uh oh!

jasonmalinowski commented Jan 4, 2018

Uh oh!

CyrusNajmabadi commented Jan 16, 2018

Uh oh!

Reviewers

Assignees

heejaechang commented Nov 28, 2017 •

edited

Loading

CyrusNajmabadi commented Nov 28, 2017 •

edited

Loading

CyrusNajmabadi commented Nov 28, 2017 •

edited

Loading