reduce diagnostic tagger's usages of UI thread for synchronization#22920
reduce diagnostic tagger's usages of UI thread for synchronization#22920heejaechang wants to merge 3 commits intodotnet:masterfrom
Conversation
There was a problem hiding this comment.
register text is async event, and unregister as synchronous events which can cause ordering issue (unregister fire first and then register event get fired) so moved all events to async
3336511 to
87d7bb7
Compare
|
retest this please |
|
@dotnet/roslyn-analysis @dotnet/roslyn-ide can you take a look? |
|
It's very unclear to me what hte actual issue is here that is being addressed. How did we get 2.1 million pending UI requests. in the first place, given how things are batched? |
Yes. How are we ensuring that we are not introducing race conditions here like we had before. The move to the FG thread as an appropriate sync context was very intentionally done because of how many race conditions existed prior to that change.
I don't know what this means. "it uses hte same amount of UI threads". There should only be one UI thread. Why would the preview tagger have a particular problem here?
Again, this was due to race conditions.
Again, this was due to race conditions. It became very intentional in the code at which points things could truly be canceled, vs getting to a point of no return wherin state changes had happened, and thus needed to fully be persisted in the appropriate places.
This seems like an issue that sounds safe to fix and could be fixed independently. |
|
@CyrusNajmabadi you can take a look how much FG requests diagnostics taggers puts in in simple Console app. easily reach 10000 requests just cycling through preview window and close LB. |
|
@CyrusNajmabadi also, you didn't specify exactly what the race was, since it didn't specify, I have no idea whether there was other way to remove race instead of doing everything in UI thread. or this will re-introduce the race you are referring to. |
|
"I don't know what this means. "it uses hte same amount of UI threads". There should only be one UI thread. Why would the preview tagger have a particular problem here?" I meant amount of UI thread requests. before, simple preview workspace tagger didn't need any of this complex state management, since each analyzer will return diagnostic only once and never change, so there was no race at all and no reason to use UI thread requests to manipulate states. |
|
@CyrusNajmabadi "Again, this was due to race conditions." so, exactly what race? |
|
"Again, this was due to race conditions. It became very intentional in the code at which points things could truly be canceled, vs getting to a point of no return wherin state changes had happened, and thus needed to fully be persisted in the appropriate places." well, view is going away and tagger is disposed. no idea when tagger goes away what race tagger can have. |
|
"This seems like an issue that sounds safe to fix and could be fixed independently." could be, but right now diagnostic tagger is the only one (2.1 million requests, almost all of them is from diagnostic tagger) who doesn't cancel work when work is already invalidated. so this change doesn't do anything without tagger fix since, all work from diagnostic tagger doesn't pass right cancellation token. |
How is that happening? what are those 10k requests? If it's just a console app, how do we have that much going on? I think we need to figure out what's going on in the system instead of trying to fix a symptom here. Even if we make things do more on the BG, we're still doing 10k things on the BG, despite there not seeming to be a good reason for that at all. |
The race conditions were the issues that caused us to have things like squiggles that never went away. Because we were not vigilant about keeping our data store safe, and we were haphazardly accessing it and mutating it from several threads, we would end up with incorrect data, and improper presentation of that data. We moved to a lockstep approach for the diagnostic tagging system where we would carefully serialize out mutations and we would then ensure that once mutations occurred we would always ensure that the tagging system was then updated to reflect the delta changes that we had recorded. Note: as mentioned in another bug today, i think it's very unfortunate that any of this code exists in the tagging layer at all. The core problem is that the tagging system has had to backward infer the contract about how diagnostics work in an attempt to actually understand and report what the state of the system is at any point in time for any buffer. As we've seen in the past there have been numerous events fired that it didn't know it was supposed to listen to. And there have been many complexities about diagnostics that it was unaware of that caused it to have a different belief about what diagnostics were around, versus what the actual diagnostics subsystem thought there should be. -- IMO, due to how buggy this entire space has been (for 5+ years now), it really should be rewritten with a far more appropriate architecture. Specifically, instead of having hte tagger try to talk to low level storage and eventing systems, it should talk to a higher level service that it registers and unregisters documents of interest with. This service, exposed at the diagnostics subsystem layer, would properly aggregate diagnostics for that document, and push normal events when they changed. The diagnostic tagger would then be a super simple tagger (like the rest of our taggers). This would isolate the complexity, instead of having it woven into the tagging infrastructure. If you wanted to make it all done in the BG for that new service, you could totally do that. Then, all this mess of BG/FG interactions in the tagger would not be a problem. |
Why are there 2.1 million requests from the diagnostic tagger? What is causing that to happen? |
Which cancellation tokens are you talking about, and to which systems is it not passing the token correctly. There are very specific cancellation tokens in the tagger for very specific purposes. Including different cancellation tokens to deal with needing to get the first set of diagnostic tags and whatnot. There are ordering concerns between these different parts, and getting it wrong can produce races. For example, it's possible, if your'e not careful, to process a "diagnostics-are-changing" event before processing the "these-are-the-initial-diagnostics" event. So we have to be careful here. -- My ideal would be to separate these out entirely. For one, it would help actually simplify the code majorly. For another, it would mean that each system has clear responsibility. i.e. the diagnostics tagger code would not have to think about the threading concerns. And the diangostics aggregation code would not have to thnk about tagging concerns. Right now, throwing more complexity into an already complex system is not a viable path forward. Even if we're lucky enough to get it correct (which i'm not convinced of based on my reading of the PR), it will be a very costly system to maintain moving forward. Most likely outcome it just having another set of racey diagnostics bugs appear, and having to revisit this problem 6 months down the line. -- The rest of the taggers are effectively a solved space. They've all be working reliably and properly for 5ish years+ at this point. It would be nice to get diagnostic tagging into the same space. |
There was a problem hiding this comment.
not passing this token is not good.
There was a problem hiding this comment.
it is field. doesn't need to pass in. it can use field to check.
There was a problem hiding this comment.
what is this code for?
There was a problem hiding this comment.
so that unnecessary tagger doesn't enqueue task for diagnostics that are not related to unnecessary tagger.
There was a problem hiding this comment.
is this racey? what happens if our buffer now maps to a different document? will we suddenly ignore events about the diagnostics for the buffer before the document changed?
There was a problem hiding this comment.
no. it won't. if it does, then that is an issue in Workspace.Events it is listening to.
There was a problem hiding this comment.
if there is an issue in Workspace events, and if we are trying to fix that by moving everything on UI thread and do filtering there, that is exactly trying to fixing the symptom not root cause of it.
There was a problem hiding this comment.
comment repeat what code says.
There was a problem hiding this comment.
you're on the UI thread. if you throw an exception, what happens? what is higher in the stack to catch this?
There was a problem hiding this comment.
this runs in foreground notification service, it will be caught and ignored.
There was a problem hiding this comment.
That is...not obvious. Can we do something else here? or at least comment that the expectation is the foreground notification service should be the caller here?
There was a problem hiding this comment.
(I think I've already caught bugs masked by that foreground notification service behavior. Would prefer exceptions just took down process, cancellation included, if exceptions leak that far.)
There was a problem hiding this comment.
so, should I throw or just return if cancelled?
There was a problem hiding this comment.
same concern. should our UI thread checks actually just be returning instead of throwing.
There was a problem hiding this comment.
same here. sure, we can change it to return.
There was a problem hiding this comment.
who does this though? Is there a benefit to not just making the entire thing lazy?
There was a problem hiding this comment.
I am not sure I am following you. this is how whole Roslyn Workspace eventing works.
so, how the existing code ensured it? |
|
"Diagnostics are (afaict) the only system we have that doesn't follow the example that we have for the rest of Roslyn" I really don't understand what you mean by this. give me the example. |
cycle through previews. |
Maybe if we could have something not host-wide that would be good :)
Sure. But having multiple bad things going on here doesn't mean we shouldn't be pushing for lots of improvements. I don't think DiagnosticService is entirely at fault here. But, i would also point out that we haven't had nearly as many issues (especially complex racey issue with enormous surrounding complexity) with WorkspaceChanged. If we were having the same sorts of issues with it then i would definitely be recommending that we do something to improve it. |
Er.. literally none of the other taggers or presentation systems we have need to do this sort of thing. Squiggle tagging literally is "the example" :) |
well, using WorkspaceChanged events without mixing other events sure. using Diagnosticservice events itself shouldn't give any trouble as well. problem is when someone tries to mix WorkspaceChanged events with DiagnosticService events. 2 are both async events, there are no central coordinator that coordinate between 2 async events. so, mixing these 2 are issue. I agree I just don't understand why it is diagnostic service's responsibility to do that? it is completely different thing or ones that want to mix 2 are in charge, in this case, the tagger. |
For one thing, it definitely did not try to work across workspace registration changes. So it avoided races there entirely. The assumption being that you're never trusting that stuff. You're always starting from the buffer and determining which document you are when you actually run your code. |
I have never asked for this. Please stop saying so. Indeed, i just spent time above to reiterate that i'm not asking or this. :) |
which is a bug? who ensure tagger to be refreshed? it subscribe to workspacechange event source for tagger, but it never refreshed tags. so it still shows old tags that belong to wrong workspace. |
Yes. I just said that above:
I don't think there's any actual disagreement here between us. It sounds like we're saying the same thing. What i'm disagreeing with then is how we execute on this problem. Specifically, the the execution here in this PR seems ot add much too much complexity, and far too much possibility for races to enter. Given that we both seem to recognize and acknowledge at least one of the root problems here, all i'm saying is that i think that root problem should have a dedicated clean, simple, reliable, and well tested solution provided. Given how deeply coupled this will be to hte behavior of IDiagnosticService, and given how IDiagnosticService itself already has deep knowledge and behavioral understanding of WorkspaceEvents, my suggestion would be that someone with deep knowledge of both work on this. |
The point is that that shouldn't be happening directly by using this. Code that cares about this is supposed to use something like TaggerEventSources.OnWorkspaceRegistrationChanged. In other words " who ensure tagger to be refreshed?" is answered by the tagger architecture. It is the event sources that cause things to be refreshed. That's the pattern all other taggers use. You'll notice none of hte rest of the taggers need the 'AggregatingTagger' concept at all. They can all work with event feeds and data sources that can drive them to a fixed point wrt to presenting their data to the tagging system. diagnostics are different here. And i think it's because diagnostics do not provide a way to be able to consistently query to determine what diagnostic data there is. Instead, diagnostics expose only a live streaming feed of "here is a change", and the expectation is that the receiver must combine that with other system events to try to get a consistent view of what the state of the system is. AFAICT, diagnostics is the only part of our system that behaves in this manner. |
|
I know it uses WorkspaceChangedEventSource. but it doesn't actually ask for new data from diagnostic service for new workspace. so it will show old data from another workspace until there is change for the new workspace that cause new diagnostic events and at the end, update data from UI thread. so it is currently a bug in current implementation. Existing design of tagger assumes tagger itself calculate tag data rather than there is another service that calculates for it. ones that have other service that calculate the data and tagger is just one that pass that along, it usually doesn't use existing tagger system (ex) rename tagger. for diagnostics tagger, we decide to use existing system, but uses diagnostic service to get info. it is unique choice for the diagnostic tagger, so in my opinion, it must be the tagger that deal with it. |
I don't see it as a "choice" :) . Were there other alternatives the diagnostic tagger could have taken? If not, there wasn't really a 'choice', but instead the tagger had to be written that way because of 'necessity'. Again, we decided to expose diagnostics from Roslyn in a way that is different from virtually every other data source in the system. We've seen the consequences of this (5 years+ of ongoing bugs in this area). IMO, not exposing a system for diagnostics that behaves like all our other data source APIs is the root problem here. |
That's not actually the design. The design requirement of the tagger architecture is only that after receiving an event, it should be possible to ask for the data necessary to produce all the tags. The tagger infrastructure then takes it from there. Given the above, it can always achieve a fixed point as events come in. It can also efficiently process, diff, and cache the new data against the last successful stored snapshot of data it processed. This works well in roslyn as we've designed so many of our systems to be able to be used in such a manner. Namely "hey, something changed! (possibly with the changed data)" and "hey, give me the data you have so i can decide what to do". This pattern works extremely well and has provided virtually bug free subsystems (at least wrt to eventing/asynchrony/concurrency/data-management) for years. The only outstanding bug (which is really a feature request) we have in this area is to expose a little more flexibility in this system around the timings the tagger system using when doing this processing. |
|
retest windows_release_vs-integration_prtest please |
|
@dotnet/roslyn-analysis @dotnet/roslyn-ide can someone take a look? |
|
@CyrusNajmabadi well. we could write it its own tagger rather than implementing on top of existing tagger system we built. "we decided to expose diagnostics from Roslyn in a way that is different from virtually every other data source in the system." and why that is problem? it is a problem since it is different and try to jam into complex system that is not designed for it. but if it is different and have built one for it, why is that an issue? question on why it is different at the first place is because of the necessity.
we had different requirement, so it works differently. not sure why that makes it wrong. like I said, we could just have made it its own tagger not using existing system. like rename tagger or syntax colorizer tagger and all those other ones that doesn't fit into existing tagger system. |
| // workspace change event is synchronous events. and it doesn't use workspace events queue since | ||
| // WorkspaceRegistration is not associated with 1 specific workspace. workspace events queue is specific to 1 workspace | ||
| // using multiple queues that belong to multiple workspace will cause events to be out of order | ||
| registration.RaiseEvents(oldWorkspace, newWorkspace); |
There was a problem hiding this comment.
i have 2 choice here. creating WorkspaceRegistration's own event queue to make it async events. or make these just synchronous events like I did here.
existing code was issue since not only one of them was async and the other is not, but because it used event queue of 1 workspace which can't guarantee event ordering of WorkspaceRegistration.WorkspaceChanged events.
|
@dotnet/roslyn-infrastructure what would be way to investigate this issue? https://ci.dot.net/job/dotnet_roslyn/job/master/job/windows_release_vs-integration_prtest/8712/testReport/junit/Roslyn.VisualStudio.IntegrationTests.Workspace/WorkspacesNetCore/RenamingOpenFiles/ looks like due to how I changed WorkspaceRegistration.WorkspaceChanged event. |
|
@dotnet/roslyn-infrastructure ping? any help on how to run integration test locally? |
|
I think someone else (@jasonmalinowski and @sharwell ) should take a look. I have not been able to convince myself of the correctness of this approach (even after pulling down the code to look at locally). I think this need a couple (or more) people sitting together to walk through all the data sharing, asynchrony and BG computation to try to document what's going on, and to get confidence that this code is handling it properly. i.e. similar to how Jason and I have handled changes to AsyncLazy in the past. I would do this, but clearly working SxS isn't something feasible right now :) -- NOte: because of the enormous complexity here, i would reiterate that my preferred approach here would be to extract out all that complexity into a separate component (it can live at the tagging layer if that's what you want). That component would then be much easier to review and understand. The tagger could then integrate with that component separately and would then manage anything UI related. This separation between aggregation+filtering of events and the processing and UI coordination would make things much easier to understand and maintain IMO. |
|
📝 (Note to self) Essentially all of the pending work when this bug occurs comes from the same place: The containing type of that method is generic. Three different instantiations each contribute:
|
If that's the case, then addressing this with a simpler solution is definitely possible. We just need to take the approach where we see if there a pending foreground notification already in flight, and not bother to make a new one if so. And instead of passing the argument along in the lambda, it just buffered into a list of work to do. When the foreground notification actually activates, it pulls off all the work, marks that any incoming work should cause a new foreground notification, and then processes it all. Tihs way, there is only one outstanding pending foreground request at a time. Or, in this case, 3 (since there are three types). This sort of pattern is simple to do, and we've done it in many different places. I'd be happy to contribute such a fix later today or tomorrow. |
| } | ||
| } | ||
|
|
||
| public IDisposable Subscribe(Workspace workspace, DocumentId documentId, Action<DiagnosticsUpdatedArgs> action) |
|
|
||
| _taskChain = task; | ||
| // make sure we belong to same workspace | ||
| Contract.ThrowIfFalse(document == null || _registration.Workspace == document.Project.Solution.Workspace); |
There was a problem hiding this comment.
My first reaction was "that could never happen" and this could be deleted. But then again I guess that's why to leave it... :-)
|
closing in favor of #23448 |
Customer scenario
Customer is working on a VS. doing code fix and watching previews, applying fix all and etc, and suddenly VS crash due to OOM.
Bugs this fixes:
https://devdiv.visualstudio.com/DevDiv/_workitems/edit/512757
Workarounds, if any
there is no easy workaround.
Risk
I don't see risk of crash, but since it is highly visible area (tagger), we probably want more dogfooding time for this.
Performance impact
it should move more work to background thread, and reduce need for UI thread. in turn, improving responsiveness and memory.
Is this a regression from a previous update?
Yes.
Root cause analysis:
this one is hard to pin point to 1 issue. but caused by several different fixes.
first, we made all state management to happen on UI thread to remove any potential race. and that caused us to use more FG than before.
second, we merged preview only tagger with regular diagnostic tagger, making preview tagger to be as expansive as normal diagnostic tagger meaning it uses same amount of UI threads since it now has same states tracking as normal ones.
third, along with that, some filtering on BG got removed which was there to reduce work on FG since all those are changed to be done on FG.
forth, diagnostic tagger didn't pass cancellation token all the way through, leaving already cancelled works in the UI work queue.
fifth, foreground work queue cleaned up cancelled works too lazily and let cancelled work to pile up.
..
this reduces demand for UI thread, but due to the fact that UI thread can be blocked (due to exclusive operation running such as Fix All, or model dialog box or Wait dialog box and more), we can't completely get rid of pending requests for UI threads. but with this change, we should have way less chance to hit 2.1 million pending UI requests.
How was the bug found?
MemoryWatson