Skip to content

Avoid retrying work after the queue is cancelled#32306

Merged
sharwell merged 1 commit intodotnet:masterfrom
sharwell:cancel-retry
Jan 14, 2019
Merged

Avoid retrying work after the queue is cancelled#32306
sharwell merged 1 commit intodotnet:masterfrom
sharwell:cancel-retry

Conversation

@sharwell
Copy link
Copy Markdown
Contributor

@sharwell sharwell commented Jan 9, 2019

Fixes #28062

Ask Mode

Customer scenario

No direct reproducer is known.

Bugs this fixes

#28062

Workarounds, if any

None.

Risk

Very low. This change avoids scheduling work items in three specific queues after global shutdown is requested.

Performance impact

This change may improve test performance, but is unlikely to have an observable impact elsewhere.

Is this a regression from a previous update?

No.

Root cause analysis

If project or document processing was cancelled because the workspace was shut down, items could be rescheduled for execution and never actually execute.

How was the bug found?

Caught by automated testing.

Test documentation updated?

N/A

@sharwell sharwell requested a review from a team as a code owner January 9, 2019 23:21
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor

Change looks good. But it would be useful to know how this fixes the linked issue.

@sharwell
Copy link
Copy Markdown
Contributor Author

sharwell commented Jan 9, 2019

Change looks good. But it would be useful to know how this fixes the linked issue.

We have a check that runs at the end of every test verifying that all asynchronous operations scheduled in the context of a workspace have completed (successfully or cancelled) before moving from one test to the next. The Workspace.Dispose() operation is supposed to trigger this cleanup sequence.

This test was failing occasionally because the asynchronous operation started here never completed:

_workItemQueue.AddOrReplace(workItem.Retry(this.Listener.BeginAsyncOperation("ReenqueueWorkItem")));

In the past, this was caused by work processing queues having work added after the queue finished processing events.

@heejaechang
Copy link
Copy Markdown
Contributor

that CancellationToken is shutdown cancellation token. so no harm checking it before re-enqueue. but there might still be a race.

if we know this work item async token is only token left not disposed, I think it might be good to have explicit Test only API such as DropAllPendingWorkItems which just loop through workitem in the queue and call dispose on the AsyncToken

@sharwell
Copy link
Copy Markdown
Contributor Author

sharwell commented Jan 9, 2019

that CancellationToken is shutdown cancellation token. so no harm checking it before re-enqueue. but there might still be a race.

👍

There may be a race, but one less is good. 😄

if we know this work item async token is only token left not disposed, I think it might be good to have explicit Test only API such as DropAllPendingWorkItems which just loop through workitem in the queue and call dispose on the AsyncToken

The purpose of the test is to assert that this is not needed. The async tokens represent work which may execute in the future. The goal is to ensure operations started by test A do not impact the state of future test B, and disposing of the async tokens manually would defeat this isolation.

@heejaechang
Copy link
Copy Markdown
Contributor

heejaechang commented Jan 9, 2019

but the queue is per test. so I am not sure how it can affect other test? or the issue is due to we moved all tests to share same MEF export which causing multiple tests to share same queue?

then, that seems the root cause of the bug?

Copy link
Copy Markdown
Member

@jasonmalinowski jasonmalinowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the explanation into the commit message.

@sharwell
Copy link
Copy Markdown
Contributor Author

sharwell commented Jan 9, 2019

@heejaechang The isolation guarantee allows us to state that tests do not impact each other through asynchronous operations continuing after the test ends. We do not need to look at the operations themselves to see if they share state because there are no operations. It's a strong guarantee but only works when applied as a blanket policy. 😄

@heejaechang
Copy link
Copy Markdown
Contributor

but I am still not sure. since solution crawler should have each workspace has its dedicated queues. it should never share queue with other workspaces.

each test should have its own isolation. either that got broken at some point, or if it still has isolation, then having test only drop all pending workitem should be fine since the queue is being shutdown.

we don't want drop all pending workitem thing in production code, since we don't want to run any clean up code on shutdown. that's just unnecessary work causing shutdown to take longer.

@heejaechang
Copy link
Copy Markdown
Contributor

I still don't get. each test should have its own Asynchronous operation queue. if it is shared between multiple tests due to MEF stuff, then that is a bug from the beginning.

if its test has its own asynchronous operation queue, then I don't get why it is a bug when there is pending operation when test are done. it is only a bug if that pending async token is related to the test itself.

@sharwell
Copy link
Copy Markdown
Contributor Author

sharwell commented Jan 10, 2019

@heejaechang The check is much wider than the asynchronous operation queues. It is a global assertion (to the degree possible) that work scheduled by a test will not start or continue executing after the end of the test. The benefits include, but are not limited, ensuring exceptions thrown by asynchronous work are correctly associated in time with the test that originally caused the scheduling of that operation.

Work scheduled after queues are shutdown will never complete, which will
hang the cleanup operations that run between unit tests.

Fixes dotnet#28062
@heejaechang
Copy link
Copy Markdown
Contributor

alright, don't want to keep argue. I looked through the code and looks like what you did is enough.

basically issue is, solution crawler has shutdown and it set shutdown cancellation token and cancel all running work. and then those running work got cancelled blindly re-enqueue itself to the queue since it didn't finish its work.

in VS, since process is going away when it got shutdown so doesn't matter. but in the unit test, due to the your check, someone must either go through those items and explicitly drop or make sure not to enqueue the work after solution is shutdown. and you did the later.

@vatsalyaagrawal vatsalyaagrawal added this to the 16.0.P2 milestone Jan 10, 2019
@sharwell
Copy link
Copy Markdown
Contributor Author

@jinujoseph for approval

@jinujoseph
Copy link
Copy Markdown
Contributor

jinujoseph commented Jan 11, 2019

Does this needs to be in preview2 ?
Should we merge this in master instead

@sharwell
Copy link
Copy Markdown
Contributor Author

I am waiting to hear back from @jaredpar on whether the current flakiness level means this should be fixed in preview 2.

@jaredpar
Copy link
Copy Markdown
Member

jaredpar commented Jan 11, 2019

whether the current flakiness level means this should be fixed in preview 2.

I don't think the flakiness level should change where this bug is fixed. The test can be disabled / enabled independently of this fix. I'm fine with master here.

@sharwell sharwell changed the base branch from dev16.0-preview2 to master January 11, 2019 23:12
@sharwell
Copy link
Copy Markdown
Contributor Author

@jinujoseph This is now targeting master

@sharwell sharwell modified the milestones: 16.0.P2, 16.0.P3 Jan 14, 2019
@sharwell sharwell merged commit 2e5772e into dotnet:master Jan 14, 2019
@sharwell sharwell deleted the cancel-retry branch January 14, 2019 15:15
xoofx pushed a commit to stark-lang/stark-roslyn that referenced this pull request Apr 16, 2019
Avoid retrying work after the queue is cancelled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Microsoft.CodeAnalysis.Editor.UnitTests.Diagnostics.DiagnosticsSquiggleTaggerProviderTests.Test_TagSourceDiffer is flaky

8 participants