Avoid retrying work after the queue is cancelled by sharwell · Pull Request #32306 · dotnet/roslyn

sharwell · 2019-01-09T23:21:17Z

Fixes #28062

Ask Mode

Customer scenario

No direct reproducer is known.

Bugs this fixes

#28062

Workarounds, if any

None.

Risk

Very low. This change avoids scheduling work items in three specific queues after global shutdown is requested.

Performance impact

This change may improve test performance, but is unlikely to have an observable impact elsewhere.

Is this a regression from a previous update?

No.

Root cause analysis

If project or document processing was cancelled because the workspace was shut down, items could be rescheduled for execution and never actually execute.

How was the bug found?

Caught by automated testing.

Test documentation updated?

N/A

CyrusNajmabadi · 2019-01-09T23:28:38Z

Change looks good. But it would be useful to know how this fixes the linked issue.

sharwell · 2019-01-09T23:32:11Z

Change looks good. But it would be useful to know how this fixes the linked issue.

We have a check that runs at the end of every test verifying that all asynchronous operations scheduled in the context of a workspace have completed (successfully or cancelled) before moving from one test to the next. The Workspace.Dispose() operation is supposed to trigger this cleanup sequence.

This test was failing occasionally because the asynchronous operation started here never completed:

roslyn/src/Features/Core/Portable/SolutionCrawler/WorkCoordinator.NormalPriorityProcessor.cs

Line 381 in 4ca8c46

    
           _workItemQueue.AddOrReplace(workItem.Retry(this.Listener.BeginAsyncOperation("ReenqueueWorkItem")));

In the past, this was caused by work processing queues having work added after the queue finished processing events.

heejaechang · 2019-01-09T23:41:15Z

that CancellationToken is shutdown cancellation token. so no harm checking it before re-enqueue. but there might still be a race.

if we know this work item async token is only token left not disposed, I think it might be good to have explicit Test only API such as DropAllPendingWorkItems which just loop through workitem in the queue and call dispose on the AsyncToken

sharwell · 2019-01-09T23:42:57Z

that CancellationToken is shutdown cancellation token. so no harm checking it before re-enqueue. but there might still be a race.

👍

There may be a race, but one less is good. 😄

if we know this work item async token is only token left not disposed, I think it might be good to have explicit Test only API such as DropAllPendingWorkItems which just loop through workitem in the queue and call dispose on the AsyncToken

The purpose of the test is to assert that this is not needed. The async tokens represent work which may execute in the future. The goal is to ensure operations started by test A do not impact the state of future test B, and disposing of the async tokens manually would defeat this isolation.

heejaechang · 2019-01-09T23:46:15Z

but the queue is per test. so I am not sure how it can affect other test? or the issue is due to we moved all tests to share same MEF export which causing multiple tests to share same queue?

then, that seems the root cause of the bug?

jasonmalinowski

Please move the explanation into the commit message.

sharwell · 2019-01-09T23:50:34Z

@heejaechang The isolation guarantee allows us to state that tests do not impact each other through asynchronous operations continuing after the test ends. We do not need to look at the operations themselves to see if they share state because there are no operations. It's a strong guarantee but only works when applied as a blanket policy. 😄

heejaechang · 2019-01-09T23:51:17Z

but I am still not sure. since solution crawler should have each workspace has its dedicated queues. it should never share queue with other workspaces.

each test should have its own isolation. either that got broken at some point, or if it still has isolation, then having test only drop all pending workitem should be fine since the queue is being shutdown.

we don't want drop all pending workitem thing in production code, since we don't want to run any clean up code on shutdown. that's just unnecessary work causing shutdown to take longer.

heejaechang · 2019-01-09T23:54:15Z

I still don't get. each test should have its own Asynchronous operation queue. if it is shared between multiple tests due to MEF stuff, then that is a bug from the beginning.

if its test has its own asynchronous operation queue, then I don't get why it is a bug when there is pending operation when test are done. it is only a bug if that pending async token is related to the test itself.

sharwell · 2019-01-10T00:02:42Z

@heejaechang The check is much wider than the asynchronous operation queues. It is a global assertion (to the degree possible) that work scheduled by a test will not start or continue executing after the end of the test. The benefits include, but are not limited, ensuring exceptions thrown by asynchronous work are correctly associated in time with the test that originally caused the scheduling of that operation.

Work scheduled after queues are shutdown will never complete, which will hang the cleanup operations that run between unit tests. Fixes dotnet#28062

heejaechang · 2019-01-10T00:23:14Z

alright, don't want to keep argue. I looked through the code and looks like what you did is enough.

basically issue is, solution crawler has shutdown and it set shutdown cancellation token and cancel all running work. and then those running work got cancelled blindly re-enqueue itself to the queue since it didn't finish its work.

in VS, since process is going away when it got shutdown so doesn't matter. but in the unit test, due to the your check, someone must either go through those items and explicitly drop or make sure not to enqueue the work after solution is shutdown. and you did the later.

sharwell · 2019-01-10T18:30:59Z

@jinujoseph for approval

jinujoseph · 2019-01-11T09:00:16Z

Does this needs to be in preview2 ?
Should we merge this in master instead

sharwell · 2019-01-11T12:44:34Z

I am waiting to hear back from @jaredpar on whether the current flakiness level means this should be fixed in preview 2.

jaredpar · 2019-01-11T21:26:08Z

whether the current flakiness level means this should be fixed in preview 2.

I don't think the flakiness level should change where this bug is fixed. The test can be disabled / enabled independently of this fix. I'm fine with master here.

sharwell · 2019-01-11T23:13:02Z

@jinujoseph This is now targeting master

Avoid retrying work after the queue is cancelled

sharwell requested a review from a team as a code owner January 9, 2019 23:21

mavasani approved these changes Jan 9, 2019

View reviewed changes

sharwell mentioned this pull request Jan 9, 2019

Microsoft.CodeAnalysis.Editor.UnitTests.Diagnostics.DiagnosticsSquiggleTaggerProviderTests.Test_TagSourceDiffer is flaky #28062

Closed

mavasani requested a review from heejaechang January 9, 2019 23:23

CyrusNajmabadi reviewed Jan 9, 2019

View reviewed changes

Comment thread src/Features/Core/Portable/SolutionCrawler/WorkCoordinator.NormalPriorityProcessor.cs Outdated

heejaechang approved these changes Jan 9, 2019

View reviewed changes

sharwell force-pushed the cancel-retry branch from 99218ca to b164d9c Compare January 9, 2019 23:40

jasonmalinowski requested changes Jan 9, 2019

View reviewed changes

Avoid retrying work after the queue is cancelled

b8caab1

Work scheduled after queues are shutdown will never complete, which will hang the cleanup operations that run between unit tests. Fixes dotnet#28062

sharwell force-pushed the cancel-retry branch from b164d9c to b8caab1 Compare January 10, 2019 00:12

jasonmalinowski approved these changes Jan 10, 2019

View reviewed changes

vatsalyaagrawal added the Area-IDE label Jan 10, 2019

vatsalyaagrawal added this to the 16.0.P2 milestone Jan 10, 2019

mavasani mentioned this pull request Jan 10, 2019

Skip flaky test Test_TagSourceDiffer #32294

Closed

sharwell changed the base branch from dev16.0-preview2 to master January 11, 2019 23:12

jinujoseph added the Approved to merge label Jan 12, 2019

sharwell modified the milestones: 16.0.P2, 16.0.P3 Jan 14, 2019

sharwell merged commit 2e5772e into dotnet:master Jan 14, 2019

sharwell deleted the cancel-retry branch January 14, 2019 15:15

xoofx pushed a commit to stark-lang/stark-roslyn that referenced this pull request Apr 16, 2019

Merge pull request dotnet#32306 from sharwell/cancel-retry

bf615ae

Avoid retrying work after the queue is cancelled

Conversation

sharwell commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Customer scenario

Bugs this fixes

Workarounds, if any

Risk

Performance impact

Is this a regression from a previous update?

Root cause analysis

How was the bug found?

Test documentation updated?

Uh oh!

CyrusNajmabadi commented Jan 9, 2019

Uh oh!

Uh oh!

sharwell commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heejaechang commented Jan 9, 2019

Uh oh!

sharwell commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heejaechang commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasonmalinowski left a comment

Choose a reason for hiding this comment

Uh oh!

sharwell commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heejaechang commented Jan 9, 2019

Uh oh!

heejaechang commented Jan 9, 2019

Uh oh!

sharwell commented Jan 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heejaechang commented Jan 10, 2019

Uh oh!

sharwell commented Jan 10, 2019

Uh oh!

jinujoseph commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sharwell commented Jan 11, 2019

Uh oh!

jaredpar commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sharwell commented Jan 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

sharwell commented Jan 9, 2019 •

edited

Loading

sharwell commented Jan 9, 2019 •

edited

Loading

sharwell commented Jan 9, 2019 •

edited

Loading

heejaechang commented Jan 9, 2019 •

edited

Loading

sharwell commented Jan 9, 2019 •

edited

Loading

sharwell commented Jan 10, 2019 •

edited

Loading

jinujoseph commented Jan 11, 2019 •

edited

Loading

jaredpar commented Jan 11, 2019 •

edited

Loading