Skip to content

fixed flaky test.#25949

Merged
heejaechang merged 6 commits intodotnet:masterfrom
heejaechang:flakytest
Apr 12, 2018
Merged

fixed flaky test.#25949
heejaechang merged 6 commits intodotnet:masterfrom
heejaechang:flakytest

Conversation

@heejaechang
Copy link
Copy Markdown
Contributor

the flaky-ness came from the fact that solution crawler has multiple async event queue to delay processing work (such as semantic prapagation event queue).

having multiple event queue is fine if test just wants to see all works queued have been processed since we can use waiter to wait for everything to finish.

but it is problem if we want to see order of work that got processed since event queue such as semantic prapagation queue can change order of things processed due to cancellation. (work A is being processed and semantic propagation want to cancel work A and mark it as semantic changed and re-enqueue the work)

so we want to wait for semantic event queue to finish processing but current big hammer approach of waiter (wait for everything in solution crawler to finish) doesn't let one to do some part of it to be waited.

added ability to the waiter to wait for parts of work to be done.

and now test waits for semantic event queue to finish before running analyzers.

...

Customer scenario

This is test only fix. doesn't affect product.

Bugs this fixes

N/A

Workarounds, if any

N/A

Risk

N/A

Performance impact

N/A

Is this a regression from a previous update?

No

Root cause analysis

fixed flaky-ness in the test

How was the bug found?

Jenkins

@heejaechang heejaechang requested a review from a team as a code owner April 5, 2018 06:57
@heejaechang heejaechang force-pushed the flakytest branch 2 times, most recently from d0805ff to 46206f3 Compare April 5, 2018 07:19
@heejaechang
Copy link
Copy Markdown
Contributor Author

nuget failed.

@heejaechang
Copy link
Copy Markdown
Contributor Author

retest this please

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Why not move GlobalOperationNotificationServiceFactory to Workspaces where the interface is defined?

❗️ If this is included, it needs [PartNotDiscoverable] as well (attribute indicates that it's only included via an explicit WithPart).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean don't have default service but all uses ServiceLayer.Host? feels like we are removing all nice layer separation we used to have..

by the way, what is wrong is the way it is doing now? I am not testing global operation service in my test.

...

not sure what [PartNotDiscoverable] is. if it is something I need to use, let me know where I need to use.

Copy link
Copy Markdown
Contributor

@sharwell sharwell Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean don't have default service but all uses ServiceLayer.Host? feels like we are removing all nice layer separation we used to have..

The Workspaces project currently includes an implementation of IGlobalOperationNotificationService which fails to adhere to the pre- and post-conditions of the methods it contains. The implementation in ServicesVisualStudio has no dependencies outside of Workspaces; moving it down to be the new default means two things:

  1. The default implementation is no longer broken
  2. The implementation in test code is redundant, per (1)

not sure what [PartNotDiscoverable] is. if it is something I need to use, let me know where I need to use.

It's an attribute applied to the type, like [Shared] is. It prevents the export from being discovered when probing the assembly, and instead requires that it be added to a part catalog explicitly. Most test-only exports will fall into this category.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default implementation intentionally doesn't do anything. we have bunch of services which default implementation is don't do anything. since we don't need such behavior in console and etc.

...

sure, I will add PartNotDiscoverable.

Copy link
Copy Markdown
Contributor

@sharwell sharwell Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a separate pull request #25960 to simplify this. The original limitation that led to the separation of these types no longer applies, but it can still be a separate refactoring.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 This code formatting suggests that assemblies and typeof(TestGlobalOperationService) are two parameters to the same method. This is not the case here; a better layout would be this:

return ExportProviderCache
  .GetOrCreateExportProviderFactory(assemblies, ExportProviderCache.CreateResolver())
  .WithPart(typeof(TestGlobalOperationService));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure 100% I understand. but sure, let me change it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetOrCreateExportProviderFactory accepts ComposableCatalog and GetOrCreateAssemblyCatalog is the one that creates one. not sure how I do this.

return ExportProviderCache
.GetOrCreateExportProviderFactory(assemblies, ExportProviderCache.CreateResolver())
.WithPart(typeof(TestGlobalOperationService));

can you give me some more detail?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidentally left out the call to GetOrCreateAssemblyCatalog. It should have been this:

return ExportProviderCache
  .GetOrCreateExportProviderFactory(
    ExportProviderCache.GetOrCreateAssemblyCatalog(assemblies, ExportProviderCache.CreateResolver()))
  .WithPart(typeof(TestGlobalOperationService));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, you want to change format of the code?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 The semantics of this CreateWaitTask overload are not obvious when reading the code. A better approach (which covers what we need here but may be expanded further in the future) would be the following:

/// <summary>
/// Asynchronously waits for all operations matching the specified predicate to complete.
/// </summary>
public async Task CreateWaitTask(Predicate<DiagnosticAsyncToken> selector)

💭 Testing for token.Name seems likely to break in the future. It is generally understood that changing the name of a listener is a sensitive operation, but changing one of the token properties is not going to cause problems. This is especially problematic because tests may or may not reveal the problem if the name is changed. If you are going to keep this approach, consider defining a static class PredefinedAsynchronousOperationNames which has a documented field EnqueueWorkItemAsync, and reference this field in all locations where it has special meaning in tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure changing Func<...,bool> to predicate is any better. not sure we ever used predicate instead of func<..,bool>. but selector seems better name and doc comment.

about Token.Name. I think it is fine as it is. test already uses internal implementation to test the code. and creating predefined... just for this test is overkill.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure changing Func<...,bool> to predicate is any better. not sure we ever used predicate instead of func<..,bool>.

This is unrelated to the intent of my comment. The confusion is caused by the use of Any and IsEmpty. Rather than wait for this condition to be true or false, CreateWaitTask should wait for one or more operations to complete. The set of operations it waits for would be identified by a predicate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I am following. so you want me to change name to "WaitUntilConditionIsMet"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm looking to make the code look like this instead:

// first, wait for first work to be queued.
await crawlerListener.CreateWaitTask(token => token.Name == "EnqueueWorkItemAsync");

// and then wait them to be processed
await crawlerListener.CreateWaitTask(token => token.Tag == workspace);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and creating predefined... just for this test is overkill.

The alternative is creating a comment everywhere this name is used. With the separate constants class, you can write the comment in one location knowing that it's out of the way and won't get out of sync.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add comment. but I like the one I have better. more flexible. and I dont need one that you suggested above.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ How does this let analyzers proceed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solution crawler will pause processing analyzer if there is pending global operation.

@heejaechang
Copy link
Copy Markdown
Contributor Author

ping? can I get code review? @sharwell anything else you want me to change?

@heejaechang
Copy link
Copy Markdown
Contributor Author

@jinujoseph @Pilchie this is change to fix flakytest, does it still need to go through escrew?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all change here is just clean up. no actual functional change.

@Pilchie
Copy link
Copy Markdown
Member

Pilchie commented Apr 6, 2018

this is change to fix flakytest, does it still need to go through escrew?

If it affects shipping product binaries, then yes. If it only affects test code then no.

@heejaechang
Copy link
Copy Markdown
Contributor Author

@Pilchie I see. so I guess I better revert change that clean up product code then?

@Pilchie
Copy link
Copy Markdown
Member

Pilchie commented Apr 6, 2018

@heejaechang we should be making the safest possible changes to avoid any risk of regression.

@heejaechang
Copy link
Copy Markdown
Contributor Author

heejaechang commented Apr 6, 2018

I think changes in this PR are really safe ones.

@sharwell
Copy link
Copy Markdown
Contributor

sharwell commented Apr 7, 2018

@jasonmalinowski I'm uncomfortable with the maintainability risk of magic string literals in this pull request. Prior to this change, changes to the name passed to asynchronous operation listeners to create a token could be changed freely without breaking tests. After this change, some of the strings are dynamically bound to the tests such that it will silently remove a synchronization guard from a test, leading to race condition flakiness that cannot be easily detected in advance or through git bisect. The suggestion to add a class just to manage/document these names was deemed too complex for the amount of value it adds to the code base. I'm requesting your input on the matter as a tie-breaker.

Your input on the CreateWaitTask/WaitUntilConditionIsMet discussion, especially #25949 (comment), would be appreciated as well.

@heejaechang If the tie-breaker vote favors the existing code I will dismiss my review.

@sharwell
Copy link
Copy Markdown
Contributor

sharwell commented Apr 7, 2018

@Pilchie This test is a regression test for a performance improvement made in 15.7. It is currently disabled, and cannot be tested reliably without some changes to production code. We can either leave the change untested in 15.7 and merge it into 15.8+, or we can look to merge a safe/minimal change and have it tested starting in the branch where the fix was originally made.

Note that the PR is not yet ready, we're already in escrow, and the fix was separately verified already, so this may already be decided.

@Pilchie
Copy link
Copy Markdown
Member

Pilchie commented Apr 9, 2018

@heejaechang @sharwell regarding this:

We can either leave the change untested in 15.7 and merge it into 15.8+, or we can look to merge a safe/minimal change and have it tested starting in the branch where the fix was originally made.

I think it comes down to what we think our odds of regressing are during the remaining escrow changes for 15.7. If this is a subtle area that we think is likely to regress, we should take the product change to enable testing. If we think the odds of a regression are low, and we can watch for them in code reviews, then we can just take the change for 15.8. I don't have enough context to say which is the case here though.

@heejaechang
Copy link
Copy Markdown
Contributor Author

@Pilchie if we are okay the test time to time failing. I am fine going into master.

@heejaechang
Copy link
Copy Markdown
Contributor Author

@sharwell most of test here has assumption on how things working. even waiter await require things to happen in specific order. so not sure why depends on implementation detail on test is suddenly an issue. this test is supposed to test whether implementation is working in order. so it is waiting tag or name specified in the listener. if that is an issue, why the listener even has name or tag property?

@sharwell
Copy link
Copy Markdown
Contributor

sharwell commented Apr 9, 2018

@Pilchie if we are okay the test time to time failing. I am fine going into master.

Right now the test is disabled in all branches (it had a failure rate of over 65%): #25956

@sharwell
Copy link
Copy Markdown
Contributor

sharwell commented Apr 9, 2018

why the listener even has name or tag property?

I've been using it to track down work that didn't complete immediately (cancelled) when the workspace is disposed, which included at least one case of changing the name to improve visibility in this area. #25896

@heejaechang
Copy link
Copy Markdown
Contributor Author

@sharwell ah, didn't know that you disabled the test. I made it disabled in other PR and I thought you told me don't disable it. so I backed it out.

@heejaechang
Copy link
Copy Markdown
Contributor Author

@sharwell so, using the name is problem? then fine if I use tag?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is subtle. Why can't we use constants here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) alright. I will add constant.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfamiliar with this: once the condition goes to true, it won't go to false until we release the global operation? i.e. there isn't a race where where it can start and stop right away?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will not go away until global operation is released. basically global operation is blocking the task from being processed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be one of those "why?" comments, not "what?" comments. 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I can add why we are doing this.

@heejaechang heejaechang changed the base branch from dev15.7.x to master April 10, 2018 18:42
…async event queue to delay processing work (such as semantic prapagation event queue).

having multiple event queue is fine if test just wants to see all works queued have been processed since we can use waiter to wait for everything to finish.

but it is problem if we want to see order of work that got processed since event queue such as semantic prapagation queue can change order of things processed due to cancellation. (work A is being processed and semantic propagation want to cancel work A and mark it as semantic changed and re-enqueue the work)

so we want to wait for semantic event queue to finish processing but current big hammer approach of waiter (wait for everything in solution crawler to finish) doesn't let one to do some part of it to be waited.

added ability to the waiter to wait for parts of work to be done.

and now test waits for semantic event queue to finish before running analyzers.
Copy link
Copy Markdown
Contributor

@sharwell sharwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the indicated items are corrected, 👍

#endregion
}

[ExportWorkspaceService(typeof(IGlobalOperationNotificationService), SolutionCrawler), Shared, PartNotDiscoverable]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❕ This can be removed now. The default implementation of IGlobalOperationNotificationService in the master branch already does this.


// we have a hint. check whether we can take advantage of it
if (await TryEnqueueFromHint(data.Document, data.ChangedMember).ConfigureAwait(continueOnCapturedContext: false))
try
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡

using (data.AsyncToken)
{
  ...
}

var data = Dequeue();
using (data.AsyncToken)

try
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡

using (data.AsyncToken)
{
  ...
}

{
internal sealed partial class SolutionCrawlerRegistrationService
{
public const string EnqueueItem = nameof(EnqueueItem);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❕ Needs a comment explaining what it's used for (specifically which actions use it as the tag)

@heejaechang heejaechang merged commit 99b937b into dotnet:master Apr 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants