Skip to content

work around slice_test compiler bug (for windows bazel opt build)#20555

Merged
jtattermusch merged 4 commits intogrpc:masterfrom
jtattermusch:workaround_slice_test_compiler_bug
Oct 16, 2019
Merged

work around slice_test compiler bug (for windows bazel opt build)#20555
jtattermusch merged 4 commits intogrpc:masterfrom
jtattermusch:workaround_slice_test_compiler_bug

Conversation

@jtattermusch
Copy link
Copy Markdown
Contributor

Fixes #20519.

The fix may look a bit random but I actually spent a fair amount of time debugging the problem
and I'm now pretty sure that the failure is due to a compiler bug.

See https://github.com/grpc/grpc/pull/20554/files that demonstrates and explains the problem.

grpc_slice src2 = grpc_slice_from_copied_string("hello123456789123456789");

// Do not remove the log line. It actually supresses a compiler bug in windows
// bazel opt build. See https://github.com/grpc/grpc/issues/20519
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to loudly mark the test as "skipped", rather than using this workaround? Very interesting finds though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just skip the test and file an issue to followup. I'm still trying to find the actual MSVC bug listing to make 100% sure this is actually a compiler bug, but it's very time-consuming as I don't have much to go off of.

So far:

  • I've been able to reproduce under bazel with VS2015, VS2017 and VS2019
  • when I build with cmake (Release build), I'm unable to reproduce - which is pretty strange, but I haven't yet compared the exact compiler flags being used (but I checked they both run with /O2)

@jtattermusch
Copy link
Copy Markdown
Contributor Author

@apolcyn ok I think I start to understand what's going on here.

  1. the slices src1 and src2 are actually inline because their length is 23 bytes (I was assuming they are actually refcounted before, which would make situation much wierder).
  2. I THINK MSVC is overly clever and realizes we are actually initializing both src1 and src2 to the exactly same value (all fields of the struct are identical). Because they are inlined and thus values types, the compiler actually gets rid of one of the copies (src2).
  3. When we look at GRPC_SLICE_START_PTR(src1), we are basically getting the &src1 + some constant offset, but the compiler doesn't realize it's basically the same as if we were looking comparing &src1 and &src2 directly. Because src1 and src2 were reduced to just one local variable, the code thinks they are actually the same instance and assertion fails.
  4. if we actually used the address of src1 or src2 somewhere in the code (we don't because everything is just passed by value), the compiler realizes that src1 and src2 cannot be reduced into just one variable and the tests suddenly starts passing.

The above explains the situation and is consistent with all my experiments.

What's unclear is whether relying on src1 and src2 being different variables (and thus assuming they have a different address) is legitimate C/C++ or not - we could ask C++ experts. Depending on the answer a) either our test is wrong b) there's a compiler bug (including VS2019) where the optimizer fails to hide that we it has reduced two local variables into one and we can observe it through our code.

@jtattermusch
Copy link
Copy Markdown
Contributor Author

I also figured out why I wasn't able to reproduce with cmake build before:

  • the default build with cmake on windows is 32-bit, which makes the max inlined size smaller, so in 32-bit slice_test both src1 and src2 are refcounted, which makes the problem disappear.
  • we do run 64bit cmake windows tests in the portability windows suite on master, but we only run it in the Debug configuration (to save time) and the problem only appears with Release build (with the optimizer "cleverness" turned on).

I was able to reproduce with cmake 64bit Release build on windows locally.

@jtattermusch
Copy link
Copy Markdown
Contributor Author

I updated the PR with what I think is a better workaround. I think I now have a good understanding of the situation so I feel it should be safe to merge and unbreak the windows opt test finally.

// distinct (even on windows opt 64bit build).
// See https://github.com/grpc/grpc/issues/20519
GPR_ASSERT(src1.refcount);
GPR_ASSERT(GRPC_SLICE_START_PTR(src1) != GRPC_SLICE_START_PTR(src2));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively we could add (&src1 != &src2) condition to the assert which also unbreaks the test.

Copy link
Copy Markdown
Contributor

@apolcyn apolcyn Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I like this idea of adding GPR_ASSERT(&src1 != &src2), I think it makes the theory explained here more explicit.

@hcaseyal hcaseyal requested a review from dklempner October 15, 2019 15:36
@vjpai vjpai requested a review from arjunroy October 15, 2019 19:14
@vjpai
Copy link
Copy Markdown
Contributor

vjpai commented Oct 15, 2019

@arjunroy needs to see this imo because of recent slice changes

@dklempner
Copy link
Copy Markdown
Contributor

I will say that I don't see anything obviously wrong here, though I may be missing something.

It is certainly valid to assume that two different objects have different addresses, and it is valid to assume objects within each of two distinct objects have different addresses.

If you never ask about these addresses the compiler is free to collapse the two objects into one, under "as-if" rules.

This whole thing smells like a bug along exactly those lines -- the compiler is failing to realize it can't do that optimization because we're taking the address of a sub-object.

Copy link
Copy Markdown
Contributor

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to fix the test. Though as mentioned this could still probably use more eyes

@jtattermusch
Copy link
Copy Markdown
Contributor Author

Merging to fix the test. I still plan to follow up and I was actually able to come up with an isolated and simplified repro
https://github.com/jtattermusch/msvc-compiler-bug-repro1

@jtattermusch
Copy link
Copy Markdown
Contributor Author

Known failures: #18404

@jtattermusch jtattermusch merged commit b0c87fd into grpc:master Oct 16, 2019
@jtattermusch
Copy link
Copy Markdown
Contributor Author

Link to compiler issue I filed: https://developercommunity.visualstudio.com/content/problem/779761/optimizer-bug-in-c-leads-to-incorrect-behavior-at.html

@lock lock bot locked as resolved and limited conversation to collaborators Jan 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

lang/c++ release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FLAKE: //test/core/slice/slice_test.exe is flaky on windows RBE opt

4 participants