util/must: add runtime assertion API#106508
Conversation
tbg
left a comment
There was a problem hiding this comment.
Just looked at the high level design and UX and everything checks out as far as I'm concerned. I would be happy to see this reviewed by TE (or if TE are too swamped I can review on their behalf, just let me know) and get merged soon.
pkg/kv/kvserver/replica_raft.go
Outdated
| if ctx.Done() != nil { | ||
| return handleRaftReadyStats{}, errors.AssertionFailedf( | ||
| "handleRaftReadyRaftMuLocked cannot be called with a cancellable context") | ||
| if err := must.Zero(ctx, ctx.Done(), |
There was a problem hiding this comment.
Does must.Nil work here, to match the previous ctx.Done() != nil check?
There was a problem hiding this comment.
Unfortunately no. must.Nil only works with pointers, since it's generic over *T. I haven't found a way to also include reference types (slices, maps, channels, funcs, etc) using generics, nor interfaces. We could use reflection, but that would be too expensive here.
However, for reference types and interfaces the zero value is nil, so must.Zero() here is exactly equivalent to == nil.
There was a problem hiding this comment.
Looks like we need a "nilable" constraint in the upstream package :)
There was a problem hiding this comment.
Maybe we could enumerate these things manually, to have a constraint that matches all the nilable things. Haven't researched whether the syntax allows for that.
There was a problem hiding this comment.
I went down that path when implementing Len(), trying to get it to work both with slices, maps, and strings. I sort of got there, but it confused type inference, so we'd have to manually instantiate the generic function at all call sites which gets really annoying.
Might still be possible, but I didn't immediately find a way to do it.
There was a problem hiding this comment.
I do think must.Zero is a surprising developer UX to test nil interface references.
Would it be possible to introduce an alias must.NilInterface or NilReference and recommend its use for that case?
There was a problem hiding this comment.
Yes, but we'd need one for each reference type, so about 6 in total (with interfaces), plus the NotNil variants, so 12. I've been hoping to find a way to finagle generics into doing it instead.
There was a problem hiding this comment.
about 6 in total (with interfaces)
I was thinking about focusing on just interfaces, which is the most common.
There was a problem hiding this comment.
Interfaces are also tricky, because we'd have to handle typed nils -- I'm not sure we can without reflection. But there's always the must.True(ctx, foo == nil) escape hatch too. I agree though, we should try to do something better here.
f302de7 to
bf16d46
Compare
mgartner
left a comment
There was a problem hiding this comment.
Great idea! Overall LGTM - I left some context into types of assertions we use in the optimizer in case that's helpful in steering the API, but I'm not sure if the patterns are widespread, so feel free to ignore.
Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)
pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):
} must.MaybeSendReport = func(ctx context.Context, err error) { maybeSendCrashReport(ctx, err, ReportTypeAssertionFailure)
maybeSendCrashReport always using global settings, which is discouraged when we have access to non-global settings:
cockroach/pkg/util/log/logcrash/crash_reporting.go
Lines 110 to 111 in 2675c7c
Should the package provide versions of each function that have settings arguments?
pkg/util/must/must.go line 44 at r8 (raw file):
// different node (thus stack traces can be insufficient by themselves). // // Some assertions may be too expensive to always run. For example, we may want
We have a set of assertions in the optimizer that run for all test builds, but not in release builds. Their cost is somewhere in between "cheap enough to run in release builds" and "too expensive to run for every test". I'm not sure if this type of assertion is used elsewhere, but if it is, it might be worth providing a pattern for.
Another pattern used in the optimizer (and a few other places IIRC) is to panic to propagate an assertion failure upward until it gets turned into an internal error by a panic-catcher.
Translating one of these panics to use must would look something like:
-if col == nil {
- panic(errors.AssertionFailedf("expected to find column %d in scope", colID))
-}
+if err := must.NotNil(ctx, col, "expected to find column %d in scope", colID); err != nil {
+ panic(err)
+}They seem ergonomically similar to me, but the latter has added complexity of either fatalling in must.NotNil or returning and error to panic. Should we provide versions of these assertions that always panic? Again, I'm not convinced this is a use case that must should cater to, but thought I'd point it out.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)
pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
maybeSendCrashReportalways using global settings, which is discouraged when we have access to non-global settings:cockroach/pkg/util/log/logcrash/crash_reporting.go
Lines 110 to 111 in 2675c7c
Should the package provide versions of each function that have settings arguments?
I generally agree, but I think it's important for an API like this to reduce friction as much as possible, and having to plumb through settings is exactly the kind of friction that might discourage people from using it.
I'm feeling kind of okay about this because log.Fatalf does exactly the same thing, for exactly the same reason. I think the main argument for plumbing through settings here is testability, and I think that's an ok tradeoff to make? Does plumbing through settings give us any other benefit?
pkg/util/must/must.go line 44 at r8 (raw file):
Their cost is somewhere in between "cheap enough to run in release builds" and "too expensive to run for every test". I'm not sure if this type of assertion is used elsewhere, but if it is, it might be worth providing a pattern for.
How about something like must.TestOnly(func() { ... }), which is similar to must.Expensive() but gated on CrdbTestBuild?
Another pattern used in the optimizer (and a few other places IIRC) is to panic to propagate an assertion failure upward until it gets turned into an internal error by a panic-catcher.
Right, we could add a helper for this, something like:
must.PanicIf(must.NotNil(ctx, col, "expected to find column %d in scope", colID))I believe this is marginally better, because in non-release builds the log.Fatalf is guaranteed to not be caught by any panic handlers anywhere, so it'll always fail loudly.
mgartner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)
pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):
Does plumbing through settings give us any other benefit?
I'm trying to figure that out, but now confused. It looks like only the cli package sets global settings. If that was the case, then crash reports would never be sent. I must be missing something.
Did you test that Sentry reports are created?
pkg/util/must/must.go line 44 at r8 (raw file):
How about something like
must.TestOnly(func() { ... }), which is similar tomust.Expensive()but gated onCrdbTestBuild?
SGTM!
I believe this is marginally better, because in non-release builds the
log.Fatalfis guaranteed to not be caught by any panic handlers anywhere, so it'll always fail loudly.
👍 We actually rely a bit on the distinction between an internal error that is caught (looks like ERROR: internal error ...) versus a node-crashing panic. The latter causes much more problems for customers so we urgently prioritize fixing those. But we can figure out a workaround for this as we translate our assertions.
renatolabs
left a comment
There was a problem hiding this comment.
This looks great, thanks for setting this up, Erik!
I'd be happy for us to adopt an API like this throughout the codebase. Should we also mark CrdbTestBuild as deprecated in the code? It'd be nice if we eventually removed that build tag and replaced its current uses with calls to must and must.Expensive.
renatolabs
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @mgartner, @pavelkalinnikov, @smg260, and @tbg)
pkg/util/must/must.go line 44 at r8 (raw file):
How about something like
must.TestOnly(func() { ... }), which is similar tomust.Expensive()but gated onCrdbTestBuild?
Maybe I misunderstand, but how is TestOnly different from Expensive? AFAICT, both CrdbTestBuild and Invariants are set to true based on the same crdb_test condition.
Looks like |
|
Ah right, I misread the build tags, thank you. I'm still wondering to what extent we can unify |
srosenberg
left a comment
There was a problem hiding this comment.
Looks good on a first pass... Several things I am not a fan of,
- allowing assertion failure to be non-fatal is a can of worms (see my earlier comment [1])
Expensiveis an unnecessary addition--compiling assertions into a release build is by design an expensive choice- syntax sugar like
Equal,NotEqual,Less, etc. detracts from the parsimonious design; whatever value it adds, it negates it by making the API more error-prone, imo
In my mind, an ideal design would expose nothing more than must.True and must.False under the semantics that if you compiled with crdb_test, and the invoked predicate evaluates to false, then you crash. The reason for such (extreme) minimalism is I shouldn't have to think how an assertion is evaluated, much less which predicate to use to assert it. That often leads to errors, in my experience. An assertion mechanism should be as simple as the meaning of the logical predicate, P(x).
Reviewed 1 of 5 files at r12.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @mgartner, @pavelkalinnikov, @smg260, and @tbg)
pkg/util/must/must.go line 61 at r13 (raw file):
// } // // Example: double-stopping a component. In release builds, we can simply ignore
Fundamentally, it seems wrong to ignore an invariant violation in release builds. Remember, software testing doesn't guarantee an absence of errors. We've seen bugs that are hard to nearly impossible to reproduce in our test environments.
erikgrinaker
left a comment
There was a problem hiding this comment.
I'm still wondering to what extent we can unify
TestOnly(if we decide to add it) andExpensive.
Yeah, I sort of feel the same way. The idea would be to set up an additional nightly run of expensive/invariant tests, which would run the same set of tests that are covered by CrdbTestBuild, so it isn't clear to me what value we get from gating them on CrdbTestBuild instead of Expensive. They'll still exercise exactly the same tests, with the same frequency. But I don't feel very strongly about this, if SQL have a strong preference for continuing to use CrdbTestBuild then that's ok by me.
allowing assertion failure to be non-fatal is a can of worms
We regularly see fatal assertions cause hour-long outages, which may require shipping a custom binary to disable the assertion (which again takes hours). Fatal assertions in production are too big of a risk, and makes people scared of using assertions. They are also largely unnecessary, because most failures are scoped to a single request/transaction/range rather than an entire node, so killing the node causes a ton of collateral damage. We do want to know when they happen in the wild though, because they indicate uncaught bugs and gaps in test coverage.
Expensiveis an unnecessary addition--compiling assertions into a release build is by design an expensive choice
There is a big difference between a boolean conditional and something like MVCC stats assertions or Pebble's invariant checks -- the former is negligible, the latter will reduce end-to-end performance by at least one order of magnitude, and is entirely unsuitable to enable in non-release builds, but provides valuable coverage.
syntax sugar like
Equal,NotEqual,Less, etc. detracts from the parsimonious design; whatever value it adds, it negates it by making the API more error-prone, imo
The same could be said for Testify -- it's equivalent to t.Fatal, and yet people clearly prefer it. If it makes people write more tests, then I'm all for it.
I think one of the main values for these helpers is in always including argument values in assertion failures, so that people don't have to do this by hand over and over -- or worse yet, forget it and we get a useless failure report. But one could also imagine other conveniences, like printing diffs when reporting struct inequality, or less trivial assertion logic -- I've kept it pretty bare-bones initially.
None of this is essential, but the point of a convenience API is to be, well, convenient, which encourages wider use.
In my mind, an ideal design would expose nothing more than
must.Trueandmust.Falseunder the semantics that if you compiled withcrdb_test, and the invoked predicate evaluates tofalse, then you crash.
What would we do in a release build when encountering the condition that would have caused the compiled-out assertion to fire? Ignore it? With most existing assertions this would currently result in correctness violations. I think there's a valuable middle ground between "always fatal" and "always ignore it". Much of the motivation here is precisely to get coverage in real production environments, with minimal disruption, because we continue to hit issues that aren't caught by our tests.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @pavelkalinnikov, @renatolabs, @smg260, @srosenberg, and @tbg)
pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Does plumbing through settings give us any other benefit?
I'm trying to figure that out, but now confused. It looks like only the
clipackage sets global settings. If that was the case, then crash reports would never be sent. I must be missing something.Did you test that Sentry reports are created?
The cli package is the entry point of the cockroach binary, so it's hooked up in the binary, which is what we care about here.
pkg/util/must/must.go line 61 at r13 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Fundamentally, it seems wrong to ignore an invariant violation in release builds. Remember, software testing doesn't guarantee an absence of errors. We've seen bugs that are hard to nearly impossible to reproduce in our test environments.
We can ignore it when it's safe to do so. The caller is responsible for judging what the appropriate action is in release builds, just like they're responsible for judging what to do with any other error condition.
erikgrinaker
left a comment
There was a problem hiding this comment.
The Raft assertions will need some more scrutiny, so I've pulled them out of this PR. Instead, I added a handful of simpler but representative examples.
I'll write up a bunch of follow-up issues for the remaining work tomorrow.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @knz, @mgartner, @pavelkalinnikov, @renatolabs, @smg260, @srosenberg, and @tbg)
5ca7f29 to
3d2f8e8
Compare
|
I'm going to merge this, for any further concerns I suggest opening an issue and we can discuss there. bors r+ |
Needed to pull in `constraints.Ordered`. Epic: none Release note: None
Epic: none Release note: None
This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure. Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via `log.Fatalf`, but this should be the exception rather than the norm. It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds. This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings: * `log.Fatalf`: kills the node even in release builds, which can cause severe disruption over often minor issues. * `errors.AssertionFailedf`: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds. * `logcrash.ReportOrPanic`: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available. * `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters. * `util.RaceEnabled`: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector. For more details and examples, see the `must` package documentation. Epic: none Release note: None
Epic: none Release note: None
|
Canceled. |
|
bors r+ |
|
This PR was included in a batch that was canceled, it will be automatically retried |
|
Build succeeded: |
108272: util/must: revert API r=erikgrinaker a=erikgrinaker This patch reverts #106508, since `@RaduBerinde` [pointed out](#107790 (review)) a performance flaw where it will often incur an allocation on the happy path due to interface boxing of the format args. Other options are considered in #108169. We'll revisit runtime assertions with a different API that avoids this cost on the happy path. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
For details and usage examples, see the package documentation.
This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure.
Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via
log.Fatalf, but this should be the exception rather than the norm.It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds.
This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings:
log.Fatalf: kills the node even in release builds, which can cause severe disruption over often minor issues.errors.AssertionFailedf: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds.logcrash.ReportOrPanic: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available.buildutil.CrdbTestBuild: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters.util.RaceEnabled: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector.For more details and examples, see the
mustpackage documentation.Resolves #94986.
Epic: none
Release note: None