util/must: add runtime assertion API by erikgrinaker · Pull Request #106508 · cockroachdb/cockroach

erikgrinaker · 2023-07-10T12:07:37Z

For details and usage examples, see the package documentation.

This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure.

Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via log.Fatalf, but this should be the exception rather than the norm.

It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds.

This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings:

log.Fatalf: kills the node even in release builds, which can cause severe disruption over often minor issues.
errors.AssertionFailedf: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds.
logcrash.ReportOrPanic: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available.
buildutil.CrdbTestBuild: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters.
util.RaceEnabled: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector.

For more details and examples, see the must package documentation.

Resolves #94986.
Epic: none
Release note: None

cockroach-teamcity · 2023-07-10T12:08:00Z

This change is

tbg

Just looked at the high level design and UX and everything checks out as far as I'm concerned. I would be happy to see this reviewed by TE (or if TE are too swamped I can review on their behalf, just let me know) and get merged soon.

pkg/util/must/must.go

pav-kv · 2023-07-10T13:09:03Z

pkg/kv/kvserver/replica_raft.go

-	if ctx.Done() != nil {
-		return handleRaftReadyStats{}, errors.AssertionFailedf(
-			"handleRaftReadyRaftMuLocked cannot be called with a cancellable context")
+	if err := must.Zero(ctx, ctx.Done(),


Does must.Nil work here, to match the previous ctx.Done() != nil check?

Unfortunately no. must.Nil only works with pointers, since it's generic over *T. I haven't found a way to also include reference types (slices, maps, channels, funcs, etc) using generics, nor interfaces. We could use reflection, but that would be too expensive here.

However, for reference types and interfaces the zero value is nil, so must.Zero() here is exactly equivalent to == nil.

Looks like we need a "nilable" constraint in the upstream package :)

Maybe we could enumerate these things manually, to have a constraint that matches all the nilable things. Haven't researched whether the syntax allows for that.

I went down that path when implementing Len(), trying to get it to work both with slices, maps, and strings. I sort of got there, but it confused type inference, so we'd have to manually instantiate the generic function at all call sites which gets really annoying.

Might still be possible, but I didn't immediately find a way to do it.

I do think must.Zero is a surprising developer UX to test nil interface references.

Would it be possible to introduce an alias must.NilInterface or NilReference and recommend its use for that case?

Yes, but we'd need one for each reference type, so about 6 in total (with interfaces), plus the NotNil variants, so 12. I've been hoping to find a way to finagle generics into doing it instead.

about 6 in total (with interfaces)

I was thinking about focusing on just interfaces, which is the most common.

Interfaces are also tricky, because we'd have to handle typed nils -- I'm not sure we can without reflection. But there's always the must.True(ctx, foo == nil) escape hatch too. I agree though, we should try to do something better here.

pkg/kv/kvserver/replica_raft.go

mgartner

Great idea! Overall LGTM - I left some context into types of assertions we use in the optimizer in case that's helpful in steering the API, but I'm not sure if the patterns are widespread, so feel free to ignore.

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)

pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):

	}
	must.MaybeSendReport = func(ctx context.Context, err error) {
		maybeSendCrashReport(ctx, err, ReportTypeAssertionFailure)

maybeSendCrashReport always using global settings, which is discouraged when we have access to non-global settings:

cockroach/pkg/util/log/logcrash/crash_reporting.go

Lines 110 to 111 in 2675c7c

    
           // Should be used only when strictly necessary; use ReportPanic whenever we have 
        
           // access to the settings.

Should the package provide versions of each function that have settings arguments?

pkg/util/must/must.go line 44 at r8 (raw file):

// different node (thus stack traces can be insufficient by themselves).
//
// Some assertions may be too expensive to always run. For example, we may want

We have a set of assertions in the optimizer that run for all test builds, but not in release builds. Their cost is somewhere in between "cheap enough to run in release builds" and "too expensive to run for every test". I'm not sure if this type of assertion is used elsewhere, but if it is, it might be worth providing a pattern for.

Another pattern used in the optimizer (and a few other places IIRC) is to panic to propagate an assertion failure upward until it gets turned into an internal error by a panic-catcher.

Translating one of these panics to use must would look something like:

-if col == nil {
-    panic(errors.AssertionFailedf("expected to find column %d in scope", colID))
-}
+if err := must.NotNil(ctx, col, "expected to find column %d in scope", colID); err != nil {
+    panic(err)
+}

They seem ergonomically similar to me, but the latter has added complexity of either fatalling in must.NotNil or returning and error to panic. Should we provide versions of these assertions that always panic? Again, I'm not convinced this is a use case that must should cater to, but thought I'd point it out.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)

pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

maybeSendCrashReport always using global settings, which is discouraged when we have access to non-global settings:

cockroach/pkg/util/log/logcrash/crash_reporting.go

Lines 110 to 111 in 2675c7c

// Should be used only when strictly necessary; use ReportPanic whenever we have

// access to the settings.

Should the package provide versions of each function that have settings arguments?

I generally agree, but I think it's important for an API like this to reduce friction as much as possible, and having to plumb through settings is exactly the kind of friction that might discourage people from using it.

I'm feeling kind of okay about this because log.Fatalf does exactly the same thing, for exactly the same reason. I think the main argument for plumbing through settings here is testability, and I think that's an ok tradeoff to make? Does plumbing through settings give us any other benefit?

pkg/util/must/must.go line 44 at r8 (raw file):

Their cost is somewhere in between "cheap enough to run in release builds" and "too expensive to run for every test". I'm not sure if this type of assertion is used elsewhere, but if it is, it might be worth providing a pattern for.

How about something like must.TestOnly(func() { ... }), which is similar to must.Expensive() but gated on CrdbTestBuild?

Another pattern used in the optimizer (and a few other places IIRC) is to panic to propagate an assertion failure upward until it gets turned into an internal error by a panic-catcher.

Right, we could add a helper for this, something like:

must.PanicIf(must.NotNil(ctx, col, "expected to find column %d in scope", colID))

I believe this is marginally better, because in non-release builds the log.Fatalf is guaranteed to not be caught by any panic handlers anywhere, so it'll always fail loudly.

mgartner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)

pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):

Does plumbing through settings give us any other benefit?

I'm trying to figure that out, but now confused. It looks like only the cli package sets global settings. If that was the case, then crash reports would never be sent. I must be missing something.

Did you test that Sentry reports are created?

pkg/util/must/must.go line 44 at r8 (raw file):

How about something like must.TestOnly(func() { ... }), which is similar to must.Expensive() but gated on CrdbTestBuild?

SGTM!

I believe this is marginally better, because in non-release builds the log.Fatalf is guaranteed to not be caught by any panic handlers anywhere, so it'll always fail loudly.

👍 We actually rely a bit on the distinction between an internal error that is caught (looks like ERROR: internal error ...) versus a node-crashing panic. The latter causes much more problems for customers so we urgently prioritize fixing those. But we can figure out a workaround for this as we translate our assertions.

renatolabs

This looks great, thanks for setting this up, Erik!

I'd be happy for us to adopt an API like this throughout the codebase. Should we also mark CrdbTestBuild as deprecated in the code? It'd be nice if we eventually removed that build tag and replaced its current uses with calls to must and must.Expensive.

pkg/util/must/must.go

pkg/kv/kvserver/replica_raft.go

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @mgartner, @pavelkalinnikov, @smg260, and @tbg)

pkg/util/must/must.go line 44 at r8 (raw file):

How about something like must.TestOnly(func() { ... }), which is similar to must.Expensive() but gated on CrdbTestBuild?

Maybe I misunderstand, but how is TestOnly different from Expensive? AFAICT, both CrdbTestBuild and Invariants are set to true based on the same crdb_test condition.

rickystewart · 2023-07-10T15:52:54Z

Maybe I misunderstand, but how is TestOnly different from Expensive? AFAICT, both CrdbTestBuild and Invariants are set to true based on the same crdb_test condition.

Looks like Invariants is set on the build tags invariants and race, NOT crdb_test.

renatolabs · 2023-07-10T16:19:49Z

Ah right, I misread the build tags, thank you.

I'm still wondering to what extent we can unify TestOnly (if we decide to add it) and Expensive. The former would have the same limitation mentioned in the PR description ("only enabled in Go tests, not roachtests, roachprod clusters, or production clusters"), and might provide a false sense of coverage since Test is ambiguous (e.g., it's common for people to think that their assertions are running in roachtests).

srosenberg

Looks good on a first pass... Several things I am not a fan of,

allowing assertion failure to be non-fatal is a can of worms (see my earlier comment [1])
Expensive is an unnecessary addition--compiling assertions into a release build is by design an expensive choice
syntax sugar like Equal, NotEqual, Less, etc. detracts from the parsimonious design; whatever value it adds, it negates it by making the API more error-prone, imo

In my mind, an ideal design would expose nothing more than must.True and must.False under the semantics that if you compiled with crdb_test, and the invoked predicate evaluates to false, then you crash. The reason for such (extreme) minimalism is I shouldn't have to think how an assertion is evaluated, much less which predicate to use to assert it. That often leads to errors, in my experience. An assertion mechanism should be as simple as the meaning of the logical predicate, P(x).

[1] #102041 (comment)

Reviewed 1 of 5 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @mgartner, @pavelkalinnikov, @smg260, and @tbg)

pkg/util/must/must.go line 61 at r13 (raw file):

//	}
//
// Example: double-stopping a component. In release builds, we can simply ignore

Fundamentally, it seems wrong to ignore an invariant violation in release builds. Remember, software testing doesn't guarantee an absence of errors. We've seen bugs that are hard to nearly impossible to reproduce in our test environments.

erikgrinaker

I'm still wondering to what extent we can unify TestOnly (if we decide to add it) and Expensive.

Yeah, I sort of feel the same way. The idea would be to set up an additional nightly run of expensive/invariant tests, which would run the same set of tests that are covered by CrdbTestBuild, so it isn't clear to me what value we get from gating them on CrdbTestBuild instead of Expensive. They'll still exercise exactly the same tests, with the same frequency. But I don't feel very strongly about this, if SQL have a strong preference for continuing to use CrdbTestBuild then that's ok by me.

allowing assertion failure to be non-fatal is a can of worms

We regularly see fatal assertions cause hour-long outages, which may require shipping a custom binary to disable the assertion (which again takes hours). Fatal assertions in production are too big of a risk, and makes people scared of using assertions. They are also largely unnecessary, because most failures are scoped to a single request/transaction/range rather than an entire node, so killing the node causes a ton of collateral damage. We do want to know when they happen in the wild though, because they indicate uncaught bugs and gaps in test coverage.

Expensive is an unnecessary addition--compiling assertions into a release build is by design an expensive choice

There is a big difference between a boolean conditional and something like MVCC stats assertions or Pebble's invariant checks -- the former is negligible, the latter will reduce end-to-end performance by at least one order of magnitude, and is entirely unsuitable to enable in non-release builds, but provides valuable coverage.

syntax sugar like Equal, NotEqual, Less, etc. detracts from the parsimonious design; whatever value it adds, it negates it by making the API more error-prone, imo

The same could be said for Testify -- it's equivalent to t.Fatal, and yet people clearly prefer it. If it makes people write more tests, then I'm all for it.

I think one of the main values for these helpers is in always including argument values in assertion failures, so that people don't have to do this by hand over and over -- or worse yet, forget it and we get a useless failure report. But one could also imagine other conveniences, like printing diffs when reporting struct inequality, or less trivial assertion logic -- I've kept it pretty bare-bones initially.

None of this is essential, but the point of a convenience API is to be, well, convenient, which encourages wider use.

In my mind, an ideal design would expose nothing more than must.True and must.False under the semantics that if you compiled with crdb_test, and the invoked predicate evaluates to false, then you crash.

What would we do in a release build when encountering the condition that would have caused the compiled-out assertion to fire? Ignore it? With most existing assertions this would currently result in correctness violations. I think there's a valuable middle ground between "always fatal" and "always ignore it". Much of the motivation here is precisely to get coverage in real production environments, with minimal disruption, because we continue to hit issues that aren't caught by our tests.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @pavelkalinnikov, @renatolabs, @smg260, @srosenberg, and @tbg)

pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Does plumbing through settings give us any other benefit?

I'm trying to figure that out, but now confused. It looks like only the cli package sets global settings. If that was the case, then crash reports would never be sent. I must be missing something.

Did you test that Sentry reports are created?

The cli package is the entry point of the cockroach binary, so it's hooked up in the binary, which is what we care about here.

pkg/util/must/must.go line 61 at r13 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Fundamentally, it seems wrong to ignore an invariant violation in release builds. Remember, software testing doesn't guarantee an absence of errors. We've seen bugs that are hard to nearly impossible to reproduce in our test environments.

We can ignore it when it's safe to do so. The caller is responsible for judging what the appropriate action is in release builds, just like they're responsible for judging what to do with any other error condition.

erikgrinaker

The Raft assertions will need some more scrutiny, so I've pulled them out of this PR. Instead, I added a handful of simpler but representative examples.

I'll write up a bunch of follow-up issues for the remaining work tomorrow.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @knz, @mgartner, @pavelkalinnikov, @renatolabs, @smg260, @srosenberg, and @tbg)

erikgrinaker · 2023-07-23T17:36:41Z

Issues for follow-up work, please weigh in if you have opinions:

erikgrinaker · 2023-07-28T12:37:37Z

I'm going to merge this, for any further concerns I suggest opening an issue and we can discuss there.

bors r+

Needed to pull in `constraints.Ordered`. Epic: none Release note: None

Epic: none Release note: None

This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure. Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via `log.Fatalf`, but this should be the exception rather than the norm. It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds. This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings: * `log.Fatalf`: kills the node even in release builds, which can cause severe disruption over often minor issues. * `errors.AssertionFailedf`: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds. * `logcrash.ReportOrPanic`: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available. * `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters. * `util.RaceEnabled`: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector. For more details and examples, see the `must` package documentation. Epic: none Release note: None

Epic: none Release note: None

craig · 2023-07-28T14:37:15Z

Canceled.

erikgrinaker · 2023-07-28T14:39:32Z

bors r+

craig · 2023-07-28T17:30:59Z

This PR was included in a batch that was canceled, it will be automatically retried

craig · 2023-07-28T18:26:05Z

Build succeeded:

Bazel Essential CI (Cockroach)

108272: util/must: revert API r=erikgrinaker a=erikgrinaker This patch reverts #106508, since `@RaduBerinde` [pointed out](#107790 (review)) a performance flaw where it will often incur an allocation on the happy path due to interface boxing of the format args. Other options are considered in #108169. We'll revisit runtime assertions with a different API that avoids this cost on the happy path. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker requested a review from a team July 10, 2023 12:07

erikgrinaker self-assigned this Jul 10, 2023

erikgrinaker requested a review from a team July 10, 2023 12:07

erikgrinaker requested review from a team as code owners July 10, 2023 12:07

erikgrinaker requested review from nkodali and removed request for a team July 10, 2023 12:07

erikgrinaker mentioned this pull request Jul 10, 2023

testing: runtime assertions #94986

Open

tbg requested review from a team, renatolabs and smg260 and removed request for a team and nkodali July 10, 2023 12:30

tbg reviewed Jul 10, 2023

View reviewed changes

pkg/util/must/must.go Outdated Show resolved Hide resolved

erikgrinaker force-pushed the must branch from bafceaa to 7ed69dd Compare July 10, 2023 12:55

pav-kv reviewed Jul 10, 2023

View reviewed changes

erikgrinaker force-pushed the must branch 2 times, most recently from f302de7 to bf16d46 Compare July 10, 2023 13:55

mgartner approved these changes Jul 10, 2023

View reviewed changes

erikgrinaker commented Jul 10, 2023

View reviewed changes

mgartner reviewed Jul 10, 2023

View reviewed changes

renatolabs reviewed Jul 10, 2023

View reviewed changes

pkg/util/must/must.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/replica_raft.go Outdated Show resolved Hide resolved

renatolabs reviewed Jul 10, 2023

View reviewed changes

srosenberg reviewed Jul 11, 2023

View reviewed changes

erikgrinaker commented Jul 11, 2023

View reviewed changes

erikgrinaker mentioned this pull request Jul 22, 2023

util/must: convenience wrapper for making many assertions #107418

Closed

erikgrinaker force-pushed the must branch from 20bef10 to 5b7dbcc Compare July 22, 2023 20:57

erikgrinaker commented Jul 22, 2023

View reviewed changes

erikgrinaker mentioned this pull request Jul 23, 2023

util/must: improve Expensive build tag gating #107425

Closed

erikgrinaker force-pushed the must branch from 5b7dbcc to 37ed03d Compare July 23, 2023 17:36

erikgrinaker added the db-cy-23 label Jul 24, 2023

erikgrinaker force-pushed the must branch 3 times, most recently from 5ca7f29 to 3d2f8e8 Compare July 28, 2023 11:28

erikgrinaker mentioned this pull request Jul 28, 2023

util/must: add Handle for convenient failure handling #107790

Closed

erikgrinaker added 4 commits July 28, 2023 14:24

DEPS: bump golang.org/x/exp

5f384b3

Needed to pull in `constraints.Ordered`. Epic: none Release note: None

logcrash: add ReportTypeAssertionFailure

4ffc54c

Epic: none Release note: None

*: add a few example usages of must

0a1699d

Epic: none Release note: None

erikgrinaker force-pushed the must branch from 3d2f8e8 to 7b6be04 Compare July 28, 2023 14:37

erikgrinaker force-pushed the must branch from 7b6be04 to 0a1699d Compare July 28, 2023 14:39

craig bot merged commit 04c91a5 into cockroachdb:master Jul 28, 2023

erikgrinaker mentioned this pull request Aug 7, 2023

util/must: revert API #108272

Merged

maryliag mentioned this pull request Aug 15, 2023

release-23.1: DEPS: bump golang.org/x/exp #108792

Merged

erikgrinaker mentioned this pull request Oct 7, 2023

util/assertion: add runtime assertion API #111991

Closed

erikgrinaker deleted the must branch November 14, 2023 10:38

	// Should be used only when strictly necessary; use ReportPanic whenever we have
	// access to the settings.

Conversation

erikgrinaker commented Jul 10, 2023

Uh oh!

cockroach-teamcity commented Jul 10, 2023

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pav-kv Jul 10, 2023

Choose a reason for hiding this comment

Uh oh!

erikgrinaker Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pav-kv Jul 10, 2023

Choose a reason for hiding this comment

Uh oh!

pav-kv Jul 10, 2023

Choose a reason for hiding this comment

Uh oh!

erikgrinaker Jul 10, 2023

Choose a reason for hiding this comment

Uh oh!

knz Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

erikgrinaker Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knz Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

erikgrinaker Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgartner left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

mgartner left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

rickystewart commented Jul 10, 2023

Uh oh!

renatolabs commented Jul 10, 2023

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker commented Jul 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker commented Jul 28, 2023

Uh oh!

craig bot commented Jul 28, 2023

Uh oh!

erikgrinaker commented Jul 28, 2023

Uh oh!

craig bot commented Jul 28, 2023

Uh oh!

erikgrinaker Jul 10, 2023 •

edited

Loading

erikgrinaker Jul 11, 2023 •

edited

Loading

erikgrinaker commented Jul 23, 2023 •

edited

Loading