Skip to content

roachtest: take storage checkpoints on failed consistency checks#117235

Merged
craig[bot] merged 5 commits intocockroachdb:masterfrom
erikgrinaker:check-consistency-checkpoint
Jan 11, 2024
Merged

roachtest: take storage checkpoints on failed consistency checks#117235
craig[bot] merged 5 commits intocockroachdb:masterfrom
erikgrinaker:check-consistency-checkpoint

Conversation

@erikgrinaker
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker commented Jan 2, 2024

roachprod: add store index parameter for {store-dir} expansion

On multi-store nodes, this allow using e.g. {store-dir:2} to reference the second store on the node (NB: not the store ID, but its local index).

roachprod: expand parameters in get source

This allows using e.g. {store-dir} in the source path.

sql: add COCKROACH_INTERNAL_CHECK_CONSISTENCY_FATAL

This environment variable makes crdb_internal.check_consistency() take storage checkpoints and terminate nodes on any inconsistencies when run with stats_only: false. This is the same as the background consistency checker, and is done by changing the mode CHECK_FULL to CHECK_VIA_QUEUE.

roachtest: take storage checkpoints on failed consistency checks

This allows us to debug range inconsistencies.

roachtest: collect Pebble checkpoint artifacts

Resolves #115770.
Epic: none
Release note: None

@erikgrinaker erikgrinaker self-assigned this Jan 2, 2024
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 2, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

@renatolabs @pav-kv Submitting this for an initial look. Will need a few tests and tweaks.

@erikgrinaker erikgrinaker changed the title Check consistency checkpoint roachtest: take storage checkpoints on failed consistency checks Jan 2, 2024
Copy link
Copy Markdown
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

@renatolabs Do you have any thoughts on this before I wrap it up?

Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Left minor comments for consideration. The direction here make sense to me.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

erikgrinaker commented Jan 8, 2024

On second thought, I wonder if we should just do something simpler here, which won't require any API changes: add an envvar (e.g. COCKROACH_INTERNAL_CHECK_CONSISTENCY_VIA_QUEUE) which will change crdb_internal.check_consistency() to use CHECK_VIA_QUEUE, causing it to take checkpoints and terminate nodes if it finds any inconsistencies.

Wdyt @pav-kv?

@erikgrinaker erikgrinaker force-pushed the check-consistency-checkpoint branch 3 times, most recently from b9ba8f1 to 0d31775 Compare January 8, 2024 11:45
@pav-kv
Copy link
Copy Markdown
Collaborator

pav-kv commented Jan 9, 2024

@erikgrinaker
The current implementation takes checkpoints and doesn’t kill the node, right? If we flip to VIA_QUEUE, it’ll kill the node. Is it ok? If not, COCKROACH_INTERNAL_CHECK_CONSISTENCY_VIA_QUEUE would need to be honoured both to set VIA_QUEUE, and to modify its interpretation. This happens on 2 different nodes (the sender and the receiver below raft).

Upd: ah no, we can just not set the Terminate field, which is all on the sender.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

I figured terminating the node seemed fine, since we were going to fail the test anyway.

I guess that prevents us from collecting debug.zip data from the node, so maybe we do want to keep it running -- but if the inconsistency happened to be caught by the normal consistency queue, then we'd be in this situation anyway.

It just seems like a much more light-weight change -- I'm not sure I like modifying the public APIs just for this.

@pav-kv
Copy link
Copy Markdown
Collaborator

pav-kv commented Jan 9, 2024

If terminating is fine, then perhaps an env var is fine. Who would set the var?

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

We'd have to set it for all roachtests.

@pav-kv
Copy link
Copy Markdown
Collaborator

pav-kv commented Jan 9, 2024

Can we use some of the static constants instead of env var? Like the buildutil.CrdbTestBuild. So that all tests/roachtests automatically change the meaning of a full check.

Another benefit of a static test-only check is that it can actually be done across nodes. So we can terminate or not terminate nodes as we like.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

erikgrinaker commented Jan 9, 2024

Roachtests use arbitrary binaries, in principle we don't control whether they're built with CrdbTestBuild or otherwise (consider e.g. running roachtest locally).

all tests/roachtests automatically change the meaning of a full check

I don't think we want to do this. For example, it will break all tests of crdb_internal.check_consistency(), since they'll unconditionally fatal.

@pav-kv
Copy link
Copy Markdown
Collaborator

pav-kv commented Jan 9, 2024

I don't think we want to do this. For example, it will break all tests of crdb_internal.check_consistency(), since they'll unconditionally fatal.

We may choose to never fatal nodes in tests. Similarly to how we mock out the fatal behaviour in unit tests.

Also, it doesn't have to be a build flag. Can be any semi-static indicator that we're in a [specific] roachtest, similar to testing knobs for the unit tests.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

Also, it doesn't have to be a build flag. Can be any semi-static indicator that we're in a [specific] roachtest, similar to testing knobs for the unit tests.

How can we do this? Roachtests use an arbitrary CRDB binary, they can only control runtime flags like envvars, not build-time flags.

@pav-kv
Copy link
Copy Markdown
Collaborator

pav-kv commented Jan 9, 2024

Right, then env var is the answer. I was thinking of some "code level" env vars basically, but it appears we don't have them.

@erikgrinaker erikgrinaker force-pushed the check-consistency-checkpoint branch from 0d31775 to 2d3dd23 Compare January 9, 2024 15:59
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

Ok, I've added COCKROACH_INTERNAL_CHECK_CONSISTENCY_FATAL instead, and adapted the code. I need to add a test for the checkpoint artifacts collection and have a final look over everything, but otherwise I think this should be good now.

On multi-store nodes, this allow using e.g. `{store-dir:2}` to reference
the second store on the node (NB: not the store ID, but its local
index).

Epic: none
Release note: None
This allows using e.g. `{store-dir}` in the source path.

Epic: none
Release note: None
This environment variable makes `crdb_internal.check_consistency()` take
storage checkpoints and terminate nodes on any inconsistencies when run
with `stats_only: false`. This is the same as the background consistency
checker, and is done by changing the mode CHECK_FULL to CHECK_VIA_QUEUE.

Epic: none
Release note: None
This allows us to debug range inconsistencies.

Epic: none
Release note: None
Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the check-consistency-checkpoint branch from 2d3dd23 to 583c105 Compare January 10, 2024 09:49
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

This should be ready for final review now. Became a fair bit simpler.

I need to add a test for the checkpoint artifacts collection

Doesn't look like we have a test harness in place for roachtest artifacts collection, and I don't want to start building one now, so I dropped this. Verified manually with the inconsistency test that this works as expected.

@erikgrinaker erikgrinaker marked this pull request as ready for review January 10, 2024 09:51
@erikgrinaker erikgrinaker requested review from a team as code owners January 10, 2024 09:51
@erikgrinaker erikgrinaker requested review from DarrylWong and removed request for a team January 10, 2024 09:51
@erikgrinaker erikgrinaker added the backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only label Jan 10, 2024
Copy link
Copy Markdown
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you!

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

@renatolabs I don't think there should be anything controversial here, so I'll merge this, but let me know if you have further comments and I'll address them separately.

TFTRs!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 11, 2024

Build succeeded:

@craig craig bot merged commit d9c7307 into cockroachdb:master Jan 11, 2024
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 11, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from f9ebbde to blathers/backport-release-23.2-117235: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: assert replica data consistency checks after tests

4 participants