Skip to content

storage: take an engine checkpoint during failing consistency checks#36867

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
tbg:fix/rocks-checkpoints
Apr 17, 2019
Merged

storage: take an engine checkpoint during failing consistency checks#36867
craig[bot] merged 2 commits intocockroachdb:masterfrom
tbg:fix/rocks-checkpoints

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Apr 16, 2019

This takes a checkpoint on the nodes with replicas of a failing range,
before the failure leads to nodes shutting down. The checkpoint will, for
the replicas of the affected range, be taken at the same Raft log position.

Release note: None

@tbg tbg requested a review from a team April 16, 2019 10:43
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg requested a review from nvb April 16, 2019 10:43
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 16, 2019

A failed consistency check will effectively leak the checkpoint, which I think is fine for now -- customers running into one are always supposed to (and likely to) contact support.

If we're paranoid about someone running out of disk space after not realizing that they have RocksDB checkpoints created after inconsistencies, we need to figure out a better UX but for now I'd say we add an env var that's set by default when using roachprod.

@tbg tbg force-pushed the fix/rocks-checkpoints branch from 1d38985 to d44e47f Compare April 16, 2019 13:46
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


c-deps/libroach/db.cc, line 269 at r2 (raw file):

    return ToDBStatus(status);
  }
  status = cp_ptr->CreateCheckpoint(cp_dir, 0 /* log_size_for_flush */);

consider saying something about what that 0 does, if it's interesting. I think a lot of people (like myself right now) will stop reading here and not dig into CreateCheckpoint() to see.


c-deps/libroach/db.cc, line 274 at r2 (raw file):

    return ToDBStatus(status);
  }
  return kSuccess;

can't we always return ToDBStatus(status); ?


c-deps/libroach/include/libroach.h, line 104 at r2 (raw file):

// Creates a RocksDB checkpoint in the specified directory (which must not exist).
DBStatus DBCreateCheckpoint(DBEngine* db, DBSlice dir);

pls put some words about what a checkpoint is - something about hardlinks to current SSTs


pkg/roachpb/api.proto, line 1144 at r2 (raw file):

  // If set, a checkpoint (i.e. cheap backup) of the engine will be taken. This
  // is expected to be set only if we already know that there is an
  // inconsistency and we want to preserve as much state as possible.

pls say where the checkpoint will live (how to find it)


pkg/storage/consistency_queue_test.go, line 196 at r2 (raw file):

	defer cache.Release()

	for i := 0; i < numStores; i++ {

say something about not using in-memory engines because we want to take snapshots. But actually, why can't we use in-mem engines (and still take snapshots)?


pkg/storage/replica_consistency.go, line 92 at r2 (raw file):

		// queue is asking us to report a diff, so it has run a previous check
		// which failed.
		Checkpoint: args.WithDiff && args.Mode == roachpb.ChecksumMode_CHECK_VIA_QUEUE,

this already confusing dance is becoming even more complex. How about just plumbing a checkpoint bool through args?


pkg/storage/replica_proposal.go, line 173 at r2 (raw file):

	if cc.Checkpoint {
		checkpointBase := filepath.Join(r.store.engine.GetAuxiliaryDir(), "checkpoints")
		_ = os.MkdirAll(checkpointBase, 0700)

nit: 0700 /* owner rwx /*


pkg/storage/replica_proposal.go, line 178 at r2 (raw file):

			log.Warningf(ctx, "unable to create checkpoint %s: %s", checkpointDir, err)
		} else {
			log.Infof(ctx, "created checkpoing %s", checkpointDir)

poing poing


pkg/storage/engine/engine_test.go, line 804 at r2 (raw file):

	defer cleanup()

	// TODO(tbg): use the method below, but only for on-disk stores.

sorry, what's going on here?


pkg/storage/engine/engine_test.go, line 821 at r2 (raw file):

	assert.NoError(t, err)
	assert.NoError(t, db.CreateCheckpoint(dir))

can you open and read the checkpoint you've just created? (as a RocksDB engine)

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/storage/consistency_queue_test.go, line 196 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

say something about not using in-memory engines because we want to take snapshots. But actually, why can't we use in-mem engines (and still take snapshots)?

what does engine.GetAuxiliaryDir() return for in-mem engines?


pkg/storage/replica_proposal.go, line 172 at r2 (raw file):

	snap := r.store.engine.NewSnapshot()
	if cc.Checkpoint {
		checkpointBase := filepath.Join(r.store.engine.GetAuxiliaryDir(), "checkpoints")

should we protect against in-mem engines here somehow? What's going to happen for those?


pkg/storage/replica_proposal.go, line 173 at r2 (raw file):

	if cc.Checkpoint {
		checkpointBase := filepath.Join(r.store.engine.GetAuxiliaryDir(), "checkpoints")
		_ = os.MkdirAll(checkpointBase, 0700)

why MkdirAll and not Mkdir?

We've repeatedly wanted this to preserve state when finding replica
inconsistencies.

See, for example:

cockroachdb#36861

Release note: None
@tbg tbg force-pushed the fix/rocks-checkpoints branch from d44e47f to 4e8d768 Compare April 17, 2019 12:22
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@petermattis could you give the C++ changes a glance to make sure I'm not doing something silly with memory mgmt?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)


c-deps/libroach/db.cc, line 269 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider saying something about what that 0 does, if it's interesting. I think a lot of people (like myself right now) will stop reading here and not dig into CreateCheckpoint() to see.

Done.


c-deps/libroach/db.cc, line 274 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can't we always return ToDBStatus(status); ?

Done.


c-deps/libroach/include/libroach.h, line 104 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

pls put some words about what a checkpoint is - something about hardlinks to current SSTs

Done.


pkg/roachpb/api.proto, line 1144 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

pls say where the checkpoint will live (how to find it)

Done.


pkg/storage/consistency_queue_test.go, line 196 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what does engine.GetAuxiliaryDir() return for in-mem engines?

RocksDB abstracts away the file system through an Env. An in-memory engine just has an in-memory env. So in-memory engines work if and only if you write everything through the RocksDB env, which basically means you have to slap a full POSIX API into the Engine interface. We did some of that (had to) for encryption at rest, but it's not enough. This isn't a yak worth shaving.

In-mem engines actually return a temp dir as the auxiliary dir, mostly to make things "mostly work", but that's not behavior that makes a lot of sense or that I want to rely on in this test.


pkg/storage/replica_consistency.go, line 92 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this already confusing dance is becoming even more complex. How about just plumbing a checkpoint bool through args?

Done. Agreed that this is better.


pkg/storage/engine/engine_test.go, line 804 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

sorry, what's going on here?

Removed. The comment was saying "I want to use runWithAllEngines so that pebble automatically gets this test as well, but we need a version of runWithAllEngines that uses non-mem engines". Not worth it right now. The method is in the interface, so pebble will have to learn it anyway.


pkg/storage/engine/engine_test.go, line 821 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can you open and read the checkpoint you've just created? (as a RocksDB engine)

Done.

Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS not sure why I LGTM'ed myself. The button looked too inviting, I guess.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)

This takes a checkpoint on the nodes with replicas of a failing range,
before the failure leads to nodes shutting down. The checkpoint will,
for the replicas of the affected range, be taken at the same Raft log
position.

Release note: None
@tbg tbg force-pushed the fix/rocks-checkpoints branch from 4e8d768 to 8b1fc16 Compare April 17, 2019 17:53
@andreimatei
Copy link
Copy Markdown
Contributor

LGTM

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 17, 2019

TFTR!

bors r=andreimatei

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 17, 2019

👎 Rejected by code reviews

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 17, 2019

bors r=andreimatei

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @nvanbenschoten, and @tbg)

craig bot pushed a commit that referenced this pull request Apr 17, 2019
36867: storage: take an engine checkpoint during failing consistency checks r=andreimatei a=tbg

This takes a checkpoint on the nodes with replicas of a failing range,
before the failure leads to nodes shutting down. The checkpoint will, for
the replicas of the affected range, be taken at the same Raft log position.

Release note: None

36915: changefeedccl: remove deprecated changefeed.min_high_water metric r=nvanbenschoten a=danhhz

It was replaced by max_behind_nanos in 19.1.

Release note (backwards-incompatible change): Removed the deprecated
`changefeed.min_high_water` metric; use `changefeed.max_behind_nanos`
instead.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 17, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants