[DNM]: gc garbage generator and KVDebug service by tbg · Pull Request #18661 · cockroachdb/cockroach

tbg · 2017-09-21T14:55:38Z

This PR shows the tooling I used to stress test the GC queue. In short, I needed a way to put
large amounts of intents on a single range; I didn't particularly care to do this on a multi-node
cluster, but I needed to do it efficiently for quick turnaround (and also to prevent the GC queue
from cleaning up my garbage faster than I could insert it).

This was also a good opportunity to investigate "better" debugging tools and to revisit the
ExternalServer interface, which historically has been the KV store we once wanted to expose to
clients. It has since become internal and is technically slated for removal, but at the same time it
has seen continued use. The reasons for keeping (something like it) are:

debug running clusters that are potentially wedged due to invalid KV data. Be able to read
transaction entries and raw KV columns that are unexpected to the SQL layer.
in our testing, create problematic conditions that are unattainable by using the public
interfaces (creating artificial GC pressure being one example)

I also think that there's a point to be made to add functionality such as being able to force a
Range to run garbage collection, etc, though that's out of scope here.

In this PR, I've sketched out a TxnCoordSender-level entry point that is tied to a bidirectional
streaming connection. This has the advantage that there is a context available the lifetime of which
is tied to the connection, which means that TxnCoordSender can base its transaction heartbeats
based on that (this is not to suggest that we should be running serious transactions through this
interface, but it establishes parity and, assuming that client.NewSender went through this
endpoint instead, TxnCoordSender could be simplified to always use the incoming context). There is
more subtlety in this topic since we want to merge TxnCoordSender and client.{DB,Txn} though,
so don't take this as a concrete suggestion.

What's been more immediately useful is a pretty low-level endpoint that allows evaluating a
BatchRequest on any given Replica (bypassing the command queue, etc) and seeing the results
(more controversially and important for gcpressurizer is the ability to execute these batches,
something that's quite dangerous in the wrong hands due to the potential of creating inconsistency
and also its insufficient synchronization with splits, etc). I think that's the part worth exploring
since it's a universally useful last resort when things go wrong and visibility into on-disk state
is desired without shutting down the node.

Long story short, I have this code and it's definitely not something to check in, but to discuss.
It'd be nice to programmatically test the GC queue in that way, and perhaps randomly "pollute"
some of our test clusters in ever-escalating ways, to improve their resilience.

cockroach-teamcity · 2017-09-21T14:55:46Z

This change is

This PR shows the tooling I used to [stress test the GC queue]. In short, I needed a way to put large amounts of intents on a single range; I didn't particularly care to do this on a multi-node cluster, but I needed to do it efficiently for quick turnaround (and also to prevent the GC queue from cleaning up my garbage faster than I could insert it). This was also a good opportunity to investigate "better" debugging tools and to revisit the `ExternalServer` interface, which historically has been the KV store we once wanted to expose to clients. It has since become internal and is technically slated for removal, but at the same time it has seen continued use. The reasons for keeping (something like it) are: 1. debug running clusters that are potentially wedged due to invalid KV data. Be able to read transaction entries and raw KV columns that are unexpected to the SQL layer. 2. in our testing, create problematic conditions that are unattainable by using the public interfaces (creating artificial GC pressure being one example) I also think that there's a point to be made to add functionality such as being able to force a Range to run garbage collection, etc, though that's out of scope here. In this PR, I've sketched out a TxnCoordSender-level entry point that is tied to a bidirectional streaming connection. This has the advantage that there is a context available the lifetime of which is tied to the connection, which means that `TxnCoordSender` can base its transaction heartbeats based on that (this is not to suggest that we should be running serious transactions through this interface, but it establishes parity and, assuming that `client.NewSender` went through this endpoint instead, TxnCoordSender could be simplified to always use the incoming context). There is more subtlety in this topic since we want to [merge] `TxnCoordSender` and `client.{DB,Txn}` though, so don't take this as a concrete suggestion. What's been more immediately useful is a pretty low-level endpoint that allows evaluating a `BatchRequest` on any given `Replica` (bypassing the command queue, etc) and seeing the results (more controversially and important for `gcpressurizer` is the ability to *execute* these batches, something that's quite dangerous in the wrong hands due to the potential of creating inconsistency and also its insufficient synchronization with splits, etc). I think that's the part worth exploring since it's a universally useful last resort when things go wrong and visibility into on-disk state is desired without shutting down the node. Long story short, I have this code and it's definitely not something to check in, but to discuss. It'd be nice to programmatically test the GC queue in that way, and perhaps randomly "pollute" some of our test clusters in ever-escalating ways, to improve their resilience. [stress test the GC queue]: cockroachdb#9540 [merge]: cockroachdb#16000

A simple data generator that makes a single large range. The created dataset can then be used for various tests, for example to exercises issues as the ones ultimately leading to cockroachdb#20589, or to make sure [large snapshots] work (once implemented). This is a work-in-progress because I haven't reached clarity on what the best way to hook things up in these tests would be. Do we want to create the datasets and upload them somewhere? That has been fragile in the past as the upload progress usually gets seldom exercised and thus rots. The alternative (which I'm leaning towards) is to bundle this binary with the test code (either explicitly or via use as a library) and create fresh test data every time (these tests would run as nightlies and so dataset generation speed isn't the top concern). In making these decisions, we should also take into account more involved datasets that can't as easily be generated from a running cluster, such as [gcpressurizer]. For those, my current take is that we'll just generate an initialized data dir, open the resulting RocksDB instance manually again, and write straight into it (via some facility that updates stats correctly, i.e. presumably `MVCCPut` and friends). Release note: None [large snapshots]: cockroachdb#16954 [gcpressurizer]: cockroachdb#18661

tbg requested a review from a team September 21, 2017 14:55

tbg mentioned this pull request Oct 25, 2017

High CPU load on an idle cluster with context deadline exceeded error. #19216

Closed

tbg mentioned this pull request Nov 30, 2017

storage: unblock the GC queue for large transaction cleanups #20230

Closed

tbg force-pushed the kv-debug branch from 07b6e96 to 91194e9 Compare December 1, 2017 18:23

tbg mentioned this pull request Dec 4, 2017

stability: investigate recovery from permanent unavailability #17186

Closed

tbg force-pushed the kv-debug branch from 91194e9 to e253726 Compare December 16, 2017 18:46

danhhz mentioned this pull request Dec 29, 2017

workload: move sampledataccl abstraction and add query workloads #21097

Merged

tbg closed this Aug 15, 2018

tbg deleted the kv-debug branch August 20, 2018 13:44

tbg mentioned this pull request Feb 2, 2021

storage: ExportRequest poorly leveraged to clean up abandoned intents #59704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM]: gc garbage generator and KVDebug service#18661

[DNM]: gc garbage generator and KVDebug service#18661
tbg wants to merge 1 commit intocockroachdb:masterfrom
tbg:kv-debug

tbg commented Sep 21, 2017

Uh oh!

cockroach-teamcity commented Sep 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tbg commented Sep 21, 2017

Uh oh!

cockroach-teamcity commented Sep 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants