kvserver: test *Replica like *raft.RawNode

There are lots of "situations of interest" that we attempt to exercise by setting
up a `TestCluster` and an often complicated web of interceptors, all to
enact a desired "interesting" state of the system. This often involves a
delicate dance between wanting to get into this state quickly (not adding a
slow test) but also not having it flake due to excessively low timeout (or
because some unexpected moving part interferes with the test). There is a time
and place for such a test, but we are currently forced into them because we
often lack a direct way to test at more intermediate levels. This is a problem.

We've seen the benefits of architecting for testability in `etcd-io/raft`
(applied to `*RawNode`). A `*RawNode` has no concurrency, i.e. all access to it
is serial. A set of `*RawNode`s forms a raft group. The test harness controls
in which order messages get delivered and processed. Verbose logs that result
from handling messages become part of the datadriven output. Everything is
deterministic. The test harness (introduced by us!) combined with the
concurrency-free, deterministic architecture has significantly improved the
ease and quality of testing[^1][^2] in `etcd/raft`.

It stands to reason that the same approach could yield similar benefits if
applied to at least the lower slices in the `Replica.Send` stack.

Squinting a little (lot), a `*Replica` that is only accessed in a
single-threaded way is just a `*RawNode`: it can receive various kinds of
messages (e.g.  `BatchRequest`, `raftpb.Message`, etc), which it handles
primarily via the raft handling loop; a set of Replicas (for the same range) is
like a Raft group.

Not all of the surface of `Replica` lends itself to this approach (after all,
some bugs will only surface with concurrency on the `Replica`, some requests today trigger work on a goroutine, etc) but my intuition and experience 
is that with each slice of the stack that we make accessible to the above
strategy will confer a significant improvement of quality.

This broad suggestion leaves much to the imagination and perhaps doesn't make
it clear that in its entirety, this is a significant undertaking. However, small
slices of the stack can be tackled in a bottom-up order and in fact, this is
already happening and has been well-received by all participants:

- as part of https://github.com/cockroachdb/cockroach/issues/16624, we [moved](https://github.com/cockroachdb/cockroach/pulls?q=is%3Apr+logstore+is%3Aclosed+created%3A%3C2023-05-01+author%3Apavelkalinnikov) much of the log handling out of the main raft ready loop, into library methods that we could now unit test.
- progress was made towards the [same goal for log application](https://github.com/cockroachdb/cockroach/issues/75729), and more recently
- https://github.com/cockroachdb/cockroach/pull/104401 made progress for evaluating commands and adding them into the log.

If we could unit test log application, conceivably it would be much easier to
do projects such as https://github.com/cockroachdb/cockroach/pull/97779 because
we'd be able to directly explore the edge cases in this code, whereas currently
we have to rely on either contrived, hard-to-maintain (and non-exhaustive)
tests, plus randomized tests (which come with their own challenges).

A first "slice" to tackle could be improving the test in
https://github.com/cockroachdb/cockroach/pull/104401: it creates an
"interesting" raft log programmatically (without even a `*Replica`), but
currently jumps through some hoops to do so. We could fill these gaps by
successive small refactors, until we have the ability to create arbitrary Raft
logs.

If we then tackled https://github.com/cockroachdb/cockroach/issues/75729,
i.e.  gain the ability to also "apply" these logs in a setting with few moving
parts, we are well underway to being able to comprehensively test the apply
pipeline, which is just not possible today.

We could then work on our ability to instantiate a `*Replica` "without moving
parts" (i.e. no queues, Raft scheduler, etc) and start writing deterministic
tests.


[^1]: https://github.com/etcd-io/raft/blob/6bf4f7fe3122b064e0a0d76314298dca6f379fc7/interaction_test.go#L25-L35
[^2]: https://github.com/etcd-io/raft/blob/7302ee6f8351076b10a13aa00fb6200de8693e58/testdata/confchange_v2_add_double_auto.txt



Jira issue: CRDB-28912


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: test Replica like raft.RawNode #105177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: test *Replica like *raft.RawNode #105177

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

kvserver: test Replica like raft.RawNode #105177