cli: add debug send-kv-batch command to run arbitrary BatchRequest#72732
cli: add debug send-kv-batch command to run arbitrary BatchRequest#72732craig[bot] merged 2 commits intocockroachdb:masterfrom
debug send-kv-batch command to run arbitrary BatchRequest#72732Conversation
4078316 to
f31b9fe
Compare
debug send-kv-batch command to run BatchRequestsdebug send-kv-batch command to run arbitrary BatchRequest
7ca1320 to
66e0052
Compare
tbg
left a comment
There was a problem hiding this comment.
Nice, LGTM and I'll defer to @knz about making sure this is documented and set up in the right way.
I did poke around a bit and ended up rewriting the test a bit, pushed this as a separate commit. Take it or leave it, fine either way.
Reviewed 15 of 15 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, @knz, and @tbg)
docs/generated/eventlog.md, line 168 at r1 (raw file):
| `Timestamp` | The timestamp of the event. Expressed as nanoseconds since the Unix epoch. | no | | `EventType` | The type of the event. | no | | `NodeID` | The node ID where the event was originated. | no |
where the event originated
pkg/cli/debug.go, line 1505 at r1 (raw file):
} var debugSendKVBatchCmd = &cobra.Command{
mind putting this and the impl + tests in a new files debug_send_kv_batch{,_test}.go?
pkg/cli/debug.go, line 1511 at r1 (raw file):
RunE: runSendKVBatch, Long: ` Submits a JSON-formatted roachpb.BatchRequest, given as a file or via stdin,
The spaces at the beginning of each line shouldn't be there IMO
pkg/cli/debug_test.go, line 522 at r1 (raw file):
require.Contains(t, output, "ERROR: invalid JSON") // Unknown JSON field should error
.
pkg/server/admin.go, line 2627 at r1 (raw file):
} nodeID := int32(s.server.NodeID()) event := &eventpb.DebugSendKvBatch{
customname DebugSendKVBatch?
pkg/server/admin.go, line 2644 at r1 (raw file):
}) if err != nil { log.Warningf(ctx, "failed to log SendKVBatch to event log, emitting below: %v", err)
I think structured events are unlikely to hit the same log as a Warning, so the "below" may not make sense. Maybe "emitting as structured event"?
pkg/util/log/eventpb/debug_events.proto, line 32 at r1 (raw file):
// CommonDebugEventDetails contains the fields common to all debugging events. message CommonDebugEventDetails { // The node ID where the event was originated.
-was
erikgrinaker
left a comment
There was a problem hiding this comment.
Heh, your test hit what I expect will be a common mistake: Protobuf structs cannot be (un)marshalled using the standard Golang JSON marshaller. You must use the JSON marshaller that's shipped with the Protobuf library, e.g. https://pkg.go.dev/github.com/gogo/protobuf/jsonpb.
I'll update the test to use this, with a big fat warning to anyone following its example.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, and @tbg)
pkg/cli/debug.go, line 1505 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
mind putting this and the impl + tests in a new files
debug_send_kv_batch{,_test}.go?
Yup, done.
pkg/cli/debug.go, line 1511 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The spaces at the beginning of each line shouldn't be there IMO
Ah, my bad -- thought I'd seen other commands indent it.
pkg/server/admin.go, line 2627 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
customname
DebugSendKVBatch?
I don't think customname can be applied to messages, only fields. Unfortunately, we need the Kv here to get the snake-cased event type generated correctly.
pkg/server/admin.go, line 2644 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I think structured events are unlikely to hit the same log as a Warning, so the "below" may not make sense. Maybe "emitting as structured event"?
Right, good catch.
f7d2859 to
9add527
Compare
98d0452 to
7769e60
Compare
knz
left a comment
There was a problem hiding this comment.
Reviewed 14 of 15 files at r1, 9 of 9 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, and @tbg)
pkg/server/admin.go, line 2637 at r1 (raw file):
BatchRequest: string(baJSON), } err = contextutil.RunWithTimeout(ctx, "log-event", 10*time.Second, func(ctx context.Context) error {
Can you explain to me the wisdom of using system.eventlog here, while we're generally trying to get away from using this table (also, observing that if one finds themselves in need to reach out to this debug facility, chances are this write may be impossible).
pkg/server/admin.go, line 2645 at r1 (raw file):
if err != nil { log.Warningf(ctx, "failed to log SendKVBatch to event log, emitting below: %v", err) log.StructuredEvent(ctx, event)
The structured event should be logged in any case.
I think you're relying on the idea that InsertEventRecord will also call log.StructuredEvent above? But then that needs to be clarified.
(I'd argue all this could be simplified by only calling log.StructuredEvent here.)
Also, the structured logging needs a test.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, @knz, and @tbg)
pkg/server/admin.go, line 2637 at r1 (raw file):
Previously, knz (kena) wrote…
Can you explain to me the wisdom of using system.eventlog here, while we're generally trying to get away from using this table (also, observing that if one finds themselves in need to reach out to this debug facility, chances are this write may be impossible).
I was under the impression that system.eventlog is part of our event/audit logging facilities. You seem to be saying that it's unnecessary or deprecated here, so I'll remove it.
7769e60 to
bbf5b56
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, and @tbg)
pkg/server/admin.go, line 2645 at r1 (raw file):
Previously, knz (kena) wrote…
The structured event should be logged in any case.
I think you're relying on the idea that InsertEventRecord will also call log.StructuredEvent above? But then that needs to be clarified.
(I'd argue all this could be simplified by only calling log.StructuredEvent here.)Also, the structured logging needs a test.
Done.
bbf5b56 to
d5c8af7
Compare
d5c8af7 to
4b62fc6
Compare
|
@knz Any further thoughts on this? |
knz
left a comment
There was a problem hiding this comment.
Sorry for the tardiness!
Reviewed 18 of 18 files at r4, 17 of 17 files at r5, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @dt)
`jsonpb.Marshal` is used for JSON-marshaling of Protobuf structs in accordance with the Protobuf spec. Release note: None
This patch adds a CLI command `debug send-kv-batch` that takes a JSON `BatchRequest` as input, proxies it into KV via the connected node's admin server, and outputs a JSON `BatchResponse`. An admin user is required, typically the root user, and the request is logged to the system event log for auditability. This is a useful backdoor to access arbitrary KV functionality for debugging and surgery, e.g. during support escalations. Release note: None
4b62fc6 to
ee1624a
Compare
|
TFTRs! bors r=knz,tbg |
|
Btw, demo about this on next weekly? This is cool stuff.
(sent from mobile, please excuse my brevity)
…On Thu, Nov 25, 2021, 20:24 Erik Grinaker ***@***.***> wrote:
TFTRs!
bors r=knz,tbg
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#72732 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZG6AUBLEN7OVJ5HLS3UN2EP7ANCNFSM5H77LVIQ>
.
|
Sure thing. It's pretty trivial tbh, but extremely useful. |
|
Build succeeded: |
This commit introduces a new KV request type `ProbeRequest`. This request performs a mutationless round-trip through the replication layer. The proximate reason to add this request is cockroachdb#33007, where we want to fail-fast requests on a Replica until a `ProbeRequest` succeeds. The alternative to introducing a new request type is using a a `Put` on `keys.RangeProbeKey` (a key introduced for [kvprober]). However, when the Replica circuit breakers perform a probe, they have to make sure that this probe itself doesn't get caught up in the circuit breakers. In particular, the circuit breaker's request needs special casing with respect to lease requests (which would otherwise also bounce on the circuit breaker). But even with that added (not too large a change), the lease would be required, so we wouldn't be able to apply the circuit breakers anywhere but the leaseholder (i.e. we would have to heal the breaker when the lease is "not obtainable". What we are after is really a per-Replica concept that is disjoint from the lease, where we check that the replica can manage to get a command into the log and observe it coming out of the replication layer. `ProbeRequest` bypasses lease checks and can thus be processed by any Replica. Upon application, they are guaranteed to result in a forced error, and we suppress this error above the replication layer (i.e. at the waiting request). This means that receiving a nil error from the probe implies that the probe successfully made it through latching (where it declares a read at its timestamp on the probed key), evaluation, and application. While no concrete plans exist for adoption of `ProbeRequest` elsewhere, this might be of interest for [kvprober] and as a general building block for quickly assessing overall KV health; for example we may extend cockroachdb#72732 to facilitate sending a probe to the leaseholders of all Ranges in the system. `ProbeRequest` is a point request. A ranged version is possible but since `DistSender` will not return partial responses, it is not introduced at this point. This can be done when a concrete need arises. `ProbeRequest` by default will be routed to the leaseholder, but it can also be used with the `NEAREST` routing policy. Some details helpful for reviewers follow. - to enable testing of this new request type, `TestingApplyFilter` was augmented with a `TestingApplyForcedErrFilter` which gets to inspect and mutate forced errors below raft. The semantics of `TestingApplyFitler` remain unchainged; the newly added field `ForcedErr` is always nil when handed to such a filter. - `ProbeRequest` is the first request type that skips the lease check and is not itself a lease check. This prompted a small refactor to `GetPrevLeaseForLeaseRequest`, which can now return false when the underlying request is a `Probe` and not `{Transfer,Request}Lease`. [kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober [routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60 Release note: None
This commit introduces a new KV request type `ProbeRequest`. This request performs a mutationless round-trip through the replication layer. The proximate reason to add this request is cockroachdb#33007, where we want to fail-fast requests on a Replica until a `ProbeRequest` succeeds. The alternative to introducing a new request type is using a `Put` on `keys.RangeProbeKey` (a key introduced for [kvprober]). However, when the Replica circuit breakers perform a probe, they have to make sure that this probe itself doesn't get caught up in the circuit breakers. In particular, the circuit breaker's request needs special casing with respect to lease requests (which would otherwise also bounce on the circuit breaker). But even with that added (not too large a change), the lease would be required, so we wouldn't be able to apply the circuit breakers anywhere but the leaseholder (i.e. we would have to heal the breaker when the lease is "not obtainable". What we are after is really a per-Replica concept that is disjoint from the lease, where we check that the replica can manage to get a command into the log and observe it coming out of the replication layer. `ProbeRequest` bypasses lease checks and can thus be processed by any Replica. Upon application, they are guaranteed to result in a forced error, and we suppress this error above the replication layer (i.e. at the waiting request). This means that receiving a nil error from the probe implies that the probe successfully made it through latching (where it declares a read at its timestamp on the probed key), evaluation, and application. While no concrete plans exist for adoption of `ProbeRequest` elsewhere, this might be of interest for [kvprober] and as a general building block for quickly assessing overall KV health; for example we may extend cockroachdb#72732 to facilitate sending a probe to the leaseholders of all Ranges in the system. `ProbeRequest` is a point request. A ranged version is possible but since `DistSender` will not return partial responses, it is not introduced at this point. This can be done when a concrete need arises. `ProbeRequest` by default will be routed to the leaseholder, but it can also be used with the `NEAREST` routing policy. Some details helpful for reviewers follow. - to enable testing of this new request type, `TestingApplyFilter` was augmented with a `TestingApplyForcedErrFilter` which gets to inspect and mutate forced errors below raft. The semantics of `TestingApplyFilter` remain unchainged; the newly added field `ForcedErr` is always nil when handed to such a filter. - `ProbeRequest` is the first request type that skips the lease check and is not itself a lease check. This prompted a small refactor to `GetPrevLeaseForLeaseRequest`, which can now return false when the underlying request is a `Probe` and not `{Transfer,Request}Lease`. [kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober [routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60 Release note: None
This commit introduces a new KV request type `ProbeRequest`. This request performs a mutationless round-trip through the replication layer. The proximate reason to add this request is cockroachdb#33007, where we want to fail-fast requests on a Replica until a `ProbeRequest` succeeds. The alternative to introducing a new request type is using a `Put` on `keys.RangeProbeKey` (a key introduced for [kvprober]). However, when the Replica circuit breakers perform a probe, they have to make sure that this probe itself doesn't get caught up in the circuit breakers. In particular, the circuit breaker's request needs special casing with respect to lease requests (which would otherwise also bounce on the circuit breaker). But even with that added (not too large a change), the lease would be required, so we wouldn't be able to apply the circuit breakers anywhere but the leaseholder (i.e. we would have to heal the breaker when the lease is "not obtainable". What we are after is really a per-Replica concept that is disjoint from the lease, where we check that the replica can manage to get a command into the log and observe it coming out of the replication layer. `ProbeRequest` bypasses lease checks and can thus be processed by any Replica. Upon application, they are guaranteed to result in a forced error, and we suppress this error above the replication layer (i.e. at the waiting request). This means that receiving a nil error from the probe implies that the probe successfully made it through latching (where it doesn't declare any key, i.e. should not conflict with long- evaluating requests), evaluation, and application. While no concrete plans exist for adoption of `ProbeRequest` elsewhere, this might be of interest for [kvprober] and as a general building block for quickly assessing overall KV health; for example we may extend cockroachdb#72732 to facilitate sending a probe to the leaseholders of all Ranges in the system. `ProbeRequest` is a point request. A ranged version is possible but since `DistSender` will not return partial responses, it is not introduced at this point. This can be done when a concrete need arises. `ProbeRequest` by default will be routed to the leaseholder, but it can also be used with the `NEAREST` routing policy. Some details helpful for reviewers follow. - to enable testing of this new request type, `TestingApplyFilter` was augmented with a `TestingApplyForcedErrFilter` which gets to inspect and mutate forced errors below raft. The semantics of `TestingApplyFilter` remain unchainged; the newly added field `ForcedErr` is always nil when handed to such a filter. - `ProbeRequest` is the first request type that skips the lease check and is not itself a lease check. This prompted a small refactor to `GetPrevLeaseForLeaseRequest`, which can now return false when the underlying request is a `Probe` and not `{Transfer,Request}Lease`. [kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober [routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60 Release note: None
72972: kvserver: introduce ProbeRequest r=erikgrinaker a=tbg This commit introduces a new KV request type `ProbeRequest`. This request performs a mutationless round-trip through the replication layer. The proximate reason to add this request is #33007, where we want to fail-fast requests on a Replica until a `ProbeRequest` succeeds. The alternative to introducing a new request type is using a a `Put` on `keys.RangeProbeKey` (a key introduced for [kvprober]). However, when the Replica circuit breakers perform a probe, they have to make sure that this probe doesn't get caught up in the circuit breakers themselves. In particular, the circuit breaker's request needs special casing with respect to lease requests (which would otherwise also bounce on the circuit breaker), and in particular we would also like the probe to be proposed (via the local raft instance) when the local replica is not the leaseholder, as this allows the circuit breakers to cover more ground and ascertain that the Replica has access to the replication layer regardless of its status. `ProbeRequest` bypasses lease checks and can thus be processed by any Replica. Upon application, they are guaranteed to result in a forced error, and we suppress this error above the replication layer (i.e. at the waiting request). This means that receiving a nil error from the probe implies that the probe successfully made it through latching, evaluation, and application. While no concrete plans exist for adoption of `ProbeRequest` elsewhere, this might be of interest for [kvprober] and as a general building block for quickly assessing overall KV health; for example we may extend #72732 to facilitate sending a probe to the leaseholders of all Ranges in the system. `ProbeRequest` is a point request. A ranged version is possible but since `DistSender` will not return partial responses, it is not introduced at this point. This can be done when a concrete need arises. `ProbeRequest` by default will be routed to the leaseholder, but it can also be used with the `NEAREST` routing policy. [kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober [routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60 Release note: None Co-authored-by: Tobias Grieger <tobias.schottdorf@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
This commit introduces a new KV request type `ProbeRequest`. This request performs a mutationless round-trip through the replication layer. The proximate reason to add this request is cockroachdb#33007, where we want to fail-fast requests on a Replica until a `ProbeRequest` succeeds. The alternative to introducing a new request type is using a `Put` on `keys.RangeProbeKey` (a key introduced for [kvprober]). However, when the Replica circuit breakers perform a probe, they have to make sure that this probe itself doesn't get caught up in the circuit breakers. In particular, the circuit breaker's request needs special casing with respect to lease requests (which would otherwise also bounce on the circuit breaker). But even with that added (not too large a change), the lease would be required, so we wouldn't be able to apply the circuit breakers anywhere but the leaseholder (i.e. we would have to heal the breaker when the lease is "not obtainable". What we are after is really a per-Replica concept that is disjoint from the lease, where we check that the replica can manage to get a command into the log and observe it coming out of the replication layer. `ProbeRequest` bypasses lease checks and can thus be processed by any Replica. Upon application, they are guaranteed to result in a forced error, and we suppress this error above the replication layer (i.e. at the waiting request). This means that receiving a nil error from the probe implies that the probe successfully made it through latching (where it doesn't declare any key, i.e. should not conflict with long- evaluating requests), evaluation, and application. While no concrete plans exist for adoption of `ProbeRequest` elsewhere, this might be of interest for [kvprober] and as a general building block for quickly assessing overall KV health; for example we may extend cockroachdb#72732 to facilitate sending a probe to the leaseholders of all Ranges in the system. `ProbeRequest` is a point request. A ranged version is possible but since `DistSender` will not return partial responses, it is not introduced at this point. This can be done when a concrete need arises. `ProbeRequest` by default will be routed to the leaseholder, but it can also be used with the `NEAREST` routing policy. Some details helpful for reviewers follow. - to enable testing of this new request type, `TestingApplyFilter` was augmented with a `TestingApplyForcedErrFilter` which gets to inspect and mutate forced errors below raft. The semantics of `TestingApplyFilter` remain unchainged; the newly added field `ForcedErr` is always nil when handed to such a filter. - `ProbeRequest` is the first request type that skips the lease check and is not itself a lease check. This prompted a small refactor to `GetPrevLeaseForLeaseRequest`, which can now return false when the underlying request is a `Probe` and not `{Transfer,Request}Lease`. [kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober [routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60 Release note: None
This patch adds a CLI command
debug send-kv-batchthat takes a JSONBatchRequestas input, proxies it into KV via the connected node'sadmin server, and outputs a JSON
BatchResponse. An admin user isrequired, typically the root user, and the request is logged to the
system event log for auditability.
This is a useful backdoor to access arbitrary KV functionality for
debugging and surgery, e.g. during support escalations.
Resolves #66829.
Resolves #67001.
Release note: None
@knz It does not seem like admin commands generally allow specifying a user (e.g. via
--user), and implicitly require the root user, so I followed that convention here as well. Let me know if we need to extend this.