kvserver,admission: kvserver changes to plumb write bytes to admissio…#83937
kvserver,admission: kvserver changes to plumb write bytes to admissio…#83937craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
tbg
left a comment
There was a problem hiding this comment.
my one concern is the addition of a heap allocation on the write path. I'm not sure if this is a true concern, maybe a q for @nvanbenschoten who has better intuition about these things.
If need be, we could amortize the allocation, or use async.Pool.
Reviewed 9 of 9 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)
pkg/kv/kvserver/replica_raft.go line 166 at r1 (raw file):
proposal.command.WriteBatch.Size(), ) writeBytes := &StoreWriteBytes{}
This allocation can probably be amortized with one already happening.
pkg/kv/kvserver/replica_raft.go line 176 at r1 (raw file):
// return a proposal result immediately on the proposal's done channel. // The channel's capacity will be large enough to accommodate this. if ba.AsyncConsensus {
If we hit this path, there is behavior that could be described as odd, which is that we'll tell admission control about some write that may not even have hit the engine. I think it's ok but maybe should be mentioned somewhere in the documentation of this work.
pkg/util/admission/work_queue.go line 1643 at r1 (raw file):
type StoreWorkDoneInfo struct { // The size of the batch, for normal writes. Zero if there were no normal
nit: pebble write batch
nit: "Zero if the write batch was empty (i.e. SST ingestion)"
3d7b1e4 to
f094192
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @tbg)
pkg/kv/kvserver/replica_raft.go line 166 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This allocation can probably be amortized with one already happening.
I added a sync.Pool, since I couldn't find an obvious way to amortize.
pkg/kv/kvserver/replica_raft.go line 176 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
If we hit this path, there is behavior that could be described as odd, which is that we'll tell admission control about some write that may not even have hit the engine. I think it's ok but maybe should be mentioned somewhere in the documentation of this work.
Added a comment
pkg/util/admission/work_queue.go line 1643 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: pebble write batch
nit: "Zero if the write batch was empty (i.e. SST ingestion)"
Done
irfansharif
left a comment
There was a problem hiding this comment.
and sstable ingestion may consume less tokens [..] we could have provided the sstable
information earlier (at admission time)
Are we more likely to over admit as a result? Or are the timescales where post-hoc corrections happen small enough to make this irrelevant?
tbg
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r2, 6 of 6 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)
pkg/kv/kvserver/replica_raft.go line 85 at r3 (raw file):
// ReleaseStoreWriteBytes returns the *StoreWriteBytes to the pool. func ReleaseStoreWriteBytes(wb *StoreWriteBytes) {
Could we have (*StoreWriteBytes).Release instead of this ad-hoc method?
I think that means switching the def to
type StoreWriteBytes TheActualType // i.e. no `=`or, if we have any methods on StoreWriteBytes today, struct{TheActualType}.
bfb8412 to
6c857b8
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Are we more likely to over admit as a result? Or are the timescales where post-hoc corrections happen small enough to make this irrelevant?
We may over admit or under admit, since at admission time we will still use estimates, which could be under or over. We currently only use estimates (never any corrections), so we should not regress.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)
pkg/kv/kvserver/replica_raft.go line 85 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could we have
(*StoreWriteBytes).Releaseinstead of this ad-hoc method?I think that means switching the def to
type StoreWriteBytes TheActualType // i.e. no `=`or, if we have any methods on
StoreWriteBytestoday,struct{TheActualType}.
Done
…n control We introduced changes to StoreWorkQueue, for kvserver to provide actual byte size values for sstable ingestion and how many bytes were ingested into L0. But did not make the corresponding changes in kvserver to provide these. Since then, we've noticed tiny writes (like TruncateLog) consume a disproportionate number of write tokens (in cockroachdb#82536). The current StoreWorkQueue interface does not have the ability to do anything reasonable for such non-ingest requests. Additionally, providing the information of how many bytes were ingested into L0 requires plumbing through raft replication and state machine application (which we wish to avoid). In the context of the WIP PR for disk bandwidth as a bottleneck resource, an alternative scheme suggested itself. This PR has the kvserver plumbing changes for this alternative scheme, so that the palatability of this can be confirmed before making the changes in admission control. This scheme can be summarized as: - No byte information is provided at admission time. Admission control uses estimates based on past behavior. So tiny writes that are not the norm (like TruncateLog) may consume more tokens than needed, and sstable ingestion may consume less tokens. - When the admitted work is done, the size of the batch (for regular writes) and sstable size (for ingest) is provided to admission control to fix what was consumed earlier. Note that we could have provided the sstable information earlier (at admission time), but this keeps everything in one place. - How much was ingested into L0 is never specified on a per admission basis, since we avoid plumbing through replication and application. Instead, estimates based on the past will be consistently used for this. Release note: None
6c857b8 to
eca2bf4
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
TFTRs!
I've squashed the commits.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)
|
bors r=tbg,irfansharif |
|
Build failed (retrying...): |
|
Build failed (retrying...): |
|
Build failed (retrying...): |
|
Build succeeded: |
The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR (cockroachdb#83937). This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model y=a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo, and y is the bytes added to L0 via write or ingestion. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can expect the a term being in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes cockroachdb#79092 Informs cockroachdb#82536 Release note: None
The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR (cockroachdb#83937). This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model y=a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo, and y is the bytes added to L0 via write or ingestion. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can expect the a term being in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes cockroachdb#79092 Informs cockroachdb#82536 Release note: None
84761: schematelemetry,eventpb: add schema telemetry r=postamar a=postamar This commit adds: - the event definitions and logic for generating them, - the scheduling and jobs boilerplate to periodically log them. Care is taken to redact all strings present in descriptors which might unintentionally be leaking PIIs. The event generation logic is tested on the schema of a bootstrapped test cluster: the test checks that the events match expectations. Fixes #84284. Release note (general change): CRDB will now collect schema info if phoning home is enabled. This schema info is added to the telemetry log by a built-in scheduled job which runs on a weekly basis by default. This recurrence can be changed via the sql.schema.telemetry.recurrence cluster setting. The schedule can also be paused via PAUSE SCHEDULE followed by its ID, which can be retrieved by querying SELECT * FROM [SHOW SCHEDULES] WHERE label = 'sql-schema-telemetry'. 85059: admission,kvserver: improved byte token estimation for writes r=irfansharif,tbg a=sumeerbhola The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR (#83937). This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model y=a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo, and y is the bytes added to L0 via write or ingestion. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can expect the a term being in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes #79092 Informs #82536 Release note: None Co-authored-by: Marius Posta <marius@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
| // any !Always EndTxnIntent can't be cleaned up until after the | ||
| // command succeeds. | ||
| return nil, nil, "", roachpb.NewErrorf("cannot perform consensus asynchronously for "+ | ||
| return nil, nil, "", writeBytes, roachpb.NewErrorf("cannot perform consensus asynchronously for "+ |
There was a problem hiding this comment.
I assume returning writeBytes in this error path was unintentional, since it's just getting discarded at the caller. I'll remove it in some later PR.
There was a problem hiding this comment.
Yes, I believe it was unintentional.
…n control
We introduced changes to StoreWorkQueue, for kvserver to provide actual
byte size values for sstable ingestion and how many bytes were ingested
into L0. But did not make the corresponding changes in kvserver to provide
these.
Since then, we've noticed tiny writes (like TruncateLog) consume a
disproportionate number of write tokens (in #82536). The current StoreWorkQueue
interface does not have the ability to do anything reasonable for such
non-ingest requests. Additionally, providing the information of how many
bytes were ingested into L0 requires plumbing through raft replication and
state machine application (which we wish to avoid).
In the context of the WIP PR for disk bandwidth as a bottleneck resource,
an alternative scheme suggested itself. This PR has the kvserver plumbing
changes for this alternative scheme, so that the palatability of this can
be confirmed before making the changes in admission control.
This scheme can be summarized as:
uses estimates based on past behavior. So tiny writes that are not the
norm (like TruncateLog) may consume more tokens than needed, and sstable
ingestion may consume less tokens.
and sstable size (for ingest) is provided to admission control to fix
what was consumed earlier. Note that we could have provided the sstable
information earlier (at admission time), but this keeps everything in
one place.
since we avoid plumbing through replication and application. Instead,
estimates based on the past will be consistently used for this.
Release note: None