kvserver,admission: kvserver changes to plumb write bytes to admissio… by sumeerbhola · Pull Request #83937 · cockroachdb/cockroach

sumeerbhola · 2022-07-06T20:42:57Z

…n control

We introduced changes to StoreWorkQueue, for kvserver to provide actual
byte size values for sstable ingestion and how many bytes were ingested
into L0. But did not make the corresponding changes in kvserver to provide
these.

Since then, we've noticed tiny writes (like TruncateLog) consume a
disproportionate number of write tokens (in #82536). The current StoreWorkQueue
interface does not have the ability to do anything reasonable for such
non-ingest requests. Additionally, providing the information of how many
bytes were ingested into L0 requires plumbing through raft replication and
state machine application (which we wish to avoid).

In the context of the WIP PR for disk bandwidth as a bottleneck resource,
an alternative scheme suggested itself. This PR has the kvserver plumbing
changes for this alternative scheme, so that the palatability of this can
be confirmed before making the changes in admission control.

This scheme can be summarized as:

No byte information is provided at admission time. Admission control
uses estimates based on past behavior. So tiny writes that are not the
norm (like TruncateLog) may consume more tokens than needed, and sstable
ingestion may consume less tokens.
When the admitted work is done, the size of the batch (for regular writes)
and sstable size (for ingest) is provided to admission control to fix
what was consumed earlier. Note that we could have provided the sstable
information earlier (at admission time), but this keeps everything in
one place.
How much was ingested into L0 is never specified on a per admission basis,
since we avoid plumbing through replication and application. Instead,
estimates based on the past will be consistently used for this.

Release note: None

cockroach-teamcity · 2022-07-06T20:43:07Z

This change is

tbg

my one concern is the addition of a heap allocation on the write path. I'm not sure if this is a true concern, maybe a q for @nvanbenschoten who has better intuition about these things.

If need be, we could amortize the allocation, or use async.Pool.

Reviewed 9 of 9 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

pkg/kv/kvserver/replica_raft.go line 166 at r1 (raw file):

		proposal.command.WriteBatch.Size(),
	)
	writeBytes := &StoreWriteBytes{}

This allocation can probably be amortized with one already happening.

pkg/kv/kvserver/replica_raft.go line 176 at r1 (raw file):

	// return a proposal result immediately on the proposal's done channel.
	// The channel's capacity will be large enough to accommodate this.
	if ba.AsyncConsensus {

If we hit this path, there is behavior that could be described as odd, which is that we'll tell admission control about some write that may not even have hit the engine. I think it's ok but maybe should be mentioned somewhere in the documentation of this work.

pkg/util/admission/work_queue.go line 1643 at r1 (raw file):

type StoreWorkDoneInfo struct {
	// The size of the batch, for normal writes. Zero if there were no normal

nit: pebble write batch
nit: "Zero if the write batch was empty (i.e. SST ingestion)"

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @tbg)

pkg/kv/kvserver/replica_raft.go line 166 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This allocation can probably be amortized with one already happening.

I added a sync.Pool, since I couldn't find an obvious way to amortize.

pkg/kv/kvserver/replica_raft.go line 176 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

If we hit this path, there is behavior that could be described as odd, which is that we'll tell admission control about some write that may not even have hit the engine. I think it's ok but maybe should be mentioned somewhere in the documentation of this work.

Added a comment

pkg/util/admission/work_queue.go line 1643 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: pebble write batch
nit: "Zero if the write batch was empty (i.e. SST ingestion)"

Done

irfansharif

and sstable ingestion may consume less tokens [..] we could have provided the sstable
information earlier (at admission time)

Are we more likely to over admit as a result? Or are the timescales where post-hoc corrections happen small enough to make this irrelevant?

tbg

Reviewed 2 of 2 files at r2, 6 of 6 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)

pkg/kv/kvserver/replica_raft.go line 85 at r3 (raw file):

// ReleaseStoreWriteBytes returns the *StoreWriteBytes to the pool.
func ReleaseStoreWriteBytes(wb *StoreWriteBytes) {

Could we have (*StoreWriteBytes).Release instead of this ad-hoc method?

I think that means switching the def to

type StoreWriteBytes TheActualType // i.e. no `=`

or, if we have any methods on StoreWriteBytes today, struct{TheActualType}.

sumeerbhola

Are we more likely to over admit as a result? Or are the timescales where post-hoc corrections happen small enough to make this irrelevant?

We may over admit or under admit, since at admission time we will still use estimates, which could be under or over. We currently only use estimates (never any corrections), so we should not regress.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)

pkg/kv/kvserver/replica_raft.go line 85 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could we have (*StoreWriteBytes).Release instead of this ad-hoc method?

I think that means switching the def to
type StoreWriteBytes TheActualType // i.e. no `=`
or, if we have any methods on StoreWriteBytes today, struct{TheActualType}.

Done

…n control We introduced changes to StoreWorkQueue, for kvserver to provide actual byte size values for sstable ingestion and how many bytes were ingested into L0. But did not make the corresponding changes in kvserver to provide these. Since then, we've noticed tiny writes (like TruncateLog) consume a disproportionate number of write tokens (in cockroachdb#82536). The current StoreWorkQueue interface does not have the ability to do anything reasonable for such non-ingest requests. Additionally, providing the information of how many bytes were ingested into L0 requires plumbing through raft replication and state machine application (which we wish to avoid). In the context of the WIP PR for disk bandwidth as a bottleneck resource, an alternative scheme suggested itself. This PR has the kvserver plumbing changes for this alternative scheme, so that the palatability of this can be confirmed before making the changes in admission control. This scheme can be summarized as: - No byte information is provided at admission time. Admission control uses estimates based on past behavior. So tiny writes that are not the norm (like TruncateLog) may consume more tokens than needed, and sstable ingestion may consume less tokens. - When the admitted work is done, the size of the batch (for regular writes) and sstable size (for ingest) is provided to admission control to fix what was consumed earlier. Note that we could have provided the sstable information earlier (at admission time), but this keeps everything in one place. - How much was ingested into L0 is never specified on a per admission basis, since we avoid plumbing through replication and application. Instead, estimates based on the past will be consistently used for this. Release note: None

sumeerbhola

TFTRs!

I've squashed the commits.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)

sumeerbhola · 2022-07-14T18:21:22Z

bors r=tbg,irfansharif

craig · 2022-07-14T19:39:43Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-07-14T19:57:46Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-07-14T21:26:58Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-07-14T23:53:18Z

Build succeeded:

GitHub CI (Cockroach)

The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR (cockroachdb#83937). This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model y=a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo, and y is the bytes added to L0 via write or ingestion. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can expect the a term being in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes cockroachdb#79092 Informs cockroachdb#82536 Release note: None

84761: schematelemetry,eventpb: add schema telemetry r=postamar a=postamar This commit adds: - the event definitions and logic for generating them, - the scheduling and jobs boilerplate to periodically log them. Care is taken to redact all strings present in descriptors which might unintentionally be leaking PIIs. The event generation logic is tested on the schema of a bootstrapped test cluster: the test checks that the events match expectations. Fixes #84284. Release note (general change): CRDB will now collect schema info if phoning home is enabled. This schema info is added to the telemetry log by a built-in scheduled job which runs on a weekly basis by default. This recurrence can be changed via the sql.schema.telemetry.recurrence cluster setting. The schedule can also be paused via PAUSE SCHEDULE followed by its ID, which can be retrieved by querying SELECT * FROM [SHOW SCHEDULES] WHERE label = 'sql-schema-telemetry'. 85059: admission,kvserver: improved byte token estimation for writes r=irfansharif,tbg a=sumeerbhola The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR (#83937). This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model y=a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo, and y is the bytes added to L0 via write or ingestion. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can expect the a term being in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes #79092 Informs #82536 Release note: None Co-authored-by: Marius Posta <marius@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

irfansharif · 2022-12-08T03:44:54Z

pkg/kv/kvserver/replica_raft.go

 			// any !Always EndTxnIntent can't be cleaned up until after the
 			// command succeeds.
-			return nil, nil, "", roachpb.NewErrorf("cannot perform consensus asynchronously for "+
+			return nil, nil, "", writeBytes, roachpb.NewErrorf("cannot perform consensus asynchronously for "+


I assume returning writeBytes in this error path was unintentional, since it's just getting discarded at the caller. I'll remove it in some later PR.

Yes, I believe it was unintentional.

sumeerbhola requested review from irfansharif and tbg July 6, 2022 20:42

sumeerbhola marked this pull request as ready for review July 7, 2022 20:29

sumeerbhola requested review from a team as code owners July 7, 2022 20:29

tbg approved these changes Jul 13, 2022

View reviewed changes

sumeerbhola force-pushed the work_estimate2 branch from 3d7b1e4 to f094192 Compare July 13, 2022 15:10

sumeerbhola commented Jul 13, 2022

View reviewed changes

irfansharif added the A-admission-control label Jul 14, 2022

irfansharif approved these changes Jul 14, 2022

View reviewed changes

tbg approved these changes Jul 14, 2022

View reviewed changes

sumeerbhola force-pushed the work_estimate2 branch from bfb8412 to 6c857b8 Compare July 14, 2022 15:45

sumeerbhola commented Jul 14, 2022

View reviewed changes

sumeerbhola force-pushed the work_estimate2 branch from 6c857b8 to eca2bf4 Compare July 14, 2022 15:46

sumeerbhola commented Jul 14, 2022

View reviewed changes

craig bot merged commit 2546b2a into cockroachdb:master Jul 14, 2022

nicktrav mentioned this pull request Jul 15, 2022

roachtest: sstable-corruption/table failed #84479

Closed

sumeerbhola mentioned this pull request Aug 8, 2022

admission,kvserver: improved byte token estimation for writes #85059

Merged

irfansharif reviewed Dec 8, 2022

View reviewed changes

Conversation

sumeerbhola commented Jul 6, 2022

Uh oh!

cockroach-teamcity commented Jul 6, 2022

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Jul 14, 2022

Uh oh!

craig bot commented Jul 14, 2022

Uh oh!

craig bot commented Jul 14, 2022

Uh oh!

craig bot commented Jul 14, 2022

Uh oh!

craig bot commented Jul 14, 2022

Uh oh!

irfansharif Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

sumeerbhola Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants