admission,kvserver: track follower writes and range snapshot ingests by sumeerbhola · Pull Request #85127 · cockroachdb/cockroach

sumeerbhola · 2022-07-27T13:46:19Z

The admission control write token consumption and write size estimation
logic gets degraded due to the lack of tracking of writes done at
followers, and of range snapshot ingestion.

For follower writes, replicaAppBatch now maintains stats for follower
writes in a followerStoreWriteBytes struct, which is added to for
every replicatedCmd that is not local. This is plumbed into
replicaStateMachine and from there into the raft handling loop
that passes it onto admission control so that the latter knows
about work count and byte sizes that bypassed admission control.
These work units and their bytes are used in the write size
estimation logic in the same manner as writes that don't bypass
admission control. These bytes are also used to consume tokens.

For range snapshot ingestion, we do not want to consume tokens,
since snapshots can be huge and only add at most 1 sublevel to L0.
So instead of accounting for these bytes as normal ingestion, we
subtract the bytes that were ingested into L0.

This change was experimented with in a 3 node cluster with kv0
running with 1024 byte blocks with a very high concurrency. Note
the bypassed count and bytes below, which are about 2/3rd of the
write volume seen by this node (since the leaseholders were evenly
distributed) and that both the write model of approx 2.0x+1B and
the at-admission-tokens model of approx 2.1KiB are reasonable since
we now have visibility into these follower writes.

220727 19:24:44.856581 26 util/admission/granter.go:2099 ⋮ [s2] 287  IO overload: compaction score 1.050[L0-overload] (707 ssts, 21 sub-levels), L0 growth 596 MiB (write 596 MiB ingest 0 B ignored 0 B): requests 287972 (184860 bypassed) with 294 MiB acc-write (189 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.02x+1 B (smoothed 2.07x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.2 KiB, compacted 135 MiB [≈270 MiB], flushed 741 MiB [≈809 MiB]; admitting 269 MiB (rate 18 MiB/s) due to L0 growth (used 626 MiB)
I220727 19:25:29.857072 26 util/admission/granter.go:2099 ⋮ [s2] 295  IO overload: compaction score 1.050[L0-overload] (725 ssts, 21 sub-levels), L0 growth 594 MiB (write 594 MiB ingest 0 B ignored 0 B): requests 296751 (191171 bypassed) with 303 MiB acc-write (196 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.96x+1 B (smoothed 2.03x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 359 MiB [≈416 MiB], flushed 803 MiB [≈813 MiB]; admitting 340 MiB (rate 23 MiB/s) due to L0 growth (used 634 MiB)
I220727 19:26:14.857530 26 util/admission/granter.go:2099 ⋮ [s2] 301  IO overload: compaction score 1.200[L0-overload] (685 ssts, 24 sub-levels), L0 growth 482 MiB (write 482 MiB ingest 0 B ignored 0 B): requests 229841 (154240 bypassed) with 217 MiB acc-write (141 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.22x+1 B (smoothed 2.11x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 492 MiB [≈475 MiB], flushed 772 MiB [≈796 MiB]; admitting 338 MiB (rate 22 MiB/s) due to L0 growth (used 435 MiB)
I220727 19:27:44.857163 26 util/admission/granter.go:2099 ⋮ [s2] 311  IO overload: compaction score 1.200[L0-overload] (538 ssts, 24 sub-levels), L0 growth 654 MiB (write 654 MiB ingest 0 B ignored 0 B): requests 300321 (193366 bypassed) with 307 MiB acc-write (198 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.13x+1 B (smoothed 2.05x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 578 MiB [≈543 MiB], flushed 753 MiB [≈775 MiB]; admitting 388 MiB (rate 26 MiB/s) due to L0 growth (used 610 MiB)

Informs #80607
Fixes #82536

Release note: None

cockroach-teamcity · 2022-07-27T13:46:31Z

This change is

sumeerbhola

~~This PR lacks any new tests for the admission package. I'll add those once the KV folks are comfortable with the plumbing changes made in kvserver.~~
The first commit is from #85059
The follower write accounting uses a suggestion from @nvanbenschoten in https://cockroachlabs.slack.com/archives/C038JEXC5AT/p1658763170402819?thread_ts=1657630075.576439&cid=C038JEXC5AT. See that thread for discussion on how it is hard to understand queueing behavior if the token estimation and token consumption does not account for these.
The snapshot ingestion accounting, where we ignore those bytes, is a simplification of what was prototyped in #80607. What is done here is hopefully a reasonable intermediate step.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @nvanbenschoten, and @tbg)

sumeerbhola · 2022-07-27T20:13:11Z

Added unit tests.

Experimented with this in a 3 node cluster with kv0 running with 1024 byte blocks with a very high concurrency. Note that the bypassed count and bytes are about 2/3rd of the write volume seen by this node (since the leaseholders were evenly distributed) and that both the write model of approx 2.0x+1B and the at-admission-tokens model of approx 2.1KiB are reasonable since we now have visibility into these follower writes.

220727 19:24:44.856581 26 util/admission/granter.go:2099 ⋮ [s2] 287  IO overload: compaction score 1.050[L0-overload] (707 ssts, 21 sub-levels), L0 growth 596 MiB (write 596 MiB ingest 0 B ignored 0 B): requests 287972 (184860 bypassed) with 294 MiB acc-write (189 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.02x+1 B (smoothed 2.07x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.2 KiB, compacted 135 MiB [≈270 MiB], flushed 741 MiB [≈809 MiB]; admitting 269 MiB (rate 18 MiB/s) due to L0 growth (used 626 MiB)
I220727 19:25:29.857072 26 util/admission/granter.go:2099 ⋮ [s2] 295  IO overload: compaction score 1.050[L0-overload] (725 ssts, 21 sub-levels), L0 growth 594 MiB (write 594 MiB ingest 0 B ignored 0 B): requests 296751 (191171 bypassed) with 303 MiB acc-write (196 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.96x+1 B (smoothed 2.03x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 359 MiB [≈416 MiB], flushed 803 MiB [≈813 MiB]; admitting 340 MiB (rate 23 MiB/s) due to L0 growth (used 634 MiB)
I220727 19:26:14.857530 26 util/admission/granter.go:2099 ⋮ [s2] 301  IO overload: compaction score 1.200[L0-overload] (685 ssts, 24 sub-levels), L0 growth 482 MiB (write 482 MiB ingest 0 B ignored 0 B): requests 229841 (154240 bypassed) with 217 MiB acc-write (141 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.22x+1 B (smoothed 2.11x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 492 MiB [≈475 MiB], flushed 772 MiB [≈796 MiB]; admitting 338 MiB (rate 22 MiB/s) due to L0 growth (used 435 MiB)
I220727 19:27:44.857163 26 util/admission/granter.go:2099 ⋮ [s2] 311  IO overload: compaction score 1.200[L0-overload] (538 ssts, 24 sub-levels), L0 growth 654 MiB (write 654 MiB ingest 0 B ignored 0 B): requests 300321 (193366 bypassed) with 307 MiB acc-write (198 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.13x+1 B (smoothed 2.05x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 578 MiB [≈543 MiB], flushed 753 MiB [≈775 MiB]; admitting 388 MiB (rate 26 MiB/s) due to L0 growth (used 610 MiB)

irfansharif

Those results look sound. Most of my comments are for my own understanding as usual, so .

It would be helpful to spell out in the commit message the reason behind ignoring snapshot-driven-l0-bytes-ingested in some of these components, if we didn't we'd see a large increase in token estimations for regular writes and it would take time for those estimations to get back to an accurate value for just regular writes. During that time we're likely to be under-admitting work because of the over-estimates. (Is this the right way to think about it?) How much of this over-estimation matters given regular writes are able to return unused tokens back in post-work? I can still understand why we'd want to avoid over-estimating at all using the scheme in this PR, I just want to make sure I understand the effects of not having this PR.

I've also now read #80914, so have a coarse understanding of what we're building towards. I haven't read #85722 yet so don't fully see how these tokens map to underlying bandwidth use and how such bandwidth is proportioned in the face of low-volume snapshots and a large volume of snapshots with the scheme outlined in #80914.

I'm not the most familiar with the command application codepaths, but reading through that slack thread and taking a cursory look, I'm wondering if this is the best plumbing point. In this current form, if the command batch fails partway through and we don't end up applying everything, our admission stats+subsequent estimates will be off. I also don't yet know if the overhead of iterating through the commands once at the outset has a measurable enough overhead (mentioned in comment below). Here's my guess at a more appropriate integration point without knowing what code changes are necessary.

Over here we should be able to get an accurate view of follower {write,ingested} bytes + # of commands that only considers commands applied successfully (see the error handling that comes after):

cockroach/pkg/kv/kvserver/replica_raft.go

Line 1048 in 7bf6804

stats.apply = sm.moveStats()

To collect these stats, could we do them at the same place we're currently maintaining these stats (under batch.Stage):

cockroach/pkg/kv/kvserver/replica_application_state_machine.go

Lines 433 to 436 in a345d46

    
           entries      int 
        
           entryBytes   int64 
        
           emptyEntries int 
        
           mutations    int

Probably the current layout is fine, especially if it keeps the code simpler.

Reviewed 2 of 17 files at r1, 19 of 20 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)

-- commits line 108 at r3:
[nit] Mention that we're doing so without seeking admission/permission (even if obvious since we're consuming tokens after having done the work).

pkg/kv/kvserver/replica_raft.go line 860 at r3 (raw file):

		return stats, getNonDeterministicFailureExplanation(err), err
	}
	followerWriteBytes := appTask.GetFollowerStoreWriteBytesForCommittedEntries(ctx)

[nit] Pull this into the if block below to avoid doing it if disabled.

pkg/kv/kvserver/replica_raft.go line 860 at r3 (raw file):

		return stats, getNonDeterministicFailureExplanation(err), err
	}
	followerWriteBytes := appTask.GetFollowerStoreWriteBytesForCommittedEntries(ctx)

Do we have microbenchmarks in the command application path that can tell us what the overhead of this is? It looks cheap enough, but there are two lock acquisitions underneath FollowerStoreWriteBytes to maintain the stats correctly.

pkg/kv/kvserver/replica_raftstorage.go line 945 at r3 (raw file):

		}
	}
	var ingestStats pebble.IngestOperationStats

Ditto, fold this into the nil check below. I was looking through non-test callers of db.IngestExternalFiles and found these two:

cockroach/pkg/kv/kvserver/replica_proposal.go

Line 498 in 58fac19

ingestErr := eng.IngestExternalFiles(ctx, []string{ingestPath})

cockroach/pkg/kv/kvserver/replica_proposal.go

Lines 532 to 534 in 58fac19

    
           if err := eng.IngestExternalFiles(ctx, []string{path}); err != nil { 
        
           	log.Fatalf(ctx, "while ingesting %s: %+v", path, err) 
        
           }

Do we need to eventually plumb statistics from those ingestion points into AC? (I've not closely read #80914 so my question is likely naive. EDIT: Just read that PR, still have the question). Leave TODOs if so. Ideally we'd get rid of IngestExternalFiles altogether and replace it with just the one variant that returns statistics, ignoring results when unnecessary.

pkg/kv/kvserver/store.go line 3926 at r3 (raw file):

}

// SnapshotIngested implements the KVAdmissionController itnerface.

Typo: itnerface.

pkg/util/admission/store_token_estimation.go line 310 at r1 (raw file):

//   accurate. Additionally, we will scale down the bytes for
//   at-admission-tokens by the fraction
//   bytes-that-sought-permission/(bytes-that-sought-permission +

I'm not seeing this scale-down math for work that sought admission vs. bypassed it. Can you point me to it? If it's missing and we still think it's a good idea, should we add a TODO somewhere?

pkg/util/admission/testdata/format_adjust_tokens_stats.txt line 6 at r3 (raw file):

compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 0 B (write 0 B ingest 0 B ignored 0 B): requests 0 (0 bypassed) with 0 B acc-write (0 B bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.00x+0 B (smoothed 0.00x+0 B) + ingested-model 0.00x+0 B (smoothed 0.00x+0 B) + at-admission-tokens 0 B, compacted 0 B [≈0 B], flushed 0 B [≈0 B]; admitting all
real-numbers:
compaction score 2.700[L0-overload] (195 ssts, 27 sub-levels), L0 growth 577 MiB (write 0 B ingest 0 B ignored 0 B): requests 0 (0 bypassed) with 0 B acc-write (0 B bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.00x+0 B (smoothed 0.00x+0 B) + ingested-model 0.00x+0 B (smoothed 0.00x+0 B) + at-admission-tokens 0 B, compacted 77 MiB [≈62 MiB], flushed 0 B [≈0 B]; admitting 116 MiB (rate 7.7 MiB/s) due to L0 growth (used 0 B)

Update test to include non-zero values for bypassed writes.

sumeerbhola

TFTR!

It would be helpful to spell out in the commit message the reason behind ignoring snapshot-driven-l0-bytes-ingested in some of these components, if we didn't we'd see a large increase in token estimations for regular writes and it would take time for those estimations to get back to an accurate value for just regular writes. During that time we're likely to be under-admitting work because of the over-estimates. (Is this the right way to think about it?)

Yes, that's the right way to think about it. Updated the commit message.

How much of this over-estimation matters given regular writes are able to return unused tokens back in post-work?

It will matter since the locally proposed writes and ingests token consumption model will be wrong, so they will overconsume to compensate. And the model will be wrong in the sense of having a large constant, which means even a mix of locally proposed workloads of different sizes will not be well modeled (which we should be otherwise able to, with a constant factor close to 1, and most of the bytes being accounted for via the multiplier).

I've also now read #80914, so have a coarse understanding of what we're building towards

We may or may not build towards that, but this PR is roughly a step in the right direction.

I haven't read #85722 yet so don't fully see how these tokens map to underlying bandwidth use and how such bandwidth is proportioned in the face of low-volume snapshots and a large volume of snapshots with the scheme outlined in #80914.

We are going to use these same bypassed writes and ingests when fitting the model for the full incoming bytes into the LSM. So from a modeling of token consumption perspective, this PR gives us a clean answer for disk bandwidth too. However, we won't be consuming any elastic tokens for these follower writes, since they are not something we can actually stop -- the elastic tokens are only for locally proposed work.

I'm not the most familiar with the command application codepaths, but reading through that slack thread and taking a cursory look, I'm wondering if this is the best plumbing point.

I'm not familiar with it either, but I took a look at your suggestion and it is likely better. I had looked for a way to not introduce additional methods in the apply package, but had missed that we have a specific implementation of apply.StateMachine that we have created and given to apply.Task, and that we are already tracking stats in there. I'll give it a try and ping this PR when ready.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, and @tbg)

-- commits line 108 at r3:

Previously, irfansharif (irfan sharif) wrote…

[nit] Mention that we're doing so without seeking admission/permission (even if obvious since we're consuming tokens after having done the work).

Done

pkg/kv/kvserver/replica_raft.go line 860 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Pull this into the if block below to avoid doing it if disabled.

Done.
btw, we still have these non-nil gated logic for admission control, which is a leftover from the initial integration. There shouldn't be any production code where r.store.cfg.KVAdmissionController is nil.

pkg/kv/kvserver/replica_raft.go line 860 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Do we have microbenchmarks in the command application path that can tell us what the overhead of this is? It looks cheap enough, but there are two lock acquisitions underneath FollowerStoreWriteBytes to maintain the stats correctly.

There aren't microbenchmarks for admission control code. Those are probably worth adding in the future. In the past I have looked at contention profiles in roachtests and they have not looked significant.

pkg/kv/kvserver/replica_raftstorage.go line 945 at r3 (raw file):

Ditto, fold this into the nil check below.

we need to call IngestExtternalFiles* anyway.

Do we need to eventually plumb statistics from those ingestion points into AC?

We could, but the current plan is not to bother unless we really have a need since that is below raft code, and hopefully our linear model does a decent job of telling us what fraction will be ingested into L0. If we can't model it well, we could look into it.

pkg/kv/kvserver/store.go line 3926 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Typo: itnerface.

Done

pkg/util/admission/store_token_estimation.go line 310 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'm not seeing this scale-down math for work that sought admission vs. bypassed it. Can you point me to it? If it's missing and we still think it's a good idea, should we add a TODO somewhere?

I had decided not to do this, since we are compensating for things at done time anyway. But thinking again about this now, it may have been muddled thinking, and I should indeed do this.
I'll make the change and ping.

pkg/util/admission/testdata/format_adjust_tokens_stats.txt line 6 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Update test to include non-zero values for bypassed writes.

This will be subsumed by the TODO I added based on your comment on the previous PR. I prefer to do it later.

tbg

Great to see this come together! I didn't scrutinize the test cases in detail but this looks good.

I am curious if we're perhaps regressing, in some way, on snapshot overload with this change. Let's say we have a node where the I/O overload is driven 50% by foreground traffic and 50% by snapshots (which all go to L0 in this example for reasons).
Before this PR, the L0 growth in snapshots would penalize the foreground traffic. The snapshots would continue, but foreground traffic would basically stop - the I/O load drops to 50% of what it was before admission control kicked in.
After this PR, we're basically "ignoring" the snapshots. Admission control pretends the L0 growth caused by them isn't actually happening. Say workload and snaps are contributing 100mb/s into L0, each. We're still overloaded, so it is doing something to slow down foreground traffic. We notice that L0 is growing at ~100mb/s (it's really 200mb/s but we're ignoring half). We're compacting out of L0 at 100mb/s as well. So we're targeting 50mb/s (half of L0->LX compaction rate), in effect slowing the foreground workload down to 50mb/s. But the aggregate 150mb/s will probably still overwhelm the disk, at least if it's one of the 125mb/s AWS gp3 volumes.

Of course this is all a bit contrived, since on a 125mb/s volume you better not set your snapshot rate limits to >= 100mb/s! And we could always lower it. Additionally, the allocator would not place new replicas on an overloaded store. And, additionally, other leaseholders would pause this follower, i.e. not send raft snapshots either (this is assuming the cluster setting for this is on). So there are multiple levels of defense here already.

And if they all fail - can still lower the snap rates. I assume that seeing I/O overload, we would check the aggregate disk throughput, as well as the snapshot rcvd bytes rate. We'd see the latter reflect a substantial fraction of the aggregate throughput and go to lower the snapshot rate (or to temporarily stop snapshots by recommissioning, etc). That seems reasonable enough. The IO overload log message would also indicate a large byte count for the "ignored" fields, which is a more direct (though only to experts) clue.

If it isn't too much work, could you share whatever you use to run your overload experiments? I am currently working on a few follow-ups to #83851 and during stability plan to revisit the corresponding roachtest #81516, and hope to add more powerful coverage of various overload scenarios including some that you have concocted, to additionally bolster our confidence in the protections we've been adding.

Reviewed 6 of 17 files at r1, 17 of 20 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, and @sumeerbhola)

pkg/kv/kvserver/replica_raft.go line 862 at r4 (raw file):

	if r.store.cfg.KVAdmissionController != nil {
		followerWriteBytes := appTask.GetFollowerStoreWriteBytesForCommittedEntries(ctx)
		r.store.cfg.KVAdmissionController.FollowerStoreWriteBytes(

Could you add a comment about the placement of this call? At this point, we have the entries in the log & know they are committed, but we have not applied them yet. We're realistically about to do so momentarily so it doesn't really matter, but conceptually, do we want to do this after or before application? Or even after the append, before we even know the entry is committed? In other words, do we have any intuition about the ideal world semantics we want here.

pkg/util/admission/testdata/io_load_listener line 284 at r4 (raw file):

set-state l0-bytes=1000 l0-added-write=191000 l0-added-ingested=30000 l0-files=21 l0-sublevels=21 print-only-first-tick=true
----
compaction score 1.050[L0-overload] (21 ssts, 21 sub-levels), L0 growth 20 KiB (write 20 KiB ingest 0 B ignored 0 B): requests 10 (0 bypassed) with 20 KiB acc-write (0 B bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.00x+1 B (smoothed 1.26x+1 B) + ingested-model 0.00x+0 B (smoothed 1.12x+1 B) + at-admission-tokens 5.9 KiB, compacted 20 KiB [≈59 KiB], flushed 0 B [≈0 B]; admitting 27 KiB (rate 1.8 KiB/s) due to L0 growth (used 0 B)

drive-by comment, would it make sense to highlight, in that message, a constant term >> 1 in the model, since it indicates that the fit isn't working well?

Something like write-model 2.99x+1025[!] whenever b > 10 or something like that?

sumeerbhola · 2022-08-09T15:08:10Z

Let's say we have a node where the I/O overload is driven 50% by foreground traffic and 50% by snapshots (which all go to L0 in this example for reasons).
Before this PR, the L0 growth in snapshots would penalize the foreground traffic. The snapshots would continue, but foreground traffic would basically stop - the I/O load drops to 50% of what it was before admission control kicked in.
After this PR, we're basically "ignoring" the snapshots. Admission control pretends the L0 growth caused by them isn't actually happening. Say workload and snaps are contributing 100mb/s into L0, each. We're still overloaded, so it is doing something to slow down foreground traffic. We notice that L0 is growing at ~100mb/s (it's really 200mb/s but we're ignoring half). We're compacting out of L0 at 100mb/s as well. So we're targeting 50mb/s (half of L0->LX compaction rate), in effect slowing the foreground workload down to 50mb/s. But the aggregate 150mb/s will probably still overwhelm the disk, at least if it's one of the 125mb/s AWS gp3 volumes.

btw, the token count computation will still notice the 200MB/s L0 growth. That doesn't substantively change this example since the token count will be computed based on compaction rate out of L0, and half of that, so 50MB/s. One thing to note is that we are not trying to protect the disk bandwidth here -- we are not aware that it is a bottleneck. The disk bandwidth PR will be the first thing that sets the stage for protecting disk bandwidth but there will be more to do there in the future. But even that is not important in your example -- even if disk bandwidth was plenty L0 will still keep growing, and that was the point you were making, which is indeed correct.

Additionally, the allocator would not place new replicas on an overloaded store. And, additionally, other leaseholders would pause this follower, i.e. not send raft snapshots either (this is assuming the cluster setting for this is on). So there are multiple levels of defense here already.

I am indeed relying on this better allocator behavior. Not as much on the leaseholder pausing behavior, since it is disabled by default.

If it isn't too much work, could you share whatever you use to run your overload experiments?

I ran a 3 node cluster using

roachprod create -n 4 --clouds aws  --aws-machine-type=c5.9xlarge --local-ssd=false --aws-ebs-volume-type=gp2 sumeer-io

but the sizes here are not important since these are just inherited from some other experiments where I wanted specific cpu size and disk bandwidth.
And then

roachprod run sumeer-io:4 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=18000 --splits=1000 --duration=120m0s --read-percent=0 --min-block-bytes=1024 --max-block-bytes=1024 --tolerate-errors=true {pgurl:1-3}

I didn't experiment with incoming snapshots to a node since I didn't know a good way of forcing that. So if you have suggestions there, and/or can fold that into the roachtest you are planning to write, that would be great.

sumeerbhola

TFTRs!

@irfansharif I have made the plumbing change to collect the stats in replicaAppBatch.Stage and to transfer them to replicaStateMachine in recordStatsOnCommit.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/replica_raft.go line 862 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you add a comment about the placement of this call? At this point, we have the entries in the log & know they are committed, but we have not applied them yet. We're realistically about to do so momentarily so it doesn't really matter, but conceptually, do we want to do this after or before application? Or even after the append, before we even know the entry is committed? In other words, do we have any intuition about the ideal world semantics we want here.

I think ideally we want them after committing. I have changed this based on @irfansharif 's earlier suggestion, so it happens after commit.

pkg/util/admission/testdata/io_load_listener line 284 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

drive-by comment, would it make sense to highlight, in that message, a constant term >> 1 in the model, since it indicates that the fit isn't working well?

Something like write-model 2.99x+1025[!] whenever b > 10 or something like that?

This log entry requires enough expertise to understand as it is. I somewhat doubt that adding an exclamation mark will help.

sumeerbhola

Also rebased on master since the first commit from the other PR has merged.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, and @tbg)

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, and @tbg)

pkg/util/admission/store_token_estimation.go line 310 at r1 (raw file):

Previously, sumeerbhola wrote…

I had decided not to do this, since we are compensating for things at done time anyway. But thinking again about this now, it may have been muddled thinking, and I should indeed do this.
I'll make the change and ping.

Sorry, I was not thinking straight last night. This is handled in that we are taking the total bytes added to L0 and dividing it by the total work count (including those that bypassed). It is just that the logic is not the same as what was written in the TODO.

nvb

I haven't read the updates to util/admission and am getting pulled away for meetings, but the plumbing changes to kvserver to pass the write bytes from follower entry application and snapshot ingestion in to StoreWorkQueue all look good.

Reviewed 3 of 20 files at r2, 15 of 18 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @sumeerbhola, and @tbg)

pkg/kv/kvserver/apply/cmd.go line 44 at r5 (raw file):

	// once per Command.
	AckErrAndFinish(context.Context, error) error
	// GetStoreWriteByteSizes returns the size of the writes to the store:

Does this still need to be on the interface? We no longer use it from this package.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @sumeerbhola, and @tbg)

pkg/util/admission/store_token_estimation.go line 310 at r1 (raw file):

Previously, sumeerbhola wrote…

Sorry, I was not thinking straight last night. This is handled in that we are taking the total bytes added to L0 and dividing it by the total work count (including those that bypassed). It is just that the logic is not the same as what was written in the TODO.

That approach makes sense, thanks.

The admission control write token consumption and write size estimation logic gets degraded due to the lack of tracking of writes done at followers, and of range snapshot ingestion. For follower writes, replicaAppBatch now maintains stats for follower writes in a followerStoreWriteBytes struct, which is added to for every replicatedCmd that is not local. This is plumbed into replicaStateMachine and from there into the raft handling loop that passes it onto admission control so that the latter knows about work count and byte sizes that bypassed admission control. These work units and their bytes are used in the write size estimation logic in the same manner as writes that don't bypass admission control. These bytes are also used to consume tokens. For range snapshot ingestion, we do now want to consume tokens, since snapshots can be huge and only add at most 1 sublevel to L0. So instead of accounting for these bytes as normal ingestion, we subtract the bytes that were ingested into L0. If we did not do such subtraction, we'd incorrectly attribute the bytes ingested into L0 to work that was seeking admission, which would inflate the byte token consumption model. This would cause subsequent time intervals (of 15s duration, since that is our re-estimation granularity), to under-admit work, until the model was eventually corrected (which can take some time due to exponential smoothing). This change was experimented with in a 3 node cluster with kv0 running with 1024 byte blocks with a very high concurrency. Note the bypassed count and bytes below, which are about 2/3rd of the write volume seen by this node (since the leaseholders were evenly distributed) and that both the write model of approx 2.0x+1B and the at-admission-tokens model of approx 2.1KiB are reasonable since we now have visibility into these follower writes. ``` 220727 19:24:44.856581 26 util/admission/granter.go:2099 ⋮ [s2] 287 IO overload: compaction score 1.050[L0-overload] (707 ssts, 21 sub-levels), L0 growth 596 MiB (write 596 MiB ingest 0 B ignored 0 B): requests 287972 (184860 bypassed) with 294 MiB acc-write (189 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.02x+1 B (smoothed 2.07x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.2 KiB, compacted 135 MiB [≈270 MiB], flushed 741 MiB [≈809 MiB]; admitting 269 MiB (rate 18 MiB/s) due to L0 growth (used 626 MiB) I220727 19:25:29.857072 26 util/admission/granter.go:2099 ⋮ [s2] 295 IO overload: compaction score 1.050[L0-overload] (725 ssts, 21 sub-levels), L0 growth 594 MiB (write 594 MiB ingest 0 B ignored 0 B): requests 296751 (191171 bypassed) with 303 MiB acc-write (196 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.96x+1 B (smoothed 2.03x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 359 MiB [≈416 MiB], flushed 803 MiB [≈813 MiB]; admitting 340 MiB (rate 23 MiB/s) due to L0 growth (used 634 MiB) I220727 19:26:14.857530 26 util/admission/granter.go:2099 ⋮ [s2] 301 IO overload: compaction score 1.200[L0-overload] (685 ssts, 24 sub-levels), L0 growth 482 MiB (write 482 MiB ingest 0 B ignored 0 B): requests 229841 (154240 bypassed) with 217 MiB acc-write (141 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.22x+1 B (smoothed 2.11x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 492 MiB [≈475 MiB], flushed 772 MiB [≈796 MiB]; admitting 338 MiB (rate 22 MiB/s) due to L0 growth (used 435 MiB) I220727 19:27:44.857163 26 util/admission/granter.go:2099 ⋮ [s2] 311 IO overload: compaction score 1.200[L0-overload] (538 ssts, 24 sub-levels), L0 growth 654 MiB (write 654 MiB ingest 0 B ignored 0 B): requests 300321 (193366 bypassed) with 307 MiB acc-write (198 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 2.13x+1 B (smoothed 2.05x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 2.1 KiB, compacted 578 MiB [≈543 MiB], flushed 753 MiB [≈775 MiB]; admitting 388 MiB (rate 26 MiB/s) due to L0 growth (used 610 MiB) ``` Informs cockroachdb#80607 Fixes cockroachdb#82536 Release note: None

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/kvserver/apply/cmd.go line 44 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does this still need to be on the interface? We no longer use it from this package.

Good point. Done.

sumeerbhola · 2022-08-09T23:03:27Z

bors r=irfansharif,tbg

craig · 2022-08-10T00:34:25Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-10T04:49:31Z

Build succeeded:

Bazel Essential CI (Cockroach)

sumeerbhola requested review from a team, irfansharif, nvb and tbg July 27, 2022 13:46

sumeerbhola requested review from a team as code owners July 27, 2022 13:46

sumeerbhola commented Jul 27, 2022

View reviewed changes

sumeerbhola force-pushed the we4 branch 2 times, most recently from c92d471 to b2b3434 Compare July 27, 2022 18:31

sumeerbhola force-pushed the we4 branch from b2b3434 to 7bf6804 Compare July 27, 2022 20:13

irfansharif approved these changes Aug 8, 2022

View reviewed changes

sumeerbhola force-pushed the we4 branch from 7bf6804 to 233b23e Compare August 9, 2022 05:37

sumeerbhola commented Aug 9, 2022

View reviewed changes

tbg approved these changes Aug 9, 2022

View reviewed changes

sumeerbhola force-pushed the we4 branch from 233b23e to cc92a68 Compare August 9, 2022 16:15

sumeerbhola commented Aug 9, 2022

View reviewed changes

sumeerbhola force-pushed the we4 branch from cc92a68 to 630a333 Compare August 9, 2022 17:00

nvb reviewed Aug 9, 2022

View reviewed changes

irfansharif approved these changes Aug 9, 2022

View reviewed changes

sumeerbhola force-pushed the we4 branch from 630a333 to 5819d7a Compare August 9, 2022 21:42

sumeerbhola commented Aug 9, 2022

View reviewed changes

craig bot merged commit f53c21f into cockroachdb:master Aug 10, 2022

	if err := eng.IngestExternalFiles(ctx, []string{path}); err != nil {
	log.Fatalf(ctx, "while ingesting %s: %+v", path, err)
	}

Conversation

sumeerbhola commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 27, 2022

Uh oh!

sumeerbhola left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Jul 27, 2022

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Aug 9, 2022

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Aug 9, 2022

Uh oh!

craig bot commented Aug 10, 2022

Uh oh!

craig bot commented Aug 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sumeerbhola commented Jul 27, 2022 •

edited

Loading

sumeerbhola left a comment •

edited

Loading