Skip to content

admission: improve granter logs#80859

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:granter-logs
May 17, 2022
Merged

admission: improve granter logs#80859
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:granter-logs

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented May 2, 2022

To the uninitiated, they were previously difficult to understand.
Now it's clearer what each number means just from looking at the
output.

While rebasing this PR on top of #80480, I realized how hard it
was (at least for me) to keep track of the many quantities involved. The
way the updates to the state were computed (were the old struct is
updated incrementally in-place) additionally made me worried
that bugs could easily be introduced. Some quantities are cumulative,
some are for the current interval but contain other quantities, etc - it
all melted my brain with ease.

I decided to at least try to make it easier for the next newcomer to
this codebase to get their bearings by constructing all of the updates
first, and then overwriting the old state. I additionally renamed all
of the variables according to a fixed pattern so that it is now clear
which ones are for the current interval or cumulative, and added
commentary to (hopefully) clarify which quantities contain each other.
Additionally, all quantities are now stored for posterity, so in
principle it's now straightforward to write a test case on any of the
intermediate quantities for any desired inputs.

More could be done here but I feel that this leaves things in a good
place. There's a chance that I have written equally intricate code but
happen to like it more because I wrote it. I trust my reviewers to keep
me honest.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg requested a review from sumeerbhola May 2, 2022 11:04
@tbg tbg force-pushed the granter-logs branch from 500a1f7 to 4f14bbf Compare May 2, 2022 11:17
@tbg tbg marked this pull request as ready for review May 2, 2022 11:17
@tbg tbg force-pushed the granter-logs branch from 4f14bbf to f0bd62f Compare May 2, 2022 11:23
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/util/admission/granter.go line 1727 at r1 (raw file):

		}
		if doLog {
			avgAdded := bytesAdded / int64(admitted)

thanks for doing this. Is this something you wanted to backport to 22.1 and 21.2?
I ask since the log statement is changing with the byte tokens PR

log.Infof(ctx,

It still lacks the humanized formatting, so something that will need to be fixed there too.

@tbg tbg force-pushed the granter-logs branch from f0bd62f to 560fd41 Compare May 3, 2022 19:07
@tbg tbg requested a review from sumeerbhola May 3, 2022 19:13
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big update, PTAL. I realize this may make it harder to backport later changes (if we're not backporting this one as well). If you're concerned about that let's discuss dropping this PR.

Please review carefully, I am still worried I messed something up here though the test diffs seem to indicate that I haven't.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/util/admission/granter.go line 1727 at r1 (raw file):

Previously, sumeerbhola wrote…

thanks for doing this. Is this something you wanted to backport to 22.1 and 21.2?
I ask since the log statement is changing with the byte tokens PR

log.Infof(ctx,

It still lacks the humanized formatting, so something that will need to be fixed there too.

I'll defer to you in terms of backporting. I'm not sure how good the test coverage here is. I don't seem to have introduced any changes (please carefully review) but there is now a large diff. Then again, not backporting may make it impossible to backport later changes. If you think that it is too risky to make a refactor of the size I'm doing here, I'm happy to hold off, though there isn't typically a better time in the cycle to do so and so it might mean the improvements here won't happen.

I think if this bakes for some time we could probably backport it to 22.1, at least if we're also backporting the bytes change. (Not sure what the plan there was).

@tbg
Copy link
Copy Markdown
Member Author

tbg commented May 11, 2022

Rebased, PTAL after Infest.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 7 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/util/admission/granter.go line 1626 at r3 (raw file):

	// represents what has been given out. The units is bytes.
	totalTokens     int64
	tokensAllocated int64

all of these are moved into ioLoadListenerState, yes? I like the comments and better names.


pkg/util/admission/granter.go line 1750 at r3 (raw file):

	intL0CompactedBytes         int64
	smoothedIntL0CompactedBytes float64

is this smoothed value the same as in ioLoadListenerState? Same question for the two other Smoothed* members in this struct.


pkg/util/admission/granter.go line 1812 at r3 (raw file):

	// intIngestedBytes are the bytes ingested since the last token adjustment. Bytes
	// tracked here are also tracked in intAdmittedBytes.
	intIngestedBytes := cumAdmissionStats.ingestedBytes - prev.cumAdmissionStats.ingestedBytes

s/intIngestedBytes/intIngestedAccountedBytes/


pkg/util/admission/granter.go line 1860 at r3 (raw file):

		// Keep the old smoothed value around, i.e. don't update it with a zero and
		// don't zero it out.
		// TODO(during review): why?

we've not had work flow through admission control, so we don't have any reasonable interval data to update our previous estimates.


pkg/util/admission/granter.go line 1864 at r3 (raw file):

	}

	// TODO(during review): aren't we double-penalizing here? Let's say 1GB was

I think what you are suggesting is potentially a higher fidelity model, but I maybe misunderstanding what you are saying.
We are currently taking all the unaccounted bytes that landed into L0 (ingests or flush) and spreading them evenly over what was admitted. This will then be fed into storeRequestEstimates.workByteAddition which is used by WorkQueue for both regular writes and ingests.

I don't understand the "penalize ingestions" question. The l0BytesIngestedFraction is solely based on what was reported by the WorkQueue. That is, intIngestedBytes is not from the Pebble metrics. This fraction calculation is only to minimize the error in the interval between Admit and AdmittedWorkDone: if there a 1MB ingest, we want to make a reasonable guess on how many tokens to deduct based on past observations, and then we will fix any error in AdmittedWorkDone. So this is not saying anything about what silently bypassed admission control.
Renaming intIngestedBytes to intIngestedAccountedBytes would make this clearer.


pkg/util/admission/granter.go line 1907 at r3 (raw file):

		}
	} else {
		// Under the threshold. Maintain a smoothedNumTokens based on what was

Maintain a smoothedTotalNumByteTokens ...


pkg/util/admission/work_queue.go line 1650 at r3 (raw file):

			q.mu.stats.admittedBytes += uint64(h.writeBytes)
			if h.ingestRequest {
				q.mu.stats.ingestedBytes += uint64(h.writeBytes)

ingestedAccountedBytes?
Though by definition, all the WorkQueue stats are accounted.

To the uninitiated, they were previously difficult to understand.
Now it's clearer what each number means just from looking at the
output.

While rebasing this PR on top of cockroachdb#80480, I realized how hard it
was (at least for me) to keep track of the many quantities involved. The
way the updates to the state were computed (were the old struct is
updated incrementally in-place) additionally made me worried
that bugs could easily be introduced. Some quantities are cumulative,
some are for the current interval but contain other quantities, etc - it
all melted my brain with ease.

I decided to at least try to make it easier for the next newcomer to
this codebase to get their bearings by constructing all of the updates
first, and then overwriting the old state. I additionally renamed all
of the variables according to a fixed pattern so that it is now clear
which ones are for the current interval or cumulative, and added
commentary to (hopefully) clarify which quantities contain each other.
Additionally, all quantities are now stored for posterity, so in
principle it's now straightforward to write a test case on any of the
intermediate quantities for any desired inputs.

More could be done here but I feel that this leaves things in a good
place. There's a chance that I have written equally intricate code but
happen to like it more because I wrote it. I trust my reviewers to keep
me honest.

Release note: None
@tbg tbg requested a review from sumeerbhola May 17, 2022 07:36
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @sumeerbhola from 4 discussions.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/util/admission/granter.go line 1626 at r3 (raw file):

Previously, sumeerbhola wrote…

all of these are moved into ioLoadListenerState, yes? I like the comments and better names.

Yep


pkg/util/admission/granter.go line 1750 at r3 (raw file):

Previously, sumeerbhola wrote…

is this smoothed value the same as in ioLoadListenerState? Same question for the two other Smoothed* members in this struct.

Good catch. I accidentally duplicated some of these fields. Cleaned this


pkg/util/admission/granter.go line 1812 at r3 (raw file):

Previously, sumeerbhola wrote…

s/intIngestedBytes/intIngestedAccountedBytes/

Done.


pkg/util/admission/granter.go line 1860 at r3 (raw file):

Previously, sumeerbhola wrote…

we've not had work flow through admission control, so we don't have any reasonable interval data to update our previous estimates.

Updated with this comment.
Still, is this the right thing to do? You'd expect us to "always measure/update" but just not throttle when admission control is off. Is this just reflecting the fact that when admission control was introduced, we wanted the "off" switch to completely disable most code related to it? As is, it seems that once admission control kicks in again after a long period of being below threshold, we'll initially start out with a mostly bogus smoothed value here. I think this is fine, just trying to deeply understand the rationale here.


pkg/util/admission/granter.go line 1864 at r3 (raw file):

Previously, sumeerbhola wrote…

I think what you are suggesting is potentially a higher fidelity model, but I maybe misunderstanding what you are saying.
We are currently taking all the unaccounted bytes that landed into L0 (ingests or flush) and spreading them evenly over what was admitted. This will then be fed into storeRequestEstimates.workByteAddition which is used by WorkQueue for both regular writes and ingests.

I don't understand the "penalize ingestions" question. The l0BytesIngestedFraction is solely based on what was reported by the WorkQueue. That is, intIngestedBytes is not from the Pebble metrics. This fraction calculation is only to minimize the error in the interval between Admit and AdmittedWorkDone: if there a 1MB ingest, we want to make a reasonable guess on how many tokens to deduct based on past observations, and then we will fix any error in AdmittedWorkDone. So this is not saying anything about what silently bypassed admission control.
Renaming intIngestedBytes to intIngestedAccountedBytes would make this clearer.

I hadn't understood this properly, with your explanation it makes sense.


pkg/util/admission/granter.go line 1907 at r3 (raw file):

Previously, sumeerbhola wrote…

Maintain a smoothedTotalNumByteTokens ...

Done.


pkg/util/admission/work_queue.go line 1650 at r3 (raw file):

Previously, sumeerbhola wrote…

ingestedAccountedBytes?
Though by definition, all the WorkQueue stats are accounted.

If we're going to do the accounted infix (and it looks like we are), then it's better to have it everywhere. Made sure it's present in all of the storeAdmissionStats byte counters.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented May 17, 2022

On bazel, TestDataDriven/multiregion failed. On regular CI, TestLogic/5node/distsql_stats

But neither are this PR's fault.

bors r=sumeerbhola

TFTR!

Will send a follow-up if you have further comments.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 17, 2022

Build succeeded:

@craig craig bot merged commit 4216e6e into cockroachdb:master May 17, 2022
tbg added a commit to tbg/cockroach that referenced this pull request May 23, 2022
tbg added a commit to tbg/cockroach that referenced this pull request May 23, 2022
craig bot pushed a commit that referenced this pull request May 23, 2022
81242: storage: use `SetOptions()` when reusing Pebble iterators r=jbowens a=erikgrinaker

Iterator reuse now relies on `Pebble.SetOptions()` to configure the
reused Pebble iterator. This allows a wider range of iterators to be
reused, since previously only the bounds could be changed on existing
iterators.

Release note: None

81661: admission: avoid spurious logging when not in IO overload r=sumeerbhola a=tbg

I broke this in #80859.

Release note: None


Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
andrewbaptist pushed a commit to andrewbaptist/cockroach that referenced this pull request May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants