Skip to content

changefeedccl: add backfill progress metric#76995

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
samiskin:changefeed-backfill-observability
Mar 4, 2022
Merged

changefeedccl: add backfill progress metric#76995
craig[bot] merged 1 commit intocockroachdb:masterfrom
samiskin:changefeed-backfill-observability

Conversation

@samiskin
Copy link
Copy Markdown
Contributor

@samiskin samiskin commented Feb 24, 2022

Previously there was no way of knowing how far along a backfill was in
its task. This made it hard to debug issues.

This patch adds a "changefeed.backfill_pending_ranges" SLI metric to allow
approximating backfill progress using the number of ranges left to
export. This value is cleared if the changefeed is cancelled.

Release note (ops change): The "changefeed.backfill_pending_ranges"
prometheus metric was added to track ongoing backfill progress of a
changefeed.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@samiskin samiskin force-pushed the changefeed-backfill-observability branch 4 times, most recently from 90dbce0 to d289f1b Compare February 28, 2022 18:30
@samiskin samiskin changed the title changefeedccl: emit backfill pending ranges to prometheus endpoint changefeedccl: add backfill progress metric Feb 28, 2022
@samiskin samiskin force-pushed the changefeed-backfill-observability branch from d289f1b to 343daaa Compare February 28, 2022 18:33
@samiskin samiskin marked this pull request as ready for review February 28, 2022 18:33
@samiskin samiskin requested a review from a team as a code owner February 28, 2022 18:33
@samiskin samiskin requested review from miretskiy and stevendanna and removed request for a team February 28, 2022 18:33
@samiskin samiskin force-pushed the changefeed-backfill-observability branch 2 times, most recently from 5b02ede to 3beec62 Compare February 28, 2022 19:25
Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @samiskin and @stevendanna)


-- commits, line 11 at r2:
should the category be enterprise change?


pkg/ccl/changefeedccl/metrics.go, line 162 at r2 (raw file):

		var mu syncutil.Mutex
		m.BackfillPendingRanges.Inc(initial)
		remaining := initial

I think the use of atomics is justified here...


pkg/ccl/changefeedccl/metrics.go, line 167 at r2 (raw file):

			mu.Lock()
			defer mu.Unlock()
			if remaining > 0 {

wouldn't this be a bug?
Anyways, I think you should just decrement BackfillPendingRanges, and also atomic decrement remaining.


pkg/ccl/changefeedccl/metrics.go, line 177 at r2 (raw file):

			defer mu.Unlock()
			m.BackfillPendingRanges.Inc(-remaining)
			remaining = 0

drop remaining = 0?


pkg/ccl/changefeedccl/kvfeed/scanner.go, line 104 at r2 (raw file):

			if err == nil {
				backfillDec()
			}

I think you should just defer backfillDec() above -- this backfill is done, whether it's done with an error or not is not important.

Code quote:

			if err == nil {
				backfillDec()
			}

pkg/ccl/changefeedccl/changefeed_test.go, line 809 at r2 (raw file):

	t.Run("enterprise", enterpriseTest(testFn, feedTestNoTenants))
	t.Run("cloudstorage", cloudStorageTest(testFn, feedTestNoTenants))
	t.Run("kafka", kafkaTest(testFn, feedTestNoTenants))

Probably doesn't need to run against many sinks since this is a metrics specific test;
Also, please make sure this test is not flaky (I don't think it is -- just want to make sure it passes
w/ e.g. test.count=250).

Copy link
Copy Markdown
Contributor Author

@samiskin samiskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @samiskin, and @stevendanna)


-- commits, line 11 at r2:

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

should the category be enterprise change?

👍 I wasn't sure if metrics-only counted as operations. I guess anything changefeed-related is an enterprise change?


pkg/ccl/changefeedccl/metrics.go, line 162 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I think the use of atomics is justified here...

The reason I just used a mutex was that I ended up having compare and swap loops:

dec = func() {
	var oldRem int64
	for { // Consume one of the remaining units
		oldRem = atomic.LoadInt64(&remaining)
		if oldRem <= 0 || atomic.CompareAndSwapInt64(&remaining, oldRem, oldRem-1) {
			break
		}
	}
	if oldRem > 0 {
		m.BackfillPendingRanges.Dec(1)
	}
}

clear = func() {
	for { // Consume all of the remaining units
		oldRem := atomic.LoadInt64(&remaining)
		if atomic.CompareAndSwapInt64(&remaining, oldRem, 0) {
			m.BackfillPendingRanges.Dec(remaining)
			break
		}
	}
}

Since I was worried that you'd have one thread loading in a non-0 value, the next clears the metric to 0, then the first thread decreases the metric now to -1.

If its still better to have the loops then I can swap it back.

Technically we should never have a clear() during an ongoing dec() since I only call clear() once the backfill completes, but I liked the idea of covering that case regardless to not implicitly couple this metrics code with how it happens to be called right now.

I could also eliminate this but add a comment of "Ensure that clear() is not called while dec() is still able to be called" so that the dependency is at least explicit.


pkg/ccl/changefeedccl/metrics.go, line 167 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

wouldn't this be a bug?
Anyways, I think you should just decrement BackfillPendingRanges, and also atomic decrement remaining.

With the current code that would be a bug, but similar to my earlier point, this is covering the "dec() after clear()" case just to avoid .


pkg/ccl/changefeedccl/metrics.go, line 177 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

drop remaining = 0?

This is also covering the "dec() after clear()"


pkg/ccl/changefeedccl/kvfeed/scanner.go, line 104 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I think you should just defer backfillDec() above -- this backfill is done, whether it's done with an error or not is not important.

I initially added this check because it made the unit test easier, but I also liked the idea that if there were errors on certain ranges during the backfill you'd be able to get an idea of how many failed based on how sharply the count dropped at the end.

Copy link
Copy Markdown
Contributor Author

@samiskin samiskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @samiskin, and @stevendanna)


pkg/ccl/changefeedccl/metrics.go, line 162 at r2 (raw file):

Previously, samiskin (Shiranka Miskin) wrote…

The reason I just used a mutex was that I ended up having compare and swap loops:

dec = func() {
	var oldRem int64
	for { // Consume one of the remaining units
		oldRem = atomic.LoadInt64(&remaining)
		if oldRem <= 0 || atomic.CompareAndSwapInt64(&remaining, oldRem, oldRem-1) {
			break
		}
	}
	if oldRem > 0 {
		m.BackfillPendingRanges.Dec(1)
	}
}

clear = func() {
	for { // Consume all of the remaining units
		oldRem := atomic.LoadInt64(&remaining)
		if atomic.CompareAndSwapInt64(&remaining, oldRem, 0) {
			m.BackfillPendingRanges.Dec(remaining)
			break
		}
	}
}

Since I was worried that you'd have one thread loading in a non-0 value, the next clears the metric to 0, then the first thread decreases the metric now to -1.

If its still better to have the loops then I can swap it back.

Technically we should never have a clear() during an ongoing dec() since I only call clear() once the backfill completes, but I liked the idea of covering that case regardless to not implicitly couple this metrics code with how it happens to be called right now.

I could also eliminate this but add a comment of "Ensure that clear() is not called while dec() is still able to be called" so that the dependency is at least explicit.

This could also happen in the concurrent dec() case but it would be due to a bug that resulted in more calls to dec() than the initial amount. It could be nice to have this even in that case since then at least the bug wouldn't permanently skew the metric.

@miretskiy
Copy link
Copy Markdown
Contributor

Since I was worried that you'd have one thread loading in a non-0 value, the next clears the metric to 0, then the first thread decreases the metric now to -1.

I think where you're going wrong is that you're trying to Load anything. All you need to do is to atomic.AddInt64(&remaining, -1) in the decrement function, and m.BackfillPendingRanges.Dec(remaining) in the clear function.
(You might need to Dec(LoadInt64) instead, but I doubt it because when clear backfill is called, you know that all of the go routines performing backfill have completed; so you only have 1 thread -- and thus, no need to synchronize anything). You can check by running test with -race).

@miretskiy
Copy link
Copy Markdown
Contributor

This could also happen in the concurrent dec() case but it would be due to a bug that resulted in more calls to dec() than the initial amount. It could be nice to have this even in that case since then at least the bug wouldn't permanently skew the metric.

Bugs are, of course, possible. But in that case we would just produce negative/incorrect values for this gauge; and.. you already have a test; so, maybe let's err on the side of fast and simple code?

@miretskiy
Copy link
Copy Markdown
Contributor

I initially added this check because it made the unit test easier, but I also liked the idea that if there were errors on certain ranges during the backfill you'd be able to get an idea of how many failed based on how sharply the count dropped at the end.

If you are concerned about this and want to track this (though, I doubt we'd need it tbh), a better way would be to have range callback take an error. And then have 2 metrics -- one for errors.

@samiskin samiskin force-pushed the changefeed-backfill-observability branch from 3beec62 to a0a3606 Compare February 28, 2022 20:46
@samiskin samiskin requested a review from miretskiy February 28, 2022 20:55
}
clear = func() {
atomic.AddInt64(&remaining, -remaining)
m.BackfillPendingRanges.Dec(remaining)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't look right. I don't think you want to update remaining prior to Dec, do you?

Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving; but please fix the bug in clear() function.

@samiskin samiskin force-pushed the changefeed-backfill-observability branch 6 times, most recently from 77d70ce to 18b3f08 Compare March 2, 2022 20:41
Previously there was no way of knowing how far along a backfill was in
its task.  This made it hard to debug issues.

This patch adds a "changefeed.backfill_pending_ranges" SLI metric to allow
approximating backfill progress using the number of ranges left to
export.  This value is cleared if the changefeed is cancelled.

Release note (enterprise change): The
"changefeed.backfill_pending_ranges" prometheus metric was added to
track ongoing backfill progress of a changefeed.

Release justification: Low risk metrics changes
@samiskin
Copy link
Copy Markdown
Contributor Author

samiskin commented Mar 3, 2022

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 4, 2022

Build succeeded:

@craig craig bot merged commit 41ae82f into cockroachdb:master Mar 4, 2022
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 4, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating backport branch refs/heads/blathers/backport-release-21.2-76995: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration []

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants