changefeedccl: add backfill progress metric#76995
changefeedccl: add backfill progress metric#76995craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
90dbce0 to
d289f1b
Compare
d289f1b to
343daaa
Compare
5b02ede to
3beec62
Compare
miretskiy
left a comment
There was a problem hiding this comment.
Reviewed 1 of 4 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @samiskin and @stevendanna)
-- commits, line 11 at r2:
should the category be enterprise change?
pkg/ccl/changefeedccl/metrics.go, line 162 at r2 (raw file):
var mu syncutil.Mutex m.BackfillPendingRanges.Inc(initial) remaining := initial
I think the use of atomics is justified here...
pkg/ccl/changefeedccl/metrics.go, line 167 at r2 (raw file):
mu.Lock() defer mu.Unlock() if remaining > 0 {
wouldn't this be a bug?
Anyways, I think you should just decrement BackfillPendingRanges, and also atomic decrement remaining.
pkg/ccl/changefeedccl/metrics.go, line 177 at r2 (raw file):
defer mu.Unlock() m.BackfillPendingRanges.Inc(-remaining) remaining = 0
drop remaining = 0?
pkg/ccl/changefeedccl/kvfeed/scanner.go, line 104 at r2 (raw file):
if err == nil { backfillDec() }
I think you should just defer backfillDec() above -- this backfill is done, whether it's done with an error or not is not important.
Code quote:
if err == nil {
backfillDec()
}pkg/ccl/changefeedccl/changefeed_test.go, line 809 at r2 (raw file):
t.Run("enterprise", enterpriseTest(testFn, feedTestNoTenants)) t.Run("cloudstorage", cloudStorageTest(testFn, feedTestNoTenants)) t.Run("kafka", kafkaTest(testFn, feedTestNoTenants))
Probably doesn't need to run against many sinks since this is a metrics specific test;
Also, please make sure this test is not flaky (I don't think it is -- just want to make sure it passes
w/ e.g. test.count=250).
samiskin
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @samiskin, and @stevendanna)
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
should the category be
enterprise change?
👍 I wasn't sure if metrics-only counted as operations. I guess anything changefeed-related is an enterprise change?
pkg/ccl/changefeedccl/metrics.go, line 162 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I think the use of atomics is justified here...
The reason I just used a mutex was that I ended up having compare and swap loops:
dec = func() {
var oldRem int64
for { // Consume one of the remaining units
oldRem = atomic.LoadInt64(&remaining)
if oldRem <= 0 || atomic.CompareAndSwapInt64(&remaining, oldRem, oldRem-1) {
break
}
}
if oldRem > 0 {
m.BackfillPendingRanges.Dec(1)
}
}
clear = func() {
for { // Consume all of the remaining units
oldRem := atomic.LoadInt64(&remaining)
if atomic.CompareAndSwapInt64(&remaining, oldRem, 0) {
m.BackfillPendingRanges.Dec(remaining)
break
}
}
}Since I was worried that you'd have one thread loading in a non-0 value, the next clears the metric to 0, then the first thread decreases the metric now to -1.
If its still better to have the loops then I can swap it back.
Technically we should never have a clear() during an ongoing dec() since I only call clear() once the backfill completes, but I liked the idea of covering that case regardless to not implicitly couple this metrics code with how it happens to be called right now.
I could also eliminate this but add a comment of "Ensure that clear() is not called while dec() is still able to be called" so that the dependency is at least explicit.
pkg/ccl/changefeedccl/metrics.go, line 167 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
wouldn't this be a bug?
Anyways, I think you should just decrement BackfillPendingRanges, and also atomic decrement remaining.
With the current code that would be a bug, but similar to my earlier point, this is covering the "dec() after clear()" case just to avoid .
pkg/ccl/changefeedccl/metrics.go, line 177 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
drop remaining = 0?
This is also covering the "dec() after clear()"
pkg/ccl/changefeedccl/kvfeed/scanner.go, line 104 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I think you should just
defer backfillDec()above -- this backfill is done, whether it's done with an error or not is not important.
I initially added this check because it made the unit test easier, but I also liked the idea that if there were errors on certain ranges during the backfill you'd be able to get an idea of how many failed based on how sharply the count dropped at the end.
samiskin
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @samiskin, and @stevendanna)
pkg/ccl/changefeedccl/metrics.go, line 162 at r2 (raw file):
Previously, samiskin (Shiranka Miskin) wrote…
The reason I just used a mutex was that I ended up having compare and swap loops:
dec = func() { var oldRem int64 for { // Consume one of the remaining units oldRem = atomic.LoadInt64(&remaining) if oldRem <= 0 || atomic.CompareAndSwapInt64(&remaining, oldRem, oldRem-1) { break } } if oldRem > 0 { m.BackfillPendingRanges.Dec(1) } } clear = func() { for { // Consume all of the remaining units oldRem := atomic.LoadInt64(&remaining) if atomic.CompareAndSwapInt64(&remaining, oldRem, 0) { m.BackfillPendingRanges.Dec(remaining) break } } }Since I was worried that you'd have one thread loading in a non-0 value, the next clears the metric to 0, then the first thread decreases the metric now to -1.
If its still better to have the loops then I can swap it back.
Technically we should never have a clear() during an ongoing dec() since I only call clear() once the backfill completes, but I liked the idea of covering that case regardless to not implicitly couple this metrics code with how it happens to be called right now.
I could also eliminate this but add a comment of "Ensure that clear() is not called while dec() is still able to be called" so that the dependency is at least explicit.
This could also happen in the concurrent dec() case but it would be due to a bug that resulted in more calls to dec() than the initial amount. It could be nice to have this even in that case since then at least the bug wouldn't permanently skew the metric.
I think where you're going wrong is that you're trying to Load anything. All you need to do is to |
Bugs are, of course, possible. But in that case we would just produce negative/incorrect values for this gauge; and.. you already have a test; so, maybe let's err on the side of fast and simple code? |
If you are concerned about this and want to track this (though, I doubt we'd need it tbh), a better way would be to have range callback take an error. And then have 2 metrics -- one for errors. |
3beec62 to
a0a3606
Compare
| } | ||
| clear = func() { | ||
| atomic.AddInt64(&remaining, -remaining) | ||
| m.BackfillPendingRanges.Dec(remaining) |
There was a problem hiding this comment.
That doesn't look right. I don't think you want to update remaining prior to Dec, do you?
miretskiy
left a comment
There was a problem hiding this comment.
Approving; but please fix the bug in clear() function.
77d70ce to
18b3f08
Compare
Previously there was no way of knowing how far along a backfill was in its task. This made it hard to debug issues. This patch adds a "changefeed.backfill_pending_ranges" SLI metric to allow approximating backfill progress using the number of ranges left to export. This value is cleared if the changefeed is cancelled. Release note (enterprise change): The "changefeed.backfill_pending_ranges" prometheus metric was added to track ongoing backfill progress of a changefeed. Release justification: Low risk metrics changes
18b3f08 to
b57812b
Compare
|
bors r+ |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating backport branch refs/heads/blathers/backport-release-21.2-76995: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration [] Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Previously there was no way of knowing how far along a backfill was in
its task. This made it hard to debug issues.
This patch adds a "changefeed.backfill_pending_ranges" SLI metric to allow
approximating backfill progress using the number of ranges left to
export. This value is cleared if the changefeed is cancelled.
Release note (ops change): The "changefeed.backfill_pending_ranges"
prometheus metric was added to track ongoing backfill progress of a
changefeed.