changefeedccl: disable drain watcher by default#123370
changefeedccl: disable drain watcher by default#123370craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
rharding6373
left a comment
There was a problem hiding this comment.
Thank you for finding this and getting a mitigation out so quickly!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)
-- commits line 2 at r1:
minor nit/educational comment: do we colloquially refer to this feature as "drain watcher", or can we call it something like "shutdown checkpointing" to reflect the cluster setting that enables it?
-- commits line 4 at r1:
nit: "introduced a mechanism in 23.2" instead of "recently ...".
-- commits line 9 at r1:
Fixes: #123371
-- commits line 11 at r1:
reminder to fill this in with the known details so far and update the PR description in github
pkg/server/drain.go line 88 at r1 (raw file):
"the maximum amount of time a server waits for all currently executing jobs "+ "to notice drain request and to perform orderly shutdown", 0,
Why is this change required too?
5820752 to
1dadf50
Compare
wenyihu6
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @rharding6373)
Previously, rharding6373 (Rachael Harding) wrote…
minor nit/educational comment: do we colloquially refer to this feature as "drain watcher", or can we call it something like "shutdown checkpointing" to reflect the cluster setting that enables it?
I'm not sure what this feature was called. I think the major work was the drain watcher on each aggregator. Aggregators forward an updated checkpointing information before being drained. I believe the same code path is hit for both shutting down aggregators and node draining.
Previously, rharding6373 (Rachael Harding) wrote…
nit: "introduced a mechanism in 23.2" instead of "recently ...".
Done.
Previously, rharding6373 (Rachael Harding) wrote…
Fixes: #123371
Let me if you want me to close the other issue. I was thinking we should have an issue open to track this work, but I'm happy to close it as well.
Previously, rharding6373 (Rachael Harding) wrote…
reminder to fill this in with the known details so far and update the PR description in github
Updated.
pkg/server/drain.go line 88 at r1 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Why is this change required too?
iiuc, the original pr introduced two cluster settings to guard this feature. I'm disabling them both since it doesn't seem harmful to disable them both. But if I understood correctly, the two cluster settings are guarding the same thing.
rharding6373
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)
Previously, wenyihu6 (Wenyi Hu) wrote…
Let me if you want me to close the other issue. I was thinking we should have an issue open to track this work, but I'm happy to close it as well.
Ah, I see what you mean. Ok, we can leave the other one as a tracking issue for follow up work.
Previously, wenyihu6 (Wenyi Hu) wrote…
Updated.
s/"its bad performance"/"unexpected behavior"
pkg/server/drain.go line 88 at r1 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
iiuc, the original pr introduced two cluster settings to guard this feature. I'm disabling them both since it doesn't seem harmful to disable them both. But if I understood correctly, the two cluster settings are guarding the same thing.
I'm concerned that other jobs may rely on this now -- I'll dig to see if this is actually the case. If the other setting is sufficient to disable this feature, let's revert this one to its original default of 10s.
rharding6373
left a comment
There was a problem hiding this comment.
pending the question about the necessity of including server.shutdown.jobs_wait. Also asked about its blast radius here: https://cockroachlabs.slack.com/archives/C02DSDS9TM1/p1714744237807489
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)
wenyihu6
left a comment
There was a problem hiding this comment.
ack, I reverted the change for it server.shutdown.jobs_wait.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andyyang890 and @rharding6373)
cockroachdb#102717 introduced a mechanism allowing for an orderly shutdown of changefeed aggregators and an up-to-date frontier information to reduce duplicates during a node restart in 23.2. However, we've recently identified a bug that could lead to unexpected behavior. It's unclear where the bug is yet. This patch disables this feature by default. Fixes: cockroachdb#123371 Release note: Disabled a changefeed optimization on reducing duplicates during aggregator restarts due to its bad performance.
|
All required checks passed bors r+ |
In #102717, we introduced a mechanism to reduce message duplicates by re-loading job progress upon resuming. However, this pr forgot to account initial scans. During initial scans, we rely on an important prerequisite (resolved timestamps of all spans are required to be empty by keeping initial highwater as zero). The pr forwarded spans with resolved timestamps to its most up to date status in
reconcileJobStateWithLocalState. This resulted in dropping events since we now sometimes make a new frontier with all spans forwarded to initialResolvedcockroach/pkg/ccl/changefeedccl/changefeed_processors.go
Line 535 in df19639
This patch disables this feature by default.
Fixes: #123371
Release note: Disabled a changefeed optimization on reducing duplicates during
aggregator restarts due to its bad performance.