changefeedccl: disable drain watcher by default#123370

Merged

craig[bot] merged 1 commit intocockroachdb:masterfrom

wenyihu6:hotfix

May 3, 2024

Contributor

wenyihu6 commented May 1, 2024 •

edited

Loading

In #102717, we introduced a mechanism to reduce message duplicates by re-loading job progress upon resuming. However, this pr forgot to account initial scans. During initial scans, we rely on an important prerequisite (resolved timestamps of all spans are required to be empty by keeping initial highwater as zero). The pr forwarded spans with resolved timestamps to its most up to date status in reconcileJobStateWithLocalState. This resulted in dropping events since we now sometimes make a new frontier with all spans forwarded to initialResolved

cockroach/pkg/ccl/changefeedccl/changefeed_processors.go

Line 535 in df19639

ca.frontier, err = makeSchemaChangeFrontier(initialHighWater, spans...)

. We should add more test coverage and re-think through the logic here though.

This patch disables this feature by default.

Fixes: #123371
Release note: Disabled a changefeed optimization on reducing duplicates during
aggregator restarts due to its bad performance.

wenyihu6 requested review from a team as code owners

May 1, 2024 13:49

wenyihu6 requested review from andyyang890 and removed request for a team

May 1, 2024 13:49

blathers-crl bot commented May 1, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Member

cockroach-teamcity commented May 1, 2024

This change is

wenyihu6 requested review from rharding6373 and removed request for a team

May 1, 2024 13:54

rharding6373 added backport-23.2.x backport-23.2.5-rc labels

rharding6373 reviewed

View reviewed changes

Collaborator

rharding6373 left a comment

Thank you for finding this and getting a mitigation out so quickly!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)

-- commits line 2 at r1:
minor nit/educational comment: do we colloquially refer to this feature as "drain watcher", or can we call it something like "shutdown checkpointing" to reflect the cluster setting that enables it?

-- commits line 4 at r1:
nit: "introduced a mechanism in 23.2" instead of "recently ...".

-- commits line 9 at r1:
Fixes: #123371

-- commits line 11 at r1:
reminder to fill this in with the known details so far and update the PR description in github

pkg/server/drain.go line 88 at r1 (raw file):

		"the maximum amount of time a server waits for all currently executing jobs "+
			"to notice drain request and to perform orderly shutdown",
		0,

Why is this change required too?

wenyihu6 force-pushed the hotfix branch 2 times, most recently from 5820752 to 1dadf50 Compare

May 2, 2024 20:05

wenyihu6 commented

View reviewed changes

Contributor Author

wenyihu6 left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @rharding6373)

-- commits line 2 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

minor nit/educational comment: do we colloquially refer to this feature as "drain watcher", or can we call it something like "shutdown checkpointing" to reflect the cluster setting that enables it?

I'm not sure what this feature was called. I think the major work was the drain watcher on each aggregator. Aggregators forward an updated checkpointing information before being drained. I believe the same code path is hit for both shutting down aggregators and node draining.

-- commits line 4 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

nit: "introduced a mechanism in 23.2" instead of "recently ...".

Done.

-- commits line 9 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

Fixes: #123371

Let me if you want me to close the other issue. I was thinking we should have an issue open to track this work, but I'm happy to close it as well.

-- commits line 11 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

reminder to fill this in with the known details so far and update the PR description in github

Updated.

pkg/server/drain.go line 88 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Why is this change required too?

iiuc, the original pr introduced two cluster settings to guard this feature. I'm disabling them both since it doesn't seem harmful to disable them both. But if I understood correctly, the two cluster settings are guarding the same thing.

wenyihu6 requested a review from rharding6373

May 2, 2024 20:12

wenyihu6 force-pushed the hotfix branch from 1dadf50 to 7e77468 Compare

May 2, 2024 20:18

wenyihu6 added backport-24.1.x backport-24.1.0-rc labels

rharding6373 reviewed

View reviewed changes

Collaborator

rharding6373 left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)

-- commits line 9 at r1:

Previously, wenyihu6 (Wenyi Hu) wrote…

Let me if you want me to close the other issue. I was thinking we should have an issue open to track this work, but I'm happy to close it as well.

Ah, I see what you mean. Ok, we can leave the other one as a tracking issue for follow up work.

-- commits line 11 at r1:

Previously, wenyihu6 (Wenyi Hu) wrote…

Updated.

s/"its bad performance"/"unexpected behavior"

pkg/server/drain.go line 88 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

iiuc, the original pr introduced two cluster settings to guard this feature. I'm disabling them both since it doesn't seem harmful to disable them both. But if I understood correctly, the two cluster settings are guarding the same thing.

I'm concerned that other jobs may rely on this now -- I'll dig to see if this is actually the case. If the other setting is sufficient to disable this feature, let's revert this one to its original default of 10s.

rharding6373 approved these changes

View reviewed changes

Collaborator

rharding6373 left a comment

pending the question about the necessity of including server.shutdown.jobs_wait. Also asked about its blast radius here: https://cockroachlabs.slack.com/archives/C02DSDS9TM1/p1714744237807489

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)

wenyihu6 force-pushed the hotfix branch from 7e77468 to 349824e Compare

May 3, 2024 17:35

wenyihu6 commented

View reviewed changes

Contributor Author

wenyihu6 left a comment

ack, I reverted the change for it server.shutdown.jobs_wait.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andyyang890 and @rharding6373)


          changefeedccl: disable drain watcher by default

d1efd5c

cockroachdb#102717 introduced a mechanism allowing for an orderly shutdown of changefeed
aggregators and an up-to-date frontier information to reduce duplicates during
a node restart in 23.2. However, we've recently identified a bug that could
lead to unexpected behavior. It's unclear where the bug is yet. This patch disables
this feature by default.

Fixes: cockroachdb#123371
Release note: Disabled a changefeed optimization on reducing duplicates during
aggregator restarts due to its bad performance.

wenyihu6 force-pushed the hotfix branch from 349824e to d1efd5c Compare

May 3, 2024 18:50

Collaborator

rharding6373 commented May 3, 2024

All required checks passed

bors r+

Contributor

craig bot commented May 3, 2024

Build succeeded:

craig bot merged commit fc9f953 into cockroachdb:master

blathers-crl bot mentioned this pull request

release-23.2: changefeedccl: disable drain watcher by default #123594

Merged

This was referenced May 3, 2024

release-24.1: changefeedccl: disable drain watcher by default #123595

Merged

release-23.2.5-rc: changefeedccl: disable drain watcher by default #123596

Merged

release-24.1.0-rc: changefeedccl: disable drain watcher by default #123597

Merged

rharding6373 mentioned this pull request

cdc: reenable graceful drain checkpointing for changefeeds #131475

Open

wenyihu6 deleted the hotfix branch

February 18, 2025 00:31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.2.x backport-24.1.x