Skip to content

changefeedccl: disable drain watcher by default#123370

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
wenyihu6:hotfix
May 3, 2024
Merged

changefeedccl: disable drain watcher by default#123370
craig[bot] merged 1 commit intocockroachdb:masterfrom
wenyihu6:hotfix

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented May 1, 2024

In #102717, we introduced a mechanism to reduce message duplicates by re-loading job progress upon resuming. However, this pr forgot to account initial scans. During initial scans, we rely on an important prerequisite (resolved timestamps of all spans are required to be empty by keeping initial highwater as zero). The pr forwarded spans with resolved timestamps to its most up to date status in reconcileJobStateWithLocalState. This resulted in dropping events since we now sometimes make a new frontier with all spans forwarded to initialResolved

ca.frontier, err = makeSchemaChangeFrontier(initialHighWater, spans...)
. We should add more test coverage and re-think through the logic here though.

This patch disables this feature by default.

Fixes: #123371
Release note: Disabled a changefeed optimization on reducing duplicates during
aggregator restarts due to its bad performance.

@wenyihu6 wenyihu6 requested review from a team as code owners May 1, 2024 13:49
@wenyihu6 wenyihu6 requested review from andyyang890 and removed request for a team May 1, 2024 13:49
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 1, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@wenyihu6 wenyihu6 requested review from rharding6373 and removed request for a team May 1, 2024 13:54
@rharding6373 rharding6373 added backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only backport-23.2.5-rc labels May 1, 2024
Copy link
Copy Markdown
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding this and getting a mitigation out so quickly!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)


-- commits line 2 at r1:
minor nit/educational comment: do we colloquially refer to this feature as "drain watcher", or can we call it something like "shutdown checkpointing" to reflect the cluster setting that enables it?


-- commits line 4 at r1:
nit: "introduced a mechanism in 23.2" instead of "recently ...".


-- commits line 9 at r1:
Fixes: #123371


-- commits line 11 at r1:
reminder to fill this in with the known details so far and update the PR description in github


pkg/server/drain.go line 88 at r1 (raw file):

		"the maximum amount of time a server waits for all currently executing jobs "+
			"to notice drain request and to perform orderly shutdown",
		0,

Why is this change required too?

@wenyihu6 wenyihu6 force-pushed the hotfix branch 2 times, most recently from 5820752 to 1dadf50 Compare May 2, 2024 20:05
Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @rharding6373)


-- commits line 2 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

minor nit/educational comment: do we colloquially refer to this feature as "drain watcher", or can we call it something like "shutdown checkpointing" to reflect the cluster setting that enables it?

I'm not sure what this feature was called. I think the major work was the drain watcher on each aggregator. Aggregators forward an updated checkpointing information before being drained. I believe the same code path is hit for both shutting down aggregators and node draining.


-- commits line 4 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

nit: "introduced a mechanism in 23.2" instead of "recently ...".

Done.


-- commits line 9 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

Fixes: #123371

Let me if you want me to close the other issue. I was thinking we should have an issue open to track this work, but I'm happy to close it as well.


-- commits line 11 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

reminder to fill this in with the known details so far and update the PR description in github

Updated.


pkg/server/drain.go line 88 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Why is this change required too?

iiuc, the original pr introduced two cluster settings to guard this feature. I'm disabling them both since it doesn't seem harmful to disable them both. But if I understood correctly, the two cluster settings are guarding the same thing.

@wenyihu6 wenyihu6 requested a review from rharding6373 May 2, 2024 20:12
@wenyihu6 wenyihu6 added backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.1.0-rc labels May 2, 2024
Copy link
Copy Markdown
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)


-- commits line 9 at r1:

Previously, wenyihu6 (Wenyi Hu) wrote…

Let me if you want me to close the other issue. I was thinking we should have an issue open to track this work, but I'm happy to close it as well.

Ah, I see what you mean. Ok, we can leave the other one as a tracking issue for follow up work.


-- commits line 11 at r1:

Previously, wenyihu6 (Wenyi Hu) wrote…

Updated.

s/"its bad performance"/"unexpected behavior"


pkg/server/drain.go line 88 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

iiuc, the original pr introduced two cluster settings to guard this feature. I'm disabling them both since it doesn't seem harmful to disable them both. But if I understood correctly, the two cluster settings are guarding the same thing.

I'm concerned that other jobs may rely on this now -- I'll dig to see if this is actually the case. If the other setting is sufficient to disable this feature, let's revert this one to its original default of 10s.

Copy link
Copy Markdown
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: pending the question about the necessity of including server.shutdown.jobs_wait. Also asked about its blast radius here: https://cockroachlabs.slack.com/archives/C02DSDS9TM1/p1714744237807489

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)

Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, I reverted the change for it server.shutdown.jobs_wait.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andyyang890 and @rharding6373)

cockroachdb#102717 introduced a mechanism allowing for an orderly shutdown of changefeed
aggregators and an up-to-date frontier information to reduce duplicates during
a node restart in 23.2. However, we've recently identified a bug that could
lead to unexpected behavior. It's unclear where the bug is yet. This patch disables
this feature by default.

Fixes: cockroachdb#123371
Release note: Disabled a changefeed optimization on reducing duplicates during
aggregator restarts due to its bad performance.
@rharding6373
Copy link
Copy Markdown
Collaborator

All required checks passed

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 3, 2024

@craig craig bot merged commit fc9f953 into cockroachdb:master May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only backport-24.1.x Flags PRs that need to be backported to 24.1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cdc: potentially dropping events during initial scan

3 participants