Skip to content

kvpb: mark BarrierRequest as isAlone#117787

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:barrier-is-alone
Jan 15, 2024
Merged

kvpb: mark BarrierRequest as isAlone#117787
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:barrier-is-alone

Conversation

@erikgrinaker
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker commented Jan 15, 2024

Otherwise, BatchRequest.RequiresConsensus() may return false and not submit the barrier through Raft. Similarly, shouldWaitOnLatchesWithoutAcquiring will return false so it will contend with later writes.

Barriers are not used in recent releases, so this does not have any mixed-version concerns.

Epic: none
Release note: None

@erikgrinaker erikgrinaker requested a review from nvb January 15, 2024 16:33
@erikgrinaker erikgrinaker self-assigned this Jan 15, 2024
@erikgrinaker erikgrinaker requested a review from a team as a code owner January 15, 2024 16:33
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 15, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@erikgrinaker erikgrinaker changed the title kvpb: mark Barrier as isAlone kvpb: mark BarrierRequest as isAlone Jan 15, 2024
Otherwise, `BatchRequest.RequiresConsensus()` may return `false` and not
submit the barrier through Raft. Similarly,
`shouldWaitOnLatchesWithoutAcquiring` will return `false` so it will
contend with later writes.

Barriers are not used in recent releases, so this does not have any
mixed-version concerns.

Epic: none
Release note: None
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 15, 2024

Build succeeded:

@craig craig bot merged commit 75f19be into cockroachdb:master Jan 15, 2024
craig bot pushed a commit that referenced this pull request Jan 23, 2024
117612: rangefeed: fix premature checkpoint due to intent resolution race r=erikgrinaker a=erikgrinaker

This PR depends on the following changes, which should be backported together:

* #117787
* #117859
* #117967
* #117968
* #117969

---

It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all.

This could happen because `PushTxn` may return a false `ABORTED` status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically
assumes an abort (see `Replica.CanCreateTxnRecord` and `batcheval.SynthesizeTxnFromMeta`).

However, the rangefeed txn pusher trusted this `ABORTED` status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario:

- A rangefeed is running on a lagging follower.
- A txn writes an intent, which is replicated to the follower.
- The closed timestamp advances past the intent.
- The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower.
- The rangefeed pushes the txn to advance its resolved timestamp.
- The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed).
- The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint.
- The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee.
- The changefeed sees an event below its frontier and drops it, never emitting these events at all.

This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking.

Resolves #104309.
Epic: none
Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.


Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants