Skip to content

synchro: fix snapshot having an outdated synchro state#11858

Merged
Gerold103 merged 11 commits intomasterfrom
gerold103/gh-11754-qsync-outdated-snapshot
Oct 31, 2025
Merged

synchro: fix snapshot having an outdated synchro state#11858
Gerold103 merged 11 commits intomasterfrom
gerold103/gh-11754-qsync-outdated-snapshot

Conversation

@Gerold103
Copy link
Collaborator

See #11754.

@Gerold103 Gerold103 self-assigned this Sep 25, 2025
@Gerold103 Gerold103 force-pushed the gerold103/gh-11754-qsync-outdated-snapshot branch from 9118286 to 39eafce Compare September 25, 2025 22:06
@coveralls
Copy link

coveralls commented Sep 25, 2025

Coverage Status

coverage: 87.645% (+0.002%) from 87.643%
when pulling c0f722e on gerold103/gh-11754-qsync-outdated-snapshot
into 24cdd89
on master
.

@Gerold103 Gerold103 marked this pull request as ready for review September 25, 2025 22:36
@Gerold103 Gerold103 requested a review from a team as a code owner September 25, 2025 22:36
Copy link
Contributor

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a glance and will properly review the patch tomorrow, these're first comments I have

@Gerold103 Gerold103 force-pushed the gerold103/gh-11754-qsync-outdated-snapshot branch from 39eafce to af8acae Compare October 20, 2025 20:57
@Serpentian
Copy link
Contributor

I've checked out the commits and see no flaws, everything is clean and understandable. Let's just fix the EE and we're good to go. Thank you a lot for working on this, I really liked, how commits are splitted

@Gerold103 Gerold103 force-pushed the gerold103/gh-11754-qsync-outdated-snapshot branch from af8acae to 0459dcb Compare October 22, 2025 20:46
@Gerold103 Gerold103 requested a review from Serpentian October 22, 2025 21:04
Copy link
Contributor

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patches! Clean and well splitted. I'm really glad, that the protocol related stuff is finally back in the relay's code!

Should be merged alongside the https://github.com/tarantool/tarantool-ee/pull/1510

@sergepetrenko sergepetrenko linked an issue Oct 28, 2025 that may be closed by this pull request
@sergepetrenko sergepetrenko added backport/3.2 Automatically create a 3.2 backport PR backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR backport/3.5 Automatically create a 3.5 backport PR labels Oct 31, 2025
Copy link
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gerold103, thanks for the patch!

I have no objections, LGTM.

Please, rebase on top of current master and we'll be good to go.

The test had a hardcoded value of ETIMEDOUT in one of the checked
error messages. The value didn't match the one on used on one of
the MacOS versions.

Lets make this value determined at runtime, so the test is
platform-agnostic.

NO_DOC=test
NO_CHANGELOG=test
It was previously only available directly in WAL API. This forced
the checkpoint build code to rely on WAL, breaking the journal's
API encapsulation.

While now it is not a big deal, it is going to become one in the
next commits, where the checkpoint build code is moved into
another place not having WAL as a dependency. Lets not drag WAL
there and generalize the journal checkpoint API.

Needed for #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
Those files were about atomic build of the "global checkpoint" of
all transaction-related control states with no observable
after-effects (nothing changed on disk and in any of those control
states).

But this code conveniently contains a lot of intricate steps which
could be reused for an on-disk checkpoint creation (snapshot).

That would turn it into something bigger than just "txn"
checkpoint. Lets prepare to that by firstly renaming the files.

The next commit will change the contents. It is separated from the
file rename in order not to confuse git into thinking that the
old txn_checkpoint files are rewritten from scratch.

Needed for #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
txn_checkpoint was about atomic build of the "global checkpoint"
of all transaction-related control states with no observable
after-effects (nothing changed on disk and in any of those control
states).

But this code conveniently contains a lot of intricate steps which
could be reused for an on-disk checkpoint creation (snapshot).

That would turn it into something bigger than just "txn"
checkpoint. Lets prepare to that by renaming the struct and
related functions.

Needed for #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
box/checkpoint.h+c already contains the in-memory control states
checkpoint creation code.

Lets also move here the creation of the on-disk checkpoint. This
will make it easier to reuse the quite nontrivial common code
between these two kinds of checkpoints.

This commit though only moves the code and doesn't intend to make
any functional changes.

Part of #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
It contains the data that needs to be known by the specific
engines to create their checkpoints.

Previously it was only a flag, but now it also contains the
box_checkpoint object. This one will steal some work from the
memtx checkpoint in the next commits, but its data is still going
to be written into "memtx checkpoint" aka snapshot.

Part of #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
The struct name already has "checkpoint" suffix in it. Lets remove
it from the member names. It was too verbose and long.

In scope of #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
Previously the collection of it was done in the memtx code. But it
isn't done right and doesn't belong to just one engine.

Lets move this code into an engine-agnostic place, into
checkpoint.c. It is logically more natural and will allow to reuse
the existing code for in-memory checkpoint build.

The commit intends no functional changes and only moves the code,
so it would be easier to actually fix it in the next commit.

Part of #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
It is going to be needed higher in the file in the next commits.
Lets move it, so the next commits would only change a small part
in it instead of having a 100% diff it due to the move.

In scope of #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
There was a bug that the snapshot creation code would do the
synchronous transaction control state collection too early. It
was fully done even before the engines would begin their
checkpoints.

That led to a problem that the limbo and Raft checkpoints could
have outdated confirmation LSN, terms, and anything else that is
supposed to be written into the snapshot.

The easiest way to reproduce it was to just start a snapshot when
there is at least one pending synchronous transaction. Then the
snap file would get the old confirmation LSN, but would also have
this transaction's data.

That seemed not to lead to any noticeable and visible bugs except
when one explicitly reads the snapshot file and observes the old
LSN or a term. But that still didn't look safe.

The solution is to just collect these control states the same way
as the in-memory checkpoint is being collected.

Closes #11754

NO_DOC=bugfix
Previously the join/fetch-snapshot procedure would build
checkpoints of the global engine-agnostic states (like limbo,
raft, journal) right inside memtx_engine code.

That didn't look right for memtx to take care of such broad things
which really cover not only memtx. It also forced the relay to
pass into memtx some additional context via struct
engine_join_ctx.

Lets move those things out. It should be fine to do them right in
the relay code. That makes memtx join code more specific, and even
no longer forces memtx join code to be executed first (probably
it still should be, but for other reasons).

Follow up for #11754

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
@Gerold103 Gerold103 force-pushed the gerold103/gh-11754-qsync-outdated-snapshot branch from 0459dcb to c0f722e Compare October 31, 2025 18:10
@Gerold103 Gerold103 added the full-ci Enables all tests for a pull request label Oct 31, 2025
@Gerold103 Gerold103 merged commit 3fcf9ab into master Oct 31, 2025
58 of 59 checks passed
@Gerold103 Gerold103 deleted the gerold103/gh-11754-qsync-outdated-snapshot branch October 31, 2025 22:17
@TarantoolBot
Copy link
Collaborator

Backport failed for release/3.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.2
git worktree add -d .worktree/backport/release/3.2/11858 origin/release/3.2
cd .worktree/backport/release/3.2/11858
git switch --create backport/release/3.2/11858
git cherry-pick -x 0d329244c74693a0ae6ed475b3ac7cca5fa9bafb 07770c7033aec3eab83f2ea604df614fdf9f5008 0773c30b93b2c9a674fdbf0e4f9484bc99f90860 a1dc2019e656abdc2fe3c99b8f5502eaf90b83c5 65882120dce9b100a589b7d0aeda3ca549df47e3 a09e4710a36bc4f53fe8a746bee2a419b757b0b2 7a587df06042bbad4d5a1350abb6984b1521c912 0a4aeb3b2ec3872a63adbbde2de37a93abaaa1ff ca7ad72765a191bca8a23c7903694e626e3f52b0 c9acf8a2ad173e831660bb3486a5d337679d5268 3fcf9ab43ea1fe4c62f650fdd2b727e271b77d71

@TarantoolBot
Copy link
Collaborator

Backport failed for release/3.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.3
git worktree add -d .worktree/backport/release/3.3/11858 origin/release/3.3
cd .worktree/backport/release/3.3/11858
git switch --create backport/release/3.3/11858
git cherry-pick -x 0d329244c74693a0ae6ed475b3ac7cca5fa9bafb 07770c7033aec3eab83f2ea604df614fdf9f5008 0773c30b93b2c9a674fdbf0e4f9484bc99f90860 a1dc2019e656abdc2fe3c99b8f5502eaf90b83c5 65882120dce9b100a589b7d0aeda3ca549df47e3 a09e4710a36bc4f53fe8a746bee2a419b757b0b2 7a587df06042bbad4d5a1350abb6984b1521c912 0a4aeb3b2ec3872a63adbbde2de37a93abaaa1ff ca7ad72765a191bca8a23c7903694e626e3f52b0 c9acf8a2ad173e831660bb3486a5d337679d5268 3fcf9ab43ea1fe4c62f650fdd2b727e271b77d71

@TarantoolBot
Copy link
Collaborator

Backport failed for release/3.4, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.4
git worktree add -d .worktree/backport/release/3.4/11858 origin/release/3.4
cd .worktree/backport/release/3.4/11858
git switch --create backport/release/3.4/11858
git cherry-pick -x 0d329244c74693a0ae6ed475b3ac7cca5fa9bafb 07770c7033aec3eab83f2ea604df614fdf9f5008 0773c30b93b2c9a674fdbf0e4f9484bc99f90860 a1dc2019e656abdc2fe3c99b8f5502eaf90b83c5 65882120dce9b100a589b7d0aeda3ca549df47e3 a09e4710a36bc4f53fe8a746bee2a419b757b0b2 7a587df06042bbad4d5a1350abb6984b1521c912 0a4aeb3b2ec3872a63adbbde2de37a93abaaa1ff ca7ad72765a191bca8a23c7903694e626e3f52b0 c9acf8a2ad173e831660bb3486a5d337679d5268 3fcf9ab43ea1fe4c62f650fdd2b727e271b77d71

@TarantoolBot
Copy link
Collaborator

Backport failed for release/3.5, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.5
git worktree add -d .worktree/backport/release/3.5/11858 origin/release/3.5
cd .worktree/backport/release/3.5/11858
git switch --create backport/release/3.5/11858
git cherry-pick -x 0d329244c74693a0ae6ed475b3ac7cca5fa9bafb 07770c7033aec3eab83f2ea604df614fdf9f5008 0773c30b93b2c9a674fdbf0e4f9484bc99f90860 a1dc2019e656abdc2fe3c99b8f5502eaf90b83c5 65882120dce9b100a589b7d0aeda3ca549df47e3 a09e4710a36bc4f53fe8a746bee2a419b757b0b2 7a587df06042bbad4d5a1350abb6984b1521c912 0a4aeb3b2ec3872a63adbbde2de37a93abaaa1ff ca7ad72765a191bca8a23c7903694e626e3f52b0 c9acf8a2ad173e831660bb3486a5d337679d5268 3fcf9ab43ea1fe4c62f650fdd2b727e271b77d71

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/3.2 Automatically create a 3.2 backport PR backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR backport/3.5 Automatically create a 3.5 backport PR full-ci Enables all tests for a pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snapshot might contain outdated limbo+raft checkpoints

6 participants