mvcc: ensure serializability by fixing dirty reads and secondary index duplicates by Astronomax · Pull Request #11662 · tarantool/tarantool

Astronomax · 2025-07-13T07:24:53Z

The patchset enforces serializable isolation by addressing the following MVCC bugs:

Rollback may cause duplicates in secondary indexes.
To guarantee the absence of duplicates in secondary indexes, MVCC maintains the following invariant for all in-progress transactions:
```
If an in-progress story `x` conflicts with a story `y` in some secondary index
(i.e., they have the same key in that index), then `x` must also conflict with `y` in the primary key.
```
x and y may belong to the same transaction or different ones. The case where x and y belong to the same transaction is trivial. If x and y belong to different transactions, then y must be the last prepared story in the chain corresponding to that index. This implies that the invariant may break for some transactions when the last prepared story in a chain changes (either when another story becomes last or when the last story's del_psn becomes 0). All such cases must be handled - any transactions that violate the invariant (i.e., start duplicating prepared tuples) must be aborted.

Rollback often leads to changes in the last prepared story within chains. However, this case was previously overlooked, which could result in duplicates after rollback. This patch adds the missing handling for rollback scenarios.

Closes mvcc: rollback may cause duplicates in secondary indexes #11660
Insert-after-delete in single transaction may cause duplicates in secondary indexes.
The issue was related to MVCC assuming that if it inserts a tuple x={key, ...} that doesn't conflict with any other tuple on the primary key key (because it previously executed delete(key) removing some tuple y={key, ...} with the same primary key), then x couldn't possibly be a duplicate in any secondary index. However, this is obviously false. To make such a conclusion for a specific secondary index, the deleted tuple y must have the same key value in that index as tuple x.

The problem was connected with the is_own_change flag in the transactional statement. Its truth or falsity did not allow us to say anything about secondary indices.

Similar flags were introduced separately for each index. The statement-level flag remains, but it now has a different name (is_own_delete) and semantics. This flag is only used for DELETE statements; for INSERT/REPLACE, it is always false. delete_stmt->is_own_delete means the statement will either delete some tuple from the same transaction or won't delete anything because the same transaction previously deleted this key.

For INSERT/REPLACE statements, stmt->is_own_change has been replaced by stmt->add_story->link[0].is_own_change, which is equivalent to the stmt->is_own_change that existed before this commit.

Closes mvcc: insert-after-delete in single transaction may cause duplicates in secondary indexes #11686
Get-after-replace in single transaction may cause dirty read in secondary index.
The issue was related to MVCC assuming that if it performed a replace with a new tuple x={key, ...} that also deleted some tuple y={key, key2} with the same primary key, then a subsequent index.sk:get(key2) in the same transaction wouldn't return anything. This would be true if the transaction had deleted y using delete rather than replace, because delete guarantees it will definitely remove exactly what was returned to the user (it tracks reads if y was inserted by another transaction, or y was inserted by the same transaction, in which case this guarantee arises automatically).

This issue was automatically fixed by the same is_own_change flag-related fix described above. Now when MVCC performs get on a secondary key, it checks the is_own_change flag for that secondary index and determines whether it can automatically guarantee it won't read anything, or whether it needs to create a gap tracker to enforce this guarantee.

Closes mvcc: get-after-replace in single transaction may cause dirty read in secondary index #11687
Rollback may cause dirty gap read in secondary index.
The essence of the problem: One REPLACE transaction (prepared but not yet committed) creates a temporary situation where another concurrent transaction cannot see a specific key in a secondary index (a "read gap"). When the first transaction then rolls back, the previous (replaced) tuple becomes visible again, and the "read gap" becomes irrelevant (the key becomes visible once more). In this case, the second transaction, which read the gap, should be aborted. However, it successfully commits, leading to a non-serializable schedule.

This patch fixes the issue by adding the necessary handling during rollback. Now, all such irrelevant gaps are aborted.

Closes mvcc: rollback may cause dirty gap read #11802

coveralls · 2025-07-17T11:04:05Z

coverage: 87.641% (+0.002%) from 87.639%
when pulling bfed7e0 on Astronomax:gh-11660-rollback-may-cause-duplicates
into 9fd673f
on tarantool:master.

Astronomax · 2025-07-17T12:46:48Z

Note

There is one possible optimization: during rollback, we could track all chains affected by rolled-back transactions. Then, after all necessary rollbacks are complete, we could:

Iterate through all tracked chains and call memtx_tx_handle_dups_in_secondary_indexes for their last prepared stories
Similarly handle gap trackers - in chains where the gap disappears, abort all transactions that saw it

This would reduce false positives.

Currently:

Gap trackers get aborted as soon as the gap disappears for the first time during rollback (even if it might reappear later after all required transactions are rolled back). The same approach is applied to duplicates in secondary indexes in this patch.
Such aborts may therefore be unnecessary.

The proposed approach would:

Defer gap tracker (and duplicates in secondary indexes) processing until rollback completion
Only abort transactions if the gap truly disappears (or duplicate truly appears) permanently

I'm uncertain how critical it is for us to reduce these false-positive aborts, but it's worth noting this optimization opportunity exists.

changelogs/unreleased/gh-11686-mvcc-insert-after-delete-may-cause-duplicates.md

changelogs/unreleased/gh-11660-mvcc-rollback-may-cause-duplicates.md

src/box/memtx_tx.c

test/box-luatest/gh_8648_memtx_mvcc_rollback_prepared_test.lua

src/box/memtx_tx.c

CuriousGeorgiy · 2025-08-07T12:14:01Z

@Astronomax

I'm uncertain how critical it is for us to reduce these false-positive aborts, but it's worth noting this optimization opportunity exists.

Please open a ticket for it.

drewdzzz · 2025-10-20T08:30:53Z

@Astronomax Please, rebase to the newest master. It seems the workflow fails because of it.

src/box/memtx_tx.c

drewdzzz

Thank you for your fixes. Overall, it looks good to me - only a few small things left!

src/box/txn.h

src/box/memtx_tx.c

changelogs/unreleased/gh-11660-mvcc-rollback-may-cause-duplicates.md

changelogs/unreleased/gh-11802-mvcc-rollback-may-cause-dirty-gap-read.md

src/box/memtx_tx.c

drewdzzz

Thank you for your tremendous work on this patch! I believe, this PR is one of the major points in memtx MVCC development.

drewdzzz · 2025-10-23T11:56:59Z

I suspect that Doc Team review is outdated (AFAIR, they approved this PR before the last two commits). If so, please, re-request their review.

changelogs/unreleased/gh-11660-mvcc-rollback-may-cause-duplicates.md

src/box/memtx_tx.c

Introduces the `abort_on_rollback` flag for inplace gap items to control gap-reader transaction abortion on rollback. This change is a prerequisite for future fixes related to per-index `is_own_change` flags, it will help prevent new false-positive cases from occurring. After you look at one of the subsequent commits — "memtx: introduce `is_own_change` flag for each secondary index", which fixes issue tarantool#11686, it will become clear that some gap-reader transactions must be rolled back only while preparing a replace/insert statement, but not during a rollback. ``` space:replace{21, 30} -- committed tuple `{21, 30}` TXN1 begin() TXN1 space:replace{21, 31}') TXN1 space.index.sk:select{30} -> nil TXN2 begin() TXN2 space:delete{21} TXN2 commit() -> rollback due to WAL IO error ``` - The operation `space.index.sk:select{30}` in `TXN1` will now correctly track a gap, because `box.space.test:replace{21, 31}` in `TXN1` does not guarantee that some tuple `{x, 30}`, `x != 21` from another concurrent transaction won't be inserted and committed before `space.sk:select{30}` from `TXN1`. (see tarantool#11686, tarantool#11687) - Transaction `TXN2` executes `box.space.test:delete{21}` - Transaction `TXN2` is rolled back. Since this `space:delete{21}` deleted the tuple `{21, 30}`, it will reappear after the rollback It turns out that some gap-reader transactions for key `30` in the secondary index should not be rolled back. Specifically, the gap-reader `TXN1` can be skipped. All inplace gap items corresponding to such transactions can be marked with the `abort_on_rollback = false` flag at the moment of their creation. Part of tarantool#11686, tarantool#11687 NO_DOC=bugfix NO_TEST=will be added later NO_CHANGELOG=will be added later

Removed the `is_own_change` output parameter from `check_dup()` in `memtx_tx.c` - now it directly modifies `txn_stmt::is_own_change`. This change prepares for further modifications where the ownership check logic will be extended (to be implemented in the next commit). Part of tarantool#11686, tarantool#11687 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring

This commit fixes two Memtx-MVCC related issues: - A bug when a transaction performing insert-after-delete with the same primary key (e.g., delete{4} followed by insert{4, 3}) could create secondary key duplicates. - A bug when a transaction performing get-after-replace could dirty-read nothing. Both problems was connected with the `is_own_change` flag in the transactional statement. Its truth or falsity did not allow us to say anything about secondary indices. Similar flags were introduced separately for each index. The statement-level flag remains, but it now has a different name (`is_own_delete`) and semantics. This flag is only used for DELETE statements; for INSERT/REPLACE, it is always `false`. `delete_stmt->is_own_delete` means the statement will either delete some tuple from the same transaction or won't delete anything because the same transaction previously deleted this key. For INSERT/REPLACE statements, `stmt->is_own_change` has been replaced by `stmt->add_story->link[0].is_own_change`, which is equivalent to the `stmt->is_own_change` that existed before this commit. Closes tarantool#11686, tarantool#11687 NO_DOC=bugfix

Duplicates need to be processed (aborted) not only during preparing, but also during rollback (see the next commit). Factor this processing out into a separate function. Part of tarantool#11660 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring

This commit fixes a Memtx-MVCC related bug that could lead to duplicates in secondary indexes after rollback. To guarantee the absence of duplicates in secondary indexes, MVCC maintains the following invariant for all in-progress transactions: ``` If an in-progress story `x` conflicts with a story `y` in some secondary index (i.e., they have the same key in that index), then `x` must also conflict with `y` in the primary key. ``` `x` and `y` may belong to the same transaction or different ones. The case where `x` and `y` belong to the same transaction is trivial. If `x` and `y` belong to different transactions, then `y` must be the last prepared story in the chain corresponding to that index. This implies that the invariant may break for some transactions when the last prepared story in a chain changes (either when another story becomes last or when the last story's `del_psn` becomes 0). All such cases must be handled - any transactions that violate the invariant (i.e., start duplicating prepared tuples) must be aborted. Rollback often leads to changes in the last prepared story within chains. However, this case was previously overlooked, which could result in duplicates after rollback. This commit adds the missing handling for rollback scenarios. Closes tarantool#11660 NO_DOC=bugfix

This commit fixes a Memtx-MVCC related bug that could lead to dirty gap read in secondary indexes after rollback. One REPLACE transaction (prepared but not yet committed) creates a temporary situation where another concurrent transaction cannot see a specific key in a secondary index (a "read gap"). When the first transaction then rolls back, the previous (replaced) tuple becomes visible again, and the "read gap" becomes irrelevant (the key becomes visible once more). In this case, the second transaction, which read the gap, should be aborted. However, it successfully commits, leading to a non-serializable schedule. This commit fixes the issue by adding the necessary handling during rollback. Now, all such irrelevant gaps are aborted. Closes tarantool#11802 NO_DOC=bugfix

TarantoolBot · 2025-10-27T09:48:01Z

Backport failed for release/3.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.2
git worktree add -d .worktree/backport/release/3.2/11662 origin/release/3.2
cd .worktree/backport/release/3.2/11662
git switch --create backport/release/3.2/11662
git cherry-pick -x 9a000e50dfb2693f817bf899b1b02473b76c7a36 78fc900ec841ec6cfe14b167ab0884b5a542609d 76ba7ee9eacb0815126b9cbc61fb92b3b543d2f8 d4f9f9b9570201b06a649d87b3ff23d4053c7d44 bd6e12b3ca90bc9b698ad15a0e41583baae821b9 2a9a463e5c303dca9c877f6e0e89d4d6665898de

TarantoolBot · 2025-10-27T09:48:03Z

Backport failed for release/3.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.3
git worktree add -d .worktree/backport/release/3.3/11662 origin/release/3.3
cd .worktree/backport/release/3.3/11662
git switch --create backport/release/3.3/11662
git cherry-pick -x 9a000e50dfb2693f817bf899b1b02473b76c7a36 78fc900ec841ec6cfe14b167ab0884b5a542609d 76ba7ee9eacb0815126b9cbc61fb92b3b543d2f8 d4f9f9b9570201b06a649d87b3ff23d4053c7d44 bd6e12b3ca90bc9b698ad15a0e41583baae821b9 2a9a463e5c303dca9c877f6e0e89d4d6665898de

TarantoolBot · 2025-10-27T09:48:07Z

Successfully created backport PR for release/3.4:

[backport 3.4] mvcc: ensure serializability by fixing dirty reads and secondary index duplicates #11985

TarantoolBot · 2025-10-27T09:48:11Z

Successfully created backport PR for release/3.5:

[backport 3.5] mvcc: ensure serializability by fixing dirty reads and secondary index duplicates #11986

TarantoolBot · 2025-10-27T09:48:14Z

Backport summary

Created [backport 3.5] mvcc: ensure serializability by fixing dirty reads and secondary index duplicates #11986 to release/3.5 to a future 3.5.1 release
Created [backport 3.4] mvcc: ensure serializability by fixing dirty reads and secondary index duplicates #11985 to release/3.4 to a future 3.4.2 release

sergepetrenko · 2025-10-27T09:48:33Z

@Astronomax, please, prepare backports for 3.2 and 3.3

Astronomax self-assigned this Jul 13, 2025

Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 4 times, most recently from 7679026 to bb463b1 Compare July 17, 2025 10:47

Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch from bb463b1 to 173e092 Compare July 17, 2025 11:17

Astronomax requested a review from a team as a code owner July 17, 2025 11:17

Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 3 times, most recently from 8c205f9 to 90c529c Compare July 17, 2025 11:31

Astronomax requested a review from drewdzzz July 17, 2025 12:23

Astronomax assigned drewdzzz and unassigned Astronomax Jul 17, 2025

Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 5 times, most recently from 63e7234 to 9af9a1a Compare July 20, 2025 20:26

Astronomax changed the title ~~mvcc: abort duplicates after rollback~~ mvcc: ensure serializability by fixing dirty reads and secondary index duplicates Jul 21, 2025

Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 4 times, most recently from c3a1487 to 327939d Compare July 21, 2025 10:11

xuniq approved these changes Jul 22, 2025

View reviewed changes

changelogs/unreleased/gh-11686-mvcc-insert-after-delete-may-cause-duplicates.md Outdated Show resolved Hide resolved

changelogs/unreleased/gh-11660-mvcc-rollback-may-cause-duplicates.md Outdated Show resolved Hide resolved

CuriousGeorgiy self-requested a review August 6, 2025 14:57

CuriousGeorgiy requested changes Aug 7, 2025

View reviewed changes

CuriousGeorgiy assigned Astronomax Aug 7, 2025

Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 5 times, most recently from 69f0a9a to 8a53894 Compare October 14, 2025 12:23

drewdzzz reviewed Oct 20, 2025

View reviewed changes

src/box/memtx_tx.c Show resolved Hide resolved

drewdzzz reviewed Oct 22, 2025

View reviewed changes

drewdzzz approved these changes Oct 23, 2025

View reviewed changes

drewdzzz reviewed Oct 23, 2025

View reviewed changes

changelogs/unreleased/gh-11660-mvcc-rollback-may-cause-duplicates.md Outdated Show resolved Hide resolved

drewdzzz reviewed Oct 23, 2025

View reviewed changes

src/box/memtx_tx.c Show resolved Hide resolved

Astronomax added 6 commits October 23, 2025 17:23

drewdzzz mentioned this pull request Oct 24, 2025

memtx: failing index count limit assertion with MVCC enabled #11930

Merged

xuniq approved these changes Oct 24, 2025

View reviewed changes

TarantoolBot mentioned this pull request Oct 27, 2025

[backport 3.4] mvcc: ensure serializability by fixing dirty reads and secondary index duplicates #11985

Merged

TarantoolBot mentioned this pull request Oct 27, 2025

[backport 3.5] mvcc: ensure serializability by fixing dirty reads and secondary index duplicates #11986

Merged

This was referenced Oct 28, 2025

[backport 3.3] mvcc: ensure serializability by fixing dirty reads and secondary index duplicates #11998

Merged

[backport 3.2] mvcc: ensure serializability by fixing dirty reads and secondary index duplicates #11999

Merged

Conversation

Astronomax commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Astronomax commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CuriousGeorgiy commented Aug 7, 2025

Uh oh!

drewdzzz commented Oct 20, 2025

Uh oh!

Uh oh!

drewdzzz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drewdzzz left a comment

Choose a reason for hiding this comment

Uh oh!

drewdzzz commented Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

TarantoolBot commented Oct 27, 2025

Uh oh!

TarantoolBot commented Oct 27, 2025

Uh oh!

TarantoolBot commented Oct 27, 2025

Uh oh!

TarantoolBot commented Oct 27, 2025

Uh oh!

TarantoolBot commented Oct 27, 2025

Backport summary

Uh oh!

sergepetrenko commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Astronomax commented Jul 13, 2025 •

edited

Loading

coveralls commented Jul 17, 2025 •

edited

Loading

Astronomax commented Jul 17, 2025 •

edited

Loading