Skip to content

mvcc: ensure serializability by fixing dirty reads and secondary index duplicates#11662

Merged
sergepetrenko merged 6 commits intotarantool:masterfrom
Astronomax:gh-11660-rollback-may-cause-duplicates
Oct 27, 2025
Merged

mvcc: ensure serializability by fixing dirty reads and secondary index duplicates#11662
sergepetrenko merged 6 commits intotarantool:masterfrom
Astronomax:gh-11660-rollback-may-cause-duplicates

Conversation

@Astronomax
Copy link
Contributor

@Astronomax Astronomax commented Jul 13, 2025

The patchset enforces serializable isolation by addressing the following MVCC bugs:

  • Rollback may cause duplicates in secondary indexes.
    To guarantee the absence of duplicates in secondary indexes, MVCC maintains the following invariant for all in-progress transactions:

    If an in-progress story `x` conflicts with a story `y` in some secondary index
    (i.e., they have the same key in that index), then `x` must also conflict with `y` in the primary key.
    

    x and y may belong to the same transaction or different ones. The case where x and y belong to the same transaction is trivial. If x and y belong to different transactions, then y must be the last prepared story in the chain corresponding to that index. This implies that the invariant may break for some transactions when the last prepared story in a chain changes (either when another story becomes last or when the last story's del_psn becomes 0). All such cases must be handled - any transactions that violate the invariant (i.e., start duplicating prepared tuples) must be aborted.

    Rollback often leads to changes in the last prepared story within chains. However, this case was previously overlooked, which could result in duplicates after rollback. This patch adds the missing handling for rollback scenarios.

    Closes mvcc: rollback may cause duplicates in secondary indexes #11660

  • Insert-after-delete in single transaction may cause duplicates in secondary indexes.
    The issue was related to MVCC assuming that if it inserts a tuple x={key, ...} that doesn't conflict with any other tuple on the primary key key (because it previously executed delete(key) removing some tuple y={key, ...} with the same primary key), then x couldn't possibly be a duplicate in any secondary index. However, this is obviously false. To make such a conclusion for a specific secondary index, the deleted tuple y must have the same key value in that index as tuple x.

    The problem was connected with the is_own_change flag in the transactional statement. Its truth or falsity did not allow us to say anything about secondary indices.

    Similar flags were introduced separately for each index. The statement-level flag remains, but it now has a different name (is_own_delete) and semantics. This flag is only used for DELETE statements; for INSERT/REPLACE, it is always false. delete_stmt->is_own_delete means the statement will either delete some tuple from the same transaction or won't delete anything because the same transaction previously deleted this key.

    For INSERT/REPLACE statements, stmt->is_own_change has been replaced by stmt->add_story->link[0].is_own_change, which is equivalent to the stmt->is_own_change that existed before this commit.

    Closes mvcc: insert-after-delete in single transaction may cause duplicates in secondary indexes #11686

  • Get-after-replace in single transaction may cause dirty read in secondary index.
    The issue was related to MVCC assuming that if it performed a replace with a new tuple x={key, ...} that also deleted some tuple y={key, key2} with the same primary key, then a subsequent index.sk:get(key2) in the same transaction wouldn't return anything. This would be true if the transaction had deleted y using delete rather than replace, because delete guarantees it will definitely remove exactly what was returned to the user (it tracks reads if y was inserted by another transaction, or y was inserted by the same transaction, in which case this guarantee arises automatically).

    This issue was automatically fixed by the same is_own_change flag-related fix described above. Now when MVCC performs get on a secondary key, it checks the is_own_change flag for that secondary index and determines whether it can automatically guarantee it won't read anything, or whether it needs to create a gap tracker to enforce this guarantee.

    Closes mvcc: get-after-replace in single transaction may cause dirty read in secondary index #11687

  • Rollback may cause dirty gap read in secondary index.
    The essence of the problem: One REPLACE transaction (prepared but not yet committed) creates a temporary situation where another concurrent transaction cannot see a specific key in a secondary index (a "read gap"). When the first transaction then rolls back, the previous (replaced) tuple becomes visible again, and the "read gap" becomes irrelevant (the key becomes visible once more). In this case, the second transaction, which read the gap, should be aborted. However, it successfully commits, leading to a non-serializable schedule.

    This patch fixes the issue by adding the necessary handling during rollback. Now, all such irrelevant gaps are aborted.

    Closes mvcc: rollback may cause dirty gap read #11802

@Astronomax Astronomax self-assigned this Jul 13, 2025
@Astronomax Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 4 times, most recently from 7679026 to bb463b1 Compare July 17, 2025 10:47
@coveralls
Copy link

coveralls commented Jul 17, 2025

Coverage Status

coverage: 87.641% (+0.002%) from 87.639%
when pulling bfed7e0 on Astronomax:gh-11660-rollback-may-cause-duplicates
into 9fd673f
on tarantool:master
.

@Astronomax Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch from bb463b1 to 173e092 Compare July 17, 2025 11:17
@Astronomax Astronomax requested a review from a team as a code owner July 17, 2025 11:17
@Astronomax Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 3 times, most recently from 8c205f9 to 90c529c Compare July 17, 2025 11:31
@Astronomax Astronomax requested a review from drewdzzz July 17, 2025 12:23
@Astronomax Astronomax assigned drewdzzz and unassigned Astronomax Jul 17, 2025
@Astronomax
Copy link
Contributor Author

Astronomax commented Jul 17, 2025

Note

There is one possible optimization: during rollback, we could track all chains affected by rolled-back transactions. Then, after all necessary rollbacks are complete, we could:

  • Iterate through all tracked chains and call memtx_tx_handle_dups_in_secondary_indexes for their last prepared stories
  • Similarly handle gap trackers - in chains where the gap disappears, abort all transactions that saw it

This would reduce false positives.

Currently:

  • Gap trackers get aborted as soon as the gap disappears for the first time during rollback (even if it might reappear later after all required transactions are rolled back). The same approach is applied to duplicates in secondary indexes in this patch.
  • Such aborts may therefore be unnecessary.

The proposed approach would:

  • Defer gap tracker (and duplicates in secondary indexes) processing until rollback completion
  • Only abort transactions if the gap truly disappears (or duplicate truly appears) permanently

I'm uncertain how critical it is for us to reduce these false-positive aborts, but it's worth noting this optimization opportunity exists.

@Astronomax Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 5 times, most recently from 63e7234 to 9af9a1a Compare July 20, 2025 20:26
@Astronomax Astronomax changed the title mvcc: abort duplicates after rollback mvcc: ensure serializability by fixing dirty reads and secondary index duplicates Jul 21, 2025
@Astronomax Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 4 times, most recently from c3a1487 to 327939d Compare July 21, 2025 10:11
@CuriousGeorgiy CuriousGeorgiy self-requested a review August 6, 2025 14:57
@CuriousGeorgiy
Copy link
Member

@Astronomax

I'm uncertain how critical it is for us to reduce these false-positive aborts, but it's worth noting this optimization opportunity exists.

Please open a ticket for it.

@Astronomax Astronomax force-pushed the gh-11660-rollback-may-cause-duplicates branch 5 times, most recently from 69f0a9a to 8a53894 Compare October 14, 2025 12:23
@drewdzzz
Copy link
Contributor

@Astronomax Please, rebase to the newest master. It seems the workflow fails because of it.

Copy link
Contributor

@drewdzzz drewdzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your fixes. Overall, it looks good to me - only a few small things left!

Copy link
Contributor

@drewdzzz drewdzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your tremendous work on this patch! I believe, this PR is one of the major points in memtx MVCC development.

@drewdzzz
Copy link
Contributor

I suspect that Doc Team review is outdated (AFAIR, they approved this PR before the last two commits). If so, please, re-request their review.

Introduces the `abort_on_rollback` flag for inplace gap items to control
gap-reader transaction abortion on rollback. This change is a prerequisite
for future fixes related to per-index `is_own_change` flags, it will help
prevent new false-positive cases from occurring.

After you look at one of the subsequent commits — "memtx: introduce
`is_own_change` flag for each secondary index", which fixes issue tarantool#11686,
it will become clear that some gap-reader transactions must be rolled back
only while preparing a replace/insert statement, but not during a rollback.

```
space:replace{21, 30} -- committed tuple `{21, 30}`
TXN1 begin()
TXN1 space:replace{21, 31}')
TXN1 space.index.sk:select{30} -> nil
TXN2 begin()
TXN2 space:delete{21}
TXN2 commit() -> rollback due to WAL IO error
```

- The operation `space.index.sk:select{30}` in `TXN1` will now correctly
track a gap, because `box.space.test:replace{21, 31}` in `TXN1` does not
guarantee that some tuple `{x, 30}`, `x != 21` from another concurrent
transaction won't be inserted and committed before `space.sk:select{30}`
from `TXN1`. (see tarantool#11686, tarantool#11687)
- Transaction `TXN2` executes `box.space.test:delete{21}`
- Transaction `TXN2` is rolled back. Since this `space:delete{21}` deleted
the tuple `{21, 30}`, it will reappear after the rollback

It turns out that some gap-reader transactions for key `30` in the
secondary index should not be rolled back. Specifically, the gap-reader
`TXN1` can be skipped. All inplace gap items corresponding to such
transactions can be marked with the `abort_on_rollback = false` flag at
the moment of their creation.

Part of tarantool#11686, tarantool#11687

NO_DOC=bugfix
NO_TEST=will be added later
NO_CHANGELOG=will be added later
Removed the `is_own_change` output parameter from `check_dup()` in
`memtx_tx.c` - now it directly modifies `txn_stmt::is_own_change`.
This change prepares for further modifications where the ownership
check logic will be extended (to be implemented in the next commit).

Part of tarantool#11686, tarantool#11687

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring
This commit fixes two Memtx-MVCC related issues:

- A bug when a transaction performing insert-after-delete with the same
  primary key (e.g., delete{4} followed by insert{4, 3}) could create
  secondary key duplicates.

- A bug when a transaction performing get-after-replace could dirty-read
  nothing.

Both problems was connected with the `is_own_change` flag in the
transactional statement. Its truth or falsity did not allow us to say
anything about secondary indices.

Similar flags were introduced separately for each index.
The statement-level flag remains, but it now has a different name
(`is_own_delete`) and semantics. This flag is only used for DELETE
statements; for INSERT/REPLACE, it is always `false`.
`delete_stmt->is_own_delete` means the statement will either delete some
tuple from the same transaction or won't delete anything because the same
transaction previously deleted this key.

For INSERT/REPLACE statements, `stmt->is_own_change` has been replaced by
`stmt->add_story->link[0].is_own_change`, which is equivalent to the
`stmt->is_own_change` that existed before this commit.

Closes tarantool#11686, tarantool#11687

NO_DOC=bugfix
Duplicates need to be processed (aborted) not only during preparing,
but also during rollback (see the next commit). Factor this processing
out into a separate function.

Part of tarantool#11660

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring
This commit fixes a Memtx-MVCC related bug that could lead to duplicates
in secondary indexes after rollback.

To guarantee the absence of duplicates in secondary indexes, MVCC
maintains the following invariant for all in-progress transactions:
```
If an in-progress story `x` conflicts with a story `y` in some secondary
index (i.e., they have the same key in that index), then `x` must also
conflict with `y` in the primary key.
```
`x` and `y` may belong to the same transaction or different ones. The case
where `x` and `y` belong to the same transaction is trivial. If `x` and `y`
belong to different transactions, then `y` must be the last prepared story
in the chain corresponding to that index. This implies that the invariant
may break for some transactions when the last prepared story in a chain
changes (either when another story becomes last or when the last story's
`del_psn` becomes 0). All such cases must be handled - any transactions
that violate the invariant (i.e., start duplicating prepared tuples) must
be aborted.

Rollback often leads to changes in the last prepared story within chains.
However, this case was previously overlooked, which could result in
duplicates after rollback. This commit adds the missing handling for
rollback scenarios.

Closes tarantool#11660

NO_DOC=bugfix
This commit fixes a Memtx-MVCC related bug that could lead to dirty gap
read in secondary indexes after rollback.

One REPLACE transaction (prepared but not yet committed) creates a
temporary situation where another concurrent transaction cannot see a
specific key in a secondary index (a "read gap"). When the first
transaction then rolls back, the previous (replaced) tuple becomes visible
again, and the "read gap" becomes irrelevant (the key becomes visible once
more). In this case, the second transaction, which read the gap, should be
aborted. However, it successfully commits, leading to a non-serializable
schedule.

This commit fixes the issue by adding the necessary handling during
rollback. Now, all such irrelevant gaps are aborted.

Closes tarantool#11802

NO_DOC=bugfix
@TarantoolBot
Copy link
Collaborator

Backport failed for release/3.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.2
git worktree add -d .worktree/backport/release/3.2/11662 origin/release/3.2
cd .worktree/backport/release/3.2/11662
git switch --create backport/release/3.2/11662
git cherry-pick -x 9a000e50dfb2693f817bf899b1b02473b76c7a36 78fc900ec841ec6cfe14b167ab0884b5a542609d 76ba7ee9eacb0815126b9cbc61fb92b3b543d2f8 d4f9f9b9570201b06a649d87b3ff23d4053c7d44 bd6e12b3ca90bc9b698ad15a0e41583baae821b9 2a9a463e5c303dca9c877f6e0e89d4d6665898de

@TarantoolBot
Copy link
Collaborator

Backport failed for release/3.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.3
git worktree add -d .worktree/backport/release/3.3/11662 origin/release/3.3
cd .worktree/backport/release/3.3/11662
git switch --create backport/release/3.3/11662
git cherry-pick -x 9a000e50dfb2693f817bf899b1b02473b76c7a36 78fc900ec841ec6cfe14b167ab0884b5a542609d 76ba7ee9eacb0815126b9cbc61fb92b3b543d2f8 d4f9f9b9570201b06a649d87b3ff23d4053c7d44 bd6e12b3ca90bc9b698ad15a0e41583baae821b9 2a9a463e5c303dca9c877f6e0e89d4d6665898de

@TarantoolBot
Copy link
Collaborator

@TarantoolBot
Copy link
Collaborator

@TarantoolBot
Copy link
Collaborator

@sergepetrenko
Copy link
Collaborator

@Astronomax, please, prepare backports for 3.2 and 3.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/3.2 Automatically create a 3.2 backport PR backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR backport/3.5 Automatically create a 3.5 backport PR

Projects

None yet

7 participants