Skip to content

raftstore: properly release snapshot precheck resource after snapshot reception#17903

Merged
ti-chi-bot[bot] merged 4 commits intotikv:masterfrom
hbisheng:snap-recv-fix
Dec 18, 2024
Merged

raftstore: properly release snapshot precheck resource after snapshot reception#17903
ti-chi-bot[bot] merged 4 commits intotikv:masterfrom
hbisheng:snap-recv-fix

Conversation

@hbisheng
Copy link
Member

@hbisheng hbisheng commented Nov 28, 2024

What is changed and how it works?

Issue Number: Close #17881

This PR fixes a case where snapshot precheck may succeed but the receiver would reject the snapshot due to incorrect ordering of resource release and recving_count updates.

Previous ordering:

  1. Reserve precheck resource (snap_mgr.recv_snap_precheck)
  2. recving_count++
  3. Release precheck resource (snap_mgr.recv_snap_complete)
  4. recving_count--

The issue lies between steps 3 and 4. After releasing the precheck resource (step 3), a new precheck can succeed. However, the receiving_busy check on the receiver would fail because recving_count hasn't been decremented. This PR ensures that recving_count is decremented before releasing the precheck resource.

In addition, this PR fixes another potential issue where the precheck resource is not released when snapshot reception encounters a network error.

What's Changed:

Ensures `recving_count` is decremented before releasing the snapshot 
precheck resource. This prevents a race condition where a new precheck 
succeeds, but the receiver rejects the snapshot because it fails the 
`receiving_busy` check.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 28, 2024
@hbisheng hbisheng marked this pull request as draft November 28, 2024 04:14
@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 28, 2024
…apshot reception

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 28, 2024
@hbisheng hbisheng changed the title [WIP] raftstore: properly release snapshot precheck resource after snapshot reception raftstore: properly release snapshot precheck resource after snapshot reception Nov 28, 2024
@hbisheng hbisheng marked this pull request as ready for review November 28, 2024 06:58
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 28, 2024
@hbisheng
Copy link
Member Author

cc @hhwyt

Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Nov 28, 2024
context.finish(raft_router)
};
async move {
defer!(cleanup_after_recv(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not clean up before responding to the sink?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a good idea. Even if responding to the sink is slow, we don't have to let it block the success of the next snapshot precheck. Do you see any downside with that? @Connor1996

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @Connor1996 offline. I believe cleaning up before responding to the sink is still an option but it probably doesn't make a significant difference since responding to the sink should be quick. I think I’ll keep it as it is to maintain consistency with the current behavior, where we decrement recving_count after responding to the sink.

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Connor1996, LykxSassinator

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [Connor1996,LykxSassinator]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 18, 2024
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 18, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-11-28 08:08:14.375590497 +0000 UTC m=+710281.995245005: ☑️ agreed by Connor1996.
  • 2024-12-18 03:01:22.201046287 +0000 UTC m=+1012272.289848826: ☑️ agreed by LykxSassinator.

@ti-chi-bot ti-chi-bot bot merged commit dd1edd0 into tikv:master Dec 18, 2024
@ti-chi-bot ti-chi-bot bot added this to the Pool milestone Dec 18, 2024
@hbisheng hbisheng deleted the snap-recv-fix branch July 14, 2025 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"too many recving snapshot tasks" logs after the introduction of snapshot precheck process

4 participants