Skip to content

Conversation

@ShooterIT
Copy link
Collaborator

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Oct 19, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (cluster-asm@cd4f7dc). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/t_hash.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             cluster-asm     #110   +/-   ##
==============================================
  Coverage               ?   75.89%           
==============================================
  Files                  ?      131           
  Lines                  ?    76332           
  Branches               ?        0           
==============================================
  Hits                   ?    57935           
  Misses                 ?    18397           
  Partials               ?        0           
Files with missing lines Coverage Δ
src/cluster_asm.c 91.02% <100.00%> (ø)
src/t_hash.c 95.78% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ShooterIT
Copy link
Collaborator Author

@ShooterIT ShooterIT requested a review from tezc October 19, 2025 16:28
} else {
/* Cancel pending trim jobs if becoming a replica. */
asmCancelPendingTrimJobs();
}
Copy link
Owner

@tezc tezc Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder, master and replica handle trim separately in the current design. So, I feel like we should not cancel on replica? e.g. we cancel pending job on replica, replica has unowned keys now. Master may not send TRIMSLOTS for these unowned slots if it has already completed the trim. Next time replica becomes master, it will have unowned keys. Do I miss something?

Copy link
Owner

@tezc tezc Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we should deal with unowned keys as quick as possible, otherwise it causes duplication for modules as they replicate their queries and merge all the results. So, unrelated to the change here, I also feel like reverting asmTrimSlotsIfNotOwned() to trim all the unowned slots, just in case. We'll be sure once we become a master, we won't have unowned keys.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder, master and replica handle trim separately in the current design.

I see, i cancel the pending trim jobs instead of active jobs. the pending trim jobs are delayed to trigger only on master, all the pending trim jobs will be added to active_trim_jobs once when triggering, besides we don't propagate the pending trim jobs to replica before triggering.

without this change, if we perform cluster failover and there is a slot migration task, the old master will cancel the task and add a trim job into pending list, then trigger a trim job when becoming a replica. the new master may also propagate a trimslot command again since this task is failed from POV of new master

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, your change will prevent double trim? Replica may get TRIMSLOTS for the empty slots. Even if that happens, it is harmless. So, we don't have a bug around here right?

Now, consider the other scenario:

  1. There is an import but it failed. Master has not replicated any key to replica yet.
  2. Master created trim job and put it into pending list.
  3. Failover happens and old master discard pending job.
    4-a. The new master does not propagate TRIMSLOTS as it is unaware of the failed migration. The old master stays with unowned keys.
    4-b. (Another scenario alternative to 4-a) Another failover happens immediately and this old master becomes master again with unowned keys.

I feel like we should let replica trim its slots. (+ perhaps asmTrimSlotsIfNotOwned() should check all the slots when the node becomes master)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 4-a, a full sync will happen, right?
for 4-b, it seems a bad cases

perhaps asmTrimSlotsIfNotOwned() should check all the slots when the node becomes master

this change may break legacy slot migration and cause data loss

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit odd but otherwise how do we clean up these slots in case of 4-b? Any idea?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't have an idea. so you accept double trim, right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks harmless to me? Replica will write TRIMSLOTS to AOF on its own. Does that bring a problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, just feel uncomfortable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

@ShooterIT ShooterIT merged commit a598c81 into tezc:cluster-asm Oct 20, 2025
17 checks passed
@ShooterIT ShooterIT deleted the asm-ci-fix branch October 20, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants