Skip to content

Commit 8f27887

Browse files
committed
Fix race condition in message store when GC deletes file
Handle case where garbage collector asynchronously deletes a file from FileSummaryEts while a write operation is trying to reference it. This race condition can occur during high load (e.g., maintenance drain with heavy queue leadership transfers) when: 1. Message exists in index with ref_count=0, file=N 2. File N is marked for deletion (valid_total_size=0) 3. GC asynchronously deletes file N from FileSummaryEts 4. Write request arrives for the message 5. ets:lookup(FileSummaryEts, N) returns [] 6. Code crashes with {case_clause, {false, []}} The fix adds two new case clauses to handle this scenario: - {false, []} - Delete stale index entry and write fresh copy - {false_if_increment, []} - Ignore write (client dying) This follows the same pattern as existing clauses that handle files being deleted or locked. Since ref_count=0, the old copy is orphaned and safe to discard. Observed in production during maintenance window causing 969 queue process crashes. See ticket V2090892319.
1 parent 189c729 commit 8f27887

1 file changed

Lines changed: 14 additions & 1 deletion

File tree

deps/rabbit/src/rabbit_msg_store.erl

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1122,7 +1122,20 @@ write_action({Mask, #msg_location { ref_count = 0, file = File,
11221122
{_Mask, [#file_summary {}]} ->
11231123
ok = index_update_ref_counter(IndexEts, MsgId, +1), %% Effectively set to 1.
11241124
State1 = adjust_valid_total_size(File, TotalSize, State),
1125-
{confirm, File, State1}
1125+
{confirm, File, State1};
1126+
{false, []} ->
1127+
%% Handle case where GC has deleted the file from the summary table
1128+
%% before we could look it up. This can occur during high load when
1129+
%% the GC asynchronously deletes files while write operations are
1130+
%% in progress. Since ref_count=0, the old copy is orphaned and
1131+
%% safe to discard. Delete stale index entry and write fresh copy
1132+
%% to current file.
1133+
ok = index_delete(IndexEts, MsgId),
1134+
{write, State};
1135+
{false_if_increment, []} ->
1136+
%% File deleted by GC, but client is dying - ignore the write
1137+
%% since the message will be deleted when client death is processed.
1138+
{ignore, File, State}
11261139
end;
11271140
write_action({_Mask, #msg_location { file = File }},
11281141
MsgId, State = #msstate{ index_ets = IndexEts }) ->

0 commit comments

Comments
 (0)