Commit 8f27887
committed
Fix race condition in message store when GC deletes file
Handle case where garbage collector asynchronously deletes a file from
FileSummaryEts while a write operation is trying to reference it. This
race condition can occur during high load (e.g., maintenance drain with
heavy queue leadership transfers) when:
1. Message exists in index with ref_count=0, file=N
2. File N is marked for deletion (valid_total_size=0)
3. GC asynchronously deletes file N from FileSummaryEts
4. Write request arrives for the message
5. ets:lookup(FileSummaryEts, N) returns []
6. Code crashes with {case_clause, {false, []}}
The fix adds two new case clauses to handle this scenario:
- {false, []} - Delete stale index entry and write fresh copy
- {false_if_increment, []} - Ignore write (client dying)
This follows the same pattern as existing clauses that handle files
being deleted or locked. Since ref_count=0, the old copy is orphaned
and safe to discard.
Observed in production during maintenance window causing 969 queue
process crashes. See ticket V2090892319.1 parent 189c729 commit 8f27887
1 file changed
Lines changed: 14 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1122 | 1122 | | |
1123 | 1123 | | |
1124 | 1124 | | |
1125 | | - | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
| 1128 | + | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
1126 | 1139 | | |
1127 | 1140 | | |
1128 | 1141 | | |
| |||
0 commit comments