fix: reduce lock contention and races in purger#27146
fix: reduce lock contention and races in purger#27146davidby-influx merged 8 commits intomaster-1.xfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request refactors the TSM file purger to address lock contention and race conditions that could prevent compactions from completing. The changes migrate from a regular map with RWMutex to a sync.Map wrapper, significantly reducing the time that locks are held during file deletion operations.
Changes:
- Replaced standard map with
gensyncmap.Map[string, TSMFile]to eliminate memory retention issues and enable lock-free concurrent access - Minimized mutex-protected sections by removing lock holding during file Close() and Remove() operations, which have unbounded runtime
- Removed unused
fileStorefield from purger struct - Added comprehensive test coverage with stress tests for concurrent operations and race conditions
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tsdb/engine/tsm1/file_store.go | Refactored purger implementation to use sync.Map and minimize lock contention; removed lock holding during file operations |
| tsdb/engine/tsm1/purger_test.go | Added comprehensive test suite with mock TSMFile implementation and stress tests for concurrent purger operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tsdb/engine/tsm1/file_store.go
Outdated
| p.mu.Lock() | ||
| defer p.mu.Unlock() | ||
|
|
||
| if p.files.Len() == 0 { |
There was a problem hiding this comment.
The Len() method iterates through the entire sync.Map using Range(), which can be expensive if there are many files. This is called on every iteration of the purge loop (line 1654). Consider tracking the count separately using an atomic counter that's incremented on Store() and decremented on Delete(), or cache the length check result to avoid repeated full map scans.
| if p.files.Len() == 0 { | |
| // Avoid using Len(), which iterates the entire underlying map. | |
| // Instead, check for emptiness by stopping Range() after the first entry. | |
| has := false | |
| p.files.Range(func(k string, v TSMFile) bool { | |
| has = true | |
| return false // stop iteration after finding the first file | |
| }) | |
| if !has { |
There was a problem hiding this comment.
oh yeah; one of the big downsides of syncmap. Len() is now no longer cheap. the separate atomic counter is also painful - both the map and counter need to be updated.
I don't recall if files is a big or not; I'll keep reading and see.
and perhaps the performance penalty of Len() is actually fine here.
| delete(p.files, k) | ||
| purgeCount++ | ||
| // Remove the file regardless of success or failure. | ||
| // Do not retry files which could not be closed or removed. |
There was a problem hiding this comment.
Are you thinking we shouldn't retry because we're just likely to fail again (and therefore forever)? The purger is already the second attempt to remove a file (if i recall correctly) so not retrying and retrying makes sense to me.
I'm curious the why behind the change: why don't you want to retry anymore? The comment says what it does but not why.
There was a problem hiding this comment.
Failing to Close() or Remove() generally means we have an operating system error unmapping or deleting the file, so we don't want to fill the log with "Already Closed" or "Path Not Found" errors. The situation will not improve, so retrying is pointless.
There was a problem hiding this comment.
cool; thought so but wanted to check; i've forgotten some of what is happening here since i made the ticket.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Do not retry files which could not be closed or removed. | ||
| // This is because they have already been closed, or there | ||
| // is an operating system problem that is unlikely to | ||
| // resolve by itself. |
| } | ||
|
|
||
| // purger manages asynchronous deletion of TSM files that have been | ||
| // replaced by compaction, but are temporarily held open by queries |
| p.running = true | ||
| p.mu.Unlock() | ||
|
|
||
| go func() { |
There was a problem hiding this comment.
the goroutine captures only logger and p, the purger, i think?
philjb
left a comment
There was a problem hiding this comment.
really nice; at first the locking looks complicated but the lock for p.running isn't held over the go routine that does the purging/removing from disk. The sync map simplifies a lot of the concurrency so that worked out well i think and it helps with the concern in the ticket significantly
that purging tsm files could freeze up compaction.
i don't think that ^^ is possible anymore. And purging is best effort and this code makes that clear.
test coverage is awesome. I liked the RW lock as a latch to start the goroutines roughly at the same time.
Use a sync.Map and minimize mutex sections
Fixes #26110