Skip to content

fix: reduce lock contention and races in purger#27146

Merged
davidby-influx merged 8 commits intomaster-1.xfrom
DSB/purger_lock
Jan 30, 2026
Merged

fix: reduce lock contention and races in purger#27146
davidby-influx merged 8 commits intomaster-1.xfrom
DSB/purger_lock

Conversation

@davidby-influx
Copy link
Copy Markdown
Contributor

Use a sync.Map and minimize mutex sections

Fixes #26110

@davidby-influx davidby-influx marked this pull request as ready for review January 29, 2026 02:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the TSM file purger to address lock contention and race conditions that could prevent compactions from completing. The changes migrate from a regular map with RWMutex to a sync.Map wrapper, significantly reducing the time that locks are held during file deletion operations.

Changes:

  • Replaced standard map with gensyncmap.Map[string, TSMFile] to eliminate memory retention issues and enable lock-free concurrent access
  • Minimized mutex-protected sections by removing lock holding during file Close() and Remove() operations, which have unbounded runtime
  • Removed unused fileStore field from purger struct
  • Added comprehensive test coverage with stress tests for concurrent operations and race conditions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tsdb/engine/tsm1/file_store.go Refactored purger implementation to use sync.Map and minimize lock contention; removed lock holding during file operations
tsdb/engine/tsm1/purger_test.go Added comprehensive test suite with mock TSMFile implementation and stress tests for concurrent purger operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

p.mu.Lock()
defer p.mu.Unlock()

if p.files.Len() == 0 {
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Len() method iterates through the entire sync.Map using Range(), which can be expensive if there are many files. This is called on every iteration of the purge loop (line 1654). Consider tracking the count separately using an atomic counter that's incremented on Store() and decremented on Delete(), or cache the length check result to avoid repeated full map scans.

Suggested change
if p.files.Len() == 0 {
// Avoid using Len(), which iterates the entire underlying map.
// Instead, check for emptiness by stopping Range() after the first entry.
has := false
p.files.Range(func(k string, v TSMFile) bool {
has = true
return false // stop iteration after finding the first file
})
if !has {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah; one of the big downsides of syncmap. Len() is now no longer cheap. the separate atomic counter is also painful - both the map and counter need to be updated.

I don't recall if files is a big or not; I'll keep reading and see.

and perhaps the performance penalty of Len() is actually fine here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

delete(p.files, k)
purgeCount++
// Remove the file regardless of success or failure.
// Do not retry files which could not be closed or removed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking we shouldn't retry because we're just likely to fail again (and therefore forever)? The purger is already the second attempt to remove a file (if i recall correctly) so not retrying and retrying makes sense to me.

I'm curious the why behind the change: why don't you want to retry anymore? The comment says what it does but not why.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing to Close() or Remove() generally means we have an operating system error unmapping or deleting the file, so we don't want to fill the log with "Already Closed" or "Path Not Found" errors. The situation will not improve, so retrying is pointless.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool; thought so but wanted to check; i've forgotten some of what is happening here since i made the ticket.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Do not retry files which could not be closed or removed.
// This is because they have already been closed, or there
// is an operating system problem that is unlikely to
// resolve by itself.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

}

// purger manages asynchronous deletion of TSM files that have been
// replaced by compaction, but are temporarily held open by queries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is helpful context

p.running = true
p.mu.Unlock()

go func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goroutine captures only logger and p, the purger, i think?

Copy link
Copy Markdown
Contributor

@philjb philjb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice; at first the locking looks complicated but the lock for p.running isn't held over the go routine that does the purging/removing from disk. The sync map simplifies a lot of the concurrency so that worked out well i think and it helps with the concern in the ticket significantly

that purging tsm files could freeze up compaction.

i don't think that ^^ is possible anymore. And purging is best effort and this code makes that clear.

test coverage is awesome. I liked the RW lock as a latch to start the goroutines roughly at the same time.

@davidby-influx davidby-influx merged commit 304e1ae into master-1.x Jan 30, 2026
9 checks passed
@davidby-influx davidby-influx deleted the DSB/purger_lock branch January 30, 2026 17:51
davidby-influx added a commit that referenced this pull request Jan 30, 2026
Use a sync.Map and minimize mutex sections to avoid
blocking in calls to the purger

Fixes #26110

(cherry picked from commit 304e1ae)

Fixes #27175
davidby-influx added a commit that referenced this pull request Jan 30, 2026
Use a sync.Map and minimize mutex sections to avoid
blocking in calls to the purger

Fixes #26110

(cherry picked from commit 304e1ae)

Fixes #27175
devanbenz pushed a commit that referenced this pull request Feb 5, 2026
Use a sync.Map and minimize mutex sections to avoid
blocking in calls to the purger

Fixes #26110

(cherry picked from commit 304e1ae)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[influxdb1.x/2.x]: purger code may prevent compactions from finishing

3 participants