fix: reduce lock contention and races in purger by davidby-influx · Pull Request #27146 · influxdata/influxdb

davidby-influx · 2026-01-22T04:16:41Z

Use a sync.Map and minimize mutex sections

Copilot

Pull request overview

This pull request refactors the TSM file purger to address lock contention and race conditions that could prevent compactions from completing. The changes migrate from a regular map with RWMutex to a sync.Map wrapper, significantly reducing the time that locks are held during file deletion operations.

Changes:

Replaced standard map with gensyncmap.Map[string, TSMFile] to eliminate memory retention issues and enable lock-free concurrent access
Minimized mutex-protected sections by removing lock holding during file Close() and Remove() operations, which have unbounded runtime
Removed unused fileStore field from purger struct
Added comprehensive test coverage with stress tests for concurrent operations and race conditions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
tsdb/engine/tsm1/file_store.go	Refactored purger implementation to use sync.Map and minimize lock contention; removed lock holding during file operations
tsdb/engine/tsm1/purger_test.go	Added comprehensive test suite with mock TSMFile implementation and stress tests for concurrent purger operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tsdb/engine/tsm1/file_store.go

Copilot · 2026-01-29T21:08:34Z

tsdb/engine/tsm1/file_store.go

+	p.mu.Lock()
+	defer p.mu.Unlock()
+
+	if p.files.Len() == 0 {


The Len() method iterates through the entire sync.Map using Range(), which can be expensive if there are many files. This is called on every iteration of the purge loop (line 1654). Consider tracking the count separately using an atomic counter that's incremented on Store() and decremented on Delete(), or cache the length check result to avoid repeated full map scans.

Suggested change

if p.files.Len() == 0 {

// Avoid using Len(), which iterates the entire underlying map.

// Instead, check for emptiness by stopping Range() after the first entry.

has := false

p.files.Range(func(k string, v TSMFile) bool {

has = true

return false // stop iteration after finding the first file

})

if !has {

oh yeah; one of the big downsides of syncmap. Len() is now no longer cheap. the separate atomic counter is also painful - both the map and counter need to be updated.

I don't recall if files is a big or not; I'll keep reading and see.

and perhaps the performance penalty of Len() is actually fine here.

tsdb/engine/tsm1/purger_test.go

tsdb/engine/tsm1/file_store.go

philjb · 2026-01-29T21:30:26Z

tsdb/engine/tsm1/file_store.go

-					delete(p.files, k)
-					purgeCount++
+					// Remove the file regardless of success or failure.
+					// Do not retry files which could not be closed or removed.


Are you thinking we shouldn't retry because we're just likely to fail again (and therefore forever)? The purger is already the second attempt to remove a file (if i recall correctly) so not retrying and retrying makes sense to me.

I'm curious the why behind the change: why don't you want to retry anymore? The comment says what it does but not why.

Failing to Close() or Remove() generally means we have an operating system error unmapping or deleting the file, so we don't want to fill the log with "Already Closed" or "Path Not Found" errors. The situation will not improve, so retrying is pointless.

cool; thought so but wanted to check; i've forgotten some of what is happening here since i made the ticket.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tsdb/engine/tsm1/file_store.go

tsdb/engine/tsm1/purger_test.go

philjb · 2026-01-30T04:58:40Z

tsdb/engine/tsm1/file_store.go

 					// Do not retry files which could not be closed or removed.
+					// This is because they have already been closed, or there
+					// is an operating system problem that is unlikely to
+					// resolve by itself.


philjb · 2026-01-30T04:58:54Z

tsdb/engine/tsm1/file_store.go

 }

+// purger manages asynchronous deletion of TSM files that have been
+// replaced by compaction, but are temporarily held open by queries


this is helpful context

philjb · 2026-01-30T05:01:12Z

tsdb/engine/tsm1/file_store.go

 	p.running = true
-	p.mu.Unlock()

 	go func() {


the goroutine captures only logger and p, the purger, i think?

philjb

really nice; at first the locking looks complicated but the lock for p.running isn't held over the go routine that does the purging/removing from disk. The sync map simplifies a lot of the concurrency so that worked out well i think and it helps with the concern in the ticket significantly

that purging tsm files could freeze up compaction.

i don't think that ^^ is possible anymore. And purging is best effort and this code makes that clear.

test coverage is awesome. I liked the RW lock as a latch to start the goroutines roughly at the same time.

Use a sync.Map and minimize mutex sections to avoid blocking in calls to the purger Fixes #26110 (cherry picked from commit 304e1ae) Fixes #27175

Use a sync.Map and minimize mutex sections to avoid blocking in calls to the purger Fixes #26110 (cherry picked from commit 304e1ae)

davidby-influx added 2 commits January 21, 2026 19:13

chore: initialize purger

c41f8bd

chore: re-order imports

2e97e45

davidby-influx self-assigned this Jan 22, 2026

davidby-influx added area/tsm 1.x kind/perf team/edge labels Jan 22, 2026

davidby-influx linked an issue Jan 22, 2026 that may be closed by this pull request

[influxdb1.x/2.x]: purger code may prevent compactions from finishing #26110

Closed

davidby-influx added 3 commits January 22, 2026 07:56

chore: refactor for clarity

50a4153

chore: remove unused purger.fileStore

a7b9d80

fix: add tests, simplify initialization

97cecd4

davidby-influx marked this pull request as ready for review January 29, 2026 02:08

davidby-influx requested review from devanbenz, gwossum and philjb January 29, 2026 02:08

fix: reduce races in concurrency logging

61c0553

philjb requested a review from Copilot January 29, 2026 21:02

Copilot started reviewing on behalf of philjb January 29, 2026 21:03 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

philjb reviewed Jan 29, 2026

View reviewed changes

fix: address review comments on mutex scope

3fb0438

davidby-influx requested review from Copilot and philjb January 29, 2026 23:01

Copilot started reviewing on behalf of davidby-influx January 29, 2026 23:02 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

tsdb/engine/tsm1/file_store.go Outdated Show resolved Hide resolved

tsdb/engine/tsm1/purger_test.go Show resolved Hide resolved

chore: update comment typo

b034da7

philjb reviewed Jan 30, 2026

View reviewed changes

philjb approved these changes Jan 30, 2026

View reviewed changes

davidby-influx merged commit 304e1ae into master-1.x Jan 30, 2026
9 checks passed

davidby-influx deleted the DSB/purger_lock branch January 30, 2026 17:51

This was referenced Jan 30, 2026

[influxdb1.x/2.x]: purger code may prevent compactions from finishing [port to 1.12] #27175

Closed

[influxdb1.x/2.x]: purger code may prevent compactions from finishing [port to main-2.x] #27176

Closed

davidby-influx added a commit that referenced this pull request Jan 30, 2026

fix: reduce lock contention and races in purger (#27146)

ae745ab

Use a sync.Map and minimize mutex sections to avoid blocking in calls to the purger Fixes #26110 (cherry picked from commit 304e1ae) Fixes #27175

devanbenz pushed a commit that referenced this pull request Feb 5, 2026

fix: reduce lock contention and races in purger (#27146)

85f3b9f

Use a sync.Map and minimize mutex sections to avoid blocking in calls to the purger Fixes #26110 (cherry picked from commit 304e1ae)

devanbenz added a commit that referenced this pull request Feb 5, 2026

fix: reduce lock contention and races in purger (#27146) (#27193)

124ca8f

-	if p.files.Len() == 0 {
+	// Avoid using Len(), which iterates the entire underlying map.
+	// Instead, check for emptiness by stopping Range() after the first entry.
+	has := false
+	p.files.Range(func(k string, v TSMFile) bool {
+		has = true
+		return false // stop iteration after finding the first file
+	})
+	if !has {

Conversation

davidby-influx commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philjb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants