Skip to content

feat: TagValueIterator holds RLock recursively#26369

Merged
devanbenz merged 1 commit intomain-2.xfrom
db/6059/logfile-deadlock
May 8, 2025
Merged

feat: TagValueIterator holds RLock recursively#26369
devanbenz merged 1 commit intomain-2.xfrom
db/6059/logfile-deadlock

Conversation

@devanbenz
Copy link
Copy Markdown

I looks like we are holding a reader lock and defer'ing it, while at the same time acquiring that same reader lock within the call to return tk.TagValueIterator()

According to Go's RWMutex documentation this is not a good idea: https://pkg.go.dev/sync#RWMutex

If any goroutine calls RWMutex.Lock while the lock is already held by one or more readers, concurrent calls to RWMutex.RLock will block until the writer has acquired (and released) the lock, to ensure that the lock eventually becomes available to the writer. Note that this prohibits recursive read-locking.

I believe this is what is causing the deadlock we are seeing in a few different tickets. I had ran a reproducer where I was effectively reading and writing from a bucket that I had set the retention policy to 1 minute and the shard rollover to 30 seconds. I was able to reproduce the deadlock consistently this way. Adding some logging around the mutexes I am seeing the following during every reproduction:

TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
TagValueIterator tk about to RLock()
TagValueIterator tk is RLock()
TagValueIterator tk about to RUnlock()
TagValueIterator tk is RUnlock()
TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
TagValueIterator tk about to RLock()
TagValueIterator tk is RLock()
TagValueIterator tk about to RUnlock()
TagValueIterator tk is RUnlock()
TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
Lock() is about to be called in AddSeriesList
TagValueIterator tk about to RLock()

The TagValueIterator tk is found within return tk.TagValueIterator()

tk.f.mu.RLock()

The AddSeriesList Lock() is found here:

So it appears that we are attempt to lock inside both AddSeriesList and TagValueIterator with different Lock types on the same RWMutex.

After modifying the code to the following I'm no longer seeing the problem. I have a feeling that the recursive call to RLock() is the underlying issue.

@devanbenz devanbenz changed the title feat: TagValueIterator holds RLock for too long feat: TagValueIterator holds RLock recursively May 8, 2025
Copy link
Copy Markdown
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code seems fine, but one suggestion for future maintainability.

@devanbenz devanbenz requested a review from davidby-influx May 8, 2025 18:11
@devanbenz devanbenz force-pushed the db/6059/logfile-deadlock branch from 47f48b4 to 6c1aee8 Compare May 8, 2025 18:15
Copy link
Copy Markdown
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@devanbenz devanbenz merged commit 2e842ff into main-2.x May 8, 2025
26 of 27 checks passed
@devanbenz devanbenz deleted the db/6059/logfile-deadlock branch May 8, 2025 18:50
devanbenz added a commit that referenced this pull request May 8, 2025
I looks like we are holding a reader lock and defer'ing it, while at the same time acquiring that same reader lock within the call to `return tk.TagValueIterator()`

According to Go's RWMutex documentation this is not a good idea: https://pkg.go.dev/sync#RWMutex
> If any goroutine calls [RWMutex.Lock](https://pkg.go.dev/sync#RWMutex.Lock) while the lock is already held by one or more readers, concurrent calls to [RWMutex.RLock](https://pkg.go.dev/sync#RWMutex.RLock) will block until the writer has acquired (and released) the lock, to ensure that the lock eventually becomes available to the writer. Note that this prohibits recursive read-locking.

I believe this is what is causing the deadlock we are seeing in a few different tickets. I had ran a reproducer where I was effectively reading and writing from a bucket that I had set the retention policy to 1 minute and the shard rollover to 30 seconds. I was able to reproduce the deadlock consistently this way. Adding some logging around the mutexes I am seeing the following during every reproduction:

```
TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
TagValueIterator tk about to RLock()
TagValueIterator tk is RLock()
TagValueIterator tk about to RUnlock()
TagValueIterator tk is RUnlock()
TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
TagValueIterator tk about to RLock()
TagValueIterator tk is RLock()
TagValueIterator tk about to RUnlock()
TagValueIterator tk is RUnlock()
TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
Lock() is about to be called in AddSeriesList
TagValueIterator tk about to RLock()
```

The `TagValueIterator tk` is found within `return tk.TagValueIterator()`

https://github.com/influxdata/influxdb/blob/eebe655e4c596b8f78bae30e5729092b6732d1a7/tsdb/index/tsi1/log_file.go#L1388

The `AddSeriesList Lock()` is found here:

https://github.com/influxdata/influxdb/blob/eebe655e4c596b8f78bae30e5729092b6732d1a7/tsdb/index/tsi1/log_file.go#L545

So it appears that we are attempt to lock inside both `AddSeriesList` and `TagValueIterator` with different Lock types on the same `RWMutex`.

After modifying the code to the following I'm no longer seeing the problem. I have a feeling that the recursive call to `RLock()` is the underlying issue.

(cherry picked from commit 2e842ff)
devanbenz added a commit that referenced this pull request May 13, 2025
I looks like we are holding a reader lock and defer'ing it, while at the same time acquiring that same reader lock within the call to `return tk.TagValueIterator()`

According to Go's RWMutex documentation this is not a good idea: https://pkg.go.dev/sync#RWMutex
> If any goroutine calls [RWMutex.Lock](https://pkg.go.dev/sync#RWMutex.Lock) while the lock is already held by one or more readers, concurrent calls to [RWMutex.RLock](https://pkg.go.dev/sync#RWMutex.RLock) will block until the writer has acquired (and released) the lock, to ensure that the lock eventually becomes available to the writer. Note that this prohibits recursive read-locking.

I believe this is what is causing the deadlock we are seeing in a few different tickets. I had ran a reproducer where I was effectively reading and writing from a bucket that I had set the retention policy to 1 minute and the shard rollover to 30 seconds. I was able to reproduce the deadlock consistently this way. Adding some logging around the mutexes I am seeing the following during every reproduction:

```
TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
TagValueIterator tk about to RLock()
TagValueIterator tk is RLock()
TagValueIterator tk about to RUnlock()
TagValueIterator tk is RUnlock()
TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
TagValueIterator tk about to RLock()
TagValueIterator tk is RLock()
TagValueIterator tk about to RUnlock()
TagValueIterator tk is RUnlock()
TagValueIterator is RUnlock()
TagValueIterator is about to RLock()
TagValueIterator is RLock()
Lock() is about to be called in AddSeriesList
TagValueIterator tk about to RLock()
```

The `TagValueIterator tk` is found within `return tk.TagValueIterator()`

https://github.com/influxdata/influxdb/blob/eebe655e4c596b8f78bae30e5729092b6732d1a7/tsdb/index/tsi1/log_file.go#L1388

The `AddSeriesList Lock()` is found here:

https://github.com/influxdata/influxdb/blob/eebe655e4c596b8f78bae30e5729092b6732d1a7/tsdb/index/tsi1/log_file.go#L545

So it appears that we are attempt to lock inside both `AddSeriesList` and `TagValueIterator` with different Lock types on the same `RWMutex`.

After modifying the code to the following I'm no longer seeing the problem. I have a feeling that the recursive call to `RLock()` is the underlying issue.

(cherry picked from commit 2e842ff)
devanbenz added a commit that referenced this pull request Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants