Skip to content

fix: Resolves RLock() leakage in Store.DeleteSeries()#26647

Merged
devanbenz merged 4 commits intomaster-1.xfrom
db/6262/hanging-queries
Aug 1, 2025
Merged

fix: Resolves RLock() leakage in Store.DeleteSeries()#26647
devanbenz merged 4 commits intomaster-1.xfrom
db/6262/hanging-queries

Conversation

@devanbenz
Copy link
Copy Markdown

Taking a look at some of our new DeleteSeries changes I'm seeing two paths in which we can escape the function without releasing the lock

We hold the lock here influxdata/influxdb@40ec5b0/tsdb/store.go#L1652

And then within this code block we can return errors without releasing the lock:

influxdata/influxdb@40ec5b0/tsdb/store.go#L1665-L1682

We do not log out the errors on the server side and bubble up by returning the error via the http handler to the client

influxdata/influxdb@40ec5b0/services/httpd/handler.go#L1060-L1064

s.mu.RUnlock()
err, shards, epochs, sfile := getEpochsAndShards()
if err != nil && err != ErrNothingToDelete {
return err
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to follow up with a commit to add error logging on the server as well here, since, we do not expose these errors except to the client handler.

Copy link
Copy Markdown
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for error checks and test robustness

tsdb/store.go Outdated
epochs := s.epochsForShards(shards)
s.mu.RUnlock()
err, shards, epochs, sfile := getEpochsAndShards()
if err != nil && err != ErrNothingToDelete {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error.Is instead of equality test.

tsdb/store.go Outdated
err, shards, epochs, sfile := getEpochsAndShards()
if err != nil && err != ErrNothingToDelete {
return err
} else if err == ErrNothingToDelete {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Is


if err != nil {
results <- fmt.Sprintf("CreateShard[%d]: ERROR after %v: %v", id, duration, err)
} else if duration > time.Millisecond*100 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about this arbitrary time behaving differently on different machines. Things like this often end up as flaky tests in a few years as processors speed up.

if shard == nil {
results <- fmt.Sprintf("Shard[%d]: ERROR after %v: shard not found", id, duration)
} else if duration > time.Millisecond*100 {
results <- fmt.Sprintf("Shard[%d]: SLOW (%v) - possible lock contention", id, duration)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, this can lead to flaky tests because of arbitrary timeouts.

if testShard == nil {
t.Error("System appears to be in unrecoverable state - shard lookup failed")
} else if recoveryTime > time.Millisecond*500 {
t.Logf("SLOW RECOVERY: %v - system performance degraded", recoveryTime)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a pattern for flaky tests.

}
}

t.Run("inmem", func(t *testing.T) { test("inmem") })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only inmem instead of all registered indices? tsi is much more common in large clusters.

Adjust the tests to simplify them for store_test.go and remove
arbitrary slow-ness checks
@devanbenz devanbenz requested a review from davidby-influx July 31, 2025 15:09
@devanbenz devanbenz marked this pull request as ready for review July 31, 2025 15:09
Copy link
Copy Markdown
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@devanbenz devanbenz merged commit c67028a into master-1.x Aug 1, 2025
9 checks passed
@devanbenz devanbenz deleted the db/6262/hanging-queries branch August 1, 2025 18:21
devanbenz added a commit that referenced this pull request Sep 26, 2025
Taking a look at some of our new DeleteSeries changes I'm seeing two paths in which we can escape the function without releasing the lock

We hold the lock here 40ec5b0/tsdb/store.go#L1652

And then within this code block we can return errors without releasing the lock:

40ec5b0/tsdb/store.go#L1665-L1682

We do not log out the errors on the server side and bubble up by returning the error via the http handler to the client

40ec5b0/services/httpd/handler.go#L1060-L1064

(cherry picked from commit c67028a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants