fix: Resolves RLock() leakage in Store.DeleteSeries()#26647
fix: Resolves RLock() leakage in Store.DeleteSeries()#26647devanbenz merged 4 commits intomaster-1.xfrom
Conversation
| s.mu.RUnlock() | ||
| err, shards, epochs, sfile := getEpochsAndShards() | ||
| if err != nil && err != ErrNothingToDelete { | ||
| return err |
There was a problem hiding this comment.
I'm going to follow up with a commit to add error logging on the server as well here, since, we do not expose these errors except to the client handler.
davidby-influx
left a comment
There was a problem hiding this comment.
Some suggestions for error checks and test robustness
tsdb/store.go
Outdated
| epochs := s.epochsForShards(shards) | ||
| s.mu.RUnlock() | ||
| err, shards, epochs, sfile := getEpochsAndShards() | ||
| if err != nil && err != ErrNothingToDelete { |
There was a problem hiding this comment.
error.Is instead of equality test.
tsdb/store.go
Outdated
| err, shards, epochs, sfile := getEpochsAndShards() | ||
| if err != nil && err != ErrNothingToDelete { | ||
| return err | ||
| } else if err == ErrNothingToDelete { |
tsdb/store_test.go
Outdated
|
|
||
| if err != nil { | ||
| results <- fmt.Sprintf("CreateShard[%d]: ERROR after %v: %v", id, duration, err) | ||
| } else if duration > time.Millisecond*100 { |
There was a problem hiding this comment.
I worry about this arbitrary time behaving differently on different machines. Things like this often end up as flaky tests in a few years as processors speed up.
tsdb/store_test.go
Outdated
| if shard == nil { | ||
| results <- fmt.Sprintf("Shard[%d]: ERROR after %v: shard not found", id, duration) | ||
| } else if duration > time.Millisecond*100 { | ||
| results <- fmt.Sprintf("Shard[%d]: SLOW (%v) - possible lock contention", id, duration) |
There was a problem hiding this comment.
As above, this can lead to flaky tests because of arbitrary timeouts.
tsdb/store_test.go
Outdated
| if testShard == nil { | ||
| t.Error("System appears to be in unrecoverable state - shard lookup failed") | ||
| } else if recoveryTime > time.Millisecond*500 { | ||
| t.Logf("SLOW RECOVERY: %v - system performance degraded", recoveryTime) |
There was a problem hiding this comment.
Can be a pattern for flaky tests.
tsdb/store_test.go
Outdated
| } | ||
| } | ||
|
|
||
| t.Run("inmem", func(t *testing.T) { test("inmem") }) |
There was a problem hiding this comment.
Why only inmem instead of all registered indices? tsi is much more common in large clusters.
Adjust the tests to simplify them for store_test.go and remove arbitrary slow-ness checks
Taking a look at some of our new DeleteSeries changes I'm seeing two paths in which we can escape the function without releasing the lock We hold the lock here 40ec5b0/tsdb/store.go#L1652 And then within this code block we can return errors without releasing the lock: 40ec5b0/tsdb/store.go#L1665-L1682 We do not log out the errors on the server side and bubble up by returning the error via the http handler to the client 40ec5b0/services/httpd/handler.go#L1060-L1064 (cherry picked from commit c67028a)
Taking a look at some of our new DeleteSeries changes I'm seeing two paths in which we can escape the function without releasing the lock
We hold the lock here influxdata/influxdb@40ec5b0/tsdb/store.go#L1652
And then within this code block we can return errors without releasing the lock:
influxdata/influxdb@40ec5b0/tsdb/store.go#L1665-L1682
We do not log out the errors on the server side and bubble up by returning the error via the http handler to the client
influxdata/influxdb@40ec5b0/services/httpd/handler.go#L1060-L1064