feat: Defer cleanup for log/index compactions, add debug log#26511
feat: Defer cleanup for log/index compactions, add debug log#26511devanbenz merged 13 commits intomaster-1.xfrom
Conversation
I believe that there is something happening which causes CurrentCompactionN() to alway be greater than 0. Thus making Partition.Wait() hang forever.
davidby-influx
left a comment
There was a problem hiding this comment.
More ticker/timer fun.
tsdb/index/tsi1/partition.go
Outdated
| } | ||
| p.logger.Debug("Partition.Wait() timed out waiting for compactions to complete", | ||
| zap.Int32("stuck_compactions", p.CurrentCompactionN()), zap.Duration("timeout", timeoutDuration), | ||
| zap.Strings("files", files)) |
There was a problem hiding this comment.
Did we decide earlier that we should not exit on timeout?
If we return on timeout, then the timeout should be a timer not a ticker.
If we don't return on a timeout, then the timeout code could be in the ticker with a check if the timeoutDuration has passed since the wait was entered.
There was a problem hiding this comment.
I don't think we should return on timeout. I do think we should log, I was actually thinking about changing this to a warn since it will only occur every 24 hours if a log compaction is stuck. This is mostly to ensure the issue is not happening after my go routine cleanup code is in place. I will add a comment about it.
davidby-influx
left a comment
There was a problem hiding this comment.
time.After changes.
tsdb/index/tsi1/partition.go
Outdated
| } | ||
| <-ticker.C | ||
| select { | ||
| case <-time.After(timeoutDuration): |
There was a problem hiding this comment.
time.After creates a new channel every call. You need to move it outside the loop. See here: https://stackoverflow.com/a/39212626
This only prints the warning once., though. If you want to print the warning repeatedly (every timeoutDuration) then you should use an if in the ticker case and check elapsed time.
davidby-influx
left a comment
There was a problem hiding this comment.
Looking better, but one more simplification
| <-ticker.C | ||
| select { | ||
| case <-ticker.C: | ||
| elapsed := time.Since(startTime) |
There was a problem hiding this comment.
It seems like lastWarningTime and startTime could be collapsed to a singe variable for simplicity.
lastTime := time.Now()
for {
if p.CurrentCompactionN() == 0 {
return
}
select {
case <-ticker.C:
elapsed := time.Since(lastTime)
if elapsed >= timeoutDuration {
lastTime = time.Now()
files := make([]string, 0)
for _, v := range p.fileSet.Files() {
files = append(files, v.Path())
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.
Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
1 runtime.gopark
runtime.chanrecv
runtime.chanrecv1
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
github.com/influxdata/influxdb/tsdb.(*Shard).Close
github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
github.com/influxdata/influxdb/services/retention.(*Service).run
github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```
Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.
Modify currentCompactionN to be a sync atomic.
Adding a debug level log within Compaction.Wait() should aid in debugging.
(cherry picked from commit 149fb47)
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.
Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
1 runtime.gopark
runtime.chanrecv
runtime.chanrecv1
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
github.com/influxdata/influxdb/tsdb.(*Shard).Close
github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
github.com/influxdata/influxdb/services/retention.(*Service).run
github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```
Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.
Modify currentCompactionN to be a sync atomic.
Adding a debug level log within Compaction.Wait() should aid in debugging.
(cherry picked from commit 149fb47)
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.
Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
1 runtime.gopark
runtime.chanrecv
runtime.chanrecv1
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
github.com/influxdata/influxdb/tsdb.(*Shard).Close
github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
github.com/influxdata/influxdb/services/retention.(*Service).run
github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```
Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.
Modify currentCompactionN to be a sync atomic.
Adding a debug level log within Compaction.Wait() should aid in debugging.
(cherry picked from commit 149fb47)
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.
Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
1 runtime.gopark
runtime.chanrecv
runtime.chanrecv1
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
github.com/influxdata/influxdb/tsdb.(*Shard).Close
github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
github.com/influxdata/influxdb/services/retention.(*Service).run
github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```
Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.
Modify currentCompactionN to be a sync atomic.
Adding a debug level log within Compaction.Wait() should aid in debugging.
(cherry picked from commit 149fb47)
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.
Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.
Modify
currentCompactionNto be a sync atomic.Adding a debug level log within
Compaction.Wait()should aid in debugging.