Skip to content

feat: Defer cleanup for log/index compactions, add debug log#26511

Merged
devanbenz merged 13 commits intomaster-1.xfrom
db/shards_persisting
Jun 20, 2025
Merged

feat: Defer cleanup for log/index compactions, add debug log#26511
devanbenz merged 13 commits intomaster-1.xfrom
db/shards_persisting

Conversation

@devanbenz
Copy link
Copy Markdown

@devanbenz devanbenz commented Jun 10, 2025

I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.

Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()

-----------+-------------------------------------------------------
         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------

Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.

Modify currentCompactionN to be a sync atomic.

Adding a debug level log within Compaction.Wait() should aid in debugging.

I believe that there is something happening which causes
CurrentCompactionN() to alway be greater than 0. Thus making
Partition.Wait() hang forever.
@devanbenz devanbenz changed the title feat: Add a debug level log for Parititon.Wait() feat: Add a debug level log for Partition.Wait() Jun 10, 2025
@devanbenz devanbenz changed the title feat: Add a debug level log for Partition.Wait() feat: Defer cleanup for log/index compactions, add debug log Jun 11, 2025
@devanbenz devanbenz marked this pull request as ready for review June 13, 2025 18:21
@devanbenz devanbenz requested a review from davidby-influx June 13, 2025 18:21
@devanbenz devanbenz self-assigned this Jun 13, 2025
@devanbenz devanbenz requested a review from gwossum June 13, 2025 18:21
Copy link
Copy Markdown
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More ticker/timer fun.

}
p.logger.Debug("Partition.Wait() timed out waiting for compactions to complete",
zap.Int32("stuck_compactions", p.CurrentCompactionN()), zap.Duration("timeout", timeoutDuration),
zap.Strings("files", files))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we decide earlier that we should not exit on timeout?

If we return on timeout, then the timeout should be a timer not a ticker.

If we don't return on a timeout, then the timeout code could be in the ticker with a check if the timeoutDuration has passed since the wait was entered.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should return on timeout. I do think we should log, I was actually thinking about changing this to a warn since it will only occur every 24 hours if a log compaction is stuck. This is mostly to ensure the issue is not happening after my go routine cleanup code is in place. I will add a comment about it.

@devanbenz devanbenz requested a review from davidby-influx June 16, 2025 18:17
Copy link
Copy Markdown
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.After changes.

}
<-ticker.C
select {
case <-time.After(timeoutDuration):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.After creates a new channel every call. You need to move it outside the loop. See here: https://stackoverflow.com/a/39212626

This only prints the warning once., though. If you want to print the warning repeatedly (every timeoutDuration) then you should use an if in the ticker case and check elapsed time.

Copy link
Copy Markdown
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better, but one more simplification

<-ticker.C
select {
case <-ticker.C:
elapsed := time.Since(startTime)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like lastWarningTime and startTime could be collapsed to a singe variable for simplicity.

lastTime := time.Now()
for {
		if p.CurrentCompactionN() == 0 {
			return
		}
		select {
		case <-ticker.C:
			elapsed := time.Since(lastTime)
                        if elapsed >= timeoutDuration {
                                lastTime = time.Now()
				files := make([]string, 0)
				for _, v := range p.fileSet.Files() {
					files = append(files, v.Path())

Copy link
Copy Markdown
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@devanbenz devanbenz merged commit 149fb47 into master-1.x Jun 20, 2025
9 checks passed
@devanbenz devanbenz deleted the db/shards_persisting branch June 20, 2025 18:18
devanbenz added a commit that referenced this pull request Jun 23, 2025
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.

Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```

Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.

Modify currentCompactionN to be a sync atomic.

Adding a debug level log within Compaction.Wait() should aid in debugging.

(cherry picked from commit 149fb47)
devanbenz added a commit that referenced this pull request Jul 25, 2025
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.

Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```

Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.

Modify currentCompactionN to be a sync atomic.

Adding a debug level log within Compaction.Wait() should aid in debugging.

(cherry picked from commit 149fb47)
devanbenz added a commit that referenced this pull request Jul 31, 2025
devanbenz added a commit that referenced this pull request Sep 30, 2025
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.

Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```

Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.

Modify currentCompactionN to be a sync atomic.

Adding a debug level log within Compaction.Wait() should aid in debugging.

(cherry picked from commit 149fb47)
devanbenz added a commit that referenced this pull request Jan 30, 2026
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.

Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```

Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.

Modify currentCompactionN to be a sync atomic.

Adding a debug level log within Compaction.Wait() should aid in debugging.

(cherry picked from commit 149fb47)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants