wlog: Optimized and refactored watcher code.#16182
Conversation
d8d616f to
2ee34a1
Compare
Beginning is hours ago, possibly tens of hours. See #8809 for another idea. |
bboreham
left a comment
There was a problem hiding this comment.
Scary to change this code, but the shadowing is a great catch.
I haven't done a full review, just a quick initial scan.
|
Thanks for a quick review @bboreham - I was fighting with the super confusing behaviour of LiveReader (the |
This comment was marked as resolved.
This comment was marked as resolved.
47cf03b to
efd6289
Compare
eca1e09 to
c76a5e3
Compare
The `:=` causes new variables to be created, which means the outer slice stays at nil, and new memory is allocated every time round the loop. Extracted from prometheus#16182 Credit to @bwplotka. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
|
I'm excited about the pessimisation you found - extracted #16197 in the hope that we can merge it faster than this 900-line PR. |
Signed-off-by: bwplotka <bwplotka@gmail.com> # Conflicts: # tsdb/wlog/watcher_test.go # Conflicts: # tsdb/wlog/watcher_test.go # Conflicts: # tsdb/wlog/watcher.go
b96b8f8 to
e325164
Compare
Signed-off-by: bwplotka <bwplotka@gmail.com> # Conflicts: # tsdb/wlog/watcher_test.go # Conflicts: # tsdb/wlog/watcher_test.go # Conflicts: # tsdb/wlog/watcher.go
|
Should be ready for review @bboreham PTAL |
The `:=` causes new variables to be created, which means the outer slice stays at nil, and new memory is allocated every time round the loop. Extracted from prometheus#16182 Credit to @bwplotka. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
As mentioned in prometheus#16182, the BenchmarkStartup test for the queue manager covers an old API and uses settings that will not occur in productio
As mentioned in prometheus#16182, the BenchmarkStartup test for the queue manager covers an old API and uses settings that will not occur in production
As mentioned in prometheus#16182, the BenchmarkStartup test for the queue manager covers an old API and uses settings that will not occur in production Signed-off-by: Adam Bernot <bernot@google.com>
As mentioned in prometheus#16182, the BenchmarkStartup test for the queue manager covers an old API and uses settings that will not occur in production Signed-off-by: Adam Bernot <bernot@google.com> Signed-off-by: perebaj <perebaj@gmail.com>
|
Hello from the bug-scrub! I will take another look. |
|
This needs a rebase, will do soon |
|
I'm looking at this PR; I've done the rebase locally, if you are brave enough to let me push it. |
bboreham
left a comment
There was a problem hiding this comment.
Generally fine, though I would like to resolve the TODOs before merging.
I didn't follow all the test changes; perhaps describe them more fully in the PR?
| // We do this here rather than in the constructor because of the ordering of | ||
| // creating Queue Managers's, stopping them, and then starting new ones in | ||
| // storage/remote/storage.go ApplyConfig. | ||
| func (w *Watcher) initMetrics() { |
There was a problem hiding this comment.
startMetrics in comment vs initMetrics in code
|
|
||
| // readSegment reads all known records into w.writer from a segment. | ||
| // It returns the EOF error if the segment is corrupted or partially written. | ||
| func (w *Watcher) readSegment(r *LiveReader, startT int64, segmentNum int) (err error) { |
There was a problem hiding this comment.
This might be clearer if named more like readSegmentAllData, since the counterpart is readSegmentSeries.
| // One table per WAL segment means it won't grow indefinitely. | ||
| dec = record.NewDecoder(labels.NewSymbolTable()) | ||
|
|
||
| // TODO(bwplotka): Consider zeropools. |
There was a problem hiding this comment.
Pool should be worse, since we are already reusing memory and there is only ever zero or one slice of each type in use.
| if h.T > startT { | ||
| if !w.replayDone { | ||
| w.replayDone = true | ||
| w.logger.Info("Done replaying WAL", "duration", time.Since(timestamp.Time(startT))) |
There was a problem hiding this comment.
Could this be a function, to avoid repeating it three times?
| break | ||
| } | ||
| metadata, err = dec.Metadata(rec, metadata[:0]) | ||
| meta, err := dec.Metadata(rec, metadata[:0]) |
There was a problem hiding this comment.
This is an accident, I think, reverting part of the optimisation from #16197.
| // TODO(bwplotka): Checking every 100ms feels too frequent. It might be enough | ||
| // to check on notify AND with emergency 15s read only. |
There was a problem hiding this comment.
I agree that scanning the whole WAL directory every 100ms is too much.
However, this is the only way that watchSegment returns (except for errors), and hence the only way we move to the next segment. We cannot wait 15s for that.
Possibly we could extend Notify to include the information that there is a new segment?
Maybe we can observe that we got notified and no new data was in the file, so we should look at that point for a new segment?
This cleans up the watcher structure and tests. No functionality should be changed.
Changes:
WriteTois compatible (it allows pooling).I noticed the need of this in #16046 (profiling suggesting watcher decoding being a significant overhead no matter of sample type).