Skip to content

WIP: Remote write checkpointing#8918

Closed
cstyan wants to merge 14 commits intoprometheus:mainfrom
cstyan:remote-write-checkpointing
Closed

WIP: Remote write checkpointing#8918
cstyan wants to merge 14 commits intoprometheus:mainfrom
cstyan:remote-write-checkpointing

Conversation

@cstyan
Copy link
Member

@cstyan cstyan commented Jun 10, 2021

Still really rough, but working, implementation of #8809

cc: @rfratto

njingco and others added 12 commits May 31, 2021 21:40
…te the segment record to a json file

Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
…e, updated the file as well

Signed-off-by: njingco <jingco.nicole@gmail.com>
…segment records from multiple endpoints

Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
…tly update the json file every set time, only made queue_manager have an update function for the CheckpointRecord

Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
…reating the checkpoint file, added the context the timed checkpointing go thread

Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Contributor

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FYI, you probably didn't mean to check in the tsdb/tsdb or documentation/examples/remote_storage/example_write_adapter/server files)

checkpointDir := filepath.Join(t.tsdbDir, "remote", t.client().Name())
err := os.MkdirAll(checkpointDir, 0777)
if err != nil {
level.Warn(t.logger).Log("msg", "error creating temporary remote write checkpoint directory", "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return on error? Ditto with WriteFile/Replace failing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

tsdb/wal/wal.go Outdated
var (
castagnoliTable = crc32.MakeTable(crc32.Castagnoli)

MaxSegmentNumber = 99999999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is math.MaxInt32/math.MaxInt64 more appropriate here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, but these aren't being used anymore so I will remove them

func (w *Watcher) Run() error {
// or an error case is hit. Use -1 for startSegment if the watcher should not
// start tailing samples until after startTime.
func (w *Watcher) Run(startSegment int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't segment files start at 1? You might be able to use 0 instead of -1 here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, they start at 0

ls data/wal
00000000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 my bad :)

Comment on lines +647 to +651
for _, s := range segments {
if s == segmentNum {
return true, nil
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the segments are sorted you may be able to do an early return here if s > segmentNum.

}
for _, s := range samples {
w.samplesRead++
if s.T > w.startTimestamp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, doesn't this have the same problem? Even if we start at a newer segment, we're still not going to queue any new samples until after time.Now() has passed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's smelly but at the moment we're only setting the start time if we didn't find/set the starting segment.

if startSegment == -1 {
	w.SetStartTime(time.Now())
}

Comment on lines +72 to +79
done chan struct{}
// Soft shutdown context will prevent new enqueues and deadlocks.
softShutdown chan struct{}

// Hard shutdown context is used to terminate outgoing HTTP connections
// after giving them a chance to terminate.
hardShutdown context.CancelFunc
droppedOnHardShutdown atomic.Uint32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these being used anywhere? I see (at least) done being closed, but never read. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No none of these were being used, don't think done was being closed either?

test.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that the checkpoint logic, and things such as the tsdbDir and path, are spread across multiple packages. Would it be possible to keep most of the logic contained within watcher or queue_manager?

My one other thought is to make sure the checkpoint file format is expandable in the future for adding offsets. Do we want more structured data in that file?

metrics *WatcherMetrics
readerMetrics *LiveReaderMetrics

samplesSent int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something, or are these unused other than to increment them?

// For testing, stop when we hit this segment.
MaxSegment int

mutex sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mutex used?

startSegment := -1

// look for a checkpoint file
fname := filepath.Join(w.tsdbDir, "remote", w.name, "checkpoint")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting remote into a wal specific file doesn't seem ideal. Could it just be watcher instead? Or specified when creating a Watcher?

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@stale stale bot added the stale label Aug 28, 2021
gouthamve added a commit to gouthamve/agent that referenced this pull request Nov 24, 2021
Agent / Remote Write code currently doesn't send the samples from
the WAL on startup at all. We only send the new samples so replaying
the WAL doesn't give us much.

It does let us load series references from the WAL but we save much more
by not reading the WAL at all.

See: prometheus/prometheus#9848 and prometheus/prometheus#8918

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
rfratto added a commit to rfratto/prometheus that referenced this pull request Nov 24, 2021
storage/remote to cache these segment offsets to disk.

Deprecates prometheus#7710.
Deprecates prometheus#8918.
Closes prometheus#8809.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
rfratto added a commit to rfratto/prometheus that referenced this pull request Feb 17, 2022
storage/remote to cache these segment offsets to disk.

Deprecates prometheus#7710.
Deprecates prometheus#8918.
Closes prometheus#8809.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
@cstyan cstyan closed this by deleting the head repository Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants