Skip to content

Remote Write Checkpointing#7710

Closed
njingco wants to merge 14 commits intoprometheus:mainfrom
njingco:remote-write-segment
Closed

Remote Write Checkpointing#7710
njingco wants to merge 14 commits intoprometheus:mainfrom
njingco:remote-write-segment

Conversation

@njingco
Copy link
Contributor

@njingco njingco commented Jul 31, 2020

Currently remote write sends data via Prometheus WAL, but the pointer for each remote write queue in the WAL does not persist across restarts, as per #6333 . This PR will attempt to solve this issue by recording an identifier to indicate which segment file to re transmit on Prometheus restart. Re transmitting the whole segment file will potentially have zero data loss. Please refer to my proposal document for more information: Proposal Doc

This PR is currently in progress, but will include the base of the proposal. It will include recording the identifier during shut down, and at a set time, identifying the segment number for each endpoint, and error/corruption checks for the record json file.

It currently has the ability to record the identifier into a json. Sample of the record file is in SegmentRecord.json. It also does corruption checks of the json file at the start of the queue_manager.go.

Whats left to do in this PR add the timed recording from start of queue_manager, and implementing the time customization in the configuration file.

This PR will not have the implementation of setting the segment file for re transmission. This will be in a future PR. This PR will only manage the recording of the identifiers.

njingco added 8 commits July 28, 2020 03:43
…te the segment record to a json file

Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
…e, updated the file as well

Signed-off-by: njingco <jingco.nicole@gmail.com>
…segment records from multiple endpoints

Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress, made some comments. Let me know if anything is unclear.

}

// Sets the current Segment files name
w.SegmentFile = SegmentName("", currentSegment)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for the watcher to have progressed to a new segment, without sending all samples from the previous segment? I think if you had a large enough buffer capacity in the queues that it could happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is possible

Signed-off-by: njingco <jingco.nicole@gmail.com>
…tly update the json file every set time, only made queue_manager have an update function for the CheckpointRecord

Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
Signed-off-by: njingco <jingco.nicole@gmail.com>
Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some comments, good progress

w.recordsReadMetric.WithLabelValues(recordType(dec.Type(rec))).Inc()

// Sets the current Segment files it's reading from
w.CurrSegmentFile = SegmentName("", segmentNum)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be set in the same location as the metric is being set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this not go in the readSegment() since this is where the current segment is being read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cleaner to set it where the control of what currentSegment is set to is. The Watchers Run function controls the local currentSegment, both passing the value to watch (and subsequently readSegment) and updating the value to the next segment number when watch returns. It's also easier for the next person who comes along to read the code if these two related variables are set in the same location.

…reating the checkpoint file, added the context the timed checkpointing go thread

Signed-off-by: njingco <jingco.nicole@gmail.com>

// UpdateEndpointRecord gets the current segment number read from the watcher and stores it to
func (t *QueueManager) UpdateEndpointRecord() {
t.endpointRecord.Segment = t.watcher.CurrSegmentFile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watcher should maybe have a CurrentSegment() function rather than exposing one of it's fields

w.recordsReadMetric.WithLabelValues(recordType(dec.Type(rec))).Inc()

// Sets the current Segment files it's reading from
w.CurrSegmentFile = SegmentName("", segmentNum)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cleaner to set it where the control of what currentSegment is set to is. The Watchers Run function controls the local currentSegment, both passing the value to watch (and subsequently readSegment) and updating the value to the next segment number when watch returns. It's also easier for the next person who comes along to read the code if these two related variables are set in the same location.


MaxSegment: -1,
MaxSegment: -1,
CurrSegmentFile: "",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty string is the default value for a string, no need to set it here

// before.
var castagnoliTable = crc32.MakeTable(crc32.Castagnoli)
var (
castagnoliTable = crc32.MakeTable(crc32.Castagnoli)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you using this for something, maybe forgot to add a file to a commit?


// EndpointRecord Structure holds the segment the endpoint url
type EndpointRecord struct {
Segment string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fields on this and checkpoint record don't need to be exported if you don't have a need to access those fields directly outside of the storage/remote package.

}

// Segmentnumber out of bounds
if segmentNumber < wal.MinSegmentNumber || segmentNumber > wal.MaxSegmentNumber {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe have the watcher have a ValidSegment function that uses wal.Segments to check if segmentNumber exists in the current WAL directory. Thoughts @csmarchbanks ?

@stale stale bot added the stale label Nov 28, 2020
Base automatically changed from master to main February 23, 2021 19:36
@stale stale bot removed the stale label Feb 23, 2021
@stale stale bot added the stale label Apr 24, 2021
rfratto added a commit to rfratto/prometheus that referenced this pull request Nov 24, 2021
storage/remote to cache these segment offsets to disk.

Deprecates prometheus#7710.
Deprecates prometheus#8918.
Closes prometheus#8809.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
rfratto added a commit to rfratto/prometheus that referenced this pull request Feb 17, 2022
storage/remote to cache these segment offsets to disk.

Deprecates prometheus#7710.
Deprecates prometheus#8918.
Closes prometheus#8809.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
@roidelapluie
Copy link
Member

roidelapluie commented Sep 26, 2023

We have looked at this pull request during our bug scrub.

@csmarchbanks @cstyan, we don't have much context on our side. What is the status of this pull request? is this still valid/needed?

Thank you for your contribution anyway @njingco

@cstyan
Copy link
Member

cstyan commented Oct 18, 2023

We can close this implementation. We'd still like this feature but no one has worked on this implementation path in quite some time.

@cstyan cstyan closed this Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants