insights: add base for data retention worker by leonore · Pull Request #46082 · sourcegraph/sourcegraph-public-snapshot

leonore · 2023-01-03T16:57:37Z

Handle() implementation will be covered by https://github.com/sourcegraph/sourcegraph/issues/45745

Test plan

Added the data retention job to the list of worker jobs and observed it started up correctly.
Jobs get queued correctly from the work handler.
Changed the completion time to 1 minute and observed the jobs get cleaned up correctly.

sourcegraph-bot · 2023-01-03T16:59:10Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 82431e2...9aaa42a.

Notify	File(s)
@efritz	enterprise/cmd/worker/internal/insights/data_retention_job.go enterprise/cmd/worker/internal/insights/query_runner_job.go
@sourcegraph/code-insights-backend	enterprise/internal/insights/background/background.go enterprise/internal/insights/background/queryrunner/cleaner.go enterprise/internal/insights/background/queryrunner/work_handler.go enterprise/internal/insights/background/queryrunner/worker.go enterprise/internal/insights/background/retention/cleaner.go enterprise/internal/insights/background/retention/job.go enterprise/internal/insights/background/retention/worker.go

chwarwick · 2023-01-03T17:28:09Z

 	}

+	// enqueue this insight series for data retention in parallel
+	_, err = retention.EnqueueJob(ctx, ss, &retention.DataRetentionJob{SeriesID: series.ID})


It feels unusual to queue the retention job from within the snapshot process. Is there a reason to not run this enqueue from its own periodic goroutine?

the reason I am doing this from the query runner rather than in a separate routine is because

the query runner will run for all active series

it will run every time we want a new sample

this means with 1) we enqueue a retention job for all relevant series and with 2) we enqueue it every time we get a new sample point i.e. that we might need to move to the retention table. this feels like an efficient way to enqueue the job we need when we need it.

My concern here is that it starts mixing responsibilities and it's not an intuitive place to look for that functionality. It's easily reversible now so I didn't want to hold things up. To keep that same logic on when to run I think it would be more appropriate lift it up to the insightEnqueuer here since this logic is around identifying active series and generating points for them. You could even limit it to only run when creating a new datapoint instead of snapshot.

I'm not sure what the logic will be to determine which points are archived, but there is potentially still a race condition on which job runs first if the current number of points is part of the criteria.

the idea was to put all the work of figuring out whether a series needs truncated and what data to archive in the new queue (insights-data-retention-worker) such that it is idempotent and there is no risk if we work on the same series twice.

if you meant that we should just move the logic to enqueue all data series that need a new recording to the insight enqueuer routine the only problem I see is that the retention worker might pick up work for the series before the recording is finished (writing this out I think that this is also the case now but we could move that enqueue to after we save the recording. we can also only enqueue for recordings).

if instead what you meant is to also do a series selection based on whether or not the series need truncation I think that adds too much responsibility on the insight enqueuer.

if you meant that we should just move the logic to enqueue all data series that need a new recording to the insight enqueuer routine the only problem I see is that the retention worker might pick up work for the series before the recording is finished (writing this out I think that this is also the case now but we could move that enqueue to after we save the recording. we can also only enqueue for recordings).

That is what I meant and that is also the race condition I was worried about, but I think moving it here is preferable in that it's a clearer what is going on and less prone to changes/errors in the work handler. My preference is still to have a small background task similar to the insight enquerer that runs once a day and queues up any series that need retention/archive work done.

moving it there then creates this race condition though? and if we error in the work handler then we don't need to have a look at the series again since it won't have any new data

I think adding another routine just adds more complexity and boilerplate when we already have all the data we need in existing routines

My thought is that typically I would expect something like data retention to run on a fixed schedule along the lines of a cron job. The closest parallel we currently have are periodic go routines which don't come with much boilerplate but do suffer from the problem that they run in each instance of the worker (for example the insight enqueuer).

What I'm trying to avoid is coupling of separate process that relate to the same data but remain distinct and unrelated to each other. If I were to edit the hander that runs searches that doesn't imply that I should be changing retention. If I change retention it doesn't imply that I should also change something about how searches are run. Can we continue to make insights more testable with this enqueue in the search handler as is, it is a hidden side effect and without a deep knowledge of the system, it's difficult to know that you should look for it. To improve the testability it could be exposed as a dependency but then the question is why is it a dependency? What should I expect?

From a users perspective if I change a retention policy it shouldn't be necessary to wait for every series to generate new data before that is applied. That could mean it takes anywhere from 1 hour to years (assuming in the handler it only gets queued after recordings) for the retention to update on a series.

I opened a draft to show an example of the approach I mentioned. I don't think this is a show stopping issue by any means but I do want to ensure we keep moving in a direction of isolating behavior and improving the ability to automate testing so that we can move faster.

fine with separating concerns. the only point that needs refining with this approach is the metric to use to fetch newly recorded series.

we use NextRecordingBefore: time.Now() to get some kind of freshness but there is a risk the retention job runs before a new record is added, wasting a cycle

we just add all series every time and exit early in the retention worker

we use some other kind of metric, like last recorded at in the past 24 hours or else

what do you think?

I've added some commits to your branch in the meantime, hope that's ok: https://github.com/sourcegraph/sourcegraph/pull/46170/files

Yeah totally good to do whatever you want with that branch I skipped things to get it out quicker for an example. I don't think adding every series every time is the worst thing a noop archive isn't expensive to run. Currently while the retention rules are basic and easy to express in a query you could query the recording times table for all series ids that have data to be archived and then just queue those.

leo added 17 commits December 20, 2022 17:24

clarify comments

894cbcc

skeleton files

2cf55a9

more skeleton

b6d6991

Merge branch 'main' into insights/truncate-series-over-N

c3d638f

add new data pruning jobs table

fddff6c

add job boilerplate

c809aab

enqueue data pruning job in work handler

8f84434

add handler boilerplate

73fc9d9

wip

6fdb81e

boilerplate initialisers

879e4d1

rename pruning -> retention not to cause confusion with data pruner job

ebde42e

connect routines to worker

a416eaa

initialise in worker for testing/debugging

38c669c

run generate

dc0cb7d

fix context cancelled error on workers

fa324f9

some fixes

49d5fea

add cleaner

57ae3cb

leonore requested a review from a team January 3, 2023 16:57

cla-bot Bot added the cla-signed label Jan 3, 2023

it's called data retention not pruning

9aaa42a

chwarwick approved these changes Jan 3, 2023

View reviewed changes

leonore merged commit 131a7ae into main Jan 4, 2023

leonore deleted the insights/truncate-series-over-N branch January 4, 2023 14:07

This was referenced Jan 5, 2023

insights: archive points for a series beyond a set sample size #46164

Merged

insights: enqueue retention work from a separate goroutine #46206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insights: add base for data retention worker#46082

insights: add base for data retention worker#46082
leonore merged 18 commits into
mainfrom
insights/truncate-series-over-N

leonore commented Jan 3, 2023

Uh oh!

sourcegraph-bot commented Jan 3, 2023 •

edited

Loading

Uh oh!

chwarwick Jan 3, 2023

Uh oh!

leonore Jan 4, 2023 •

edited

Loading

Uh oh!

chwarwick Jan 4, 2023

Uh oh!

leonore Jan 4, 2023

Uh oh!

chwarwick Jan 4, 2023

Uh oh!

leonore Jan 4, 2023

Uh oh!

chwarwick Jan 5, 2023

Uh oh!

leonore Jan 6, 2023

Uh oh!

chwarwick Jan 6, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leonore commented Jan 3, 2023

Test plan

Uh oh!

sourcegraph-bot commented Jan 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonore Jan 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sourcegraph-bot commented Jan 3, 2023 •

edited

Loading

leonore Jan 4, 2023 •

edited

Loading