Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

insights: add base for data retention worker#46082

Merged
leonore merged 18 commits into
mainfrom
insights/truncate-series-over-N
Jan 4, 2023
Merged

insights: add base for data retention worker#46082
leonore merged 18 commits into
mainfrom
insights/truncate-series-over-N

Conversation

@leonore

@leonore leonore commented Jan 3, 2023

Copy link
Copy Markdown
Contributor

closes #46073

Handle() implementation will be covered by https://github.com/sourcegraph/sourcegraph/issues/45745

Test plan

Added the data retention job to the list of worker jobs and observed it started up correctly.
Jobs get queued correctly from the work handler.
Changed the completion time to 1 minute and observed the jobs get cleaned up correctly.

@leonore leonore requested a review from a team January 3, 2023 16:57
@cla-bot cla-bot Bot added the cla-signed label Jan 3, 2023
@sourcegraph-bot

sourcegraph-bot commented Jan 3, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff 82431e2...9aaa42a.

Notify File(s)
@efritz enterprise/cmd/worker/internal/insights/data_retention_job.go
enterprise/cmd/worker/internal/insights/query_runner_job.go
@sourcegraph/code-insights-backend enterprise/internal/insights/background/background.go
enterprise/internal/insights/background/queryrunner/cleaner.go
enterprise/internal/insights/background/queryrunner/work_handler.go
enterprise/internal/insights/background/queryrunner/worker.go
enterprise/internal/insights/background/retention/cleaner.go
enterprise/internal/insights/background/retention/job.go
enterprise/internal/insights/background/retention/worker.go

}

// enqueue this insight series for data retention in parallel
_, err = retention.EnqueueJob(ctx, ss, &retention.DataRetentionJob{SeriesID: series.ID})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels unusual to queue the retention job from within the snapshot process. Is there a reason to not run this enqueue from its own periodic goroutine?

@leonore leonore Jan 4, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason I am doing this from the query runner rather than in a separate routine is because

  1. the query runner will run for all active series
  2. it will run every time we want a new sample

this means with 1) we enqueue a retention job for all relevant series and with 2) we enqueue it every time we get a new sample point i.e. that we might need to move to the retention table. this feels like an efficient way to enqueue the job we need when we need it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that it starts mixing responsibilities and it's not an intuitive place to look for that functionality. It's easily reversible now so I didn't want to hold things up. To keep that same logic on when to run I think it would be more appropriate lift it up to the insightEnqueuer here since this logic is around identifying active series and generating points for them. You could even limit it to only run when creating a new datapoint instead of snapshot.

I'm not sure what the logic will be to determine which points are archived, but there is potentially still a race condition on which job runs first if the current number of points is part of the criteria.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was to put all the work of figuring out whether a series needs truncated and what data to archive in the new queue (insights-data-retention-worker) such that it is idempotent and there is no risk if we work on the same series twice.

if you meant that we should just move the logic to enqueue all data series that need a new recording to the insight enqueuer routine the only problem I see is that the retention worker might pick up work for the series before the recording is finished (writing this out I think that this is also the case now but we could move that enqueue to after we save the recording. we can also only enqueue for recordings).

if instead what you meant is to also do a series selection based on whether or not the series need truncation I think that adds too much responsibility on the insight enqueuer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you meant that we should just move the logic to enqueue all data series that need a new recording to the insight enqueuer routine the only problem I see is that the retention worker might pick up work for the series before the recording is finished (writing this out I think that this is also the case now but we could move that enqueue to after we save the recording. we can also only enqueue for recordings).

That is what I meant and that is also the race condition I was worried about, but I think moving it here is preferable in that it's a clearer what is going on and less prone to changes/errors in the work handler. My preference is still to have a small background task similar to the insight enquerer that runs once a day and queues up any series that need retention/archive work done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving it there then creates this race condition though? and if we error in the work handler then we don't need to have a look at the series again since it won't have any new data

I think adding another routine just adds more complexity and boilerplate when we already have all the data we need in existing routines

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that typically I would expect something like data retention to run on a fixed schedule along the lines of a cron job. The closest parallel we currently have are periodic go routines which don't come with much boilerplate but do suffer from the problem that they run in each instance of the worker (for example the insight enqueuer).

What I'm trying to avoid is coupling of separate process that relate to the same data but remain distinct and unrelated to each other. If I were to edit the hander that runs searches that doesn't imply that I should be changing retention. If I change retention it doesn't imply that I should also change something about how searches are run. Can we continue to make insights more testable with this enqueue in the search handler as is, it is a hidden side effect and without a deep knowledge of the system, it's difficult to know that you should look for it. To improve the testability it could be exposed as a dependency but then the question is why is it a dependency? What should I expect?

From a users perspective if I change a retention policy it shouldn't be necessary to wait for every series to generate new data before that is applied. That could mean it takes anywhere from 1 hour to years (assuming in the handler it only gets queued after recordings) for the retention to update on a series.

I opened a draft to show an example of the approach I mentioned. I don't think this is a show stopping issue by any means but I do want to ensure we keep moving in a direction of isolating behavior and improving the ability to automate testing so that we can move faster.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with separating concerns. the only point that needs refining with this approach is the metric to use to fetch newly recorded series.

  1. we use NextRecordingBefore: time.Now() to get some kind of freshness but there is a risk the retention job runs before a new record is added, wasting a cycle
  2. we just add all series every time and exit early in the retention worker
  3. we use some other kind of metric, like last recorded at in the past 24 hours or else

what do you think?

I've added some commits to your branch in the meantime, hope that's ok: https://github.com/sourcegraph/sourcegraph/pull/46170/files

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally good to do whatever you want with that branch I skipped things to get it out quicker for an example. I don't think adding every series every time is the worst thing a noop archive isn't expensive to run. Currently while the retention rules are basic and easy to express in a query you could query the recording times table for all series ids that have data to be archived and then just queue those.

Comment thread enterprise/cmd/worker/internal/insights/query_runner_job.go
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

insights: add data retention worker

3 participants