chore(search_jobs): add janitor job#64186
Conversation
| // 🚨 SECURITY: only someone with access to the job may upload the logs | ||
| if err := s.store.UserHasAccess(ctx, id); err != nil { | ||
| return 0, err | ||
| } | ||
|
|
||
| return s.uploadStore.Upload(ctx, getLogKey(id), r) |
Check notice
Code scanning / Semgrep OSS
Semgrep Finding: security-semgrep-rules.semgrep-rules.generic.comment-tagging-rule
| // 🚨 SECURITY: only someone with access to the job may download the logs | ||
| if err := s.store.UserHasAccess(ctx, id); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| return s.uploadStore.Get(ctx, getLogKey(id)) |
Check notice
Code scanning / Semgrep OSS
Semgrep Finding: security-semgrep-rules.semgrep-rules.generic.comment-tagging-rule
| // 🚨 SECURITY: only someone with access to the job may delete the logs | ||
| if err := s.store.UserHasAccess(ctx, id); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| return s.uploadStore.Delete(ctx, getLogKey(id)) |
Check notice
Code scanning / Semgrep OSS
Semgrep Finding: security-semgrep-rules.semgrep-rules.generic.comment-tagging-rule
|
💡 Learn more about each section: PR description tips, Test Plan and Changelog. |
| return func(w http.ResponseWriter, r *http.Request) { | ||
| jobIDStr := mux.Vars(r)["id"] | ||
| jobID, err := strconv.Atoi(jobIDStr) | ||
| jobID, err := strconv.ParseInt(jobIDStr, 10, 64) |
There was a problem hiding this comment.
Parsing into int64 to avoid a lot of int64 casts later on.
| @@ -0,0 +1,166 @@ | |||
| package search | |||
There was a problem hiding this comment.
This file is the core of the change. The idea is to calculate the aggregate state (which we already do to report the job status on the Search Job page) and set the status of the top-level search job to that status.
Once the aggregate state is set and the logs are uploaded to the blobstore, we can remove all db entries except for the top-level search job.
Progress reporting in UI will keep working as-is and the logs are now served from the blobstore.
There was a problem hiding this comment.
Thanks for the explanation! Could you also explain why we chose to use the blobstore vs. the DB for the aggregated state? Maybe because it's quite a lot of info to be storing in the DB (failure messages + states)?
There was a problem hiding this comment.
The aggregate state (IE "completed", "failed", ...) is persisted in the db, the log is uploaded to the blobstore.
The ultimate goal of this PR is to avoid accumulating millions of entries in the db for each search job. While a search job is running, we need those entries to keep track of the scope and for snapshotting (restart a job if Sourcegraph is restarted). After a search job is finished, we only care about the aggregate state of the entire job and the failure messages.
We already use the blobstore to store the search results (which might be GB worth of data), so storing the logs right next to them makes sense to me. I guess we could store the logs in the db, but this seems wasteful considering we only serve them as download.
There was a problem hiding this comment.
I was just asking because it would simplify the "logs serving" logic if everything was in the DB (so we could do it all in one transaction). But this trade-off makes sense to me.
| @@ -0,0 +1,106 @@ | |||
| package storetest | |||
There was a problem hiding this comment.
This contains test helpers which both frontend and worker can use. Mostly copy&paste.
| writeCSV(logger.With(log.Int64("jobID", jobID)), w, filename, csvWriterTo) | ||
| } | ||
|
|
||
| func serveLogFromBlobstore(ctx context.Context, logger log.Logger, svc *service.Service, filenameNoQuotes string, jobID int64, w http.ResponseWriter) { |
There was a problem hiding this comment.
serveLogFromBlobstore is the only thing which is truly new in export.go. Instead of assembling the logs on the fly by calling out to the db, we serve the logs the new janitor job has uploaded to the blobstore.
This adds a background job to Search Jobs that periodically scans for finished jobs to aggregate the status, upload logs, and clean up the tables. This drastically reduces the size of the tables and improves performance of the API. For example, with this change a finished search job on .com only has 1 entry in the database, whereas before it could have several milions. Test plan: - new unit tests - manual testing I ran a couple of search jobs locally and checked that - logs are uploaded to `blobstore-go/buckets/search-jobs` - repo jobs are deleted from `exhaustive_repo_jobs` - logs are served from the blobstore after the janitor ran
dfeff33 to
ae3371e
Compare
| csvWriterTo, err := svc.GetSearchJobLogsWriterTo(r.Context(), int64(jobID)) | ||
| filename := filenamePrefix(jobID) + ".log.csv" | ||
|
|
||
| // Jobs in a terminal state are aggregated. As part of the aggregation, the logs |
There was a problem hiding this comment.
It's always nice to not have known races to keep the mental model simple (even if users are unlikely to see them). What if we always tried the blobstore (regardless of isAggregated status), and fall back to the DB if it's not found?
There was a problem hiding this comment.
The store abstracts over several different stores (aws, gcs, blobstore) and I believe we don't have a consistent signal of "blob not found". Checking for isAggregated makes this a bit more robust.
There was a problem hiding this comment.
I see, too bad we don't have "blob not found". In that case, maybe we could add a single retry here (if the job is not aggregated, and we fail to serve the log from the DB, then retrying should succeed, since the DB transaction to aggregate has completed). Then we don't expect any failures in practice, and can treat any error as something we need to investigate!
| @@ -0,0 +1,166 @@ | |||
| package search | |||
There was a problem hiding this comment.
Thanks for the explanation! Could you also explain why we chose to use the blobstore vs. the DB for the aggregated state? Maybe because it's quite a lot of info to be storing in the DB (failure messages + states)?
jtibshirani
left a comment
There was a problem hiding this comment.
This looks good to me. I am totally new to this code though, so my review should be taken with a grain of salt :)
| csvWriterTo, err := svc.GetSearchJobLogsWriterTo(r.Context(), int64(jobID)) | ||
| filename := filenamePrefix(jobID) + ".log.csv" | ||
|
|
||
| // Jobs in a terminal state are aggregated. As part of the aggregation, the logs |
There was a problem hiding this comment.
I see, too bad we don't have "blob not found". In that case, maybe we could add a single retry here (if the job is not aggregated, and we fail to serve the log from the DB, then retrying should succeed, since the DB transaction to aggregate has completed). Then we don't expect any failures in practice, and can treat any error as something we need to investigate!
Relates to #64186 With this PR we only show `83 out of 120 tasks` if the search job is currently processing. In all other states, we don't show this stats. This is a consequence of the janitor job. After aggregation, this data is not available anymore. I remove an unncessary restriction on the download of logs and results. Test plan: I ran a search job locally and confirmed that the progress message is only visible while the job is processing and that logs and downloads are always available.
Relates to #64186 With this PR we only show `83 out of 120 tasks` if the search job is currently processing. In all other states, we don't show this stat. This is a consequence of the janitor job I recently added, because after aggregation, this data is not available anymore. User's can still inspect the logs and download results to get a detailed view of which revisions were searched. I also remove an unnecessary dependency of the download links on the job state. ## Test plan: I ran a search job locally and confirmed that the progress message is only visible while the job is processing and that logs and downloads are always available. ## Changelog - Show detailed progress only while job is in status "processing" - Remove dependency of download links on job state
Fixes SPLF-119
This adds a background job to Search Jobs that periodically scans for finished jobs to aggregate the status, upload logs, and clean up the tables. This drastically reduces the size of the tables and improves the performance of the Search Jobs GQL API.
For example, with this change, a finished search job on .com only has 1 entry in the database, whereas before it could have several millions if we searched each repository.
Notes:
Test plan:
I ran a couple of search jobs locally (with the janitor job interval set to 1 min) and checked that
blobstore-go/buckets/search-jobsexhaustive_repo_jobsChangelog
The new background job drastically reduces the size of the
exhaustive_*tables and improves performance of the Search Jobs GQL API.