Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

chore(search_jobs): add janitor job#64186

Merged
stefanhengl merged 3 commits into
mainfrom
sh/search-jobs/janitor
Aug 1, 2024
Merged

chore(search_jobs): add janitor job#64186
stefanhengl merged 3 commits into
mainfrom
sh/search-jobs/janitor

Conversation

@stefanhengl

@stefanhengl stefanhengl commented Jul 31, 2024

Copy link
Copy Markdown
Member

Fixes SPLF-119

This adds a background job to Search Jobs that periodically scans for finished jobs to aggregate the status, upload logs, and clean up the tables. This drastically reduces the size of the tables and improves the performance of the Search Jobs GQL API.

For example, with this change, a finished search job on .com only has 1 entry in the database, whereas before it could have several millions if we searched each repository.

Notes:

  • the diff seems larger than it actually is. I left a couple of comments to help the reviewers.

Test plan:

  • new unit tests
  • manual testing:

I ran a couple of search jobs locally (with the janitor job interval set to 1 min) and checked that

  • logs are uploaded to blobstore-go/buckets/search-jobs
  • repo jobs are deleted from exhaustive_repo_jobs
  • logs are served from the blobstore after the janitor ran
  • downloading logs while the job is running still works

Changelog

The new background job drastically reduces the size of the exhaustive_* tables and improves performance of the Search Jobs GQL API.

Comment on lines +241 to +246
// 🚨 SECURITY: only someone with access to the job may upload the logs
if err := s.store.UserHasAccess(ctx, id); err != nil {
return 0, err
}

return s.uploadStore.Upload(ctx, getLogKey(id), r)

Check notice

Code scanning / Semgrep OSS

Semgrep Finding: security-semgrep-rules.semgrep-rules.generic.comment-tagging-rule

Code that highlight SECURITY in comment has changed. Please review the code for changes. The changes might be sensitive.
Comment on lines +250 to +255
// 🚨 SECURITY: only someone with access to the job may download the logs
if err := s.store.UserHasAccess(ctx, id); err != nil {
return nil, err
}

return s.uploadStore.Get(ctx, getLogKey(id))

Check notice

Code scanning / Semgrep OSS

Semgrep Finding: security-semgrep-rules.semgrep-rules.generic.comment-tagging-rule

Code that highlight SECURITY in comment has changed. Please review the code for changes. The changes might be sensitive.
Comment on lines +259 to +264
// 🚨 SECURITY: only someone with access to the job may delete the logs
if err := s.store.UserHasAccess(ctx, id); err != nil {
return err
}

return s.uploadStore.Delete(ctx, getLogKey(id))

Check notice

Code scanning / Semgrep OSS

Semgrep Finding: security-semgrep-rules.semgrep-rules.generic.comment-tagging-rule

Code that highlight SECURITY in comment has changed. Please review the code for changes. The changes might be sensitive.
@cla-bot cla-bot Bot added the cla-signed label Jul 31, 2024
@github-actions

Copy link
Copy Markdown
Contributor

💡 Learn more about each section: PR description tips, Test Plan and Changelog.

@github-actions github-actions Bot added team/product-platform team/search-platform Issues owned by the search platform team labels Jul 31, 2024
return func(w http.ResponseWriter, r *http.Request) {
jobIDStr := mux.Vars(r)["id"]
jobID, err := strconv.Atoi(jobIDStr)
jobID, err := strconv.ParseInt(jobIDStr, 10, 64)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing into int64 to avoid a lot of int64 casts later on.

@@ -0,0 +1,166 @@
package search

@stefanhengl stefanhengl Jul 31, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is the core of the change. The idea is to calculate the aggregate state (which we already do to report the job status on the Search Job page) and set the status of the top-level search job to that status.

Once the aggregate state is set and the logs are uploaded to the blobstore, we can remove all db entries except for the top-level search job.

Progress reporting in UI will keep working as-is and the logs are now served from the blobstore.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Could you also explain why we chose to use the blobstore vs. the DB for the aggregated state? Maybe because it's quite a lot of info to be storing in the DB (failure messages + states)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregate state (IE "completed", "failed", ...) is persisted in the db, the log is uploaded to the blobstore.

The ultimate goal of this PR is to avoid accumulating millions of entries in the db for each search job. While a search job is running, we need those entries to keep track of the scope and for snapshotting (restart a job if Sourcegraph is restarted). After a search job is finished, we only care about the aggregate state of the entire job and the failure messages.

We already use the blobstore to store the search results (which might be GB worth of data), so storing the logs right next to them makes sense to me. I guess we could store the logs in the db, but this seems wasteful considering we only serve them as download.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just asking because it would simplify the "logs serving" logic if everything was in the DB (so we could do it all in one transaction). But this trade-off makes sense to me.

@@ -0,0 +1,106 @@
package storetest

@stefanhengl stefanhengl Jul 31, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contains test helpers which both frontend and worker can use. Mostly copy&paste.

writeCSV(logger.With(log.Int64("jobID", jobID)), w, filename, csvWriterTo)
}

func serveLogFromBlobstore(ctx context.Context, logger log.Logger, svc *service.Service, filenameNoQuotes string, jobID int64, w http.ResponseWriter) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serveLogFromBlobstore is the only thing which is truly new in export.go. Instead of assembling the logs on the fly by calling out to the db, we serve the logs the new janitor job has uploaded to the blobstore.

This adds a background job to Search Jobs that periodically scans for
finished jobs to aggregate the status, upload logs, and clean up the
tables. This drastically reduces the size of the tables and improves
performance of the API.

For example, with this change a finished search job on .com only has 1
entry in the database, whereas before it could have several milions.

Test plan:
- new unit tests
- manual testing

 I ran a couple of search jobs locally and checked that
 - logs are uploaded to `blobstore-go/buckets/search-jobs`
 - repo jobs are deleted from `exhaustive_repo_jobs`
 - logs are served from the blobstore after the janitor ran
@stefanhengl stefanhengl force-pushed the sh/search-jobs/janitor branch from dfeff33 to ae3371e Compare July 31, 2024 13:00
@stefanhengl stefanhengl requested a review from a team July 31, 2024 13:37
@stefanhengl stefanhengl marked this pull request as ready for review July 31, 2024 13:37
csvWriterTo, err := svc.GetSearchJobLogsWriterTo(r.Context(), int64(jobID))
filename := filenamePrefix(jobID) + ".log.csv"

// Jobs in a terminal state are aggregated. As part of the aggregation, the logs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always nice to not have known races to keep the mental model simple (even if users are unlikely to see them). What if we always tried the blobstore (regardless of isAggregated status), and fall back to the DB if it's not found?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The store abstracts over several different stores (aws, gcs, blobstore) and I believe we don't have a consistent signal of "blob not found". Checking for isAggregated makes this a bit more robust.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, too bad we don't have "blob not found". In that case, maybe we could add a single retry here (if the job is not aggregated, and we fail to serve the log from the DB, then retrying should succeed, since the DB transaction to aggregate has completed). Then we don't expect any failures in practice, and can treat any error as something we need to investigate!

@@ -0,0 +1,166 @@
package search

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Could you also explain why we chose to use the blobstore vs. the DB for the aggregated state? Maybe because it's quite a lot of info to be storing in the DB (failure messages + states)?

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I am totally new to this code though, so my review should be taken with a grain of salt :)

csvWriterTo, err := svc.GetSearchJobLogsWriterTo(r.Context(), int64(jobID))
filename := filenamePrefix(jobID) + ".log.csv"

// Jobs in a terminal state are aggregated. As part of the aggregation, the logs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, too bad we don't have "blob not found". In that case, maybe we could add a single retry here (if the job is not aggregated, and we fail to serve the log from the DB, then retrying should succeed, since the DB transaction to aggregate has completed). Then we don't expect any failures in practice, and can treat any error as something we need to investigate!

@stefanhengl stefanhengl merged commit cd38adb into main Aug 1, 2024
@stefanhengl stefanhengl deleted the sh/search-jobs/janitor branch August 1, 2024 13:29
stefanhengl added a commit that referenced this pull request Aug 5, 2024
Relates to #64186

With this PR we only show `83 out of 120 tasks` if the search job is
currently processing. In all other states, we don't show this stats.
This is a consequence of the janitor job. After aggregation, this data
is not available anymore.

I remove an unncessary restriction on the download of logs and results.

Test plan:
I ran a search job locally and confirmed that the progress message is
only visible while the job is processing and that logs and downloads are
always available.
stefanhengl added a commit that referenced this pull request Aug 6, 2024
Relates to #64186

With this PR we only show `83 out of 120 tasks` if the search job is
currently processing. In all other states, we don't show this stat. This
is a consequence of the janitor job I recently added, because after
aggregation, this data is not available anymore. User's can still
inspect the logs and download results to get a detailed view of which
revisions were searched.

I also remove an unnecessary dependency of the download links on the job
state.

## Test plan:

I ran a search job locally and confirmed that the progress message is
only visible while the job is processing and that logs and downloads are
always available.

## Changelog
- Show detailed progress only while job is in status "processing"
- Remove dependency of download links on job state
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla-signed team/product-platform team/search-platform Issues owned by the search platform team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants