chore(search_jobs): add janitor job by stefanhengl · Pull Request #64186 · sourcegraph/sourcegraph-public-snapshot

stefanhengl · 2024-07-31T11:47:25Z

Fixes SPLF-119

This adds a background job to Search Jobs that periodically scans for finished jobs to aggregate the status, upload logs, and clean up the tables. This drastically reduces the size of the tables and improves the performance of the Search Jobs GQL API.

For example, with this change, a finished search job on .com only has 1 entry in the database, whereas before it could have several millions if we searched each repository.

Notes:

the diff seems larger than it actually is. I left a couple of comments to help the reviewers.

Test plan:

new unit tests
manual testing:

I ran a couple of search jobs locally (with the janitor job interval set to 1 min) and checked that

logs are uploaded to blobstore-go/buckets/search-jobs
repo jobs are deleted from exhaustive_repo_jobs
logs are served from the blobstore after the janitor ran
downloading logs while the job is running still works

Changelog

The new background job drastically reduces the size of the exhaustive_* tables and improves performance of the Search Jobs GQL API.

+	// 🚨 SECURITY: only someone with access to the job may upload the logs
+	if err := s.store.UserHasAccess(ctx, id); err != nil {
+		return 0, err
+	}
+
+	return s.uploadStore.Upload(ctx, getLogKey(id), r)


+	// 🚨 SECURITY: only someone with access to the job may download the logs
+	if err := s.store.UserHasAccess(ctx, id); err != nil {
+		return nil, err
+	}
+
+	return s.uploadStore.Get(ctx, getLogKey(id))


+	// 🚨 SECURITY: only someone with access to the job may delete the logs
+	if err := s.store.UserHasAccess(ctx, id); err != nil {
+		return err
+	}
+
+	return s.uploadStore.Delete(ctx, getLogKey(id))


github-actions · 2024-07-31T11:47:36Z

💡 Learn more about each section: PR description tips, Test Plan and Changelog.

stefanhengl · 2024-07-31T11:49:07Z

 	return func(w http.ResponseWriter, r *http.Request) {
 		jobIDStr := mux.Vars(r)["id"]
-		jobID, err := strconv.Atoi(jobIDStr)
+		jobID, err := strconv.ParseInt(jobIDStr, 10, 64)


Parsing into int64 to avoid a lot of int64 casts later on.

stefanhengl · 2024-07-31T11:57:17Z

@@ -0,0 +1,166 @@
+package search


This file is the core of the change. The idea is to calculate the aggregate state (which we already do to report the job status on the Search Job page) and set the status of the top-level search job to that status.

Once the aggregate state is set and the logs are uploaded to the blobstore, we can remove all db entries except for the top-level search job.

Progress reporting in UI will keep working as-is and the logs are now served from the blobstore.

Thanks for the explanation! Could you also explain why we chose to use the blobstore vs. the DB for the aggregated state? Maybe because it's quite a lot of info to be storing in the DB (failure messages + states)?

The aggregate state (IE "completed", "failed", ...) is persisted in the db, the log is uploaded to the blobstore.

The ultimate goal of this PR is to avoid accumulating millions of entries in the db for each search job. While a search job is running, we need those entries to keep track of the scope and for snapshotting (restart a job if Sourcegraph is restarted). After a search job is finished, we only care about the aggregate state of the entire job and the failure messages.

We already use the blobstore to store the search results (which might be GB worth of data), so storing the logs right next to them makes sense to me. I guess we could store the logs in the db, but this seems wasteful considering we only serve them as download.

I was just asking because it would simplify the "logs serving" logic if everything was in the DB (so we could do it all in one transaction). But this trade-off makes sense to me.

stefanhengl · 2024-07-31T12:02:19Z

@@ -0,0 +1,106 @@
+package storetest


This contains test helpers which both frontend and worker can use. Mostly copy&paste.

stefanhengl · 2024-07-31T12:06:29Z

+	writeCSV(logger.With(log.Int64("jobID", jobID)), w, filename, csvWriterTo)
+}
+
+func serveLogFromBlobstore(ctx context.Context, logger log.Logger, svc *service.Service, filenameNoQuotes string, jobID int64, w http.ResponseWriter) {


serveLogFromBlobstore is the only thing which is truly new in export.go. Instead of assembling the logs on the fly by calling out to the db, we serve the logs the new janitor job has uploaded to the blobstore.

This adds a background job to Search Jobs that periodically scans for finished jobs to aggregate the status, upload logs, and clean up the tables. This drastically reduces the size of the tables and improves performance of the API. For example, with this change a finished search job on .com only has 1 entry in the database, whereas before it could have several milions. Test plan: - new unit tests - manual testing I ran a couple of search jobs locally and checked that - logs are uploaded to `blobstore-go/buckets/search-jobs` - repo jobs are deleted from `exhaustive_repo_jobs` - logs are served from the blobstore after the janitor ran

jtibshirani · 2024-08-01T07:22:37Z

-		csvWriterTo, err := svc.GetSearchJobLogsWriterTo(r.Context(), int64(jobID))
+		filename := filenamePrefix(jobID) + ".log.csv"
+
+		// Jobs in a terminal state are aggregated. As part of the aggregation, the logs


It's always nice to not have known races to keep the mental model simple (even if users are unlikely to see them). What if we always tried the blobstore (regardless of isAggregated status), and fall back to the DB if it's not found?

The store abstracts over several different stores (aws, gcs, blobstore) and I believe we don't have a consistent signal of "blob not found". Checking for isAggregated makes this a bit more robust.

I see, too bad we don't have "blob not found". In that case, maybe we could add a single retry here (if the job is not aggregated, and we fail to serve the log from the DB, then retrying should succeed, since the DB transaction to aggregate has completed). Then we don't expect any failures in practice, and can treat any error as something we need to investigate!

jtibshirani · 2024-08-01T07:32:41Z

@@ -0,0 +1,166 @@
+package search


Thanks for the explanation! Could you also explain why we chose to use the blobstore vs. the DB for the aggregated state? Maybe because it's quite a lot of info to be storing in the DB (failure messages + states)?

jtibshirani

This looks good to me. I am totally new to this code though, so my review should be taken with a grain of salt :)

jtibshirani · 2024-08-01T10:00:12Z

-		csvWriterTo, err := svc.GetSearchJobLogsWriterTo(r.Context(), int64(jobID))
+		filename := filenamePrefix(jobID) + ".log.csv"
+
+		// Jobs in a terminal state are aggregated. As part of the aggregation, the logs


I see, too bad we don't have "blob not found". In that case, maybe we could add a single retry here (if the job is not aggregated, and we fail to serve the log from the DB, then retrying should succeed, since the DB transaction to aggregate has completed). Then we don't expect any failures in practice, and can treat any error as something we need to investigate!

Relates to #64186 With this PR we only show `83 out of 120 tasks` if the search job is currently processing. In all other states, we don't show this stats. This is a consequence of the janitor job. After aggregation, this data is not available anymore. I remove an unncessary restriction on the download of logs and results. Test plan: I ran a search job locally and confirmed that the progress message is only visible while the job is processing and that logs and downloads are always available.

Relates to #64186 With this PR we only show `83 out of 120 tasks` if the search job is currently processing. In all other states, we don't show this stat. This is a consequence of the janitor job I recently added, because after aggregation, this data is not available anymore. User's can still inspect the logs and download results to get a detailed view of which revisions were searched. I also remove an unnecessary dependency of the download links on the job state. ## Test plan: I ran a search job locally and confirmed that the progress message is only visible while the job is processing and that logs and downloads are always available. ## Changelog - Show detailed progress only while job is in status "processing" - Remove dependency of download links on job state

github-advanced-security AI found potential problems Jul 31, 2024

View reviewed changes

cla-bot Bot added the cla-signed label Jul 31, 2024

github-actions Bot added team/product-platform team/search-platform Issues owned by the search platform team labels Jul 31, 2024

stefanhengl commented Jul 31, 2024

View reviewed changes

stefanhengl force-pushed the sh/search-jobs/janitor branch from dfeff33 to ae3371e Compare July 31, 2024 13:00

sg bazel configure

5586da1

stefanhengl requested a review from a team July 31, 2024 13:37

stefanhengl marked this pull request as ready for review July 31, 2024 13:37

jtibshirani reviewed Aug 1, 2024

View reviewed changes

jtibshirani approved these changes Aug 1, 2024

View reviewed changes

delete logs

a48093e

stefanhengl merged commit cd38adb into main Aug 1, 2024

stefanhengl deleted the sh/search-jobs/janitor branch August 1, 2024 13:29

stefanhengl mentioned this pull request Aug 6, 2024

fix(search_jobs): progress reporting #64287

Merged

Conversation

stefanhengl commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan:

Changelog

Uh oh!

Check notice

Check notice

Check notice

github-actions Bot commented Jul 31, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefanhengl Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefanhengl Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefanhengl commented Jul 31, 2024 •

edited

Loading

stefanhengl Jul 31, 2024 •

edited

Loading

stefanhengl Jul 31, 2024 •

edited

Loading