Schedule embeddings jobs will not roll back transaction on read errors by gl-srgr · Pull Request #54701 · sourcegraph/sourcegraph-public-snapshot

gl-srgr · 2023-07-07T05:42:52Z

When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue (slack discussion) was due to one of the repos being empty and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt.

We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job.

This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a failed state so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with a failed state we do not roll back the transaction so that successful embeddings jobs are still created with a queued state.

Test plan

will add test case to either scheduling logic tests or job execution logic tests (depends on implementation)

gl-srgr · 2023-07-07T05:53:28Z

@stefanhengl @camdencheek I'm still unsure about how we want to bubble up errors we encounter. The first iteration of my PR just removes reporting the "read" errors but still reports errors for creating embeddings jobs (that is, there are still certain errors that will fail out of the loop early and therefore cancel all jobs from being written to the table).

Some ideas I had in mind:

return all of the "read" errors (e.g. the empty repo error) across all repos as a multi error called readErrs; we won't pass this error to Done() and therefore doesn't roll back the full set of job writes.
Instead of returning "read" errors such as the empty repo error we just log it as part of the failed jobs failure_message field. We could add a new function like CreateFailedRepoEmbeddingJob() which takes a failure_message value (instead of modifying the existing CreateRepoEmbeddingJob())
Pass a logger to the schedule jobs function. This would be nice to report each repo's error independently. For graphql resolvers I believer there is already a logger member of the Resolver. However, I'm not sure if the worker background routine leans into this approach well or if it makes sense to pass a logger argument.

gl-srgr · 2023-07-07T05:54:38Z

 	Done(err error) error

-	CreateRepoEmbeddingJob(ctx context.Context, repoID api.RepoID, revision api.CommitID) (int, error)
+	CreateRepoEmbeddingJob(ctx context.Context, repoID api.RepoID, revision api.CommitID, state string) (int, error)


We could also leave this as is and add a new function

gl-srgr · 2023-07-07T05:56:39Z

 			r, err := repoStore.GetByName(ctx, repoName)
 			if err != nil {
-				return err
+				return nil


The "read" errors I mentioned in the PR comment is essentially just:

some error when reading repo store

some error when fetching branch from gitserver

no error when fetching branch from gitserver but empty repo has an empty refName

We should probably report these somewhere but I'm between the message logs and the jobs table. Does the failure_message get displayed anywhere for an embeddings job?

Does the failure_message get displayed anywhere for an embeddings job?

Yes, this gets propagated to the UI when we hit an error

stefanhengl · 2023-07-07T11:31:24Z

Taking a step back: We definitely have to surface problematic repos to the admin. I am just not sure that the list of embeddings jobs is the right way. Scheduling errors belong much more to the policy IE it is the policy that couldn't do it's job, not the embedding service.

Also, scheduling a failed job because we encountered an error during scheduling just to report an error seems like a misuse of the jobs page and the worker framework.

However, it becomes tricky because we can have overlapping policies and it is not a good experience for an admin to click through the policies and see which repo fails where.

Suggestion

We should take inspiration from code-intel's dashboard page and create a new page that displays all repositories with errors (from policies and jobs).

Once you click on a repository you could see

all related jobs with errors
all related policies with errors

As a stop-gap for the next patch, we could log the errors during scheduling (but continue with the other repos) and display an error icon next to policy with a tooltip "some repos might not have been scheduled, check the logs for ".

gl-srgr · 2023-07-07T13:57:24Z

Agreed that this solution is surfacing the problematic repos via a page that was not intended for this. Also the failed jobs can bloat / clutter the jobs page with a bunch of jobs that never actually ran / actually enqueued as new/ready to be processed.

As a stop-gap for the next patch, we could log the errors during scheduling (but continue with the other repos) and display an error icon next to policy with a tooltip "some repos might not have been scheduled, check the logs for ".

I'll think this over as an option for the immediate fix

camdencheek · 2023-07-07T13:59:58Z

scheduling a failed job because we encountered an error during scheduling just to report an error seems like a misuse of the jobs page and the worker framework

So, I would agree, but we could also just move resolution of the default branch into the job execution itself, right? It's somewhat arbitrary that this failure occurred during scheduling rather than during the execution of the scheduled job. I mostly thought that would be the simplest way to surface the issue in a way that's not confusing to admins

stefanhengl · 2023-07-07T14:05:17Z

It's somewhat arbitrary that this failure occurred during scheduling rather than during the execution of the scheduled job.

I agree. Moving the resolution of the default branch to the job execution seems like a good option. Then the job can just fail naturally and report what happened. I think what bothered me the most was scheduling a failed job from the scheduling logic.

gl-srgr · 2023-07-10T03:53:56Z

but we could also just move resolution of the default branch into the job execution itself, right?

I'm okay with this approach. I'm implementing now and the remaining decision is how to handle the force reschedule parameter; right now we act on this parameter depending on repo id and latestRevision. If we only want to resolve the repo (i.e. read repo store or fetch data from gitserver) and handle errors in the job execution, not job scheduling, then we can update the repo embedding job schema (here and here) with a force column. We'd no longer prevent scheduling dupes of the same repo+revision jobs in this case but we can still check whether a job can be skipped if it's a dupe during job execution.

Alternatively we can call the gitserver GetDefaultBranch in both scheduling and job execution, where in scheduling we just call it when we have a non empty revision that we can use for look up of last scheduled job for revision and determine we can skip a job creation; if we can't confirm that we can skip then we'd enqueue a new job and let the job execution handle marking as failed for empty repo or error cases.

Not sure which feels more out of place: a force column for the job schema or fetching twice with the gitserver client for default branch. I'm leaning towards and am currently implementing the former but let me know if you have another idea.

cc @camdencheek @stefanhengl

gl-srgr · 2023-07-10T06:01:29Z

Pushed some changes but still need to update existing tests and add new tests. Wanted to get the changes out for review in the meantime.

sourcegraph-bot · 2023-07-10T06:03:02Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 4307c7b...264f3aa.

Notify	File(s)
@efritz	enterprise/cmd/worker/internal/embeddings/repo/handler.go enterprise/cmd/worker/internal/embeddings/repo/handler_test.go

stefanhengl · 2023-07-10T14:19:49Z

We'd no longer prevent scheduling dupes of the same repo+revision jobs in this case but we can still check whether a job can be skipped if it's a dupe during job execution.

Good point. The issue is we check for embeddable repos every minute or so IE scheduling duplicates will let the repo_embeddings_jobs table explode if I am not mistaken. We should try not to schedule duplicates for that reason.

gl-srgr · 2023-07-10T15:37:53Z

I suppose we could back off for a bit and not enqueue the job right away to reduce duplicates. That is, we could check if there was a recent failed job in the last X minutes for a repo+revision tuple, where revision is also empty string. We'd be queueing jobs despite knowing that we had a bad response from gitserver, albeit less frequently, and still have job execution mark job as failed.

More thorough step by step:

Scheduler successfully get repo by name from store.
Scheduler fetches branch from gitserver and return is either (a) an error or (b) nil error but also empty values.
With (2.) confirmed we check if there is a duplicate job with repo and empty revision enqueued. If yes do nothing.
If (3.) does not exit early, we also check if there is a failed job for the repo for an empty revision in the last X minutes, and only enqueue a job with repo & empty revision if FinishedAt for that previous job was before now - X minutes.

We still add error handling to job execution but this would let us not modify the interface of repo job embeddings store and also not modify the schema (b/c we'd check forceReschedule in scheduler still).

This is sort of overlapping with my alternate solution that I commented on yesterday, that leaves a gitserver request in scheduler:

Alternatively we can call the gitserver GetDefaultBranch in both scheduling and job execution, where in scheduling we just call it when we have a non empty revision that we can use for look up of last scheduled job for revision and determine we can skip a job creation; if we can't confirm that we can skip then we'd enqueue a new job and let the job execution handle marking as failed for empty repo or error cases.

sourcegraph-bot · 2023-07-11T04:47:43Z

📖 Storybook live preview

… with failed status and allow transaction to proceed.

…revision at time of scheduling.

gl-srgr · 2023-07-11T07:44:55Z

Spoke with @camdencheek on a call this morning and we agreed that we should still try to prevent duplicate jobs. The trade off is that we will attempt default branch resolution in job scheduling and if there is an issue with the response we will still enqueue a job with an empty revision. This lets job execution return an error which will lead to marking the job as failed (or errored if retries take place).

Regarding failed jobs and retries: I've also added a check to prevent re-scheduling a failed job if that job had an empty revision. Once we can fetch latest revision for a repo we'll proceed with scheduling a new job. This prevents us from adding redundant dupes for a job that should never succeed because of empty revision being invalid in handler.

camdencheek

LGTM. Have you done a manual E2E test with an empty repo?

camdencheek · 2023-07-11T14:46:58Z

+		gitserver: h.gitserverClient,
+	}
+
+	err = fetcher.validateRevision(ctx)


Can you add a comment here for future readers? I could see it being surprising to need to re-validate a revision

Good point! Will add a comment

gl-srgr · 2023-07-11T18:45:36Z

TLDR; manual e2e testing looks good so I'm merging.

LGTM. Have you done a manual E2E test with an empty repo?
For sure, I ran the same type of E2E testing for the repo which is mostly with graphql scheduleRepositoriesForEmbedding with differently sized sets of repo names and combinations of empty and non-empty repos. Also trying with policy as well.

I suppose the only caveat is that with our erroring silently on repo not found a graphql query with a bad repo name doesn't show that error (and we can't have a job in the table showing this error at this time). Policies shouldn't have this issue since I believe we resolve repos as part of generating the set of repos to submit for scheduling of embedding jobs. I believe this dashboard @stefanhengl described can help with showing more fine grained errors/warnings in the future:

We should take inspiration from code-intel's dashboard page and create a new page that displays all repositories with errors (from policies and jobs).

gl-srgr · 2023-07-11T19:35:08Z

One last caveat although I don't think it's blocker, just something for us to consider later:

Now that we submit an empty repo for an embeddings job it will be completed with state failed. I thought that maybe GetEmbeddableRepos would reconsider scheduling a job for this repo once the repo is non-empty but the sql query will ignore it, so users will need to manually schedule that repo once it's non-empty.

Again, not a blocker but just something to call out. I'll work on summarizing the follow up needs or nice to haves related to our recent job scheduling and job execution discussions.

#54701) When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue ([slack discussion](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688671109829779)) was due to one of the [repos being empty](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/embeddings/schedule.go?L36-42) and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt. We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job. This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a `failed` state so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with a `failed` state we do not roll back the transaction so that successful embeddings jobs are still created with a `queued` state. ## Test plan will add test case to either scheduling logic tests or job execution logic tests (depends on implementation)  (cherry picked from commit 57146ce)

…n on read errors (#54804) When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue ([slack discussion](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688671109829779)) was due to one of the [repos being empty](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/embeddings/schedule.go?L36-42) and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt. We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job. This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a `failed` state so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with a `failed` state we do not roll back the transaction so that successful embeddings jobs are still created with a `queued` state. ## Test plan will add test case to either scheduling logic tests or job execution logic tests (depends on implementation)  <br> Backport 57146ce from #54701 Co-authored-by: Gary Lee <105310278+gl-srgr@users.noreply.github.com>

cla-bot Bot added the cla-signed label Jul 7, 2023

gl-srgr requested review from camdencheek and stefanhengl July 7, 2023 05:53

gl-srgr commented Jul 7, 2023

View reviewed changes

For errors when reading from repo store or gitserver we enqueue a job…

9d082b5

… with failed status and allow transaction to proceed.

gl-srgr force-pushed the garyl/enqueue_failed_embedjob branch from 7cf914e to 9d082b5 Compare July 11, 2023 06:02

gl-srgr added 5 commits July 10, 2023 23:14

Revert change to CreateRepoEmbeddingJob Func. Enqueue job with empty …

1ba2076

…revision at time of scheduling.

build bazel

ee5dd0b

fix

cf72dea

fix import after goland resolves refs

d4c62bb

fix code analysis ineffassign

c851220

camdencheek approved these changes Jul 11, 2023

View reviewed changes

describe validateRevision use case and requirement

0bb4049

changelog unreleased

264f3aa

gl-srgr added the backport 5.1 label Jul 11, 2023

gl-srgr merged commit 57146ce into main Jul 11, 2023

gl-srgr deleted the garyl/enqueue_failed_embedjob branch July 11, 2023 19:35

github-actions Bot mentioned this pull request Jul 11, 2023

[Backport 5.1] Schedule embeddings jobs will not roll back transaction on read errors #54804

Merged

gl-srgr mentioned this pull request Aug 7, 2023

Scheduling embed jobs will return errors for skipped repos #55625

Closed

Conversation

gl-srgr commented Jul 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

gl-srgr commented Jul 7, 2023

Uh oh!

gl-srgr Jul 7, 2023

Choose a reason for hiding this comment

Uh oh!

gl-srgr Jul 7, 2023

Choose a reason for hiding this comment

Uh oh!

camdencheek Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

stefanhengl commented Jul 7, 2023

Uh oh!

gl-srgr commented Jul 7, 2023

Uh oh!

camdencheek commented Jul 7, 2023

Uh oh!

stefanhengl commented Jul 7, 2023

Uh oh!

gl-srgr commented Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gl-srgr commented Jul 10, 2023

Uh oh!

sourcegraph-bot commented Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanhengl commented Jul 10, 2023

Uh oh!

gl-srgr commented Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcegraph-bot commented Jul 11, 2023

Uh oh!

gl-srgr commented Jul 11, 2023

Uh oh!

camdencheek left a comment

Choose a reason for hiding this comment

Uh oh!

camdencheek Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

gl-srgr Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

gl-srgr commented Jul 11, 2023

Uh oh!

gl-srgr commented Jul 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gl-srgr commented Jul 7, 2023 •

edited

Loading

gl-srgr commented Jul 10, 2023 •

edited

Loading

sourcegraph-bot commented Jul 10, 2023 •

edited

Loading

gl-srgr commented Jul 10, 2023 •

edited

Loading