Schedule embeddings jobs will not roll back transaction on read errors#54701
Conversation
|
@stefanhengl @camdencheek I'm still unsure about how we want to bubble up errors we encounter. The first iteration of my PR just removes reporting the "read" errors but still reports errors for creating embeddings jobs (that is, there are still certain errors that will fail out of the loop early and therefore cancel all jobs from being written to the table). Some ideas I had in mind:
|
| Done(err error) error | ||
|
|
||
| CreateRepoEmbeddingJob(ctx context.Context, repoID api.RepoID, revision api.CommitID) (int, error) | ||
| CreateRepoEmbeddingJob(ctx context.Context, repoID api.RepoID, revision api.CommitID, state string) (int, error) |
There was a problem hiding this comment.
We could also leave this as is and add a new function
| r, err := repoStore.GetByName(ctx, repoName) | ||
| if err != nil { | ||
| return err | ||
| return nil |
There was a problem hiding this comment.
The "read" errors I mentioned in the PR comment is essentially just:
- some error when reading repo store
- some error when fetching branch from gitserver
- no error when fetching branch from gitserver but empty repo has an empty
refName
We should probably report these somewhere but I'm between the message logs and the jobs table. Does the failure_message get displayed anywhere for an embeddings job?
There was a problem hiding this comment.
Does the failure_message get displayed anywhere for an embeddings job?
Yes, this gets propagated to the UI when we hit an error
|
Taking a step back: We definitely have to surface problematic repos to the admin. I am just not sure that the list of embeddings jobs is the right way. Scheduling errors belong much more to the policy IE it is the policy that couldn't do it's job, not the embedding service. Also, scheduling a failed job because we encountered an error during scheduling just to report an error seems like a misuse of the jobs page and the worker framework. However, it becomes tricky because we can have overlapping policies and it is not a good experience for an admin to click through the policies and see which repo fails where. Suggestion We should take inspiration from code-intel's dashboard page and create a new page that displays all repositories with errors (from policies and jobs). Once you click on a repository you could see
As a stop-gap for the next patch, we could log the errors during scheduling (but continue with the other repos) and display an error icon next to policy with a tooltip "some repos might not have been scheduled, check the logs for ". |
|
Agreed that this solution is surfacing the problematic repos via a page that was not intended for this. Also the failed jobs can bloat / clutter the jobs page with a bunch of jobs that never actually ran / actually enqueued as new/ready to be processed.
I'll think this over as an option for the immediate fix |
So, I would agree, but we could also just move resolution of the default branch into the job execution itself, right? It's somewhat arbitrary that this failure occurred during scheduling rather than during the execution of the scheduled job. I mostly thought that would be the simplest way to surface the issue in a way that's not confusing to admins |
I agree. Moving the resolution of the default branch to the job execution seems like a good option. Then the job can just fail naturally and report what happened. I think what bothered me the most was scheduling a failed job from the scheduling logic. |
I'm okay with this approach. I'm implementing now and the remaining decision is how to handle the force reschedule parameter; right now we act on this parameter depending on repo id and latestRevision. If we only want to resolve the repo (i.e. read repo store or fetch data from gitserver) and handle errors in the job execution, not job scheduling, then we can update the repo embedding job schema (here and here) with a Alternatively we can call the gitserver Not sure which feels more out of place: a force column for the job schema or fetching twice with the gitserver client for default branch. I'm leaning towards and am currently implementing the former but let me know if you have another idea. |
|
Pushed some changes but still need to update existing tests and add new tests. Wanted to get the changes out for review in the meantime. |
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 4307c7b...264f3aa.
|
Good point. The issue is we check for embeddable repos every minute or so IE scheduling duplicates will let the repo_embeddings_jobs table explode if I am not mistaken. We should try not to schedule duplicates for that reason. |
|
I suppose we could back off for a bit and not enqueue the job right away to reduce duplicates. That is, we could check if there was a recent failed job in the last X minutes for a repo+revision tuple, where revision is also empty string. We'd be queueing jobs despite knowing that we had a bad response from gitserver, albeit less frequently, and still have job execution mark job as More thorough step by step:
We still add error handling to job execution but this would let us not modify the interface of repo job embeddings store and also not modify the schema (b/c we'd check This is sort of overlapping with my alternate solution that I commented on yesterday, that leaves a gitserver request in scheduler:
|
… with failed status and allow transaction to proceed.
7cf914e to
9d082b5
Compare
|
Spoke with @camdencheek on a call this morning and we agreed that we should still try to prevent duplicate jobs. The trade off is that we will attempt default branch resolution in job scheduling and if there is an issue with the response we will still enqueue a job with an empty revision. This lets job execution return an error which will lead to marking the job as Regarding failed jobs and retries: I've also added a check to prevent re-scheduling a failed job if that job had an empty revision. Once we can fetch latest revision for a repo we'll proceed with scheduling a new job. This prevents us from adding redundant dupes for a job that should never succeed because of empty revision being invalid in handler. |
camdencheek
left a comment
There was a problem hiding this comment.
LGTM. Have you done a manual E2E test with an empty repo?
| gitserver: h.gitserverClient, | ||
| } | ||
|
|
||
| err = fetcher.validateRevision(ctx) |
There was a problem hiding this comment.
Can you add a comment here for future readers? I could see it being surprising to need to re-validate a revision
There was a problem hiding this comment.
Good point! Will add a comment
|
TLDR; manual e2e testing looks good so I'm merging.
I suppose the only caveat is that with our erroring silently on repo not found a graphql query with a bad repo name doesn't show that error (and we can't have a job in the table showing this error at this time). Policies shouldn't have this issue since I believe we resolve repos as part of generating the set of repos to submit for scheduling of embedding jobs. I believe this dashboard @stefanhengl described can help with showing more fine grained errors/warnings in the future:
|
|
One last caveat although I don't think it's blocker, just something for us to consider later: Now that we submit an empty repo for an embeddings job it will be completed with state Again, not a blocker but just something to call out. I'll work on summarizing the follow up needs or nice to haves related to our recent job scheduling and job execution discussions. |
#54701) When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue ([slack discussion](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688671109829779)) was due to one of the [repos being empty](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/embeddings/schedule.go?L36-42) and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt. We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job. This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a `failed` state so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with a `failed` state we do not roll back the transaction so that successful embeddings jobs are still created with a `queued` state. ## Test plan will add test case to either scheduling logic tests or job execution logic tests (depends on implementation) <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> (cherry picked from commit 57146ce)
…n on read errors (#54804) When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue ([slack discussion](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688671109829779)) was due to one of the [repos being empty](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/embeddings/schedule.go?L36-42) and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt. We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job. This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a `failed` state so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with a `failed` state we do not roll back the transaction so that successful embeddings jobs are still created with a `queued` state. ## Test plan will add test case to either scheduling logic tests or job execution logic tests (depends on implementation) <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> <br> Backport 57146ce from #54701 Co-authored-by: Gary Lee <105310278+gl-srgr@users.noreply.github.com>
When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue (slack discussion) was due to one of the repos being empty and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt.
We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job.
This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a
failedstate so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with afailedstate we do not roll back the transaction so that successful embeddings jobs are still created with aqueuedstate.Test plan
will add test case to either scheduling logic tests or job execution logic tests (depends on implementation)