Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Schedule embeddings jobs will not roll back transaction on read errors#54701

Merged
gl-srgr merged 8 commits into
mainfrom
garyl/enqueue_failed_embedjob
Jul 11, 2023
Merged

Schedule embeddings jobs will not roll back transaction on read errors#54701
gl-srgr merged 8 commits into
mainfrom
garyl/enqueue_failed_embedjob

Conversation

@gl-srgr

@gl-srgr gl-srgr commented Jul 7, 2023

Copy link
Copy Markdown
Contributor

When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue (slack discussion) was due to one of the repos being empty and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt.

We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job.

This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a failed state so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with a failed state we do not roll back the transaction so that successful embeddings jobs are still created with a queued state.

Test plan

will add test case to either scheduling logic tests or job execution logic tests (depends on implementation)

@cla-bot cla-bot Bot added the cla-signed label Jul 7, 2023
@gl-srgr

gl-srgr commented Jul 7, 2023

Copy link
Copy Markdown
Contributor Author

@stefanhengl @camdencheek I'm still unsure about how we want to bubble up errors we encounter. The first iteration of my PR just removes reporting the "read" errors but still reports errors for creating embeddings jobs (that is, there are still certain errors that will fail out of the loop early and therefore cancel all jobs from being written to the table).

Some ideas I had in mind:

  • return all of the "read" errors (e.g. the empty repo error) across all repos as a multi error called readErrs; we won't pass this error to Done() and therefore doesn't roll back the full set of job writes.
  • Instead of returning "read" errors such as the empty repo error we just log it as part of the failed jobs failure_message field. We could add a new function like CreateFailedRepoEmbeddingJob() which takes a failure_message value (instead of modifying the existing CreateRepoEmbeddingJob())
  • Pass a logger to the schedule jobs function. This would be nice to report each repo's error independently. For graphql resolvers I believer there is already a logger member of the Resolver. However, I'm not sure if the worker background routine leans into this approach well or if it makes sense to pass a logger argument.

Done(err error) error

CreateRepoEmbeddingJob(ctx context.Context, repoID api.RepoID, revision api.CommitID) (int, error)
CreateRepoEmbeddingJob(ctx context.Context, repoID api.RepoID, revision api.CommitID, state string) (int, error)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also leave this as is and add a new function

r, err := repoStore.GetByName(ctx, repoName)
if err != nil {
return err
return nil

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "read" errors I mentioned in the PR comment is essentially just:

  1. some error when reading repo store
  2. some error when fetching branch from gitserver
  3. no error when fetching branch from gitserver but empty repo has an empty refName

We should probably report these somewhere but I'm between the message logs and the jobs table. Does the failure_message get displayed anywhere for an embeddings job?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the failure_message get displayed anywhere for an embeddings job?

Yes, this gets propagated to the UI when we hit an error

@stefanhengl

Copy link
Copy Markdown
Member

Taking a step back: We definitely have to surface problematic repos to the admin. I am just not sure that the list of embeddings jobs is the right way. Scheduling errors belong much more to the policy IE it is the policy that couldn't do it's job, not the embedding service.

Also, scheduling a failed job because we encountered an error during scheduling just to report an error seems like a misuse of the jobs page and the worker framework.

However, it becomes tricky because we can have overlapping policies and it is not a good experience for an admin to click through the policies and see which repo fails where.

Suggestion

We should take inspiration from code-intel's dashboard page and create a new page that displays all repositories with errors (from policies and jobs).

Once you click on a repository you could see

  • all related jobs with errors
  • all related policies with errors

As a stop-gap for the next patch, we could log the errors during scheduling (but continue with the other repos) and display an error icon next to policy with a tooltip "some repos might not have been scheduled, check the logs for ".

@gl-srgr

gl-srgr commented Jul 7, 2023

Copy link
Copy Markdown
Contributor Author

Agreed that this solution is surfacing the problematic repos via a page that was not intended for this. Also the failed jobs can bloat / clutter the jobs page with a bunch of jobs that never actually ran / actually enqueued as new/ready to be processed.

As a stop-gap for the next patch, we could log the errors during scheduling (but continue with the other repos) and display an error icon next to policy with a tooltip "some repos might not have been scheduled, check the logs for ".

I'll think this over as an option for the immediate fix

@camdencheek

Copy link
Copy Markdown
Member

scheduling a failed job because we encountered an error during scheduling just to report an error seems like a misuse of the jobs page and the worker framework

So, I would agree, but we could also just move resolution of the default branch into the job execution itself, right? It's somewhat arbitrary that this failure occurred during scheduling rather than during the execution of the scheduled job. I mostly thought that would be the simplest way to surface the issue in a way that's not confusing to admins

@stefanhengl

Copy link
Copy Markdown
Member

It's somewhat arbitrary that this failure occurred during scheduling rather than during the execution of the scheduled job.

I agree. Moving the resolution of the default branch to the job execution seems like a good option. Then the job can just fail naturally and report what happened. I think what bothered me the most was scheduling a failed job from the scheduling logic.

@gl-srgr

gl-srgr commented Jul 10, 2023

Copy link
Copy Markdown
Contributor Author

but we could also just move resolution of the default branch into the job execution itself, right?

I'm okay with this approach. I'm implementing now and the remaining decision is how to handle the force reschedule parameter; right now we act on this parameter depending on repo id and latestRevision. If we only want to resolve the repo (i.e. read repo store or fetch data from gitserver) and handle errors in the job execution, not job scheduling, then we can update the repo embedding job schema (here and here) with a force column. We'd no longer prevent scheduling dupes of the same repo+revision jobs in this case but we can still check whether a job can be skipped if it's a dupe during job execution.

Alternatively we can call the gitserver GetDefaultBranch in both scheduling and job execution, where in scheduling we just call it when we have a non empty revision that we can use for look up of last scheduled job for revision and determine we can skip a job creation; if we can't confirm that we can skip then we'd enqueue a new job and let the job execution handle marking as failed for empty repo or error cases.

Not sure which feels more out of place: a force column for the job schema or fetching twice with the gitserver client for default branch. I'm leaning towards and am currently implementing the former but let me know if you have another idea.

cc @camdencheek @stefanhengl

@gl-srgr

gl-srgr commented Jul 10, 2023

Copy link
Copy Markdown
Contributor Author

Pushed some changes but still need to update existing tests and add new tests. Wanted to get the changes out for review in the meantime.

@sourcegraph-bot

sourcegraph-bot commented Jul 10, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff 4307c7b...264f3aa.

Notify File(s)
@efritz enterprise/cmd/worker/internal/embeddings/repo/handler.go
enterprise/cmd/worker/internal/embeddings/repo/handler_test.go

@stefanhengl

Copy link
Copy Markdown
Member

We'd no longer prevent scheduling dupes of the same repo+revision jobs in this case but we can still check whether a job can be skipped if it's a dupe during job execution.

Good point. The issue is we check for embeddable repos every minute or so IE scheduling duplicates will let the repo_embeddings_jobs table explode if I am not mistaken. We should try not to schedule duplicates for that reason.

@gl-srgr

gl-srgr commented Jul 10, 2023

Copy link
Copy Markdown
Contributor Author

I suppose we could back off for a bit and not enqueue the job right away to reduce duplicates. That is, we could check if there was a recent failed job in the last X minutes for a repo+revision tuple, where revision is also empty string. We'd be queueing jobs despite knowing that we had a bad response from gitserver, albeit less frequently, and still have job execution mark job as failed.

More thorough step by step:

  1. Scheduler successfully get repo by name from store.
  2. Scheduler fetches branch from gitserver and return is either (a) an error or (b) nil error but also empty values.
  3. With (2.) confirmed we check if there is a duplicate job with repo and empty revision enqueued. If yes do nothing.
  4. If (3.) does not exit early, we also check if there is a failed job for the repo for an empty revision in the last X minutes, and only enqueue a job with repo & empty revision if FinishedAt for that previous job was before now - X minutes.

We still add error handling to job execution but this would let us not modify the interface of repo job embeddings store and also not modify the schema (b/c we'd check forceReschedule in scheduler still).

This is sort of overlapping with my alternate solution that I commented on yesterday, that leaves a gitserver request in scheduler:

Alternatively we can call the gitserver GetDefaultBranch in both scheduling and job execution, where in scheduling we just call it when we have a non empty revision that we can use for look up of last scheduled job for revision and determine we can skip a job creation; if we can't confirm that we can skip then we'd enqueue a new job and let the job execution handle marking as failed for empty repo or error cases.

@sourcegraph-bot

Copy link
Copy Markdown
Contributor

📖 Storybook live preview

… with failed status and allow transaction to proceed.
@gl-srgr gl-srgr force-pushed the garyl/enqueue_failed_embedjob branch from 7cf914e to 9d082b5 Compare July 11, 2023 06:02
@gl-srgr

gl-srgr commented Jul 11, 2023

Copy link
Copy Markdown
Contributor Author

Spoke with @camdencheek on a call this morning and we agreed that we should still try to prevent duplicate jobs. The trade off is that we will attempt default branch resolution in job scheduling and if there is an issue with the response we will still enqueue a job with an empty revision. This lets job execution return an error which will lead to marking the job as failed (or errored if retries take place).

Regarding failed jobs and retries: I've also added a check to prevent re-scheduling a failed job if that job had an empty revision. Once we can fetch latest revision for a repo we'll proceed with scheduling a new job. This prevents us from adding redundant dupes for a job that should never succeed because of empty revision being invalid in handler.

@camdencheek camdencheek left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Have you done a manual E2E test with an empty repo?

gitserver: h.gitserverClient,
}

err = fetcher.validateRevision(ctx)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here for future readers? I could see it being surprising to need to re-validate a revision

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Will add a comment

@gl-srgr

gl-srgr commented Jul 11, 2023

Copy link
Copy Markdown
Contributor Author

TLDR; manual e2e testing looks good so I'm merging.

LGTM. Have you done a manual E2E test with an empty repo?
For sure, I ran the same type of E2E testing for the repo which is mostly with graphql scheduleRepositoriesForEmbedding with differently sized sets of repo names and combinations of empty and non-empty repos. Also trying with policy as well.

I suppose the only caveat is that with our erroring silently on repo not found a graphql query with a bad repo name doesn't show that error (and we can't have a job in the table showing this error at this time). Policies shouldn't have this issue since I believe we resolve repos as part of generating the set of repos to submit for scheduling of embedding jobs. I believe this dashboard @stefanhengl described can help with showing more fine grained errors/warnings in the future:

We should take inspiration from code-intel's dashboard page and create a new page that displays all repositories with errors (from policies and jobs).

@gl-srgr

gl-srgr commented Jul 11, 2023

Copy link
Copy Markdown
Contributor Author

One last caveat although I don't think it's blocker, just something for us to consider later:

Now that we submit an empty repo for an embeddings job it will be completed with state failed. I thought that maybe GetEmbeddableRepos would reconsider scheduling a job for this repo once the repo is non-empty but the sql query will ignore it, so users will need to manually schedule that repo once it's non-empty.

Again, not a blocker but just something to call out. I'll work on summarizing the follow up needs or nice to haves related to our recent job scheduling and job execution discussions.

@gl-srgr gl-srgr merged commit 57146ce into main Jul 11, 2023
@gl-srgr gl-srgr deleted the garyl/enqueue_failed_embedjob branch July 11, 2023 19:35
github-actions Bot pushed a commit that referenced this pull request Jul 11, 2023
#54701)

When a set of repositories are to be scheduled for embeddings (via
graphql or a policy) an error with a single repository can cause the
entire set of embeddings jobs to be rolled back. The specific error that
occurred for a recent customer issue ([slack
discussion](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688671109829779))
was due to one of the [repos being
empty](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/embeddings/schedule.go?L36-42)
and causing all 500 repos matched by a policy to not be enqueued for
embeddings jobs. Not only do all repos have their embeddings jobs rolled
back but additionally the logging will only report one problematic repo
before exiting the scheduling logic which makes it so that the log does
not capture all of the problematic repos in one attempt.

We should instead allow read errors (e.g. reading sql or fetching
information from gitserver) to be handled but not prevent the entire
batch of jobs to be rolled back. This will allow a policy to be
partially impactful across the repositories it is concerned with.
Especially given that an empty repo won't have any need to be enqueued
for an embeddings job.

This PR changes the error handling so that repos that cannot complete an
embeddings job will still be enqueued with a `failed` state so that site
admin pages can still display which individual repos are problematic.
When a repo is enqueued with a `failed` state we do not roll back the
transaction so that successful embeddings jobs are still created with a
`queued` state.

## Test plan

will add test case to either scheduling logic tests or job execution
logic tests (depends on implementation)

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

(cherry picked from commit 57146ce)
coury-clark pushed a commit that referenced this pull request Jul 11, 2023
…n on read errors (#54804)

When a set of repositories are to be scheduled for embeddings (via
graphql or a policy) an error with a single repository can cause the
entire set of embeddings jobs to be rolled back. The specific error that
occurred for a recent customer issue ([slack
discussion](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688671109829779))
was due to one of the [repos being
empty](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/embeddings/schedule.go?L36-42)
and causing all 500 repos matched by a policy to not be enqueued for
embeddings jobs. Not only do all repos have their embeddings jobs rolled
back but additionally the logging will only report one problematic repo
before exiting the scheduling logic which makes it so that the log does
not capture all of the problematic repos in one attempt.

We should instead allow read errors (e.g. reading sql or fetching
information from gitserver) to be handled but not prevent the entire
batch of jobs to be rolled back. This will allow a policy to be
partially impactful across the repositories it is concerned with.
Especially given that an empty repo won&#39;t have any need to be
enqueued for an embeddings job.

This PR changes the error handling so that repos that cannot complete an
embeddings job will still be enqueued with a `failed` state so that site
admin pages can still display which individual repos are problematic.
When a repo is enqueued with a `failed` state we do not roll back the
transaction so that successful embeddings jobs are still created with a
`queued` state.


## Test plan

will add test case to either scheduling logic tests or job execution
logic tests (depends on implementation)

&lt;!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
--&gt;
 <br> Backport 57146ce from #54701

Co-authored-by: Gary Lee <105310278+gl-srgr@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants