This repository was archived by the owner on Sep 30, 2024. It is now read-only.
[Backport 5.1] Schedule embeddings jobs will not roll back transaction on read errors#54804
Merged
Conversation
#54701) When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue ([slack discussion](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688671109829779)) was due to one of the [repos being empty](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/embeddings/schedule.go?L36-42) and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt. We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job. This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a `failed` state so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with a `failed` state we do not roll back the transaction so that successful embeddings jobs are still created with a `queued` state. ## Test plan will add test case to either scheduling logic tests or job execution logic tests (depends on implementation) <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> (cherry picked from commit 57146ce)
Contributor
gl-srgr
approved these changes
Jul 11, 2023
davejrt
referenced
this pull request
Jul 12, 2023
…ty revision string (#54879) This [PR](https://github.com/sourcegraph/sourcegraph/pull/54804) fixed an issue with job scheduling/execution in the backend. The interim fix of this PR relies on allowing for empty string `revision` values written with a repo embedding job when the repo is empty or some other issue occurs when fetching the repo's default branch (e.g. empty repo). Empty revision value causes issues with graphql query repoEmbeddingJobs when [querying for Revision](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/enterprise/site-admin/cody/backend.ts?L32-35) as our site admin jobs panel does. `Panic occurred: runtime error: slice bounds out of range [:7] with length 0` This PR updates the embedding job resolver to consider empty revision string acceptable and not call constructor for git commit resolver. Also site admin jobs list will now show repo name if repo is non-null and revision is null. ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> sg manual test
gl-srgr
referenced
this pull request
Jul 12, 2023
…ty revision string (#54879) This [PR](https://github.com/sourcegraph/sourcegraph/pull/54804) fixed an issue with job scheduling/execution in the backend. The interim fix of this PR relies on allowing for empty string `revision` values written with a repo embedding job when the repo is empty or some other issue occurs when fetching the repo's default branch (e.g. empty repo). Empty revision value causes issues with graphql query repoEmbeddingJobs when [querying for Revision](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/enterprise/site-admin/cody/backend.ts?L32-35) as our site admin jobs panel does. `Panic occurred: runtime error: slice bounds out of range [:7] with length 0` This PR updates the embedding job resolver to consider empty revision string acceptable and not call constructor for git commit resolver. Also site admin jobs list will now show repo name if repo is non-null and revision is null. <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> sg manual test (cherry picked from commit 1dd6db4)
camdencheek
referenced
this pull request
Jul 12, 2023
…ty revision string (#54879) This [PR](https://github.com/sourcegraph/sourcegraph/pull/54804) fixed an issue with job scheduling/execution in the backend. The interim fix of this PR relies on allowing for empty string `revision` values written with a repo embedding job when the repo is empty or some other issue occurs when fetching the repo's default branch (e.g. empty repo). Empty revision value causes issues with graphql query repoEmbeddingJobs when [querying for Revision](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/enterprise/site-admin/cody/backend.ts?L32-35) as our site admin jobs panel does. `Panic occurred: runtime error: slice bounds out of range [:7] with length 0` This PR updates the embedding job resolver to consider empty revision string acceptable and not call constructor for git commit resolver. Also site admin jobs list will now show repo name if repo is non-null and revision is null. <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> sg manual test (cherry picked from commit 1dd6db4)
camdencheek
referenced
this pull request
Jul 12, 2023
…ddings job's empty revision string (#54881) This [PR](https://github.com/sourcegraph/sourcegraph/pull/54804) fixed an issue with job scheduling/execution in the backend. The interim fix of this PR relies on allowing for empty string `revision` values written with a repo embedding job when the repo is empty or some other issue occurs when fetching the repo's default branch (e.g. empty repo). Empty revision value causes issues with graphql query repoEmbeddingJobs when [querying for Revision](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/enterprise/site-admin/cody/backend.ts?L32-35) as our site admin jobs panel does. `Panic occurred: runtime error: slice bounds out of range [:7] with length 0` This PR updates the embedding job resolver to consider empty revision string acceptable and not call constructor for git commit resolver. Also site admin jobs list will now show repo name if repo is non-null and revision is null. Backport of https://github.com/sourcegraph/sourcegraph/pull/54879
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a set of repositories are to be scheduled for embeddings (via graphql or a policy) an error with a single repository can cause the entire set of embeddings jobs to be rolled back. The specific error that occurred for a recent customer issue (slack discussion) was due to one of the repos being empty and causing all 500 repos matched by a policy to not be enqueued for embeddings jobs. Not only do all repos have their embeddings jobs rolled back but additionally the logging will only report one problematic repo before exiting the scheduling logic which makes it so that the log does not capture all of the problematic repos in one attempt.
We should instead allow read errors (e.g. reading sql or fetching information from gitserver) to be handled but not prevent the entire batch of jobs to be rolled back. This will allow a policy to be partially impactful across the repositories it is concerned with. Especially given that an empty repo won't have any need to be enqueued for an embeddings job.
This PR changes the error handling so that repos that cannot complete an embeddings job will still be enqueued with a
failedstate so that site admin pages can still display which individual repos are problematic. When a repo is enqueued with afailedstate we do not roll back the transaction so that successful embeddings jobs are still created with aqueuedstate.Test plan
will add test case to either scheduling logic tests or job execution logic tests (depends on implementation)
<!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles -->
Backport 57146ce from #54701