Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Embeddings: fail job immediately if rate limited exceeded#58869

Merged
jtibshirani merged 2 commits into
mainfrom
jtibs/embeddings
Dec 11, 2023
Merged

Embeddings: fail job immediately if rate limited exceeded#58869
jtibshirani merged 2 commits into
mainfrom
jtibs/embeddings

Conversation

@jtibshirani

@jtibshirani jtibshirani commented Dec 8, 2023

Copy link
Copy Markdown
Contributor

Usually, during an embeddings job we allow 10% of embedding requests to fail,
simply skipping over failed chunks. If a customer has hit their rate limits,
this means we might continually send a huge number of embedding requests that
we know will immediately fail. With this change, we immediately fail a job if
the rate limit is exceeded.

It also increases the amount of time between attempting to run a job to 15
minutes. This won't make a big difference to user experience, since by default
embeddings jobs aren't allowed to be scheduled within 24h of the last run. But
it helps prevent jobs from continuously being scheduled then failing.

This change is unlikely to have a user-facing impact, but just helps cut down
on noise in logs and excessive requests to Cody Gateway.

Test plan

Added new unit test

Preview 🤩

Preview Link

@cla-bot cla-bot Bot added the cla-signed label Dec 8, 2023
@jtibshirani jtibshirani marked this pull request as ready for review December 8, 2023 22:58
@jtibshirani jtibshirani requested a review from a team December 8, 2023 22:59
Comment on lines +267 to +270
// To avoid failing large jobs on a flaky API, just mark all files
// as failed and continue. This means we may have some missing
// files, but they will be logged as such below and some embeddings
// are better than no embeddings.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know what errors we would get back in this case? Feels like it would be more robust to allow list the errors we have this behaviour with rather than assuming it is the flaky API. Either way this is an improvement so rather land this PR than directly address this comment. Also fine to say we don't know what the errors are :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Looking at the PR where it was added (https://github.com/sourcegraph/sourcegraph/pull/57224), I don't think we have a precise handle on the causes of flakiness. So I'll leave this as-is for now.

@jtibshirani jtibshirani merged commit 6ea7321 into main Dec 11, 2023
@jtibshirani jtibshirani deleted the jtibs/embeddings branch December 11, 2023 16:20
@sourcegraph-release-bot

Copy link
Copy Markdown
Collaborator

The backport to 5.2 failed at https://github.com/sourcegraph/sourcegraph/actions/runs/7170316193:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-5.2 5.2
# Navigate to the new working tree
cd .worktrees/backport-5.2
# Create a new branch
git switch --create backport-58869-to-5.2
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 6ea73218c13abcaf9938e912d6b4809d02e36ace
# Push it to GitHub
git push --set-upstream origin backport-58869-to-5.2
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.2

If you encouter conflict, first resolve the conflict and stage all files, then run the commands below:

git cherry-pick --continue
# Push it to GitHub
git push --set-upstream origin backport-58869-to-5.2
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.2
  • Follow above instructions to backport the commit.
  • Create a pull request where the base branch is 5.2 and the compare/head branch is backport-58869-to-5.2., click here to create the pull request.
  • Make sure to tag @sourcegraph/release-guild in the pull request description.
  • Once the backport pull request is created, kindly remove the release-blocker from this pull request.

@sourcegraph-release-bot sourcegraph-release-bot added backports failed-backport-to-5.2 release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases labels Dec 11, 2023
jtibshirani added a commit that referenced this pull request Dec 12, 2023
Usually, during an embeddings job we allow 10% of embedding requests to fail,
simply skipping over failed chunks. If a customer has hit their rate limits,
this means we might continually send a huge number of embedding requests that
we know will immediately fail. With this change, we immediately fail a job if
the rate limit is exceeded.

It also increases the amount of time between attempting to run a job to 15
minutes. This won't make a big difference to user experience, since by default
embeddings jobs aren't allowed to be scheduled within 24h of the last run. But
it helps prevent jobs from continuously being scheduled then failing.

This change is unlikely to have a user-facing impact, but just helps cut down
on noise in logs and excessive requests to Cody Gateway.
vovakulikov pushed a commit that referenced this pull request Dec 12, 2023
Usually, during an embeddings job we allow 10% of embedding requests to fail,
simply skipping over failed chunks. If a customer has hit their rate limits,
this means we might continually send a huge number of embedding requests that
we know will immediately fail. With this change, we immediately fail a job if
the rate limit is exceeded.

It also increases the amount of time between attempting to run a job to 15
minutes. This won't make a big difference to user experience, since by default
embeddings jobs aren't allowed to be scheduled within 24h of the last run. But
it helps prevent jobs from continuously being scheduled then failing.

This change is unlikely to have a user-facing impact, but just helps cut down
on noise in logs and excessive requests to Cody Gateway.
jdpleiness pushed a commit that referenced this pull request Dec 13, 2023
…ded (#58939)

Embeddings: fail job immediately if rate limited exceeded (#58869)

Usually, during an embeddings job we allow 10% of embedding requests to fail,
simply skipping over failed chunks. If a customer has hit their rate limits,
this means we might continually send a huge number of embedding requests that
we know will immediately fail. With this change, we immediately fail a job if
the rate limit is exceeded.

It also increases the amount of time between attempting to run a job to 15
minutes. This won't make a big difference to user experience, since by default
embeddings jobs aren't allowed to be scheduled within 24h of the last run. But
it helps prevent jobs from continuously being scheduled then failing.

This change is unlikely to have a user-facing impact, but just helps cut down
on noise in logs and excessive requests to Cody Gateway.
@jtibshirani jtibshirani removed release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases failed-backport-to-5.2 labels Jan 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants