Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Embeddings: avoid constantly rerunning job if it failed#58980

Merged
jtibshirani merged 2 commits into
mainfrom
jtibs/embeddings
Dec 15, 2023
Merged

Embeddings: avoid constantly rerunning job if it failed#58980
jtibshirani merged 2 commits into
mainfrom
jtibs/embeddings

Conversation

@jtibshirani

@jtibshirani jtibshirani commented Dec 14, 2023

Copy link
Copy Markdown
Contributor

The embeddings policy framework attempts to rerun a repo job even if a previous
run failed at the exact same revision. This means that when a job failed, for
example because of rate limits or a problematic file, it would immediately be
rescheduled and fail again. This can be expensive and noisy.

Now, the policy framework does not rerun failed jobs unless the revision
changes. An admin can always kick off a job manually if they want to rerun a
job at the revision. This reduces noise and feels like a better trade-off.

Test plan

Modified unit tests

@cla-bot cla-bot Bot added the cla-signed label Dec 14, 2023
@jtibshirani jtibshirani marked this pull request as ready for review December 14, 2023 22:02
@jtibshirani jtibshirani requested a review from a team December 14, 2023 22:52

@stefanhengl stefanhengl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jtibshirani

Copy link
Copy Markdown
Contributor Author

Thanks for the review. I'll backport this as part of a round of embeddings fixes.

@jtibshirani jtibshirani merged commit 550e077 into main Dec 15, 2023
@jtibshirani jtibshirani deleted the jtibs/embeddings branch December 15, 2023 19:42
jtibshirani added a commit that referenced this pull request Dec 18, 2023
The embeddings policy framework attempts to rerun a repo job even if a previous
run failed at the exact same revision. This means that when a job failed, for
example because of rate limits or a problematic file, it would immediately be
rescheduled and fail again. This can be expensive and noisy.

Now, the policy framework does **not** rerun failed jobs unless the revision
changes. An admin can always kick off a job manually if they want to rerun a
job at the revision. This reduces noise and feels like a better trade-off.
camdencheek pushed a commit that referenced this pull request Jan 9, 2024
#59090)

* Embeddings: refactor job scheduling code (#58787)

Refactors the repository scheduling logic into two separate methods, one used
by the GraphQL API, and the other by the policy framework. This makes the code
easier to read and lets us make some improvements:
* For policy scheduling, stop translating back-and-forth between repo IDs and
names
* Make sure to fail entire GraphQL request if there is an error fetching repos,
instead of silently ignoring it

## Test plan

Added new unit test

* Embeddings: avoid constantly rerunning job if it failed (#58980)

The embeddings policy framework attempts to rerun a repo job even if a previous
run failed at the exact same revision. This means that when a job failed, for
example because of rate limits or a problematic file, it would immediately be
rescheduled and fail again. This can be expensive and noisy.

Now, the policy framework does **not** rerun failed jobs unless the revision
changes. An admin can always kick off a job manually if they want to rerun a
job at the revision. This reduces noise and feels like a better trade-off.

* Fix compile error
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants