Skip to content

[DRAFT] Failed attempt to directly use a Thread for repo fetching#22110

Closed
Wyverald wants to merge 2 commits intomasterfrom
wyv-failed-repoworker-attempt
Closed

[DRAFT] Failed attempt to directly use a Thread for repo fetching#22110
Wyverald wants to merge 2 commits intomasterfrom
wyv-failed-repoworker-attempt

Conversation

@Wyverald
Copy link
Copy Markdown
Member

Failed attempt of #22100

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.)

I then tried various things to fix this; this PR is an attempt that actually seemed to eliminate the deadlock. Instead of waiting for a `DONE` signal to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.)

Related: #22003
Copy link
Copy Markdown
Member Author

@Wyverald Wyverald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this setup has a deadlock that I couldn't work out how to eliminate.

workerThread = null;
if (myWorkerThread != null) {
myWorkerThread.interrupt();
Uninterruptibles.joinUninterruptibly(myWorkerThread);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host thread deadlocks here (from StarlarkRepositoryFunction.java:180)

state.recordedInputValues,
key)));
} catch (Throwable e) {
state.signalQueue.put(new Signal.Failure(e));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker thread deadlocks here

@Wyverald Wyverald closed this May 1, 2024
@Wyverald Wyverald deleted the wyv-failed-repoworker-attempt branch May 1, 2024 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant