Skip to content

Reconnection may not have retry capacity #18764

@werkt

Description

@werkt

Description of the bug:

A build failure occurs when GrpcCacheClient cannot reach the remote endpoint in the Ssl handshake timeout. The local exception is wrapped in a grpc UNAVAILABLE StatusRuntimeException with a silent inner request error handler, bounded by downloadBlobs future aggregation.

(01:48:39) ERROR: /<redacted>/BUILD:41:9: Compiling REDACTED <redacted> failed: (Exit 34): io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
com.google.devtools.build.lib.remote.common.BulkTransferException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
  at com.google.devtools.build.lib.remote.util.Utils.waitForBulkTransfer(Utils.java:594)
  at com.google.devtools.build.lib.remote.RemoteExecutionService.downloadOutputs(RemoteExecutionService.java:1120)
  at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:413)
  at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:218)
  at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:301)
  at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:152)
  at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:112)
  at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
  at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:64)
  at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:352)
  at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
  at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:957)
  at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1124)
  at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1082)
  at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:160)
  at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:93)
  at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:516)
  at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:827)
  at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:323)
  at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:161)
  at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:571)
  at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  at java.base/java.lang.Thread.run(Unknown Source)
  Suppressed: java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
    at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$downloadBlob$18(GrpcCacheClient.java:346)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:203)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
    at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
    at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
    at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1124)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:422)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
    at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    ... 3 more
  Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
    at io.grpc.Status.asRuntimeException(Status.java:535)
    ... 14 more
  Caused by: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
    at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2113)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    ... 1 more

Evidence for the lack of retry is circumstantial: Unavailability of the server for >10s * 5 would have been unlikely, and if it was caused by a severed connection on time wait, I would have expected a reconnection attempt after the first. I observed this stack trace with a silent accepting remote endpoint from the Capabilities check (which also failed to retry).

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Very little easy here: a server would have to lose connection after startup and be replaced with a silent accepting one in the time period before a download began.

Minimum spec: grpcs (not grpc plaintext), remote_cache (to inspire downloadBlob), hitting the remote cache for an action and downloading results.

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

6.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2We'll consider working on this in future. (Assignee optional)help wantedSomeone outside the Bazel team could own thisteam-Remote-ExecIssues and PRs for the Execution (Remote) teamtype: bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions