-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Reconnection may not have retry capacity #18764
Description
Description of the bug:
A build failure occurs when GrpcCacheClient cannot reach the remote endpoint in the Ssl handshake timeout. The local exception is wrapped in a grpc UNAVAILABLE StatusRuntimeException with a silent inner request error handler, bounded by downloadBlobs future aggregation.
(01:48:39) ERROR: /<redacted>/BUILD:41:9: Compiling REDACTED <redacted> failed: (Exit 34): io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
com.google.devtools.build.lib.remote.common.BulkTransferException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at com.google.devtools.build.lib.remote.util.Utils.waitForBulkTransfer(Utils.java:594)
at com.google.devtools.build.lib.remote.RemoteExecutionService.downloadOutputs(RemoteExecutionService.java:1120)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:413)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:218)
at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:301)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:152)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:112)
at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:64)
at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:352)
at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:957)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1124)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1082)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:160)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:93)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:516)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:827)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:323)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:161)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:571)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Suppressed: java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$downloadBlob$18(GrpcCacheClient.java:346)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:203)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1124)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:422)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
... 3 more
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
at io.grpc.Status.asRuntimeException(Status.java:535)
... 14 more
Caused by: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2113)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
Evidence for the lack of retry is circumstantial: Unavailability of the server for >10s * 5 would have been unlikely, and if it was caused by a severed connection on time wait, I would have expected a reconnection attempt after the first. I observed this stack trace with a silent accepting remote endpoint from the Capabilities check (which also failed to retry).
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Very little easy here: a server would have to lose connection after startup and be replaced with a silent accepting one in the time period before a download began.
Minimum spec: grpcs (not grpc plaintext), remote_cache (to inspire downloadBlob), hitting the remote cache for an action and downloading results.
Which operating system are you running Bazel on?
No response
What is the output of bazel info release?
6.2.0
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?
No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response