Fix Transport Stopped Exception#48930
Conversation
When a node shutdowns, `TransportService` moves to stopped state and then closes connections. If a request is done in between, an exception was thrown that was not retried in replication actions. Now throw a wrapped `NodeClosedException` exception instead, which is correctly handled in replication action. Fixed other usages too. Relates elastic#42612
|
Pinging @elastic/es-distributed (:Distributed/CRUD) |
Turns out new exception was unnecessary. The `Transport is stopped` messages were hardcoded in a few more places, fixed those too.
original-brownbear
left a comment
There was a problem hiding this comment.
Looks good, just a few random comments :)
server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
Outdated
Show resolved
Hide resolved
| final boolean nodeIsClosing = | ||
| cause instanceof NodeClosedException | ||
| || ExceptionsHelper.isTransportStoppedForAction(cause, "internal:cluster/shard/failure"); | ||
| final boolean nodeIsClosing = cause instanceof NodeClosedException; |
There was a problem hiding this comment.
Maybe use the ExceptionsHelper.unwrap(e, NodeClosedException.class) != null pattern here too to be a little more resilient to future changes (and present unexpected paths that get us here ...?)?
There was a problem hiding this comment.
I thought about this too, but ended up keeping this as is, since the unwrap's are different in that unwrapCause only unwraps ElasticsearchWrapperException, whereas unwrap unwraps all causes. It could be important in case we end up unwrapping a deep exception from another host? Do you think we should try to follow your suggestion anyway?
There was a problem hiding this comment.
Nah let's not do that. I def. doesn't look impossible for that exception from another node making it here (since it's not trivial to tell it the suggestion is pointless in the first place :)).
.../elasticsearch/action/support/replication/TransportReplicationActionRetryOnClosedNodeIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceTests.java
Outdated
Show resolved
Hide resolved
.../elasticsearch/action/support/replication/TransportReplicationActionRetryOnClosedNodeIT.java
Outdated
Show resolved
Hide resolved
| final boolean nodeIsClosing = | ||
| cause instanceof NodeClosedException | ||
| || ExceptionsHelper.isTransportStoppedForAction(cause, "internal:cluster/shard/failure"); | ||
| final boolean nodeIsClosing = cause instanceof NodeClosedException; |
There was a problem hiding this comment.
Nah let's not do that. I def. doesn't look impossible for that exception from another node making it here (since it's not trivial to tell it the suggestion is pointless in the first place :)).
ywelsch
left a comment
There was a problem hiding this comment.
I like the simplification of exception handling in this PR a lot. I've left one comment, looking good o.w.
AFAICS, this exception is only bubbled up locally on a node, so we don't need to care about BWC (i.e. be able to detect the old TransportException bubbling up from other nodes).
server/src/main/java/org/elasticsearch/transport/TransportService.java
Outdated
Show resolved
Hide resolved
|
@elasticmachine run elasticsearch-ci/bwc |
|
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
When a node shuts down, `TransportService` moves to stopped state and then closes connections. If a request is done in between, an exception was thrown that was not retried in replication actions. Now throw a wrapped `NodeClosedException` exception instead, which is correctly handled in replication action. Fixed other usages too. Relates elastic#42612
When a node shuts down, `TransportService` moves to stopped state and then closes connections. If a request is done in between, an exception was thrown that was not retried in replication actions. Now throw a wrapped `NodeClosedException` exception instead, which is correctly handled in replication action. Fixed other usages too. Relates #42612
When a node shuts down,
TransportServicemoves to stopped state andthen closes connections. If a request is done in between, an exception
was thrown that was not retried in replication actions. Now throw a
wrapped
NodeClosedExceptionexception instead, which is correctlyhandled in replication action. Fixed other usages too.
Relates #42612