Close translog writer if exception on write channel#29401
Close translog writer if exception on write channel#29401jasontedor merged 3 commits intoelastic:masterfrom
Conversation
Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine. Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.
|
Pinging @elastic/es-distributed |
|
This addresses the test failure in #29390 exactly because the test is asserting that the translog is closed after a tragic event occurred on the writer but because of the missed handling of an I/O exception on the write channel in the read method, the translog will not be closed by the random readOperation that was added to TranslogThread#run. |
ywelsch
left a comment
There was a problem hiding this comment.
I've left one ask (and one optional ask)
| return current.read(location); | ||
| } catch (final IOException e) { | ||
| closeOnTragicEvent(e); | ||
| throw e; |
There was a problem hiding this comment.
The other calls to closeOnTragicEvent are all of the form
try {
closeOnTragicEvent(ex);
} catch (final Exception inner) {
ex.addSuppressed(inner);
}
throw ex;
which does not make any sense btw when you look at closeOnTragicEvent which already takes care of exceptions that happen during closing. Can you unify the code in this class?
There was a problem hiding this comment.
Okay, I would like to do this in an immediate follow-up to this one. For now I pushed: f9a88eb
| } | ||
| } catch (final IOException e) { | ||
| closeWithTragicEvent(e); | ||
| throw e; |
There was a problem hiding this comment.
Here, I think you should adopt the pattern we use throughout the rest of the class, i.e.,
try {
closeWithTragicEvent(e);
} catch (Exception inner) {
ex.addSuppressed(inner);
}
throw e;
so that we get the original cause.
There was a problem hiding this comment.
Bah! Thrown off by the fact that the method this is in declares throws IOException. I will address.
|
@ywelsch I pushed. I want to refactor the exception handling for |
Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine. Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.
Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine.
Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.
Closes #29390