Add a NoopEngine implementation#31163
Add a NoopEngine implementation#31163dakrone merged 23 commits intoelastic:closed-index-replicationfrom
Conversation
This adds a new Engine implementation that does.. nothing. Any operations throw an `UnsupportedOperationException` when tried. This engine is intended as a building block for replicated closed indices in subsequent code changes. Relates to elastic#31141
|
Pinging @elastic/es-distributed |
jasontedor
left a comment
There was a problem hiding this comment.
I left some high-level comments. I have not done a careful review.
| /** | ||
| * Returns true if the engine is a noop engine | ||
| */ | ||
| public abstract boolean isNoopEngine(); |
There was a problem hiding this comment.
This feels wrong to me. It looks like it's only used for testing? Is this really needed? We did not feel the need to add something like this for the following engine in CCR.
There was a problem hiding this comment.
Sorry this snuck in from me extracting this from my other branch (where this is actually used), I'll remove it for now and re-add it for the next set of work
| } | ||
|
|
||
| @Override | ||
| public Engine newNoopEngine(EngineConfig config) { |
There was a problem hiding this comment.
Can you take a look at how we handle the following engine in CCR and see if that extension point makes sense instead of adding this here? It feels conceptually wrong that the internal engine factory returns something other than an internal engine.
There was a problem hiding this comment.
Do we want to make this a plugin? I was thinking built-in but I guess we could move it to a module if we wanted to, I wasn't sure if it's a good idea to do engine implementations inside of modules
There was a problem hiding this comment.
Looking at this again, it appears that EnginePlugin is part of the CCR branch, are there plans to factor it out into a separate PR that goes into master?
There was a problem hiding this comment.
I think it would be great if we could make it a module, I am always happy to see isolation when it makes sense and here I think it can. Note that CCR has an engine implementation inside a module. 😇
We have talked exactly about the possibility of reusing the engine plugin work in the context of closed indices many months ago so we could bring that work out of the CCR branch and into 6.x/master.
There was a problem hiding this comment.
Okay, I've removed the changes to the EngineFactory and I can use the pluggable one for the next PR (since this one doesn't hook the engine in yet)
|
|
||
| @Override | ||
| public void restoreLocalCheckpointFromTranslog() { | ||
|
|
| Translog translog = null; | ||
|
|
||
| // The deletion policy for the translog should not keep any translogs around, so the min age/size is set to -1 | ||
| final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(-1, -1); |
There was a problem hiding this comment.
can we move this into the try block please?
| private final IndexCommit lastCommit; | ||
| private final LocalCheckpointTracker localCheckpointTracker; | ||
| private final String historyUUID; | ||
| private SegmentInfos lastCommittedSegmentInfos; |
| * Directory so that the last commit's user data can be read for the historyUUID | ||
| * and last committed segment info. | ||
| */ | ||
| public class NoopEngine extends Engine { |
There was a problem hiding this comment.
can this class be final and maybe pkg private?
There was a problem hiding this comment.
Sure, though I'll un-final it when I work on the second half of this, but it can be final for now :)
| @Override | ||
| public CommitId flush() throws EngineException { | ||
| try { | ||
| translog.rollGeneration(); |
There was a problem hiding this comment.
this looks wrong. We don't write anything here why do we need to modify the translog? I think this should be read-only
There was a problem hiding this comment.
In order for flushing to clear existing translogs (because we don't want any translog operations to be replayed for peer or store recovery) we want the flush method to remove the translog, this was added so that flushing the new engine would ensure that we don't have any translog operations around that could cause UOEs during recovery
There was a problem hiding this comment.
I can remove this for now and re-introduce it later when adding the state transition part, if that makes it better, but we still need to be able to completely remove translog ops before doing recovery since we have no way to do operation-based recovery.
What do you think?
There was a problem hiding this comment.
yeah I'd like to remove it for now I can 't see in this change why it's needed
There was a problem hiding this comment.
Okay, I've removed that change from this PR
| translogDeletionPolicy.setTranslogGenerationOfLastCommit(lastGen); | ||
| translogDeletionPolicy.setMinTranslogGenerationForRecovery(lastGen); | ||
|
|
||
| localCheckpointTracker = createLocalCheckpointTracker(); |
There was a problem hiding this comment.
we don't really need a local checkpoint tracker here, instead can we work on master to not expose the localCheckpointTracker out of engine (similar to how we don't expose the translog) and then we can avoid creating it? We should assert that maxSeq == localCheckpoint when opening lucene and otherwise fail.
| final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(-1, -1); | ||
|
|
||
| lastCommittedSegmentInfos = store.readLastCommittedSegmentsInfo(); | ||
| translog = openTranslog(engineConfig, translogDeletionPolicy, engineConfig.getGlobalCheckpointSupplier()); |
There was a problem hiding this comment.
Do we really need a translog? all we want is to validate that the translog has a the right uuid and that it's empty?
There was a problem hiding this comment.
We use the translog all over the place in other methods in Engine. We can try to encapsulate all of these into different methods (similar to #31213), but I'm not sure what benefit we'd actually get from that. In the next iteration we'll need the translog because we'll need to sync and flush it on engine opening so that there are no operations to be replayed during peer recovery (which I think will still need a translog to retrieve stats about # of ops).
There was a problem hiding this comment.
I'm not sure what benefit we'd actually get from that
I looked at these methods and I agree it's on the edge - there are quite a few. I would still prefer we replace them with unsupported operations or noop returns (ex empty sanpshots). This will lock things down and prevent things that aren't supposed to happen - I think that's good. An alternative is to implement a NoopTranslog but that's another rabbit hole.
we'll need to sync and flush it on engine opening
Why is that? I think it's good to only close indices that have no ongoing indexing (like our plan for frozen index). Regardless - why can't we do the flush / trim when we close the open engine and convert it to a noop engine?
I can see one thing down the road because we may close an index on recovery where it has broken settings (TBD). In that case I would still prefer to make utilities methods like Store#associateIndexWithNewTranslog that work on the folder. Note that this problem isn't solved even if we keep the translog open as we can't index operations from it into the lucene index with the NoopEngine nor ship to a recovering shard with NoopEngine in it. We assume those don't exist. I think we should discuss this separately.
There was a problem hiding this comment.
I think it's good to only close indices that have no ongoing indexing (like our plan for frozen index).
Yes, but even with no ongoing indexing, a translog still remains (due to retention policy)
Regardless - why can't we do the flush / trim when we close the open engine and convert it to a noop engine?
That would require setting a new retention policy on an existing engine (making a part of InternalEngine mutable which makes me :(). In the future though, we could do the retention policy, sync, and flush/trim when the NoopEngine is opened, and then immediately close the translog.
In order to do this though, we'll have to remove the getTranslog method from Engine, is that something you want me to do as a precursor to this?
|
|
||
| @Override | ||
| public boolean ensureTranslogSynced(Stream<Translog.Location> locations) { | ||
| return false; |
There was a problem hiding this comment.
I think this can throw an unsupported operation exception too?
| } | ||
|
|
||
| @Override | ||
| public Translog getTranslog() { |
|
@jasontedor @bleskes @s1monw I've removed almost all the pieces from this that were deemed unnecessary, can you take another look please? |
bleskes
left a comment
There was a problem hiding this comment.
Looks great thanks for the extra iteration. I left some minor asks.
| lastCommit = indexCommits.get(indexCommits.size()-1); | ||
| historyUUID = lastCommit.getUserData().get(HISTORY_UUID_KEY); | ||
| // We don't want any translogs hanging around for recovery, so we need to set these accordingly | ||
| final long lastGen = Long.parseLong(lastCommit.getUserData().get(Translog.TRANSLOG_GENERATION_KEY)); |
There was a problem hiding this comment.
this is not used?
Also - I think it's to validate integrity - i.e. open the translog, see that it's uuid matches, see that it's empty and shut it down?
There was a problem hiding this comment.
Okay I pushed a commit removing the unused lastGen and open-closing the translog for validation purposes
|
|
||
| @Override | ||
| public SeqNoStats getSeqNoStats(long globalCheckpoint) { | ||
| return new SeqNoStats(0, 0, 0); |
There was a problem hiding this comment.
use the incoming globalCheckpoint?
There was a problem hiding this comment.
Wouldn't that violate the assertion that global checkpoint is the minimum of all shards' local checkpoints?
There was a problem hiding this comment.
that would but we don't check this here. Also - I think we should load the stats from the last commit point, not just 0 (both max and local checkpoint are stored in the commit)
bleskes
left a comment
There was a problem hiding this comment.
I did another round. Sorry for not thinking about these in the first run.
| final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(-1, -1); | ||
|
|
||
| // The translog is opened and closed to validate that the translog UUID from lucene is the same as the one in the translog | ||
| Translog translog = openTranslog(engineConfig, translogDeletionPolicy, engineConfig.getGlobalCheckpointSupplier()); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
|
||
| @Override | ||
| public SeqNoStats getSeqNoStats(long globalCheckpoint) { | ||
| return new SeqNoStats(0, 0, 0); |
There was a problem hiding this comment.
that would but we don't check this here. Also - I think we should load the stats from the last commit point, not just 0 (both max and local checkpoint are stored in the commit)
| } | ||
|
|
||
| @Override | ||
| public void resetLocalCheckpoint(long localCheckpoint) { |
There was a problem hiding this comment.
can you assert this is equal to getLocalCheckpoint()?
| } | ||
|
|
||
| @Override | ||
| public long getLocalCheckpoint() { |
There was a problem hiding this comment.
can you return the local checkpoint stored in the commit?
|
|
||
| public class NoopEngineTests extends EngineTestCase { | ||
| private static final IndexSettings INDEX_SETTINGS = IndexSettingsModule.newIndexSettings("index", Settings.EMPTY); | ||
|
|
There was a problem hiding this comment.
can you also add a test that creates a normal engine, adds some data, flushes, closes and the opens a noop engine and verify all the stats makes sense and getting a lucene snapshot returns the expected data (doc counts etc)?
There was a problem hiding this comment.
getting a lucene snapshot returns the expected data (doc counts etc)?
Do you mean from newTranslogSnapshotFromMinSeqNo? I've been returning an empty snapshot for this with docCount of 0 (and no operations, because we have no way of returning operations), so it won't match the same doc count that a snapshot from a regular engine will.
There was a problem hiding this comment.
I meant this . The standard engine ends up returning a SnapshotIndexCommit which is why I used the term lucene snapshot. I hope this is clearer.
There was a problem hiding this comment.
Yep, thanks for clarifying, this was added as a test
|
@bleskes thanks for taking another look, I think I addressed all your feedback |
| @@ -98,6 +98,9 @@ public void close() { | |||
|
|
|||
| // The translog is opened and closed to validate that the translog UUID from lucene is the same as the one in the translog | |||
| Translog translog = openTranslog(engineConfig, translogDeletionPolicy, engineConfig.getGlobalCheckpointSupplier()); | |||
There was a problem hiding this comment.
Yes good catch, I thought about that last night but forgot to do it, I'll update that
|
|
||
| public void testNoopAfterRegularEngine() throws IOException { | ||
| int docs = randomIntBetween(1, 10); | ||
| ReplicationTracker tracker = (ReplicationTracker) engine.config().getGlobalCheckpointSupplier(); |
There was a problem hiding this comment.
why is this needed? we typically just write to the engine?
There was a problem hiding this comment.
This is a rabbit hole @jasontedor and I got into last night; in order to completely flush and remove the translog we need a deletion policy where we can advance the generation needed for recovery, in order to advance that we have to have the tracker working correctly (which I guess is normally managed at the IndexShard level), so this allows us to advance global checkpoints as needed so we can remove the translog and open a new noop engine (otherwise the noop engine would throw an exception about there being translog operations when being opened).
There was a problem hiding this comment.
We can pass the global checkpoint as a LongSupplier to an engine.
final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.NO_OPS_PERFORMED);
try (Store store = createStore();
InternalEngine engine = createEngine(store, createTempDir(), globalCheckpoint::get)) {
engine.syncTranslog(); // Make sure that the global checkpoint is persisted to the translog's checkpoint
}| List<IndexCommit> indexCommits = DirectoryReader.listCommits(store.directory()); | ||
| lastCommit = indexCommits.get(indexCommits.size()-1); | ||
| historyUUID = lastCommit.getUserData().get(HISTORY_UUID_KEY); | ||
| localCheckpoint = Long.parseLong(lastCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY)); |
There was a problem hiding this comment.
This is not yet relevant to this PR but these will have to be the same IMO .
jasontedor
left a comment
There was a problem hiding this comment.
I left some comments/questions.
| try { | ||
| lastCommittedSegmentInfos = store.readLastCommittedSegmentsInfo(); | ||
| List<IndexCommit> indexCommits = DirectoryReader.listCommits(store.directory()); | ||
| lastCommit = indexCommits.get(indexCommits.size()-1); |
There was a problem hiding this comment.
Formatting nit: indexCommits.size()-1 -> indexCommits.size() - 1
| // The deletion policy for the translog should not keep any translogs around, so the min age/size is set to -1 | ||
| final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(-1, -1); | ||
|
|
||
| // The translog is opened and closed to validate that the translog UUID from lucene is the same as the one in the translog |
There was a problem hiding this comment.
Can we have a test case for this?
| final NoopEngine engine = new NoopEngine(noopConfig(INDEX_SETTINGS, store, primaryTranslogDir)); | ||
| expectThrows(UnsupportedOperationException.class, () -> engine.index(null)); | ||
| expectThrows(UnsupportedOperationException.class, () -> engine.delete(null)); | ||
| expectThrows(UnsupportedOperationException.class, () -> engine.noOp(null)); |
There was a problem hiding this comment.
As a follow-up can we have:
Noop->NoOpnoop->no-op(when written in exception messages to the user, and code comments)noopEngine-> `noOpEngine
|
|
||
| public void testTwoNoopEngines() throws IOException { | ||
| engine.close(); | ||
| // It's so noop you can even open two engines for the same store without tripping anything |
There was a problem hiding this comment.
Is this something that we are going to rely on?
There was a problem hiding this comment.
Probably not? I just thought it was a good test that we aren't acquiring locks or locking files accidentally or anything like that
There was a problem hiding this comment.
Okay, maybe the test name should be clearer, or at least a comment of its purpose?
|
@jasontedor I addressed your comments (I'll do the rename in a followup) |
|
@elasticmachine run sample packaging tests please |
In elastic#31163 (comment) Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class and relevant parts. Relates to elastic#31141
In #31163 (comment) Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class and relevant parts. Relates to #31141
This commit uses the NoOp engine introduced in elastic#31163 for closed indices. Instead of being removed from the routing table and essentially "dropped" (but not deleted), closed indices are now replicated the same way open indices are. Relates to elastic#31141
This adds a new Engine implementation that does.. nothing. Any operations throw an `UnsupportedOperationException` when tried. This engine is intended as a building block for replicated closed indices in subsequent code changes. Relates to #31141
In #31163 (comment) Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class and relevant parts. Relates to #31141
This adds a new Engine implementation that does.. nothing. Any operations throw an `UnsupportedOperationException` when tried. This engine is intended as a building block for replicated closed indices in subsequent code changes. Relates to elastic#31141
In elastic#31163 (comment) Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class and relevant parts. Relates to elastic#31141
This adds a new Engine implementation that does.. nothing. Any operations throw
an
UnsupportedOperationExceptionwhen tried. This engine is intended as abuilding block for replicated closed indices in subsequent code changes.
Relates to #31141