Skip to content

index shard should be able to cancel check index on close.#18839

Merged
bleskes merged 1 commit intoelastic:masterfrom
bleskes:index_shard_cancel_on_close
Jun 20, 2016
Merged

index shard should be able to cancel check index on close.#18839
bleskes merged 1 commit intoelastic:masterfrom
bleskes:index_shard_cancel_on_close

Conversation

@bleskes
Copy link
Copy Markdown
Contributor

@bleskes bleskes commented Jun 13, 2016

If someone sets index.shard.check_on_startup, indexing start up time can be slow (by design, it diligently goes and checks all data). If for some reason the shard is closed in that time, the store ref is kept around and prevents a new shard copy to be allocated to this node via the shard level locks. This is especially tricky if the shard is close due to a cancelled recovery which may re-restart soon.

This PR adds a cancellable threads instance to each IndexShard and perform index checking underneath it, so it can be cancelled on close. This assumes that:

  1. Interrupting a checkIndex is safe.
  2. Interrupting a checkIndex will actually make it stop.

Relates to #12011

PS. We are also discussing not doing a full index check on peer recovery but rather just do a checksum check.

@bleskes
Copy link
Copy Markdown
Contributor Author

bleskes commented Jun 13, 2016

@s1monw wdyt?

@clintongormley clintongormley added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 14, 2016
@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Jun 20, 2016

LGTM

@bleskes bleskes merged commit 0cb4e57 into elastic:master Jun 20, 2016
@bleskes bleskes deleted the index_shard_cancel_on_close branch June 20, 2016 10:16
bleskes added a commit that referenced this pull request Jun 21, 2016
```
   > Throwable #1: java.lang.RuntimeException: file handle leaks: [SeekableByteChannel(/var/lib/jenkins/workspace/elastic+elasticsearch+master+g1gc/core/build/testrun/integTest/J0/temp/org.elasticsearch.search.suggest.CompletionSuggestSearch2xIT_518545A20D129C8C-001/tempDir-001/data/nodes/1/indices/4sTECv6WSJOJsw9L4CGamg/0/index/segments_1), SeekableByteChannel(/var/lib/jenkins/workspace/elastic+elasticsearch+master+g1gc/core/build/testrun/integTest/J0/temp/org.elasticsearch.search.suggest.CompletionSuggestSearch2xIT_518545A20D129C8C-001/tempDir-001/data/nodes/1/indices/4sTECv6WSJOJsw9L4CGamg/0/index/segments_1)]
   > 	at __randomizedtesting.SeedInfo.seed([518545A20D129C8C]:0)
   > 	at org.apache.lucene.mockfile.LeakFS.onClose(LeakFS.java:63)
   > 	at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:77)
   > 	at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:78)
   > 	at java.lang.Thread.run(Thread.java:745)
   > Caused by: java.lang.Exception
   > 	at org.apache.lucene.mockfile.LeakFS.onOpen(LeakFS.java:46)
   > 	at org.apache.lucene.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:81)
   > 	at org.apache.lucene.mockfile.HandleTrackingFS.newByteChannel(HandleTrackingFS.java:271)
   > 	at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212)
   > 	at org.apache.lucene.mockfile.HandleTrackingFS.newByteChannel(HandleTrackingFS.java:240)
   > 	at java.nio.file.Files.newByteChannel(Files.java:361)
   > 	at java.nio.file.Files.newByteChannel(Files.java:407)
   > 	at org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:77)
   > 	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:94)
   > 	at org.apache.lucene.util.LuceneTestCase.slowFileExists(LuceneTestCase.java:2695)
   > 	at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:737)
   > 	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:94)
   > 	at org.elasticsearch.common.lucene.Lucene$1.doBody(Lucene.java:237)
   > 	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:685)
   > 	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:637)
   > 	at org.elasticsearch.common.lucene.Lucene.checkSegmentInfoIntegrity(Lucene.java:242)
   > 	at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:847)
   > 	at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:740)
   > 	at org.elasticsearch.index.store.Store.getMetadata(Store.java:260)
   > 	at org.elasticsearch.index.store.Store.getMetadata(Store.java:240)
   > 	at org.elasticsearch.index.shard.IndexShard.doCheckIndex(IndexShard.java:1310)
   > 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:102)
   > 	at org.elasticsearch.index.shard.IndexShard.checkIndex(IndexShard.java:1288)
   > 	at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:921)
   > 	at org.elasticsearch.index.shard.IndexShard.skipTranslogRecovery(IndexShard.java:964)
   > 	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:297)
   > 	at
   ```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement resiliency v5.0.0-alpha4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants