Skip to content

Always use DirectoryReader for realtime get from translog#74722

Merged
ywelsch merged 7 commits intoelastic:masterfrom
ywelsch:wrap-translog-reader
Jul 1, 2021
Merged

Always use DirectoryReader for realtime get from translog#74722
ywelsch merged 7 commits intoelastic:masterfrom
ywelsch:wrap-translog-reader

Conversation

@ywelsch
Copy link
Copy Markdown
Contributor

@ywelsch ywelsch commented Jun 29, 2021

Reading from translog during a realtime get requires special handling in some higher level components, e.g. ShardGetService, where we're doing a bunch of tricks to extract other stored fields from the source. Another issue with the current approach relates to #74227 where we introduce a new "field usage tracking" directory wrapper that's always applied, and we want to make sure that we can still quickly do realtime gets from translog without creating an in-memory index of the document, even when this directory wrapper exists.

This PR introduces a directory reader that contains a single translog indexing operation. This can be used during a realtime get to access documents that haven't been refreshed yet. In the normal case, all information relevant to resolve the realtime get is mocked out to provide fast access to _id and _source. In case where more values are requested (e.g. access to other stored fields) etc., this reader will index the document into an in-memory Lucene segment that is created on-demand.

Relates #64504

@ywelsch ywelsch added >non-issue :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Jun 29, 2021
@ywelsch ywelsch requested a review from dnhatn June 29, 2021 21:43
IndexReader.CacheHelper cacheHelper = reader.getReaderCacheHelper(); // this one takes deletes into account
if (cacheHelper == null) {
throw new IllegalStateException("Reader " + reader + " does not support caching");
return computeNumDocs(reader, roleQueryBits);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was necessary as DocumentSubsetReader was a bad citizen. It required the leafreader that it wrapped to have a readerCacheHelper, but then at the same time did not expose a reader cache helper by its own wrapping leafreader, so essentially double-standards.

I don't see a reason for this reader not to work when caching isn't available (single-document case)

@ywelsch ywelsch requested a review from henningandersen June 30, 2021 09:33
@ywelsch ywelsch marked this pull request as ready for review June 30, 2021 11:14
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Jun 30, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Copy Markdown
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

All comments below are optional at your discretion.

@ywelsch ywelsch merged commit 90e663b into elastic:master Jul 1, 2021
ywelsch added a commit that referenced this pull request Jul 1, 2021
Reading from translog during a realtime get requires special handling in some higher level components, e.g.
ShardGetService, where we're doing a bunch of tricks to extract other stored fields from the source. Another issue with
the current approach relates to #74227 where we introduce a new "field usage tracking" directory wrapper that's always
applied, and we want to make sure that we can still quickly do realtime gets from translog without creating an in-memory
index of the document, even when this directory wrapper exists.

This PR introduces a directory reader that contains a single translog indexing operation. This can be used during a
realtime get to access documents that haven't been refreshed yet. In the normal case, all information relevant to resolve
the realtime get is mocked out to provide fast access to _id and _source. In case where more values are requested (e.g.
access to other stored fields) etc., this reader will index the document into an in-memory Lucene segment that is
created on-demand.

Relates #64504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue Team:Distributed Meta label for distributed team. v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants