Add profiling and documentation for dfs phase#90536
Conversation
|
Documentation preview: |
|
Pinging @elastic/es-search (Team:Search) |
|
Hi @jdconrad, I've created a changelog YAML for you. |
| As an example, let's first setup an index with multiple shards and index | ||
| a pair of documents with different values on a keyword field. | ||
|
|
||
| [source,console] |
There was a problem hiding this comment.
If you do [source,console,id=profile_dfs] this'll name the yaml test it makes and make an invisible but linkable anchor tag in the docs. I've been naming everything lately because it makes the error messages when the test fails nicer.
| ... | ||
| } | ||
| -------------------------------------------------- | ||
| // NOTCONSOLE |
There was a problem hiding this comment.
You can make this [source,console-result] with a bit of hacking to replace the .... Something like
// TESTRESPONSE[s/: \{\.\.\.\}/: $body.$_path/]
// TESTRESPONSE[s/: (\-)?[0-9]+/: $body.$_path/]
might do it, depending on how much you are willing to fight with the test generator. The advantage of doing this is that it'll fail if the names of the results change so it'll force us to keep at least the json keys up to date.
There was a problem hiding this comment.
@nik9000 Thank you for walking me through and creating the appropriate test responses with me!
| [[profiling-knn]] | ||
| ===== Profiling kNN | ||
|
|
||
| A k-nearest neighbor (kNN) search runs as part of the dfs phase. To |
There was a problem hiding this comment.
Maybe should link to <<approximate-knn>>?
| ordinals (an internal data structure used to speed up search). | ||
| - Profiling statistics are currently not available for suggestions, | ||
| highlighting, `dfs_query_then_fetch`. | ||
| highlighting |
There was a problem hiding this comment.
Could you remove highlighting as well? Profiling is available for highlighting as part of the fetch phase work I did a while back. I just never noticed this line.
rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/370_profile.yml
Show resolved
Hide resolved
| - is_true: profile.shards.0.dfs.statistics.breakdown | ||
| - match: { profile.shards.1.dfs.statistics.type: "statistics" } | ||
| - match: { profile.shards.1.dfs.statistics.description: "collect term statistics" } | ||
| - gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 } |
There was a problem hiding this comment.
I think asserting that both shards have time > 0 relies on the hashing of the _ids landing on different shards, right? Maybe worth a comment, but ok, I guess.
There was a problem hiding this comment.
Could this cause test flakes though, in cases where all docs happen to be hashed to the same shard?
There was a problem hiding this comment.
My understanding is which shards the docs are on shouldn't matter as we still need to hit each shard to collect dfs information, so each shard will have a dfs profile even if no info is ultimately aggregated. (Am I missing some caching or shard skipping in this case?)
I have updated the test to change/add the following:
- changed gt to gte for time_in_nanos
- updated the test to add a no documents case where we still expect to get dfs profile info
- updated the test to add a single document case where we still expect to get dfs profile info
- updated the test to add a three document case where two share the same keyword and we still expect to get dfs profile info
- updated the test to ensure we get no profile info if search type is
query_then_fetch
There was a problem hiding this comment.
👍 this makes sense. For me maybe we don't need both tests (3) and (4) since having the same keyword shouldn't affect shard routing, or whether we use DFS?
There was a problem hiding this comment.
I removed 3 and 4 since they are extraneous as you mentioned.
| opens org.elasticsearch.common.logging to org.apache.logging.log4j.core; | ||
|
|
||
| exports org.elasticsearch.search.profile.dfs; | ||
|
|
There was a problem hiding this comment.
What's this bit for? Should this be in alphabetic order?
There was a problem hiding this comment.
Not sure why spotless didn't pick this up. I moved it to the appropriate location and it mimics the behavior of the rest of the of profile package to allow for other modules to use these classes.
server/src/main/java/org/elasticsearch/search/profile/SearchProfileDfsPhaseResult.java
Show resolved
Hide resolved
| QueryProfileShardResult queryProfileShardResult = profiledVectorQuery | ||
| ? new QueryProfileShardResult(queryProfiler.getTree(), queryProfiler.getRewriteTime(), queryProfiler.getCollector()) | ||
| : null; | ||
| return new SearchProfileDfsPhaseResult(dfsProfileResult, queryProfileShardResult); |
There was a problem hiding this comment.
Is there any way to use the "emptiness" of some object to infer this? Like, if the breakdown map is empty it's all the statistics? I don't know that that's a perfect thing, but it might make this easier for someone to read in nine months.
There was a problem hiding this comment.
I agree on this being a bit confusing. I'll give it some more thought on a better way to work with this existing. It's a bit tricky because of what things are available at what times.
There was a problem hiding this comment.
+1 it'd be nice to avoid these mutable booleans like profiledVectorQuery if possible (but I haven't looked into it deeply how we could do that!)
There was a problem hiding this comment.
I tried to clean this up a bit. If the dfs phase runs, we always attempt to collect term stats, so there's no reason to check anything. For knn, we have to set the collector on the profiler, so I track whether or not we've set the collector in DfsProfiler to determine if we should have a knn profile section.
| - is_true: profile.shards.0.dfs.statistics.breakdown | ||
| - match: { profile.shards.1.dfs.statistics.type: "statistics" } | ||
| - match: { profile.shards.1.dfs.statistics.description: "collect term statistics" } | ||
| - gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 } |
There was a problem hiding this comment.
Could this cause test flakes though, in cases where all docs happen to be hashed to the same shard?
| QueryProfileShardResult queryProfileShardResult = profiledVectorQuery | ||
| ? new QueryProfileShardResult(queryProfiler.getTree(), queryProfiler.getRewriteTime(), queryProfiler.getCollector()) | ||
| : null; | ||
| return new SearchProfileDfsPhaseResult(dfsProfileResult, queryProfileShardResult); |
There was a problem hiding this comment.
+1 it'd be nice to avoid these mutable booleans like profiledVectorQuery if possible (but I haven't looked into it deeply how we could do that!)
|
@nik9000 @jtibshirani Thank you both for the review. I will work on addressing all of the feedback. |
|
@nik9000 @jtibshirani I think I have addressed all of the given feedback, so this is ready for another round of review. |
jtibshirani
left a comment
There was a problem hiding this comment.
This looks good to me too, I just left some tiny last comments.
| the dfs phase. | ||
|
|
||
| The following is an example of setting `profile` to `true` on a search | ||
| that has a knn section: |
There was a problem hiding this comment.
Tiny comment, we usually style the names of API parameters as literals ("knn section").
| the of timings for <<query-section, query>>, <<rewrite-section, rewrite>>, | ||
| and <<collectors-section, collector>>. Unlike many other queries, kNN | ||
| search does the bulk of the work during the query rewrite. This means | ||
| rewrite_time represents the time spent on kNN search. |
There was a problem hiding this comment.
Same comment here, we should say rewrite_time.
| highlighting, `dfs_query_then_fetch`. | ||
| - Profiling statistics are currently not available for suggestions. | ||
| - Profiling of the reduce phase of aggregation is currently not available. | ||
| - The Profiler is instrumenting internals that can change from version to |
There was a problem hiding this comment.
Just noticed this existing comment and glad we have it -- it will let us tweak the DFS output format if needed!
| - is_true: profile.shards.0.dfs.statistics.breakdown | ||
| - match: { profile.shards.1.dfs.statistics.type: "statistics" } | ||
| - match: { profile.shards.1.dfs.statistics.description: "collect term statistics" } | ||
| - gt: { profile.shards.1.dfs.statistics.time_in_nanos: 0 } |
There was a problem hiding this comment.
👍 this makes sense. For me maybe we don't need both tests (3) and (4) since having the same keyword shouldn't affect shard routing, or whether we use DFS?
|
@elasticmachine retest this please |
* main: (150 commits) Remove ToXContent interface from ChunkedToXContent (elastic#90409) Remove extra SearchService constructor (elastic#90733) Update min version for the diagnosis yaml test (elastic#90731) Use the AggTestConfig object in testCase (elastic#90699) [DOCS] Add links to clear trained model deployment cache API (elastic#90727) Assert wildcards are not expanded as specified by request options (elastic#90641) [TEST] Fix exit snapshot restore exit condition (elastic#90696) [TEST] Change to atomic file contents save (elastic#90695) Update forbiddenapis to 3.4 (elastic#90624) [Tests] Don't use concurrent search in scripted field type tests (elastic#90712) [ML] Move scaling is possible check for starting trained model (elastic#90706) Add new base test case for chunked xcontent types (elastic#90707) Fix testRedNoBlockedIndicesAndRedAllRoleNodes (elastic#90671) Fix nullpointer in docs test setup (elastic#90660) Don't produce build logs artifact when in a composite build Fixing a race condition in EnrichCoordinatorProxyAction that can leave an item stuck in its queue (elastic#90688) docs: update fleet/agent pipeline docs (elastic#90659) [HealthAPI] Use plural consistently in resource types (elastic#90682) [Testing] Enable bwc and fix sorting for 500_date_range (elastic#90681) Add profiling and documentation for dfs phase (elastic#90536) ... # Conflicts: # x-pack/plugin/mapper-aggregate-metric/src/test/java/org/elasticsearch/xpack/aggregatemetric/mapper/AggregateDoubleMetricFieldMapperTests.java
Introduced in: #90536 Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between. The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
Introduced in: elastic#90536 Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between. The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
Introduced in: #90536 Profiling for DFS has had its timing numbers looking weird, additionally, it would trigger some assertion failures because `timer.start()` was called without a `stop()` in between. The key issue was around query `weight` creation. `Weight` creation could be called recursively, thus calling `start` on the timer more than once before calling stop.
This change adds profiling statistics for the dfs phase that look like the following:
This change also adds documentation for both the above dfs phase profiling and kNN profiling.
Closes #89713