Skip to content

improve aggregate_metric_double parsing time#80685

Merged
csoulios merged 30 commits intoelastic:mainfrom
weizijun:number-support-single-type
Sep 8, 2022
Merged

improve aggregate_metric_double parsing time#80685
csoulios merged 30 commits intoelastic:mainfrom
weizijun:number-support-single-type

Conversation

@weizijun
Copy link
Copy Markdown
Contributor

when there are many aggregate_metric_double type fields, like hundreds, writing index is slow. I perf record the flame graph. and find that, It's about 68% time in LuceneDocument.getField.

Why LuceneDocument.getField cost so many time? Because in the aggregate_metric_double parse logic. It will check if the field is single, it will call LuceneDocument.getField to check multi values.

and the LuceneDocument.getField method is a heavy method, when there are many fields. Because the method use List to find the key, it will foreach the list. So If the fields grow, the LuceneDocument.getField is slow.

I add a flag in NumberFieldMapper, that will add the field to the LuceneDocument.keyedFields, it's a map, so check multi values is O(1) cost, and become cheap.

here is a part of the flame graph:

<g>
<title>org/elasticsearch/index/mapper/LuceneDocument.getField (5,963 samples, 67.53%)</title><rect x="284.1" y="1251.0" width="796.9" height="15" fill="#53e453" rx="2" ry="2"/>
<text x="287.1" y="1262.0">org/elasticsearch/index/mapper/LuceneDocument.getField</text>
</g>

@elasticsearchmachine elasticsearchmachine added v8.1.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Nov 13, 2021
@weizijun
Copy link
Copy Markdown
Contributor Author

hi, @nik9000 @csoulios, can you help to review this PR?

Copy link
Copy Markdown
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weizijun thank you for submitting this improvement.

Indeed, AggregateDoubleMetricFieldMapper naively uses LuceneDocument#getField that iterates over all fields in the document and should definitely leverage the LuceneDocument#keyedFields map to check for duplicates.

However, NumberFieldMapper#indexValue is going to change significantly so that it can support _tsid generation (#80276)

See here: https://github.com/elastic/elasticsearch/pull/80276/files#diff-690f39e25a1fddcfe00433a60cb25365138254e25315726ef7fd195230a3e774

So, I suggest you don't use the LuceneDocument#keyedFields implicitly via the NumberFieldMapper, but access it directly via LuceneDocument#addWithKey and LuceneDocument#getByKey in the AggregateDoubleMetricFieldMapper#parseCreateField method.

In this case you would not have to modify NumberFieldMapper at all.

@csoulios csoulios added :StorageEngine/TSDB You know, for Metrics :Search Foundations/Mapping Index mappings, including merging and defining field types labels Nov 15, 2021
@elasticmachine elasticmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team labels Nov 15, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@weizijun
Copy link
Copy Markdown
Contributor Author

@weizijun thank you for submitting this improvement.

Indeed, AggregateDoubleMetricFieldMapper naively uses LuceneDocument#getField that iterates over all fields in the document and should definitely leverage the LuceneDocument#keyedFields map to check for duplicates.

However, NumberFieldMapper#indexValue is going to change significantly so that it can support _tsid generation (#80276)

See here: https://github.com/elastic/elasticsearch/pull/80276/files#diff-690f39e25a1fddcfe00433a60cb25365138254e25315726ef7fd195230a3e774

So, I suggest you don't use the LuceneDocument#keyedFields implicitly via the NumberFieldMapper, but access it directly via LuceneDocument#addWithKey and LuceneDocument#getByKey in the AggregateDoubleMetricFieldMapper#parseCreateField method.

In this case you would not have to modify NumberFieldMapper at all.

@csoulios I see your change, it affect the dimension field, but normal aggregate_metric_double field don't affect

@csoulios
Copy link
Copy Markdown
Contributor

@csoulios I see your change, it affect the dimension field, but normal aggregate_metric_double field don't affect

You are right. Also NumberFieldMapper does not use the LuceneDocument#keyedFields at all any more.

* upstream/master: (150 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
Copy link
Copy Markdown
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this. I am happy to merge this change.

Can you please change the parameter name in the NumberFieldMapper

this.script.precludesParameters(ignoreMalformed, coerce, nullValue);
addScriptValidation(script, indexed, hasDocValues);

this.singleValue = Parameter.boolParam("single_value", false, m -> toType(m).singleValue, false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should change the parameter name here to allow_multiple_values:true|false, as discussed in #80825, #58523 and https://github.com/elastic/ecs/blob/main/rfcs/text/0029-enforce-single-value-fields.md

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw there is an open PR for adding allow_multiple_values parameter (#80289)

@markharwood do you think we should wait for your PR to be merged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this parameter is useful, I can wait for #80289 merged, then change the aggregate_metric_double logic

Copy link
Copy Markdown
Contributor

@markharwood markharwood Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR changes the main DocumentParser parsing logic to reject documents with arrays where allowMultipleValues is false for a field. Currently only user-defined mappings set this flag.
The PR does not tackle changes to existing single-valued fields like AggregateDoubleMetricFieldMapper that have implemented their own singularity checks (which I understand here have performance issues).

This is perhaps an argument for committing #80289 and creating a follow-up one for fields like AggregateDoubleMetricFieldMapper to strip out their singularity checks and override the new FieldMapper. allowMultipleValues() to return false. This would mean for always-singular fields like this, the central DocumentParser class would quickly reject any arrays at source. One complication that may need addressing is that the existing Lucene document-level singularity-checking code probably handles scenarios where the double field is singular but is contained in an object that might be repeated in an array e.g.[ { "num": 1}, {"num":2} ] rather than "num":[1,2]. The DocumentParser may need extra logic to spot the former.
Thoughts @jpountz ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, It's maybe not equals to single value. your PR reject the array value, but if a field in a object array, like your example: [ { "num": 1}, {"num":2} ], it may be duplicate in lucene. the AggregateDoubleMetricFieldMapper need a single value in lucene, not only array case. e.g:

{
	"field": [{
		"min": 10.0,
		"max": 50.0,
		"value_count": 3
	}, {
		"min": 11.0,
		"max": 51.0,
		"value_count": 3
	}]
}

it's not allow like this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so the DocumentParser would keep tracks of when it enters and exits arrays while parsing. Any field flagged as allow_multiple_values:false would cause an error if the current parse context was inside an array (no matter the nesting depth).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, when your PR merged, I will change the AggregateDoubleMetricFieldMapper code to adapt the parameter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can change the mapping parameter from single_value to allow_multiple_values with the following semantics:

Should the field permit arrays of values? Accepts true
(default) or false. When false, documents presenting arrays
of values for this field will be rejected with an error.

So that we are consistent with what the Mark's design

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I will change it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@csoulios
Copy link
Copy Markdown
Contributor

csoulios commented Dec 3, 2021

@elasticmachine, test this please

@weizijun
Copy link
Copy Markdown
Contributor Author

weizijun commented Dec 3, 2021

@elasticmachine update branch

elasticmachine and others added 3 commits July 28, 2022 12:11
* upstream/main:
  Add 8.5 migration docs (elastic#88923)
  Script: Reindex & UpdateByQuery Metadata (elastic#88665)
  Remove unused plugins dir var from server CLI (elastic#88917)
  Use tracing API in TaskManager (elastic#88885)
  Add source fallback for keyword fields using operation (elastic#88735)
  Prune changelogs after 8.3.3 release
  Bump versions after 8.3.3 release
  Add a test for checking for misspelled "dry_run" parameters for Desired Nodes API (elastic#88898)
  Speedup BalanceUnbalancedClusterTests (elastic#88794)
  Preventing exceptions on node shutdown in integration tests (elastic#88827)
  Do not trigger check part3 for test mute and docs PRs (elastic#88895)
  Add troubleshooting docs about data corruption (elastic#88760)
  Mute RollupActionSingleNodeTests#testRollupDatastream (elastic#88891)
  [DOCS] Domain splitting impacts API keys (elastic#88677)
  Fix SqlSearchIT testAllTypesWithRequestToOldNodes (elastic#88866) (elastic#88883)
  Update synthetic-source.asciidoc (elastic#88880)
  Log more details in TaskAssertions (elastic#88864)
  Make Tuple a record (elastic#88280)
@weizijun
Copy link
Copy Markdown
Contributor Author

@csoulios all tests have passed, and update to the latest code. Can you help to review it?

Copy link
Copy Markdown
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, the change looks ok to me. It definitely improves performance of adding a new aggregate_metric_double field.

I have added some minor comments for improvements.

Also, I would like to get an opinion from the @elastic/es-search team since they are the experts on the field mapper area.

);
}

public void testSingleValuedField() throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why this test is in WholeNumberFieldMapperTests and not in NumberFieldMapperTests?

Comment on lines +613 to +617
if (e.getCause() instanceof IllegalArgumentException) {
throw (IllegalArgumentException) e.getCause();
} else {
throw e;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that you extractIllegalArgumentException from this MapperParsingException because you want to bubble up the underlying error instead of throwing a generic message like this `"failed to parse field [field.value_count] of type [integer] in document with id '1'.".

If that's the case, can you please add a comment here to document the thinking behind this?

containsString(
"Field [field] of type [aggregate_metric_double] "
+ "does not support indexing multiple values for the same field in the same document"
"failed to parse field [field.value_count] of type [integer] in document with id '1'. Preview of field's value: '3'"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we receive a very generic error message. I think it's worth throwing the more informative: Field [field] of type [aggregate_metric_double] does not support indexing multiple values for the same field in the same documet.

Do you think that's possible?

@javanna
Copy link
Copy Markdown
Contributor

javanna commented Aug 4, 2022

I would wait to get this in until we have reached some conclusion on #80825 . I would not like to get in this flag for one of our field mappers without deciding first what we are going to do overall.

@romseygeek
Copy link
Copy Markdown
Contributor

Would it be possible to add the functionality for aggregate_metric_double without exposing it as a public flag? We know that we never want this field to be multi-valued, so we can happily add the internal APIs to handle it but delay making it available on other field types until we've decided on the public API

@csoulios
Copy link
Copy Markdown
Contributor

csoulios commented Aug 4, 2022

I agree with @romseygeek.

We can keep the NumberFieldMapper#allowMultipleValues as an internal parameter that sets this behavior, but not expose it as a mapping parameter. Therefore, the AggregateDoubleMetricFieldType can set it internally.

Later, we are free to change the implementation and perhaps introduce a new mapping parameter without breaking backwards compatibility.

@weizijun
Copy link
Copy Markdown
Contributor Author

weizijun commented Aug 4, 2022

Yeah, I also agree with @romseygeek , I have talked with @csoulios , I will modify the code.

* upstream/main: (265 commits)
  Disable openid connect tests due to missing fixture (elastic#89478)
  Add periodic job for single processor node testing
  Updates to changelog processing after docs redesign (elastic#89463)
  Better support for multi cluster for run task (elastic#89442)
  Mute failing tests (elastic#89465)
  [ML] Performance improvements related to ECS Grok pattern usage (elastic#89424)
  Add source fallback support for date and date_nanos mapped types (elastic#89440)
  Reuse Info in lifecycle step (elastic#89419)
  feature: support metrics for multi value fields (elastic#88818)
  Upgrade OpenTelemetry API and remove workaround (elastic#89438)
  Remove LegacyClusterTaskResultActionListener (elastic#89459)
  Add YAML spec docs about matching errors (elastic#89370)
  Remove redundant cluster upgrade tests for auth tokens (elastic#89417)
  Return 400 error for GetUserPrivileges call with API keys (elastic#89333)
  User Profile - Detailed errors in hasPrivileges response (elastic#89224)
  Rollover min_* conditions docs and highlight (elastic#89434)
  REST tests for percentiles_bucket agg (elastic#88029)
  REST tests for cumulative pipeline aggs (elastic#88966)
  Clean-up file watcher keys. (elastic#89429)
  fix a typo in Security.java (elastic#89248)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
@weizijun
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

Copy link
Copy Markdown
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@csoulios
Copy link
Copy Markdown
Contributor

csoulios commented Sep 7, 2022

@elasticmachine update branch

@csoulios
Copy link
Copy Markdown
Contributor

csoulios commented Sep 7, 2022

I have discussed this PR with @javanna offline and we have agreed that this implementation does not expose any public APIs (mapping parameters) for setting a field as single-valued.

So, we can move forward with merging this PR.

@csoulios
Copy link
Copy Markdown
Contributor

csoulios commented Sep 7, 2022

@elasticmachine run elasticsearch-ci/bwc

@weizijun
Copy link
Copy Markdown
Contributor Author

weizijun commented Sep 8, 2022

@elasticmachine update branch

Copy link
Copy Markdown
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thank you for submitting this improvement

fail("mapper [" + mapper.getClass() + "] error, not number field");
}

Exception e = expectThrows(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@csoulios csoulios merged commit 1f1ac13 into elastic:main Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants