Validate whether a data stream timestamp has been specified in a document by martijnvg · Pull Request #58119 · elastic/elasticsearch

martijnvg · 2020-06-15T15:02:38Z

If the document is going to be index into a backing index of a data stream then
check whether a timestamp field has been specified and that exactly one timestamp
value has been specified.

Currently there is no concept of a required field in the mapping code. To me
the best place to add the data stream timestamp validation logic is in: ParseContext#postParse(...) line 481
If there is a better place then I happily move this new logic elsewhere.

In order to ParseContext to know whether an index is part of a data stream and
what the timestamp field is, a DataStream instance had to be passed down this
this place. This is why the change touches relatively many files compared to the
actual added logic. However this is needed and I don't see another way to do this.

Specifically looking for feedback from the @elastic/es-search team.

Relates to #53100

…ment If the document is going to be index into a backing index of a data stream then check whether a timestamp field has been specified and that exactly one timestamp value has been specified. Currently there is no concept of a required field in the mapping code. To me the best place to add the data stream timestamp validation logic is in: `ParseContext#postParse(...)`. If there is a better place then I happily move this new logic elsewhere. In order to ParseContext to know whether an index is part of a data stream and what the timestamp field is, a `DataStream` instance had to be passed down this this place. This is why the change touches relatively many files compared to the actual added logic. However this is needed and I don't see another way to do this. Relates to elastic#53100

…eing parsed.

henningandersen

Left a couple smaller comments.

I think we should also add a REST test to demonstrate that error handling of a bulk request and a single index requests works as intended when no timestamp is specified.

henningandersen · 2020-06-16T09:26:23Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java

-    DocumentParser(IndexSettings indexSettings, DocumentMapperParser docMapperParser, DocumentMapper docMapper) {
+    DocumentParser(IndexSettings indexSettings,
+                   DocumentMapperParser docMapperParser,
+                   DataStream dataStream,


Order or args look strange, move dataStream to end?

henningandersen · 2020-06-16T09:35:47Z

docs/reference/indices/rollover-index.asciidoc

    "mappings": {
      "properties": {
-        "@timestamp": {
+        "date": {


I find the original name better, since it complies with ECS and also date opens up for a bit of confusion between field name and type.

Agreed, but in the test data sets that is generated (huge twitter setup), has its timestamp in the date field. I think this should be changed in a followup change?

…_timestamp

jtibshirani

This looks like a good start to me. I left a idea on how we could restructure the check to avoid counting the Lucene fields.

Earlier we brainstormed whether the concept of a 'singleton' field would be useful more generally, for example as a mapping option that could apply to any type. This is still on my radar, but I think it's good we're not blocking on that. I agree with your approach of just making a targeted change for this important validation.

server/src/main/java/org/elasticsearch/index/mapper/MapperService.java

jtibshirani · 2020-06-17T22:06:57Z

server/src/main/java/org/elasticsearch/index/mapper/ParseContext.java

+                }
+            }
+
+            if (numStoredFields > 1 || numPointFields > 1 || numDocValuesFields > 1) {


It feels a little fragile to be checking the Lucene fields that the timestamp field produces. Sometimes field mappers decide to produce multiple Lucene fields given a single value in the _source. Or the mapping could have doc values and indexing disabled.

Instead I think we could check that the timestamp is a 'singleton' during document parsing:

We could add a boolean flag to DateFieldMapper like isSingletonTimestamp, based on whether its field name matches the datastream timestamp field.

In DateFieldMapper#parseCreateField we would check + update a flag on ParseContext like alreadyParsedTimestamp. If it's already true, we throw an error.

In ParseContext#postParse, we verify that alreadyParsedTimestamp is true.

I like this because it avoids the fragility of checking Lucene fields, but still correctly handle cases where data is copied into the field like copy_to and multi-fields.

Thanks for this suggestion @jtibshirani. I also found counting of Lucene fields to be fragile and the isSingletonTimestamp and alreadyParsedTimestamp flags will make this check more robust. I will try and adjust the code.

martijnvg · 2020-06-18T07:37:46Z

Earlier we brainstormed whether the concept of a 'singleton' field would be useful more generally, for example as a mapping option that could apply to any type

This idea also crossed my mind. This pr kind of creates an implicit singleton field based on whether the index is part of a data stream. If/when we change this to be a field mapper attribute we can force this setting when creating backing index. This new singleton attribute should be immutable like most of the other field mapping attributes.

…_timestamp

since these documents have an empty source.

…_timestamp

jtibshirani

The overall structure looks good to me! I left some more detailed comments.

It's too bad we need to pass through an extra parameter in so many places. It's generally unusual that we have mapping information passed through externally -- usually all information affecting the schema/ document parsing can be found in the index metadata or settings. I don't really see a way around this though, I don't think we want to add dataStreamTimestampField to the index metadata, since it's a property of the 'index abstraction' and not the index?

If/when we change this to be a field mapper attribute we can force this setting when creating backing index. This new singleton attribute should be immutable like most of the other field mapping attributes.

I'll create an issue about this to start a discussion. I think this would help with my concern above, since most of the logic will be moved into an actual mapping attribute like singleton.

jtibshirani · 2020-06-23T17:01:52Z

server/src/main/java/org/elasticsearch/index/mapper/DateFieldMapper.java

    protected void parseCreateField(ParseContext context) throws IOException {
+        if (singletonDataStreamTimestamp) {
+            if (context.isDataStreamTimestampParsed()) {
+                throw new IllegalArgumentException("timestamp field has multiple values, only a single value is allowed");


Small comment, this could mention 'data stream timestamp' for clarity. We also try to include the field name when possible to help with debugging: "Encountered data stream timestamp field [my-timestamp] with multiple values ..."

jtibshirani · 2020-06-23T17:03:21Z

server/src/main/java/org/elasticsearch/index/mapper/MapperService.java

+            idFieldDataEnabled, null);
+    }
+
+    public MapperService(IndexSettings indexSettings, IndexAnalyzers indexAnalyzers, NamedXContentRegistry xContentRegistry,


Could we delete the extra MapperService constructor above, to make it harder to forget to pass the timestamp field? It looks like it's only used in tests and for simulating a merge.

jtibshirani · 2020-06-23T17:06:25Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentMapperParser.java

+        this(indexSettings, mapperService, xContentRegistry, similarityService, mapperRegistry, queryShardContextSupplier, null);
+    }
+
+    public DocumentMapperParser(IndexSettings indexSettings, MapperService mapperService, NamedXContentRegistry xContentRegistry,


Same thought here, it'd be nice to delete the extra constructor above.

jtibshirani · 2020-06-23T17:15:27Z

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java

                MultiFieldParserContext(ParserContext in) {
                    super(in.similarityLookupService(), in.mapperService(), in.typeParsers(),
-                            in.indexVersionCreated(), in.queryShardContextSupplier());
+                            in.indexVersionCreated(), in.queryShardContextSupplier(), null);


I think we should pass through the timestamp field here. I guess the timestamp could happen to be a multi-field.

I will pass down the timestamp field, a small note, the timestamp field can only be a field that is part of the _source.
There is validation that checks whether a field mapping exists of type date or date_nanos when creating the composable index template and this is also asserted when a backing index of a data stream is created.

the timestamp field can only be a field that is part of the _source.

Interesting! I am generally curious to catch up on timestamp mapping validation, I'll ping the team offline about this.

jtibshirani · 2020-06-23T17:17:55Z

server/src/main/java/org/elasticsearch/index/mapper/ParseContext.java

        }

        void postParse() {
+            if (dataStreamTimestampField != null) {


Small comment, can collapse these two 'if' checks.

Also, same thought as above about including 'data stream' and the field name in the error message.

jtibshirani · 2020-06-23T17:33:07Z

server/src/internalClusterTest/java/org/elasticsearch/indices/DataStreamIT.java

            "support aliases."));
    }

+    public void testNoTimestampInDocument() throws Exception {


We have good integration test coverage, but no unit tests. It would be great to add a unit test at the level of the mapping code that checks the document validation. Perhaps DateFieldMapperTests would be a good place for this?

Ignore this comment, @martijnvg explained the unit tests are missing because this is a 'draft' :)

…_timestamp

martijnvg · 2020-06-24T11:54:58Z

I don't think we want to add dataStreamTimestampField to the index metadata, since it's a property of the 'index abstraction' and not the index?

It is a property of DataStream, the Metadata#indicesLookup sorted set with index abstractions is built from the information available in Metadata class which contains both the index metadata instances and the data stream instances.

We avoided adding data stream information to index metadata, because then the information that indicates whether an index is part of a data stream is in two places and then there is a risk of certain type of bugs if a data stream instance and index metadata instance go for some reason out of sync. An example would be if a backing index gets shrunken. The new index with less shards would need to be added to the data stream and the original index would need to be removed from the data stream. In this case both data stream instance and an index metadata instance would need to be updated.

elasticmachine · 2020-06-24T12:00:28Z

Pinging @elastic/es-core-features (:Core/Features/Data streams)

martijnvg · 2020-06-24T14:05:28Z

I think this would help with my concern above, since most of the logic will be moved into an actual mapping attribute like singleton.

I think if the singleton feature existed today, then the approach taken in this pr would never have been done. I think with the singleton feature, this change will be much cleaner and less intrusive, because composable index templates with data stream definition would set singleton=true on the appropriate field and there is then no need to pass down the timestamp field all the way down to where it is now in the pr. Maybe we should try to introduce a singleton attribute? I did some exploring and I think it is doable: c830711#diff-d72103d748a7ab089c4a87707755fe3dR449

martijnvg · 2020-06-24T15:41:46Z

I think if the singleton feature existed today, then the approach taken in this pr would never have been done. I think with the singleton feature, this change will be much cleaner and less intrusive, because composable index templates with data stream definition would set singleton=true on the appropriate field and there is then no need to pass down the timestamp field all the way down to where it is now in the pr. Maybe we should try to introduce a singleton attribute? I did some exploring and I think it is doable: c830711#diff-d72103d748a7ab089c4a87707755fe3dR449

I chatted with @jtibshirani via another channel and it is unsure whether something like a singleton field will be added and if so then then it is unsure how this should be exposed. So in the meantime, for data streams, the best way forward seems to be moving forward with this PR. When something like singleton field is added, then the migration can be easy, since the singleton attribute can be enabled automatically when creating a new backing index by ES. This way the migration will be easy.

jtibshirani · 2020-06-25T03:30:50Z

I opened #58523 to discuss the idea of adding a 'singleton' flag to field mappers. I'll do a final review shortly!

jimczi · 2020-06-25T08:16:46Z

Sorry, I am late in the discussion but I wonder if this could be implemented as a MetadataFieldMapper ?
Currently we do not allow to put metadata field in the _source but this is something that we could revisit for this new field.
The requirements described here are easy to implement in a metadata field mapper, they are unique and we can constrain the values to be present once and only once. The main advantage I see is that mappings would have a consistent and unified view of a timestamp field when enabled.
This could look like this:

"mappings": {
    "_timestamp": {
      "enabled": true 
    }
}

Today the timestamp field name is set when the data stream is created. This is flexible but I wonder how we plan to handle multiple data streams that don't share the same timestamp field name in a search request. In other words, will it be possible to sort documents by timestamp if I target more than one data stream ?
It seems that this flexibility in the naming is only required at ingest time ? If that's true then I wonder if we could use a unique metadata field and also create an alias field that would point to the metadata field when the data stream is created ?

martijnvg · 2020-06-25T11:58:59Z

Sorry, I am late in the discussion but I wonder if this could be implemented as a MetadataFieldMapper ?

I think that could work. The postParse() method can check whether the value has been specified, but how would we enforce that a single value has been specified? In this pr that is the responsibility of the date field mapper and if it sees a value twice, then it fails. How would that work if we have a data stream timestamp meta field? In a previous iteration in ParseContext#postParse() the number of lucene fields were counted, but this logic is a bit fragile as it could contain a stored field, doc values field and points field. So we decided to move away from that.

This is flexible but I wonder how we plan to handle multiple data streams that don't share the same timestamp field name in a search request. In other words, will it be possible to sort documents by timestamp if I target more than one data stream ?

We have not discussed that yet. We focussed on at least ensuring that each document has a timestamp value. Right now if a data stream has different timestamp fields then sorting is like if you try to sort over non uniform indices.

If that's true then I wonder if we could use a unique metadata field and also create an alias field that would point to the metadata field when the data stream is created ?

I like this idea. But this could also be resolved at query parse time? If we know that we sort over the primary timestamp field of data streams then at query parse time we could resolve to the right field?

martijnvg · 2020-06-25T14:38:23Z

I chatted with @jimczi via another channel and we see benefits in developing the timestamp field validation as metadata field mapper. The metadata field mapper implementation would indicate what the timestamp field is. In the postParse() method it would check whether there is exactly one points field in the captures lucene document. We only need to check for point fields, because it doesn't make sense for this to be disabled. Validation is going to be added to will disallow setting index attribute to no, so the validation logic wouldn't be fragile as it was in the initial commit of this pr. I will work on a draft pr, to whether the implementation with a metadata field mapper will be cleaner than this pr.

jtibshirani · 2020-06-25T19:58:57Z

server/src/main/java/org/elasticsearch/index/mapper/DateFieldMapper.java

+            if (context.isDataStreamTimestampParsed()) {
+                throw new IllegalArgumentException("data stream timestamp field [" + name() + "] encountered multiple values");
+            }
+            context.setDataStreamTimestampParsed(true);


I just noticed that we should probably only set this after the date has been successfully parsed? Just writing this down to not forget, I see that we may be changing the strategy and moving to a metadata field mapper.

jtibshirani · 2020-06-25T20:18:50Z

chatted with @jimczi via another channel and we see benefits in developing the timestamp field validation as metadata field mapper. The metadata field mapper implementation would indicate what the timestamp field is.

One aspect I like about the metadata field approach is that it consolidates the information into the mapping itself, as opposed to passing it down externally from the datastream definition. This is a good property to maintain -- that all information affecting schema/ document parsing can be found in the index mappings or settings.

martijnvg · 2020-06-26T06:42:50Z

One aspect I like about the metadata field approach is that it consolidates the information into the mapping itself, as opposed to passing it down externally from the datastream definition. This is a good property to maintain -- that all information affecting schema/ document parsing can be found in the index mappings or settings

Yes, I agree and it can make sorting on data streams with different timestamp fields easier (a user would sort by the meta field instead of the actual data stream timestamp field).

martijnvg · 2020-06-29T14:34:35Z

Closing this pr in favour of #58582

martijnvg added >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 :StorageEngine/Data streams Data streams and their lifecycles v7.9.0 labels Jun 15, 2020

martijnvg requested review from danhermann and henningandersen June 15, 2020 15:02

martijnvg added 11 commits June 15, 2020 18:02

no need to lookup data streams in upgrade service, no documents are b…

1d7d5ac

…eing parsed.

use the right date field from the huge twitter docs setup

b76ac27

removed unused import

21264e0

Removed other unneeded changes

a158a8a

fixed more rest tests

99f4ba8

fixed ilm integ test

d6b1efa

fixed another test

fd56292

overload constructors to reduce changes in other files.

9ed3369

iter

3d1920a

undo unrelated changes

5cb17c4

undo another unrelated change

2457562

henningandersen reviewed Jun 16, 2020

View reviewed changes

martijnvg added 3 commits June 17, 2020 11:44

Merge remote-tracking branch 'es/master' into validate_documents_have…

1def3e4

…_timestamp

re-order constructor arguments

22e80a1

added rest yaml test

91b6bbd

jtibshirani self-requested a review June 17, 2020 17:34

jtibshirani reviewed Jun 17, 2020

View reviewed changes

martijnvg added 4 commits June 18, 2020 12:35

Merge remote-tracking branch 'es/master' into validate_documents_have…

3064a6b

…_timestamp

iter

5883e83

adjusted assertion

6ae8bc4

adjusted unneeded line changes

3759560

martijnvg added 2 commits June 18, 2020 19:48

don't validate timestamp field for tombstone documents,

f272148

since these documents have an empty source.

Merge remote-tracking branch 'es/master' into validate_documents_have…

16ce263

…_timestamp

jtibshirani reviewed Jun 23, 2020

View reviewed changes

martijnvg added 5 commits June 24, 2020 09:59

Merge remote-tracking branch 'es/master' into validate_documents_have…

0c3cdcb

…_timestamp

iter

b01d89a

fixed test

57bd047

added unit tests

3bc6e04

fixed tests

6627f59

martijnvg requested a review from jtibshirani June 24, 2020 11:55

martijnvg marked this pull request as ready for review June 24, 2020 12:00

elasticmachine added Team:Data Management (obsolete) DO NOT USE. This team no longer exists. Team:Search Meta label for search team labels Jun 24, 2020

jtibshirani mentioned this pull request Jun 25, 2020

Add 'singleton' flag to field mappers? #58523

Closed

jtibshirani reviewed Jun 25, 2020

View reviewed changes

martijnvg mentioned this pull request Jun 26, 2020

Add data stream timestamp validation via metadata field mapper #58582

Merged

martijnvg closed this Jun 29, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Conversation

martijnvg commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jtibshirani Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Jun 18, 2020

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Jun 24, 2020

Uh oh!

elasticmachine commented Jun 24, 2020

Uh oh!

martijnvg commented Jun 24, 2020

Uh oh!

martijnvg commented Jun 24, 2020

Uh oh!

jtibshirani commented Jun 25, 2020

Uh oh!

jimczi commented Jun 25, 2020

Uh oh!

martijnvg commented Jun 25, 2020

Uh oh!

martijnvg commented Jun 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani commented Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martijnvg commented Jun 26, 2020

Uh oh!

martijnvg commented Jun 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

martijnvg commented Jun 15, 2020 •

edited

Loading

jtibshirani Jun 17, 2020 •

edited

Loading

jtibshirani Jun 23, 2020 •

edited

Loading

jtibshirani commented Jun 25, 2020 •

edited

Loading