Validate whether a data stream timestamp has been specified in a document#58119
Validate whether a data stream timestamp has been specified in a document#58119martijnvg wants to merge 26 commits intoelastic:masterfrom
Conversation
…ment If the document is going to be index into a backing index of a data stream then check whether a timestamp field has been specified and that exactly one timestamp value has been specified. Currently there is no concept of a required field in the mapping code. To me the best place to add the data stream timestamp validation logic is in: `ParseContext#postParse(...)`. If there is a better place then I happily move this new logic elsewhere. In order to ParseContext to know whether an index is part of a data stream and what the timestamp field is, a `DataStream` instance had to be passed down this this place. This is why the change touches relatively many files compared to the actual added logic. However this is needed and I don't see another way to do this. Relates to elastic#53100
henningandersen
left a comment
There was a problem hiding this comment.
Left a couple smaller comments.
I think we should also add a REST test to demonstrate that error handling of a bulk request and a single index requests works as intended when no timestamp is specified.
| DocumentParser(IndexSettings indexSettings, DocumentMapperParser docMapperParser, DocumentMapper docMapper) { | ||
| DocumentParser(IndexSettings indexSettings, | ||
| DocumentMapperParser docMapperParser, | ||
| DataStream dataStream, |
There was a problem hiding this comment.
Order or args look strange, move dataStream to end?
| "mappings": { | ||
| "properties": { | ||
| "@timestamp": { | ||
| "date": { |
There was a problem hiding this comment.
I find the original name better, since it complies with ECS and also date opens up for a bit of confusion between field name and type.
There was a problem hiding this comment.
Agreed, but in the test data sets that is generated (huge twitter setup), has its timestamp in the date field. I think this should be changed in a followup change?
jtibshirani
left a comment
There was a problem hiding this comment.
This looks like a good start to me. I left a idea on how we could restructure the check to avoid counting the Lucene fields.
Earlier we brainstormed whether the concept of a 'singleton' field would be useful more generally, for example as a mapping option that could apply to any type. This is still on my radar, but I think it's good we're not blocking on that. I agree with your approach of just making a targeted change for this important validation.
| } | ||
| } | ||
|
|
||
| if (numStoredFields > 1 || numPointFields > 1 || numDocValuesFields > 1) { |
There was a problem hiding this comment.
It feels a little fragile to be checking the Lucene fields that the timestamp field produces. Sometimes field mappers decide to produce multiple Lucene fields given a single value in the _source. Or the mapping could have doc values and indexing disabled.
Instead I think we could check that the timestamp is a 'singleton' during document parsing:
- We could add a boolean flag to
DateFieldMapperlikeisSingletonTimestamp, based on whether its field name matches the datastream timestamp field. - In
DateFieldMapper#parseCreateFieldwe would check + update a flag onParseContextlikealreadyParsedTimestamp. If it's already true, we throw an error. - In
ParseContext#postParse, we verify thatalreadyParsedTimestampis true.
I like this because it avoids the fragility of checking Lucene fields, but still correctly handle cases where data is copied into the field like copy_to and multi-fields.
There was a problem hiding this comment.
Thanks for this suggestion @jtibshirani. I also found counting of Lucene fields to be fragile and the isSingletonTimestamp and alreadyParsedTimestamp flags will make this check more robust. I will try and adjust the code.
This idea also crossed my mind. This pr kind of creates an implicit singleton field based on whether the index is part of a data stream. If/when we change this to be a field mapper attribute we can force this setting when creating backing index. This new singleton attribute should be immutable like most of the other field mapping attributes. |
since these documents have an empty source.
jtibshirani
left a comment
There was a problem hiding this comment.
The overall structure looks good to me! I left some more detailed comments.
It's too bad we need to pass through an extra parameter in so many places. It's generally unusual that we have mapping information passed through externally -- usually all information affecting the schema/ document parsing can be found in the index metadata or settings. I don't really see a way around this though, I don't think we want to add dataStreamTimestampField to the index metadata, since it's a property of the 'index abstraction' and not the index?
If/when we change this to be a field mapper attribute we can force this setting when creating backing index. This new singleton attribute should be immutable like most of the other field mapping attributes.
I'll create an issue about this to start a discussion. I think this would help with my concern above, since most of the logic will be moved into an actual mapping attribute like singleton.
| protected void parseCreateField(ParseContext context) throws IOException { | ||
| if (singletonDataStreamTimestamp) { | ||
| if (context.isDataStreamTimestampParsed()) { | ||
| throw new IllegalArgumentException("timestamp field has multiple values, only a single value is allowed"); |
There was a problem hiding this comment.
Small comment, this could mention 'data stream timestamp' for clarity. We also try to include the field name when possible to help with debugging: "Encountered data stream timestamp field [my-timestamp] with multiple values ..."
| idFieldDataEnabled, null); | ||
| } | ||
|
|
||
| public MapperService(IndexSettings indexSettings, IndexAnalyzers indexAnalyzers, NamedXContentRegistry xContentRegistry, |
There was a problem hiding this comment.
Could we delete the extra MapperService constructor above, to make it harder to forget to pass the timestamp field? It looks like it's only used in tests and for simulating a merge.
| this(indexSettings, mapperService, xContentRegistry, similarityService, mapperRegistry, queryShardContextSupplier, null); | ||
| } | ||
|
|
||
| public DocumentMapperParser(IndexSettings indexSettings, MapperService mapperService, NamedXContentRegistry xContentRegistry, |
There was a problem hiding this comment.
Same thought here, it'd be nice to delete the extra constructor above.
| MultiFieldParserContext(ParserContext in) { | ||
| super(in.similarityLookupService(), in.mapperService(), in.typeParsers(), | ||
| in.indexVersionCreated(), in.queryShardContextSupplier()); | ||
| in.indexVersionCreated(), in.queryShardContextSupplier(), null); |
There was a problem hiding this comment.
I think we should pass through the timestamp field here. I guess the timestamp could happen to be a multi-field.
There was a problem hiding this comment.
I will pass down the timestamp field, a small note, the timestamp field can only be a field that is part of the _source.
There is validation that checks whether a field mapping exists of type date or date_nanos when creating the composable index template and this is also asserted when a backing index of a data stream is created.
There was a problem hiding this comment.
the timestamp field can only be a field that is part of the _source.
Interesting! I am generally curious to catch up on timestamp mapping validation, I'll ping the team offline about this.
| } | ||
|
|
||
| void postParse() { | ||
| if (dataStreamTimestampField != null) { |
There was a problem hiding this comment.
Small comment, can collapse these two 'if' checks.
Also, same thought as above about including 'data stream' and the field name in the error message.
| "support aliases.")); | ||
| } | ||
|
|
||
| public void testNoTimestampInDocument() throws Exception { |
There was a problem hiding this comment.
We have good integration test coverage, but no unit tests. It would be great to add a unit test at the level of the mapping code that checks the document validation. Perhaps DateFieldMapperTests would be a good place for this?
There was a problem hiding this comment.
Ignore this comment, @martijnvg explained the unit tests are missing because this is a 'draft' :)
It is a property of We avoided adding data stream information to index metadata, because then the information that indicates whether an index is part of a data stream is in two places and then there is a risk of certain type of bugs if a data stream instance and index metadata instance go for some reason out of sync. An example would be if a backing index gets shrunken. The new index with less shards would need to be added to the data stream and the original index would need to be removed from the data stream. In this case both data stream instance and an index metadata instance would need to be updated. |
|
Pinging @elastic/es-core-features (:Core/Features/Data streams) |
I think if the singleton feature existed today, then the approach taken in this pr would never have been done. I think with the |
I chatted with @jtibshirani via another channel and it is unsure whether something like a singleton field will be added and if so then then it is unsure how this should be exposed. So in the meantime, for data streams, the best way forward seems to be moving forward with this PR. When something like singleton field is added, then the migration can be easy, since the singleton attribute can be enabled automatically when creating a new backing index by ES. This way the migration will be easy. |
|
I opened #58523 to discuss the idea of adding a 'singleton' flag to field mappers. I'll do a final review shortly! |
|
Sorry, I am late in the discussion but I wonder if this could be implemented as a "mappings": {
"_timestamp": {
"enabled": true
}
}Today the timestamp field name is set when the data stream is created. This is flexible but I wonder how we plan to handle multiple data streams that don't share the same timestamp field name in a search request. In other words, will it be possible to sort documents by timestamp if I target more than one data stream ? |
I think that could work. The
We have not discussed that yet. We focussed on at least ensuring that each document has a timestamp value. Right now if a data stream has different timestamp fields then sorting is like if you try to sort over non uniform indices.
I like this idea. But this could also be resolved at query parse time? If we know that we sort over the primary timestamp field of data streams then at query parse time we could resolve to the right field? |
|
I chatted with @jimczi via another channel and we see benefits in developing the timestamp field validation as metadata field mapper. The metadata field mapper implementation would indicate what the timestamp field is. In the |
| if (context.isDataStreamTimestampParsed()) { | ||
| throw new IllegalArgumentException("data stream timestamp field [" + name() + "] encountered multiple values"); | ||
| } | ||
| context.setDataStreamTimestampParsed(true); |
There was a problem hiding this comment.
I just noticed that we should probably only set this after the date has been successfully parsed? Just writing this down to not forget, I see that we may be changing the strategy and moving to a metadata field mapper.
One aspect I like about the metadata field approach is that it consolidates the information into the mapping itself, as opposed to passing it down externally from the datastream definition. This is a good property to maintain -- that all information affecting schema/ document parsing can be found in the index mappings or settings. |
Yes, I agree and it can make sorting on data streams with different timestamp fields easier (a user would sort by the meta field instead of the actual data stream timestamp field). |
|
Closing this pr in favour of #58582 |
If the document is going to be index into a backing index of a data stream then
check whether a timestamp field has been specified and that exactly one timestamp
value has been specified.
Currently there is no concept of a required field in the mapping code. To me
the best place to add the data stream timestamp validation logic is in:
ParseContext#postParse(...)line 481If there is a better place then I happily move this new logic elsewhere.
In order to ParseContext to know whether an index is part of a data stream and
what the timestamp field is, a
DataStreaminstance had to be passed down thisthis place. This is why the change touches relatively many files compared to the
actual added logic. However this is needed and I don't see another way to do this.
Specifically looking for feedback from the @elastic/es-search team.
Relates to #53100