[ML] Improve uniqueness of result document IDs by droberts195 · Pull Request #50644 · elastic/elasticsearch

droberts195 · 2020-01-06T09:56:03Z

Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613

elasticmachine · 2020-01-06T09:56:05Z

Pinging @elastic/ml-core (:ml)

benwtrent · 2020-01-06T12:08:00Z

In the case of a job re-running from a snapshot, we delete results directly correct? And do not rely on the IDs to be the same between the two runs?

droberts195 · 2020-01-06T17:57:49Z

In the case of a job re-running from a snapshot, we delete results directly correct?

We do if it's an explicit revert with delete_intervening_results set to true. If not then we try to start off from after the last observed input or result, which will generally mean we don't create duplicate results - see

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedJob.java

Line 95 in 197d5e7

long lastEndTime = Math.max(latestFinalBucketEndTimeMs, latestRecordTimeMs);

This latter scenario will always be the case during a failover from one node to another.

It is actually possible for a model plot document to exist that's more recent than the restart time. This is because model plot documents are written before bucket documents - see https://github.com/elastic/ml-cpp/blob/a9c468cf8b991b8d30f1a9ba2846ff90edaa8bcc/lib/api/CAnomalyJob.cc#L626-L629

So in the worst case we'd persist some model plot documents that would get indexed due to a bulk request filling up, then the node would be killed before the corresponding bucket or data counts documents could be indexed, then the job would restart on a different node, redo the same bucket and we'd get duplicate model plot documents for one bucket. This would only be a problem in the case where the old node was running a version before this fix and the new node was running a version after this fix. I think it's probably worth tolerating this unlikely/single bucket problem to fix the problem of entire partitions never having any model plot.

We do explicit deletes in the case of interim results, so those won't be a problem - see

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

Lines 204 to 208 in 5c3dd57

    
           // Delete any existing interim results generated by a Flush command 
        
           // which have not been replaced or superseded by new results. 
        
           LOGGER.trace("[{}] Deleting interim results", jobId); 
        
           persister.deleteInterimResults(jobId); 
        
           deleteInterimRequired = false;

benwtrent

I think it would be nice to have a ID length worst case test where every possible value is at its configurable max.

Looking at the maximum value possible for murmur hash bytes: "170141183460469231722463931679029329919" which is two Long.MAX_VALUE bytes. That as a length of UTF_8 bytes is: 39.

I think we are way under the limit, but it would be nice for such a test to cover that extreme (and unlikely) case.

droberts195 · 2020-01-07T09:37:20Z

I think we are way under the limit, but it would be nice for such a test to cover that extreme

I added a test and we're under 200 bytes in total in the worst case.

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes #50613

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613

droberts195 added 2 commits January 3, 2020 14:50

Test for result doc ID uniqueness

1ffb83d

droberts195 added >bug :ml Machine learning v8.0.0 v7.6.0 labels Jan 6, 2020

benwtrent reviewed Jan 6, 2020

View reviewed changes

benwtrent approved these changes Jan 6, 2020

View reviewed changes

droberts195 added 3 commits January 7, 2020 09:15

Fix more tests

5a3b1d8

Merge branch 'master' into test_id_uniqueness

e54559b

Add a test of maximum length

e4366ae

droberts195 merged commit 1adf4c2 into elastic:master Jan 7, 2020

droberts195 deleted the test_id_uniqueness branch January 7, 2020 10:23

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve uniqueness of result document IDs#50644

[ML] Improve uniqueness of result document IDs#50644
droberts195 merged 5 commits intoelastic:masterfrom
droberts195:test_id_uniqueness

droberts195 commented Jan 6, 2020

Uh oh!

elasticmachine commented Jan 6, 2020

Uh oh!

benwtrent commented Jan 6, 2020

Uh oh!

droberts195 commented Jan 6, 2020

Uh oh!

benwtrent left a comment

Uh oh!

droberts195 commented Jan 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

droberts195 commented Jan 6, 2020

Uh oh!

elasticmachine commented Jan 6, 2020

Uh oh!

benwtrent commented Jan 6, 2020

Uh oh!

droberts195 commented Jan 6, 2020

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

droberts195 commented Jan 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants