[ML] Improve uniqueness of result document IDs#50644
[ML] Improve uniqueness of result document IDs#50644droberts195 merged 5 commits intoelastic:masterfrom
Conversation
Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613
|
Pinging @elastic/ml-core (:ml) |
|
In the case of a job re-running from a snapshot, we delete results directly correct? And do not rely on the IDs to be the same between the two runs? |
We do if it's an explicit revert with This latter scenario will always be the case during a failover from one node to another. It is actually possible for a model plot document to exist that's more recent than the restart time. This is because model plot documents are written before bucket documents - see https://github.com/elastic/ml-cpp/blob/a9c468cf8b991b8d30f1a9ba2846ff90edaa8bcc/lib/api/CAnomalyJob.cc#L626-L629 So in the worst case we'd persist some model plot documents that would get indexed due to a bulk request filling up, then the node would be killed before the corresponding bucket or data counts documents could be indexed, then the job would restart on a different node, redo the same bucket and we'd get duplicate model plot documents for one bucket. This would only be a problem in the case where the old node was running a version before this fix and the new node was running a version after this fix. I think it's probably worth tolerating this unlikely/single bucket problem to fix the problem of entire partitions never having any model plot. We do explicit deletes in the case of interim results, so those won't be a problem - see |
benwtrent
left a comment
There was a problem hiding this comment.
I think it would be nice to have a ID length worst case test where every possible value is at its configurable max.
Looking at the maximum value possible for murmur hash bytes: "170141183460469231722463931679029329919" which is two Long.MAX_VALUE bytes. That as a length of UTF_8 bytes is: 39.
I think we are way under the limit, but it would be nice for such a test to cover that extreme (and unlikely) case.
I added a test and we're under 200 bytes in total in the worst case. |
Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes #50613
Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.
Fixes #50613