Conversation
|
To give more detailed information about why half floats are not enough, here is a table that gives disk usage for storing 10M random floats between 0 and 1 depending on the mapping:
I chose Of course this is not a good benchmark since this is fake data, but given how points and doc values work this simulates the worst case and real data could expect even better disk utilization. |
There was a problem hiding this comment.
Just a question: would it be possible to extend from LongFieldMapper? Would be nice to have some code reuse.
There was a problem hiding this comment.
I thought about it when working on this PR but in the end it made things more complicated since this mapper partially needs to behave as a long field and as a double field.
There was a problem hiding this comment.
Cool, I can see how this can complicate things, was just hoping that this code reuse would be a low hanging fruit.
|
Updated numbers with https://issues.apache.org/jira/browse/LUCENE-7371:
|
This is a tentative to revive elastic#15939 motivated by elastic/beats#1941. Half-floats are a pretty bad option for storing percentages. They would likely require 2 bytes all the time while they don't need more than one byte. So this PR exposes a new `scaled_float` type that requires a `scaling_factor` and internally indexes `value*scaling_factor` in a long field. Compared to the original PR it exposes a lower-level API so that the trade-offs are clearer and avoids any reference to fixed precision that might imply that this type is more accurate (actually it is *less* accurate). In addition to being more space-efficient for some use-cases that beats is interested in, this is also faster that `half_float` unless we can improve the efficiency of decoding half-float bits (which is currently done using software) or until Java gets first-class support for half-floats.
4be06bf to
398d70b
Compare
Elasticsearch added a couple of new numeric datatypes, which means we need to update our type casting list to include them. Kibana should see them as "numbers" so they work properly in searches and aggs. Fixes elastic#7782 Related elastic/elasticsearch#18887 Related elastic/elasticsearch#19264
Elasticsearch has recently added scaled_float as an option for storing floating point numbers. The scaled floats are stored internally as longs, which means they can take advantage of the integer compression in Lucene. See elastic/elasticsearch#19264 for details. The PR moves all percentages to scaled floats. In our `fields.yml` we assume a default scaling factor of 1000, which should work well for our percentages (values between 0 and 1). This scaling factor can also be set to a different value in `fields.yml`.
Elasticsearch has recently added scaled_float as an option for storing floating point numbers. The scaled floats are stored internally as longs, which means they can take advantage of the integer compression in Lucene. See elastic/elasticsearch#19264 for details. The PR moves all percentages to scaled floats. In our `fields.yml` we assume a default scaling factor of 1000, which should work well for our percentages (values between 0 and 1). This scaling factor can also be set to a different value in `fields.yml`.
Elasticsearch added a couple of new numeric datatypes, which means we need to update our type casting list to include them. Kibana should see them as "numbers" so they work properly in searches and aggs. Fixes elastic#7782 Related elastic/elasticsearch#18887 Related elastic/elasticsearch#19264 Former-commit-id: 298ee35
This is a tentative to revive #15939 motivated by elastic/beats#1941.
Half-floats are a pretty bad option for storing percentages. They would likely
require 2 bytes all the time while percentages really don't need more than one
byte.
So this PR exposes a new
scaled_floattype that requires ascaling_factorand internally indexes
value*scaling_factorin a long field. Compared to theoriginal PR it exposes a lower-level API so that the trade-offs are clearer and
avoids any reference to fixed precision that might imply that this type is more
accurate (actually it is less accurate).
In addition to being more space-efficient for some use-cases that beats is
interested in, this is also faster that
half_floatunless we can improve theefficiency of decoding half-float bits (which is currently done using software)
or until Java gets first-class support for half-floats.