Add `scaled_float`. by jpountz · Pull Request #19264 · elastic/elasticsearch

jpountz · 2016-07-05T09:40:18Z

This is a tentative to revive #15939 motivated by elastic/beats#1941.
Half-floats are a pretty bad option for storing percentages. They would likely
require 2 bytes all the time while percentages really don't need more than one
byte.

So this PR exposes a new scaled_float type that requires a scaling_factor
and internally indexes value*scaling_factor in a long field. Compared to the
original PR it exposes a lower-level API so that the trade-offs are clearer and
avoids any reference to fixed precision that might imply that this type is more
accurate (actually it is less accurate).

In addition to being more space-efficient for some use-cases that beats is
interested in, this is also faster that half_float unless we can improve the
efficiency of decoding half-float bits (which is currently done using software)
or until Java gets first-class support for half-floats.

jpountz · 2016-07-05T13:18:09Z

To give more detailed information about why half floats are not enough, here is a table that gives disk usage for storing 10M random floats between 0 and 1 depending on the mapping:

Mapping	Points disk usage (kB)	Doc values disk usage (kB)	Total
float	49728	34180	83908
half float	26560	19532	46092
scaled float (factor=4000)	25744	14652	40396
scaled float (factor=100)	13044	9768	22812

I chose 4000 and 100 as scaling factors because 4000 means 0.025% accuracy, which is better than what a half float can do for this particular use case (floats between 0 and 1) yet requires less disk, and 100 because I suspect it would be enough for many metrics like cpu utilization with its 1% accuracy.

Of course this is not a good benchmark since this is fake data, but given how points and doc values work this simulates the worst case and real data could expect even better disk utilization.

martijnvg · 2016-07-08T10:04:35Z

core/src/main/java/org/elasticsearch/index/mapper/core/ScaledFloatFieldMapper.java

Just a question: would it be possible to extend from LongFieldMapper? Would be nice to have some code reuse.

I thought about it when working on this PR but in the end it made things more complicated since this mapper partially needs to behave as a long field and as a double field.

Cool, I can see how this can complicate things, was just hoping that this code reuse would be a low hanging fruit.

jpountz · 2016-07-12T16:37:11Z

Updated numbers with https://issues.apache.org/jira/browse/LUCENE-7371:

Mapping	Points disk usage (kB)	Doc values disk usage (kB)	Total
float	40312	34180	74492
half float	23092	19532	42624
scaled float (factor=4000)	22792	14652	37444
scaled float (factor=100)	12984	9768	22752

This is a tentative to revive elastic#15939 motivated by elastic/beats#1941. Half-floats are a pretty bad option for storing percentages. They would likely require 2 bytes all the time while they don't need more than one byte. So this PR exposes a new `scaled_float` type that requires a `scaling_factor` and internally indexes `value*scaling_factor` in a long field. Compared to the original PR it exposes a lower-level API so that the trade-offs are clearer and avoids any reference to fixed precision that might imply that this type is more accurate (actually it is *less* accurate). In addition to being more space-efficient for some use-cases that beats is interested in, this is also faster that `half_float` unless we can improve the efficiency of decoding half-float bits (which is currently done using software) or until Java gets first-class support for half-floats.

Elasticsearch added a couple of new numeric datatypes, which means we need to update our type casting list to include them. Kibana should see them as "numbers" so they work properly in searches and aggs. Fixes elastic#7782 Related elastic/elasticsearch#18887 Related elastic/elasticsearch#19264

Elasticsearch has recently added scaled_float as an option for storing floating point numbers. The scaled floats are stored internally as longs, which means they can take advantage of the integer compression in Lucene. See elastic/elasticsearch#19264 for details. The PR moves all percentages to scaled floats. In our `fields.yml` we assume a default scaling factor of 1000, which should work well for our percentages (values between 0 and 1). This scaling factor can also be set to a different value in `fields.yml`.

Elasticsearch added a couple of new numeric datatypes, which means we need to update our type casting list to include them. Kibana should see them as "numbers" so they work properly in searches and aggs. Fixes elastic#7782 Related elastic/elasticsearch#18887 Related elastic/elasticsearch#19264 Former-commit-id: 298ee35

jpountz added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types v5.0.0-alpha5 labels Jul 5, 2016

jpountz added the review label Jul 5, 2016

jpountz mentioned this pull request Jul 7, 2016

Use half floats for most floating point numbers elastic/beats#1941

Merged

martijnvg reviewed Jul 8, 2016
View reviewed changes

jpountz force-pushed the feature/scaled_floats branch from 4be06bf to 398d70b Compare July 18, 2016 11:37

jpountz merged commit 398d70b into elastic:master Jul 18, 2016

jpountz deleted the feature/scaled_floats branch July 18, 2016 12:05

clintongormley added the release highlight label Jul 18, 2016

ycombinator mentioned this pull request Jul 20, 2016

Handle half_float and scaled_float datatypes in Elasticsearch elastic/kibana#7782

Closed

Bargs mentioned this pull request Jul 21, 2016

Support new half_float and scaled_float field types elastic/kibana#7789

Merged

Bargs mentioned this pull request Jul 22, 2016

Support new half_float and scaled_float types in Console elastic/kibana#7811

Closed

tsg mentioned this pull request Aug 2, 2016

Use scaled_floats for percentages in ES mapping elastic/beats#2156

Merged

acchen97 mentioned this pull request Aug 9, 2016

New ES type scaled_float elastic/elasticsearch-hadoop#822

Closed

$@polyfractal$ polyfractal mentioned this pull request Sep 27, 2016

Add Fixed-Point Numeric data type #13625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `scaled_float`.#19264

Add `scaled_float`.#19264
jpountz merged 1 commit intoelastic:masterfrom
jpountz:feature/scaled_floats

jpountz commented Jul 5, 2016 •

edited

Loading

Uh oh!

jpountz commented Jul 5, 2016 •

edited

Loading

Uh oh!

martijnvg Jul 8, 2016

Uh oh!

jpountz Jul 8, 2016

Uh oh!

martijnvg Jul 8, 2016

Uh oh!

jpountz commented Jul 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jpountz commented Jul 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Jul 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martijnvg Jul 8, 2016

Choose a reason for hiding this comment

Uh oh!

jpountz Jul 8, 2016

Choose a reason for hiding this comment

Uh oh!

martijnvg Jul 8, 2016

Choose a reason for hiding this comment

Uh oh!

jpountz commented Jul 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpountz commented Jul 5, 2016 •

edited

Loading

jpountz commented Jul 5, 2016 •

edited

Loading