Skip to content

Incorrect quantiles for floating-point and integer arrays #144

@yurivish

Description

@yurivish

Say we have an array with a large negative number and a smaller positive number:

array = Float16[-9000, 100]

Both of these numbers are represented exactly in Float16:

julia> Float16(-9000) == -9_000
true

julia> Float16(100) == 100
true

Julia will return the wrong answer for quantile queries over this array:

julia> using Statistics

julia> quantile(Float16[-9000, 100], 1.0)
104.0 # should be 100, not 104

julia> quantile(Float16[-9000, 100], 0.999999999)
103.99999089599987 # should be approximately 100, not 104

If we make the large number bigger, then quantile queries over Float32 and Float64 arrays are also incorrect:

julia> quantile(Float32[-1e9, 100], 1)
128.0f0 # should be 100, not 128

julia> quantile(Float32[-1e9, 100], 0.9999999999)
127.89999997615814 # should be approximately 100, not 128

julia> quantile(Float64[-1e20, 100], 1)
0.0 # should be 100, not 0

This can happen when both numbers are small:

julia> quantile(Float16[-255, 0.58], 1)
Float16(0.625) # should be 0.58, not 0.625

And it can happen when the array contains more than two numbers:

julia> quantile(Float32[-1e15, -1e14, -1e13, -1e12, -1e11, -1e10, -1e9, 100], 1)
128.0f0 # should be 100, not 128

For integers there’s the interesting twist – the quantiles can exceed the representable values of the integer type:

julia> quantile(Int8[-68, 60], 0.5)
-132.0 # should be -4, not less than typemin(Int8)

In some cases every quantile other than the 0th percentile is incorrect. Interestingly, the values decrease as we query successively higher percentiles:

julia> for percentile in 0:0.1:1
          arr = Int16[-30000, 30000]
          result = quantile(arr, percentile)
          println("p$(rpad(Int(100*percentile), 3)) = $result")
       end
p0   = -30000.0 # correct!
p10  = -30553.600000000002 # wrong
p20  = -31107.2 # wrong
p30  = -31660.8 # wrong
p40  = -32214.4 # wrong
p50  = -32768.0 # wrong
p60  = -33321.6 # wrong
p70  = -33875.2 # wrong
p80  = -34428.8 # wrong
p90  = -34982.4 # wrong
p100 = -35536.0 # wrong

If the numbers are big enough, comparable results can be found for Int32 and Int64:

julia> quantile(Int32[-1e9, 2e9], 1.0) < typemin(Int32)
true # should be false

julia> quantile(Int[-5e18, -2e18, 9e18], 1.0) < typemin(Int)
true # should be false

Randomized testing suggests that for Int32 this behavior occurs more frequently for short integer arrays. Based on a million samples at each length, approximately

  • 1 in every 4 random arrays of length 2 will produce incorrect quantiles.
  • 1 in every 8 random arrays of length 5 will produce incorrect quantiles.
  • 1 in every 120 random arrays of length 10 will produce incorrect quantiles.
  • 1 in every 45,000 random arrays of length 20 will produce incorrect quantiles.

For Int64 it looks like the error rate may not go down as array size decreases, and approximately

  • 1 in every 2 random arrays of length 2 will produce incorrect quantiles.
  • 1 in every 7 random arrays of length 10 will produce incorrect quantiles.
  • 1 in every 8 random arrays of length 50 will produce incorrect quantiles.
  • 1 in every 8 random arrays of length 500 will produce incorrect quantiles.
  • 1 in every 8 random arrays of length 5,000 will produce incorrect quantiles.
The function I used to compute these estimates
"""
Estimate the probability that incorrect quantiles will
be produced for a random array of a particular size and
element type.

Correctness is determined using a very naïve method where
we only consider a quantile result to be incorrect if it
lies outside the extrema of the array.
"""
function estimate_wrong_int_arrays(T, n, n_samples)
  count = 0
  percentiles = 0:0.01:1
  arr = zeros(T, n)
  for _ in 1:n_samples
    rand!(arr)
    # Since quantiles are returned in Float64 precision,
    # convert the extrema to that type for comparisons
    lo, hi = Float64.(extrema(arr))
    for p in percentiles
      result = quantile(arr, p)
      quantile_in_bounds = lo  result  hi
      if !quantile_in_bounds
        count += 1
        break
      end
    end
  end
  count / n_samples
end

This behavior reproduces on the current LTS release, Julia 1.6, as well as Julia 1.9.1, the current release as of 2023-06-07.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions