Incorrect quantiles for floating-point and integer arrays

Say we have an array with a large negative number and a smaller positive number:

```julia
array = Float16[-9000, 100]
```

Both of these numbers are represented exactly in `Float16`:

```julia
julia> Float16(-9000) == -9_000
true

julia> Float16(100) == 100
true
```

Julia will return the wrong answer for quantile queries over this array:

```julia
julia> using Statistics

julia> quantile(Float16[-9000, 100], 1.0)
104.0 # should be 100, not 104

julia> quantile(Float16[-9000, 100], 0.999999999)
103.99999089599987 # should be approximately 100, not 104
```

If we make the large number bigger, then quantile queries over `Float32` and `Float64` arrays are also incorrect:

```julia
julia> quantile(Float32[-1e9, 100], 1)
128.0f0 # should be 100, not 128

julia> quantile(Float32[-1e9, 100], 0.9999999999)
127.89999997615814 # should be approximately 100, not 128

julia> quantile(Float64[-1e20, 100], 1)
0.0 # should be 100, not 0
```

This can happen when both numbers are small:

```julia
julia> quantile(Float16[-255, 0.58], 1)
Float16(0.625) # should be 0.58, not 0.625
```

And it can happen when the array contains more than two numbers:

```julia
julia> quantile(Float32[-1e15, -1e14, -1e13, -1e12, -1e11, -1e10, -1e9, 100], 1)
128.0f0 # should be 100, not 128
```

For integers there’s the interesting twist – the quantiles can exceed the representable values of the integer type:

```julia
julia> quantile(Int8[-68, 60], 0.5)
-132.0 # should be -4, not less than typemin(Int8)
```

In some cases every quantile other than the 0th percentile is incorrect. Interestingly, the values decrease as we query successively higher percentiles:

```julia
julia> for percentile in 0:0.1:1
          arr = Int16[-30000, 30000]
          result = quantile(arr, percentile)
          println("p$(rpad(Int(100*percentile), 3)) = $result")
       end
p0   = -30000.0 # correct!
p10  = -30553.600000000002 # wrong
p20  = -31107.2 # wrong
p30  = -31660.8 # wrong
p40  = -32214.4 # wrong
p50  = -32768.0 # wrong
p60  = -33321.6 # wrong
p70  = -33875.2 # wrong
p80  = -34428.8 # wrong
p90  = -34982.4 # wrong
p100 = -35536.0 # wrong
```

If the numbers are big enough, comparable results can be found for `Int32` and `Int64`:

```julia
julia> quantile(Int32[-1e9, 2e9], 1.0) < typemin(Int32)
true # should be false

julia> quantile(Int[-5e18, -2e18, 9e18], 1.0) < typemin(Int)
true # should be false
```

Randomized testing suggests that for `Int32` this behavior occurs more frequently for short integer arrays. Based on a million samples at each length, approximately

- 1 in every 4 random arrays of length 2 will produce incorrect quantiles.
- 1 in every 8 random arrays of length 5 will produce incorrect quantiles.
- 1 in every 120 random arrays of length 10 will produce incorrect quantiles.
- 1 in every 45,000 random arrays of length 20 will produce incorrect quantiles.

For `Int64` it looks like the error rate may not go down as array size decreases, and approximately

- 1 in every 2 random arrays of length 2 will produce incorrect quantiles.
- 1 in every 7 random arrays of length 10 will produce incorrect quantiles.
- 1 in every 8 random arrays of length 50 will produce incorrect quantiles.
- 1 in every 8 random arrays of length 500 will produce incorrect quantiles.
- 1 in every 8 random arrays of length 5,000 will produce incorrect quantiles.

<details>
  <summary>The function I used to compute these estimates</summary>

```julia
"""
Estimate the probability that incorrect quantiles will
be produced for a random array of a particular size and
element type.

Correctness is determined using a very naïve method where
we only consider a quantile result to be incorrect if it
lies outside the extrema of the array.
"""
function estimate_wrong_int_arrays(T, n, n_samples)
  count = 0
  percentiles = 0:0.01:1
  arr = zeros(T, n)
  for _ in 1:n_samples
    rand!(arr)
    # Since quantiles are returned in Float64 precision,
    # convert the extrema to that type for comparisons
    lo, hi = Float64.(extrema(arr))
    for p in percentiles
      result = quantile(arr, p)
      quantile_in_bounds = lo ≤ result ≤ hi
      if !quantile_in_bounds
        count += 1
        break
      end
    end
  end
  count / n_samples
end
```
</details>

This behavior reproduces on the current LTS release, Julia 1.6, as well as Julia 1.9.1, the current release as of 2023-06-07.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect quantiles for floating-point and integer arrays #144

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect quantiles for floating-point and integer arrays #144

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions