Say we have an array with a large negative number and a smaller positive number:
array = Float16[-9000, 100]
Both of these numbers are represented exactly in Float16:
julia> Float16(-9000) == -9_000
true
julia> Float16(100) == 100
true
Julia will return the wrong answer for quantile queries over this array:
julia> using Statistics
julia> quantile(Float16[-9000, 100], 1.0)
104.0 # should be 100, not 104
julia> quantile(Float16[-9000, 100], 0.999999999)
103.99999089599987 # should be approximately 100, not 104
If we make the large number bigger, then quantile queries over Float32 and Float64 arrays are also incorrect:
julia> quantile(Float32[-1e9, 100], 1)
128.0f0 # should be 100, not 128
julia> quantile(Float32[-1e9, 100], 0.9999999999)
127.89999997615814 # should be approximately 100, not 128
julia> quantile(Float64[-1e20, 100], 1)
0.0 # should be 100, not 0
This can happen when both numbers are small:
julia> quantile(Float16[-255, 0.58], 1)
Float16(0.625) # should be 0.58, not 0.625
And it can happen when the array contains more than two numbers:
julia> quantile(Float32[-1e15, -1e14, -1e13, -1e12, -1e11, -1e10, -1e9, 100], 1)
128.0f0 # should be 100, not 128
For integers there’s the interesting twist – the quantiles can exceed the representable values of the integer type:
julia> quantile(Int8[-68, 60], 0.5)
-132.0 # should be -4, not less than typemin(Int8)
In some cases every quantile other than the 0th percentile is incorrect. Interestingly, the values decrease as we query successively higher percentiles:
julia> for percentile in 0:0.1:1
arr = Int16[-30000, 30000]
result = quantile(arr, percentile)
println("p$(rpad(Int(100*percentile), 3)) = $result")
end
p0 = -30000.0 # correct!
p10 = -30553.600000000002 # wrong
p20 = -31107.2 # wrong
p30 = -31660.8 # wrong
p40 = -32214.4 # wrong
p50 = -32768.0 # wrong
p60 = -33321.6 # wrong
p70 = -33875.2 # wrong
p80 = -34428.8 # wrong
p90 = -34982.4 # wrong
p100 = -35536.0 # wrong
If the numbers are big enough, comparable results can be found for Int32 and Int64:
julia> quantile(Int32[-1e9, 2e9], 1.0) < typemin(Int32)
true # should be false
julia> quantile(Int[-5e18, -2e18, 9e18], 1.0) < typemin(Int)
true # should be false
Randomized testing suggests that for Int32 this behavior occurs more frequently for short integer arrays. Based on a million samples at each length, approximately
- 1 in every 4 random arrays of length 2 will produce incorrect quantiles.
- 1 in every 8 random arrays of length 5 will produce incorrect quantiles.
- 1 in every 120 random arrays of length 10 will produce incorrect quantiles.
- 1 in every 45,000 random arrays of length 20 will produce incorrect quantiles.
For Int64 it looks like the error rate may not go down as array size decreases, and approximately
- 1 in every 2 random arrays of length 2 will produce incorrect quantiles.
- 1 in every 7 random arrays of length 10 will produce incorrect quantiles.
- 1 in every 8 random arrays of length 50 will produce incorrect quantiles.
- 1 in every 8 random arrays of length 500 will produce incorrect quantiles.
- 1 in every 8 random arrays of length 5,000 will produce incorrect quantiles.
The function I used to compute these estimates
"""
Estimate the probability that incorrect quantiles will
be produced for a random array of a particular size and
element type.
Correctness is determined using a very naïve method where
we only consider a quantile result to be incorrect if it
lies outside the extrema of the array.
"""
function estimate_wrong_int_arrays(T, n, n_samples)
count = 0
percentiles = 0:0.01:1
arr = zeros(T, n)
for _ in 1:n_samples
rand!(arr)
# Since quantiles are returned in Float64 precision,
# convert the extrema to that type for comparisons
lo, hi = Float64.(extrema(arr))
for p in percentiles
result = quantile(arr, p)
quantile_in_bounds = lo ≤ result ≤ hi
if !quantile_in_bounds
count += 1
break
end
end
end
count / n_samples
end
This behavior reproduces on the current LTS release, Julia 1.6, as well as Julia 1.9.1, the current release as of 2023-06-07.
Say we have an array with a large negative number and a smaller positive number:
Both of these numbers are represented exactly in
Float16:Julia will return the wrong answer for quantile queries over this array:
If we make the large number bigger, then quantile queries over
Float32andFloat64arrays are also incorrect:This can happen when both numbers are small:
And it can happen when the array contains more than two numbers:
For integers there’s the interesting twist – the quantiles can exceed the representable values of the integer type:
In some cases every quantile other than the 0th percentile is incorrect. Interestingly, the values decrease as we query successively higher percentiles:
If the numbers are big enough, comparable results can be found for
Int32andInt64:Randomized testing suggests that for
Int32this behavior occurs more frequently for short integer arrays. Based on a million samples at each length, approximatelyFor
Int64it looks like the error rate may not go down as array size decreases, and approximatelyThe function I used to compute these estimates
This behavior reproduces on the current LTS release, Julia 1.6, as well as Julia 1.9.1, the current release as of 2023-06-07.