-
Notifications
You must be signed in to change notification settings - Fork 268
Description
I'm working on implementing a HdrHistogram based Summary metric collector for Prometheus, as the default CKMS based Summary metric collector is not performant enough for our use cases.
prometheus/client_java#484
We experienced ArrayIndexOutOfBoundsException on ConcurrentDoubleHistogram#recordValue during load testing under realistic workload(s) with high-throughput and high-concurrency (200mn+ calls / day, 100+ threads).
I couldn't yet reproduce the same exception under controlled conditions (-ea, unit tests, JMH benchmarks, serial tests, concurrent tests, static data, randomized data, ...).
I could only trigger similar exceptions when a measurement is outside of the widest possible dynamic range, but that's a different scenario which is expected for invalid measurements.
https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/DoubleHistogram.java#L557
https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/DoubleHistogram.java#L416
It looks like it’s a rare concurrency bug happening when recording values while auto-resizing or value shifting, but I'm still investigating and exploring HdrHistogram code to understand the details.
1st occurrence - immediately after restart, no error handling:
java.lang.ArrayIndexOutOfBoundsException: Index 67584 out of bounds for length 65536
at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:53)
at io.prometheus.client.Summary$Child.observe(Summary.java:264)
...
2nd occurrence - during normal operation, improved error handling:
java.lang.RuntimeException: Failed to record 0.019 in bucket org.HdrHistogram.ConcurrentDoubleHistogram@47729061
encodeIntoCompressedByteBuffer: DHISTwAAAAMAAAAAAAAABByEkxQAAABBeNqTaZkszMDAoMwAAcxQmhFC2f+3OwBhHZdgecrONJWDpZuF5zozSzMTUz8bVzcL03VmjkZGlnqWRkY2AGoTC78=
at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:65)
at io.prometheus.client.Summary$Child.observe(Summary.java:264)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4317 out of bounds for length 4096
at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:57)
...
3rd occurrence - during normal operation, improved error handling:
java.lang.RuntimeException: Failed to record 0.012 in bucket org.HdrHistogram.ConcurrentDoubleHistogram@ed824791
encodeIntoCompressedByteBuffer: DHISTwAAAAMAAAAAAAAAAhyEkxQAAAAieNqTaZkszMDAwMIAAcxQmhFCyf+32wBhMa0VYAIAUp8EHA==
at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:65)
at io.prometheus.client.Summary$Child.observe(Summary.java:264)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4644 out of bounds for length 4096
at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:57)
...
I improved error handling by serializing the histogram and logging the offending value:
try {
bucket.recordValue(value);
} catch (ArrayIndexOutOfBoundsException e) {
ByteBuffer byteBuffer = ByteBuffer.allocate(bucket.getNeededByteBufferCapacity());
bucket.encodeIntoCompressedByteBuffer(byteBuffer, Deflater.BEST_COMPRESSION);
byteBuffer.flip();
byte[] byteArray = new byte[byteBuffer.limit()];
byteBuffer.get(byteArray);
String base64 = Base64.getEncoder().encodeToString(byteArray);
throw new RuntimeException("Failed to record " + value + " in bucket " + bucket + "\n" + "encodeIntoCompressedByteBuffer: " + base64, e);
}
I verified that the deserialized histogram is not corrupted and the value could be recorded:
double value = ...;
String base64 = ...;
byte[] byteArray = Base64.getDecoder().decode(base64);
ByteBuffer byteBuffer = ByteBuffer.wrap(byteArray);
DoubleHistogram bucket = ConcurrentDoubleHistogram.decodeFromCompressedByteBuffer(byteBuffer, ConcurrentHistogram.class, 2);
bucket.setAutoResize(true);
bucket.recordValue(value);
It looks like increasing the initial dynamic range avoids the issue as it reduces auto-resizing or value shifting, but it might still fail on larger measurements which trigger auto-resizing or value shifting.
prometheus/client_java#484 (comment)
We could swallow the exception and drop the measurement, so this is not a showstopper for our use cases, but I'm not sure if it's acceptable for others, especially use cases other than application metrics.