Skip to content

ArrayIndexOutOfBoundsException on ConcurrentDoubleHistogram#recordValue #156

@ghost

Description

I'm working on implementing a HdrHistogram based Summary metric collector for Prometheus, as the default CKMS based Summary metric collector is not performant enough for our use cases.
prometheus/client_java#484

We experienced ArrayIndexOutOfBoundsException on ConcurrentDoubleHistogram#recordValue during load testing under realistic workload(s) with high-throughput and high-concurrency (200mn+ calls / day, 100+ threads).
I couldn't yet reproduce the same exception under controlled conditions (-ea, unit tests, JMH benchmarks, serial tests, concurrent tests, static data, randomized data, ...).
I could only trigger similar exceptions when a measurement is outside of the widest possible dynamic range, but that's a different scenario which is expected for invalid measurements.
https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/DoubleHistogram.java#L557
https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/DoubleHistogram.java#L416

It looks like it’s a rare concurrency bug happening when recording values while auto-resizing or value shifting, but I'm still investigating and exploring HdrHistogram code to understand the details.

1st occurrence - immediately after restart, no error handling:

java.lang.ArrayIndexOutOfBoundsException: Index 67584 out of bounds for length 65536
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
  at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
  at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
  at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
  at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
  at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
  at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
  at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
  at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
  at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:53)
  at io.prometheus.client.Summary$Child.observe(Summary.java:264)
  ...

2nd occurrence - during normal operation, improved error handling:

java.lang.RuntimeException: Failed to record 0.019 in bucket org.HdrHistogram.ConcurrentDoubleHistogram@47729061
encodeIntoCompressedByteBuffer: DHISTwAAAAMAAAAAAAAABByEkxQAAABBeNqTaZkszMDAoMwAAcxQmhFC2f+3OwBhHZdgecrONJWDpZuF5zozSzMTUz8bVzcL03VmjkZGlnqWRkY2AGoTC78=
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:65)
  at io.prometheus.client.Summary$Child.observe(Summary.java:264)
  ...
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4317 out of bounds for length 4096
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
  at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
  at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
  at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
  at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
  at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
  at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
  at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
  at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
  at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:57)
  ...

3rd occurrence - during normal operation, improved error handling:

java.lang.RuntimeException: Failed to record 0.012 in bucket org.HdrHistogram.ConcurrentDoubleHistogram@ed824791
encodeIntoCompressedByteBuffer: DHISTwAAAAMAAAAAAAAAAhyEkxQAAAAieNqTaZkszMDAwMIAAcxQmhFCyf+32wBhMa0VYAIAUp8EHA==
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:65)
  at io.prometheus.client.Summary$Child.observe(Summary.java:264)
  ...
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4644 out of bounds for length 4096
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
  at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
  at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
  at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
  at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
  at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
  at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
  at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
  at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
  at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:57)
  ...

I improved error handling by serializing the histogram and logging the offending value:

    try {
        bucket.recordValue(value);
    } catch (ArrayIndexOutOfBoundsException e) {
        ByteBuffer byteBuffer = ByteBuffer.allocate(bucket.getNeededByteBufferCapacity());
        bucket.encodeIntoCompressedByteBuffer(byteBuffer, Deflater.BEST_COMPRESSION);
        byteBuffer.flip();
        byte[] byteArray = new byte[byteBuffer.limit()];
        byteBuffer.get(byteArray);
        String base64 = Base64.getEncoder().encodeToString(byteArray);
        throw new RuntimeException("Failed to record " + value + " in bucket " + bucket + "\n" + "encodeIntoCompressedByteBuffer: " + base64, e);
    }

I verified that the deserialized histogram is not corrupted and the value could be recorded:

    double value = ...;
    String base64 = ...;

    byte[] byteArray = Base64.getDecoder().decode(base64);
    ByteBuffer byteBuffer = ByteBuffer.wrap(byteArray);
    DoubleHistogram bucket = ConcurrentDoubleHistogram.decodeFromCompressedByteBuffer(byteBuffer, ConcurrentHistogram.class, 2);
    bucket.setAutoResize(true);

    bucket.recordValue(value);

It looks like increasing the initial dynamic range avoids the issue as it reduces auto-resizing or value shifting, but it might still fail on larger measurements which trigger auto-resizing or value shifting.
prometheus/client_java#484 (comment)

We could swallow the exception and drop the measurement, so this is not a showstopper for our use cases, but I'm not sure if it's acceptable for others, especially use cases other than application metrics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions