ArrayIndexOutOfBoundsException on ConcurrentDoubleHistogram#recordValue

I'm working on implementing a `HdrHistogram` based `Summary` metric collector for Prometheus, as the default `CKMS` based `Summary` metric collector is not performant enough for our use cases.
https://github.com/prometheus/client_java/pull/484

We experienced `ArrayIndexOutOfBoundsException` on `ConcurrentDoubleHistogram#recordValue` during load testing under realistic workload(s) with high-throughput and high-concurrency (200mn+ calls / day, 100+ threads).
I couldn't yet reproduce the same exception under controlled conditions (`-ea`, unit tests, JMH benchmarks, serial tests, concurrent tests, static data, randomized data, ...).
I could only trigger similar exceptions when a measurement is outside of the widest possible dynamic range, but that's a different scenario which is expected for invalid measurements.
https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/DoubleHistogram.java#L557
https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/DoubleHistogram.java#L416

It looks like it’s a rare concurrency bug happening when recording values while `auto-resizing` or `value shifting`, but I'm still investigating and exploring `HdrHistogram` code to understand the details.

1st occurrence - immediately after restart, no error handling:
```
java.lang.ArrayIndexOutOfBoundsException: Index 67584 out of bounds for length 65536
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
  at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
  at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
  at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
  at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
  at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
  at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
  at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
  at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
  at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:53)
  at io.prometheus.client.Summary$Child.observe(Summary.java:264)
  ...
```

2nd occurrence - during normal operation, improved error handling:
```
java.lang.RuntimeException: Failed to record 0.019 in bucket org.HdrHistogram.ConcurrentDoubleHistogram@47729061
encodeIntoCompressedByteBuffer: DHISTwAAAAMAAAAAAAAABByEkxQAAABBeNqTaZkszMDAoMwAAcxQmhFC2f+3OwBhHZdgecrONJWDpZuF5zozSzMTUz8bVzcL03VmjkZGlnqWRkY2AGoTC78=
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:65)
  at io.prometheus.client.Summary$Child.observe(Summary.java:264)
  ...
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4317 out of bounds for length 4096
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
  at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
  at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
  at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
  at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
  at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
  at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
  at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
  at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
  at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:57)
  ...
```

3rd occurrence - during normal operation, improved error handling:
```
java.lang.RuntimeException: Failed to record 0.012 in bucket org.HdrHistogram.ConcurrentDoubleHistogram@ed824791
encodeIntoCompressedByteBuffer: DHISTwAAAAMAAAAAAAAAAhyEkxQAAAAieNqTaZkszMDAwMIAAcxQmhFCyf+32wBhMa0VYAIAUp8EHA==
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:65)
  at io.prometheus.client.Summary$Child.observe(Summary.java:264)
  ...
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4644 out of bounds for length 4096
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2011)
  at java.base/java.lang.invoke.VarHandle$1.apply(VarHandle.java:2008)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:159)
  at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:156)
  at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
  at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
  at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
  at java.base/java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:721)
  at java.base/java.lang.invoke.VarHandleGuards.guard_LIJ_J(VarHandleGuards.java:778)
  at java.base/java.util.concurrent.atomic.AtomicLongArray.incrementAndGet(AtomicLongArray.java:234)
  at org.HdrHistogram.ConcurrentHistogram.recordConvertedDoubleValue(ConcurrentHistogram.java:169)
  at org.HdrHistogram.DoubleHistogram.recordSingleValue(DoubleHistogram.java:353)
  at org.HdrHistogram.DoubleHistogram.recordValue(DoubleHistogram.java:294)
  at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:57)
  ...
```

I improved error handling by serializing the histogram and logging the offending value: 
```
    try {
        bucket.recordValue(value);
    } catch (ArrayIndexOutOfBoundsException e) {
        ByteBuffer byteBuffer = ByteBuffer.allocate(bucket.getNeededByteBufferCapacity());
        bucket.encodeIntoCompressedByteBuffer(byteBuffer, Deflater.BEST_COMPRESSION);
        byteBuffer.flip();
        byte[] byteArray = new byte[byteBuffer.limit()];
        byteBuffer.get(byteArray);
        String base64 = Base64.getEncoder().encodeToString(byteArray);
        throw new RuntimeException("Failed to record " + value + " in bucket " + bucket + "\n" + "encodeIntoCompressedByteBuffer: " + base64, e);
    }
```

I verified that the deserialized histogram is not corrupted and the value could be recorded:
```
    double value = ...;
    String base64 = ...;

    byte[] byteArray = Base64.getDecoder().decode(base64);
    ByteBuffer byteBuffer = ByteBuffer.wrap(byteArray);
    DoubleHistogram bucket = ConcurrentDoubleHistogram.decodeFromCompressedByteBuffer(byteBuffer, ConcurrentHistogram.class, 2);
    bucket.setAutoResize(true);

    bucket.recordValue(value);
```

It looks like increasing the initial dynamic range avoids the issue as it reduces `auto-resizing` or `value shifting`, but it might still fail on larger measurements which trigger `auto-resizing` or `value shifting`.
https://github.com/prometheus/client_java/pull/484#discussion_r299942012

We could swallow the exception and drop the measurement, so this is not a showstopper for our use cases, but I'm not sure if it's acceptable for others, especially use cases other than application metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrayIndexOutOfBoundsException on ConcurrentDoubleHistogram#recordValue #156

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ArrayIndexOutOfBoundsException on ConcurrentDoubleHistogram#recordValue #156

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions