SliceUtf8 performance optimizations by dain · Pull Request #192 · airlift/slice

dain · 2026-03-09T16:58:08Z

Summary

Refactored core implementations to operate on byte[] + offset + length, with Slice overloads delegating.
Added/expanded ASCII fast paths across key algorithms.
Reduced repeated decode work in loop-heavy code paths.
Added new UTF-8/code-point conversion helpers for Trino-style usage.
Expanded JMH coverage for existing methods and Trino-representative loops.

High-level optimization approaches

byte[] first internals: better JVM bounds-check hoisting and easier raw-array integration.
ASCII specialization: skip full decode work when all bytes are ASCII.
SWAR/chunked scanning where applicable (long/int lanes via var handles) to skip equal ASCII regions quickly.
Fewer passes over data: APIs/helpers that decode once and reuse derived results.
Explicit API boundary validation with inner loops kept lean.

New APIs

toCodePoints(byte[] utf8, int offset, int length)
fromCodePoints(int[] codePoints, int offset, int length)
codePointByteLengths(byte[] utf8, int offset, int length)

Benchmark highlights (JMH)

Most results below are for length=1000 code points unless noted.

benchmarkCompareUtf16BE
- ASCII: 3.483 -> 0.102 ns/codepoint (~34x)
- non-ASCII: 8.214 -> 6.395 ns/codepoint (~1.28x)
benchmarkToLowerCase
- ASCII: 3.029 -> 0.501 ns/codepoint (~6.0x)
- non-ASCII: 7.145 -> 4.183 ns/codepoint (~1.71x)
benchmarkToUpperCase
- ASCII: 3.053 -> 0.601 ns/codepoint (~5.1x)
- non-ASCII: 7.254 -> 5.019 ns/codepoint (~1.45x)
benchmarkTrimCustom
- ASCII: 2.702 -> 0.474 ns/codepoint (~5.7x)
- non-ASCII: 5.224 -> 4.329 ns/codepoint (~1.21x)
benchmarkLeftTrim
- ASCII: 1.919 -> 0.344 ns/codepoint (~5.6x)
- non-ASCII: 3.137 -> 2.201 ns/codepoint (~1.42x)
benchmarkRightTrim
- ASCII: 0.551 -> 0.359 ns/codepoint (~1.53x)
- non-ASCII: 2.939 -> 2.534 ns/codepoint (~1.16x)
benchmarkToCodePointsApi (ns/byte)
- ASCII: 2.4902 -> 0.2319 (~10.7x vs two-pass baseline)
- non-ASCII: 1.6643 -> 1.0820 (~1.54x vs two-pass baseline)
benchmarkFromCodePointsApi
- ASCII: 0.500 -> 0.326 ns/codepoint (~1.53x)
- non-ASCII: 3.230 -> 2.062 ns/codepoint (~1.57x)
benchmarkFixInvalidUtf8WithoutReplacement (inputLength=1024, ns/byte)
- valid non-ASCII: 6.341 -> 3.978 (~1.59x)
- invalid non-ASCII: 6.242 -> 4.549 (~1.37x)
benchmarkReverse
- ASCII: 0.318 -> 0.067 ns/codepoint (~4.7x)
- non-ASCII: 3.397 -> 3.406 ns/codepoint (flat/noise)
codePointByteLengths helper benchmark (length=128)
- ASCII: 1.020 -> 0.696 ns/codepoint (~1.47x)
- non-ASCII: 3.596 -> 2.129 ns/codepoint (~1.69x)

Small-string sanity (tail paths)

Ran a dedicated JMH sanity pass at non-8-multiple lengths 7 and 31 (with ascii=true,false) for:
compareUtf16BE, toLowerCase, toUpperCase, trimCustom, toCodePointsApi, and fromCodePointsApi.

compareUtf16BE:
- ASCII 7.332 / 10.781 ns/op (len=7 / 31)
- non-ASCII 41.308 / 193.126 ns/op
fromCodePointsApi:
- ASCII 7.702 / 14.937 ns/op
- non-ASCII 25.337 / 64.678 ns/op
toCodePointsApi:
- ASCII 5.843 / 12.206 ns/op
- non-ASCII 29.565 / 123.929 ns/op
toLowerCase:
- ASCII 12.790 / 29.095 ns/op
- non-ASCII 27.257 / 124.587 ns/op
toUpperCase:
- ASCII 7.894 / 23.449 ns/op
- non-ASCII 36.579 / 124.978 ns/op
trimCustom:
- ASCII 18.294 / 29.761 ns/op
- non-ASCII 54.013 / 170.847 ns/op

Conclusion: no obvious small-string regressions; short-input behavior is consistent with expected fixed-overhead effects.

wendigo · 2026-03-10T18:24:51Z

No regressions in Trino. Slight CPU improvement for TPCH/TPCDS (~1%)

Copilot

Pull request overview

This PR optimizes SliceUtf8’s core UTF-8 routines by shifting internal implementations to operate on byte[] + offset + length, adding ASCII fast paths and chunked scanning, and introducing new UTF-8↔code-point helper APIs to better support loop-heavy “Trino-style” usage patterns.

Changes:

Refactors multiple SliceUtf8 operations to use byte[] range overloads (with Slice overloads delegating) and adds new conversion helpers (toCodePoints, fromCodePoints, codePointByteLengths).
Adds/expands ASCII-optimized paths and reduces repeated decode work in key algorithms (e.g., compare, trim, case conversion, reverse).
Expands verification via new unit tests and significantly broadens JMH benchmarks to cover the new/optimized paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/main/java/io/airlift/slice/SliceUtf8.java`	Refactors implementations to `byte[]`-range internals, adds ASCII/chunk fast paths, and introduces new code point conversion/length helper APIs.
`src/test/java/io/airlift/slice/TestSliceUtf8.java`	Adds tests validating byte[] overload equivalence and correctness of new code-point APIs, plus extra range validation coverage.
`src/test/java/io/airlift/slice/SliceUtf8Benchmark.java`	Extends JMH coverage with Trino-representative loops and byte[]-range benchmark variants to validate performance improvements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/main/java/io/airlift/slice/SliceUtf8.java

unwrapping slice makes it easier to see what is happening in these algorithms and makes it easier to optimize. Additionally this makes these functions usable without having to wrap them into a slice.

Benchmark (benchmarkCompareUtf16BE, length=1000): - ascii=true: 3.483 -> 0.102 ns/codepoint - ascii=false: 8.214 -> 6.395 ns/codepoint

Benchmark (benchmarkReverse, length=1000): - ascii=true: 0.318 -> 0.067 ns/codepoint - ascii=false: 3.397 -> 3.406 ns/codepoint (flat within noise)

Benchmark (benchmarkToUpperCase, length=1000): - ascii=true: 3.053 -> 0.601 ns/codepoint - ascii=false: 7.254 -> 5.019 ns/codepoint

Benchmark (benchmarkToLowerCase, length=1000): - ascii=true: 3.029 -> 0.501 ns/codepoint - ascii=false: 7.145 -> 4.183 ns/codepoint

Benchmark (benchmarkFixInvalidUtf8WithoutReplacement, inputLength=1024): - valid_non_ascii: 6.341 -> 3.978 ns/byte - invalid_non_ascii: 6.242 -> 4.549 ns/byte

Benchmark (benchmarkLeftTrim, length=1000): - ascii=true: 1.919 -> 0.344 ns/codepoint - ascii=false: 3.137 -> 2.201 ns/codepoint

Benchmark (benchmarkRightTrim, length=1000): - ascii=true: 0.551 -> 0.359 ns/codepoint - ascii=false: 2.939 -> 2.534 ns/codepoint

Benchmark (benchmarkTrimCustom, length=1000): - ascii=true: 2.702 -> 0.474 ns/codepoint - ascii=false: 5.224 -> 4.329 ns/codepoint

Benchmark (benchmarkSetCodePointAt, length=1000): - ascii=true: 0.336 -> 0.332 ns/codepoint - ascii=false: 2.259 -> 2.334 ns/codepoint Related benchmark (benchmarkCodePointToUtf8, length=1000): - ascii=false: 2.404 -> 2.154 ns/codepoint

Useful for Trino VARCHAR->code points casts and similar decode loops. Benchmark (ns/byte, length=1000): - toCodePointsApi ascii: 0.2319 (baseline two-pass: 2.4902) - toCodePointsApi non-ascii: 1.0820 (baseline two-pass: 1.6643)

Adds fromCodePoints to encode code-point arrays directly into UTF-8 Slice output. This is useful for Trino-style loops that currently pre-size and encode with repeated setCodePointAt calls. Benchmark (SliceUtf8Benchmark, length=1000 code points): - ascii=true: fromCodePointsApi 0.326 ns/codepoint vs Trino baseline 0.500 ns/codepoint - ascii=false: fromCodePointsApi 2.062 ns/codepoint vs Trino baseline 3.230 ns/codepoint

Adds codePointByteLengths so callers can decode UTF-8 once and directly materialize per-code-point byte widths (1..4) for padding/loop planning. Benchmark (SliceUtf8Benchmark, length=128 code points): - ascii=true: helper(byte[]) 0.696 ns/codepoint vs Trino byte[] baseline 1.020 ns/codepoint - ascii=false: helper(byte[]) 2.129 ns/codepoint vs Trino byte[] baseline 3.596 ns/codepoint

Add JMH benchmarks for SliceUtf8 hotspots

8e0271d

dain requested review from electrum and wendigo March 9, 2026 17:00

electrum requested a review from Copilot March 13, 2026 15:06

Copilot started reviewing on behalf of electrum March 13, 2026 15:06 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

src/main/java/io/airlift/slice/SliceUtf8.java Show resolved Hide resolved

src/main/java/io/airlift/slice/SliceUtf8.java Outdated Show resolved Hide resolved

src/main/java/io/airlift/slice/SliceUtf8.java Outdated Show resolved Hide resolved

electrum approved these changes Mar 13, 2026

View reviewed changes

src/main/java/io/airlift/slice/SliceUtf8.java Show resolved Hide resolved

dain added 14 commits March 13, 2026 12:48

Refactor SliceUtf8 byte-array internals and remove wrapper accessors

0fa1828

unwrapping slice makes it easier to see what is happening in these algorithms and makes it easier to optimize. Additionally this makes these functions usable without having to wrap them into a slice.

Add expanded benchmarks for existing SliceUtf8 methods

ab0d6b7

Optimize compareUtf16BE

90d6da0

Benchmark (benchmarkCompareUtf16BE, length=1000): - ascii=true: 3.483 -> 0.102 ns/codepoint - ascii=false: 8.214 -> 6.395 ns/codepoint

Optimize reverse

fd68e96

Benchmark (benchmarkReverse, length=1000): - ascii=true: 0.318 -> 0.067 ns/codepoint - ascii=false: 3.397 -> 3.406 ns/codepoint (flat within noise)

Optimize toUpperCase

d67ac6f

Benchmark (benchmarkToUpperCase, length=1000): - ascii=true: 3.053 -> 0.601 ns/codepoint - ascii=false: 7.254 -> 5.019 ns/codepoint

Optimize toLowerCase

f17b0e1

Benchmark (benchmarkToLowerCase, length=1000): - ascii=true: 3.029 -> 0.501 ns/codepoint - ascii=false: 7.145 -> 4.183 ns/codepoint

Optimize fixInvalidUtf8

d4b8994

Benchmark (benchmarkFixInvalidUtf8WithoutReplacement, inputLength=1024): - valid_non_ascii: 6.341 -> 3.978 ns/byte - invalid_non_ascii: 6.242 -> 4.549 ns/byte

Optimize leftTrim

fb51807

Benchmark (benchmarkLeftTrim, length=1000): - ascii=true: 1.919 -> 0.344 ns/codepoint - ascii=false: 3.137 -> 2.201 ns/codepoint

Optimize rightTrim

a68e23a

Benchmark (benchmarkRightTrim, length=1000): - ascii=true: 0.551 -> 0.359 ns/codepoint - ascii=false: 2.939 -> 2.534 ns/codepoint

Optimize trim(..., whiteSpaceCodePoints)

a48e993

Benchmark (benchmarkTrimCustom, length=1000): - ascii=true: 2.702 -> 0.474 ns/codepoint - ascii=false: 5.224 -> 4.329 ns/codepoint

Optimize setCodePointAt

15633cc

Benchmark (benchmarkSetCodePointAt, length=1000): - ascii=true: 0.336 -> 0.332 ns/codepoint - ascii=false: 2.259 -> 2.334 ns/codepoint Related benchmark (benchmarkCodePointToUtf8, length=1000): - ascii=false: 2.404 -> 2.154 ns/codepoint

Add toCodePoints UTF-8 decode API

5daf0f1

Useful for Trino VARCHAR->code points casts and similar decode loops. Benchmark (ns/byte, length=1000): - toCodePointsApi ascii: 0.2319 (baseline two-pass: 2.4902) - toCodePointsApi non-ascii: 1.0820 (baseline two-pass: 1.6643)

dain force-pushed the user/dain/sliceutf8-perf branch from 605b373 to 2a36dd8 Compare March 13, 2026 19:49

dain merged commit aa8a4d1 into master Mar 13, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SliceUtf8 performance optimizations#192

SliceUtf8 performance optimizations#192
dain merged 15 commits intomasterfrom
user/dain/sliceutf8-perf

dain commented Mar 9, 2026

Uh oh!

wendigo commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dain commented Mar 9, 2026

Summary

High-level optimization approaches

New APIs

Benchmark highlights (JMH)

Small-string sanity (tail paths)

Uh oh!

wendigo commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants