Skip to content

feat: use codegen instead of build.rs#2

Merged
Boshen merged 1 commit into
mainfrom
codegen
May 28, 2024
Merged

feat: use codegen instead of build.rs#2
Boshen merged 1 commit into
mainfrom
codegen

Conversation

@Boshen

@Boshen Boshen commented May 28, 2024

Copy link
Copy Markdown
Member

No description provided.

@Boshen Boshen merged commit 620e60a into main May 28, 2024
@Boshen Boshen deleted the codegen branch May 28, 2024 14:53
@Boshen

Boshen commented May 28, 2024

Copy link
Copy Markdown
Member Author

In oxc, compile time is 7.5s + 1.5s build.rs before, 8.2s after.

Boshen added a commit that referenced this pull request May 29, 2026
…pair indices (#690)

## Summary

Applies **byte-plane (stream-split) compression** to the region
`(browser, version)` pair-index blob — the single largest bundled data
blob.

Each region's pair indices are `u16` values, but there are only ~557
distinct pairs, so the high byte is almost always `0`. Instead of
postcard-varint-encoding interleaved `u16`s per region, the codegen now
writes **all the low bytes, then all the high bytes**, then deflates.
The high-byte plane collapses to near nothing, and isolating it from the
high-entropy low byte lets deflate model each stream far better than the
interleaved varint stream did.

The reader splits the decompressed blob at `len / 2` and recombines `lo
| hi << 8` — no postcard deserialization for this blob anymore.
`PAIR_RANGES` switches from byte offsets to element offsets (cumulative
datum counts).

## Results (lossless)

| | Before | After | Δ |
|---|--:|--:|--:|
| pair-index blob | 47,602 | 44,567 | **−3,035** |
| Linux musl example binary | 782,048 | 778,528 | **−3,520** |

The binary shrinks slightly more than the blob because the `u16`
postcard decode path is dropped. macOS file size is unchanged (16 KB
page quantization swallows the sub-page win), but the `.rodata` is 3 KB
smaller — which helps consumers that link this crate into a larger
binary.

## Why this is the remaining clean win

An entropy analysis across ~15 candidate encodings showed:
- The **percentages** blob (#2) already sits at its order-0 entropy
floor — delta/raw/byte-plane/columnar/first-value-split all fail to beat
the current delta-varint.
- The pair-index byte-plane (44,567) is already *below* the order-0
symbol entropy floor (48,944) because deflate exploits cross-region
repetition, so an order-0 arithmetic coder would be worse.
- MTF, delta+zigzag, and columnar transpose all *hurt*.

Beating the current state would require an order-1+ model (BWT / range
coder / brotli) = real decoder code that a few KB of savings won't pay
for.

Verification: all 392 tests + 14 JS-fuzz proptests pass; clippy and fmt
clean; every other generated blob is byte-identical (reproducible
codegen).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant