Skip to content

Improve decode trie size#1944

Merged
fb55 merged 6 commits intomainfrom
decode-size
Sep 3, 2025
Merged

Improve decode trie size#1944
fb55 merged 6 commits intomainfrom
decode-size

Conversation

@fb55
Copy link
Owner

@fb55 fb55 commented Sep 2, 2025

Reduced length of the encoded trie by 21% for HTML (from 15,242 words (uint16) to 12075), by:

  • Adding a semicolon flag on value nodes, which means semicolons don't have to be encoded anymore.
  • Compacting dict keys: two keys are now stored in each word, instead of one
  • Compacting runs of 3 or more characters: successive characters are now also stored as two characters per word

Also reduces the size of the encoded table by 32% (after the previous change) by encoding it as base64.

Reduced length of the encoded trie by 21% for HTML (from 15,242 words (uint16) to 12075), by:
- Adding a semicolon flag on value nodes, which means semicolons don't have to be encoded anymore.
- Compacting dict keys: two keys are now stored in each word, instead of one
- Compacting runs of 3 or more characters: successive characters are now also stored as two characters per word

Also reduces the size of the encoded table by 32% (after the previous change) by encoding it as base64.
@fb55 fb55 requested a review from Copilot September 2, 2025 22:13

This comment was marked as outdated.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@fb55 fb55 requested a review from Copilot September 2, 2025 22:19

This comment was marked as outdated.

@fb55 fb55 requested a review from Copilot September 2, 2025 22:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR significantly improves the encoding efficiency of the HTML entity decode trie by implementing three main optimization strategies, resulting in a 21% reduction in trie size (from 15,242 to 12,075 words) and a 32% reduction in the encoded table size through base64 encoding.

  • Added semicolon flag handling to eliminate explicit semicolon branches for strict entities
  • Implemented compact dictionary keys storing two keys per word instead of one
  • Added compact runs for sequences of 3+ consecutive single-child nodes

Reviewed Changes

Copilot reviewed 10 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/internal/bin-trie-flags.ts Defines bit flags and masks for the new binary trie encoding format
src/decode.ts Updates entity decoder with new trie format support including compact runs and semicolon flags
scripts/write-decode-map.ts Switches from string-based encoding to base64-encoded Uint16Array for smaller output
scripts/trie/trie.ts Adds semicolon requirement tracking and prevents empty next maps for leaf nodes
scripts/trie/encode-trie.ts Implements compact run encoding and packed dictionary key storage
scripts/trie/encode-trie.spec.ts Updates tests for new packed dictionary format
scripts/trie/decode-trie.ts Adds decode support for compact runs and packed keys
scripts/trie/decode-trie.spec.ts Updates merge logic to handle strict vs legacy entity semantics
scripts/trie/compact-run.spec.ts Adds comprehensive tests for compact run encoding/decoding
scripts/trie/README.md Documents the new encoding format including compact runs and semicolon handling

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@fb55 fb55 merged commit 9e6fa85 into main Sep 3, 2025
15 checks passed
@fb55 fb55 deleted the decode-size branch September 3, 2025 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants