Structure-Aware Tokenization for JSON
The Next Stage in Scaling AI is to Stop the Waste
Every time GPT-4 processes a JSON response, it’s doing something quietly wasteful: it’s re-learning what a curly brace is. (Kind of.) Not once. Every single time it sees one. Note that I’m talking about GPT-4, a legacy model, but the cl100k_base standard popularized by GPT-4 was and is still foundational to many current architectures and tiktoken (an implementation) is damned near universal.
Back to our efficiency issue - that’s because the tokenizers behind today’s large language models -- Claude, Gemini, ChatGPT -- were designed for English prose, or more generally let’s just say prose. They treat JSON like it’s a weird paragraph.
The result: structural characters get split unpredictably, the same field names get re-encoded from scratch in every object, and your token budget bleeds out on syntax that carries zero information.
I wanted to see if we could do better. So I built a tokenizer that actually understands JSON. So what problem did I solve, if any?
The problem in one sentence
General-purpose tokenizers waste tokens on the parts of JSON that are completely predictable.
I have been wondering why everyone out there was ok with this? The answer, I believe, is that our LARGE language models all TRY to BE general-purpose so they NEED a general-purpose tokenizer in order to handle so many use cases – but this is not just painful inefficiency - it’s expensive.
I believe the next stage in scaling AI is to STOP THE WASTE.
Think about a typical API response:
500 user objects, each with the same 12 fields -- name, email, created_at, is_active, and so on. A standard tokenizer encodes “created_at” from scratch every time it appears. That’s 3-4 tokens per occurrence, times 500 objects, times 12 fields.
Thousands of tokens spent on information that could be represented once.
And it’s not just keys. The tokenizer might encode { as one token in one context and merge it with a preceding newline in another. There’s no consistency -- because the tokenizer doesn’t know it’s looking at JSON.
What if the tokenizer understood JSON grammar?
That’s the core idea behind json-tokenizer. Instead of treating JSON as flat text, it splits the work into three tiers:
Tier 1: Structural tokens. Every grammar character -- {, }, [, ], :, ,, true, false, null -- gets its own dedicated token. Always one token, always the same token, regardless of context. No more splitting curly braces across token boundaries.
Tier 2: A learned key vocabulary. During training, the tokenizer counts which JSON keys appear most frequently in your data. Keys like “created_at” or “coordinates” that would normally cost 3-4 tokens each get compressed into a single token. Train it on your API schema, and every one of those 500 identical field names costs exactly 1 token instead of 3.
Tier 3: BPE for values only. The actual content -- strings, numbers, UUIDs -- still uses byte-pair encoding (the same technique behind GPT-4’s tokenizer). But now it’s trained specifically on JSON value distributions instead of English prose.
The entire vocabulary is about 1,100 to 4,100 tokens. GPT-4’s tokenizer has 100,256. That’s roughly 90x smaller.
Does it actually work? Yes, but always use the right tool for the job:
I tested it against cl100k_base (GPT-4’s tokenizer) on five real-world dataset types:
GeoJSON geographic data: 7.8% fewer tokens
Observability telemetry logs: 5.5% fewer tokens
Configuration files: 12-15% fewer tokens
Kubernetes manifests: roughly even (break-even)
Instruction-tuning data (mostly English text): 26% more tokens
That last number matters. This is not a universal improvement.
On JSON that’s mostly paragraphs of English -- like training datasets for chatbots -- a 100K-token vocabulary trained on billions of English words will always win.
The json-tokenizer is purpose-built for machine-to-machine JSON: API responses, log pipelines, config files, structured LLM outputs.
The results hold up across five random seeds with standard deviation below 0.01%.
An Unexpected Finding
Here’s what I didn’t expect: with only 558 total tokens (46 learned keys + 512 BPE sub- words), the json-tokenizer already beats GPT-4’s 100,256-token vocabulary on structured data.
That means the compression comes almost entirely from the key vocabulary -- not from having a bigger BPE. Doubling the BPE vocabulary from 2,000 to 4,000 tokens only adds 0.9% in savings. The key vocabulary does the heavy lifting.
This makes intuitive sense. In a schema-repetitive JSON payload, the keys are the most redundant component. They repeat identically in every object. A general-purpose tokenizer can’t exploit that repetition because it doesn’t know what a “key” is. It just sees bytes.
Experience building things always teaches the lessons not obvious from the documentation.
AI Product Engineering Impact
If you’re running LLM pipelines that process structured JSON -- and most production systems are -- tokens directly equal cost. Every API call, every log line, every function-call response burns tokens.
A few places where 5-15% savings compound fast:
LLM structured output. Function calling and tool-use responses follow fixed schemas. The same keys repeat in every response. This is the ideal case.
API response caching. Paginated responses are arrays of identical-schema objects. Batch encoding pushes savings to 9.3%.
Observability at scale. Billions of JSON log lines per day. Even 5% adds up to meaningful infrastructure savings.
What this is NOT: Honest Disclosure on Appropriate Use Case
This tokenizer will not replace something like tiktoken for general text. It’s slower (pure Python vs. Rust -- about 23-256x slower for encoding). It requires training on your target schema. And it has a real cost: type-prefix tokens ([STR], [NUM]) consume 18-21% of the token budget to guarantee lossless roundtrip decoding.
The honest framing: this is a specialized tool for a specific, high-volume use case. It works well when your JSON has a predictable schema and repeating keys. It works poorly when your JSON is mostly English paragraphs.
Try it yourself
The tokenizer is open source, zero dependencies, and includes a Hugging Face Transformers-compatible wrapper so you can drop it into existing pipelines:
pip install json-tokenizer
The paper, code, and benchmarks are all at: github.com/anthony-maio/json-tokenizer
We’re compatible with transformers v5.3 with the included wrapper - and you can find our tokenizer there at https://huggingface.co/anthonym21/json-tokenizer-structured
If you’re processing structured JSON at scale and want to see what the savings look like on your data -- train it on a sample of your payloads and run the benchmark. The CLI makes it straightforward.
I’d be curious to hear from anyone running high-volume JSON pipelines. What does your token budget look like? Is tokenizer efficiency even on your radar? WOULD YOU USE THIS if it meant saving 10-15% of your AI costs?
With my design, research and specifications I was able to hand off implementation boilerplate and testing duties to my team of coding agents to bring this from draft specification to release in 6 hours. That includes writing this article. Imagine if we start investing more in efficiency over parameter count?
The full paper, “Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary,” is available on Zenodo (DOI: 10.5281/zenodo.18879110) with the open-source implementation on Github and HuggingFace. Anthony Maio is an independent AI researcher operating his own consultancy for the past 18 months and now seeking full-time collaboration/employment with a company doing innovative work in the AI space. anthony@making-minds.ai


