Skip to content

feat(apr-cli): apr tokenize import-hf — §50.4 step 5g.0#1497

Merged
noahgift merged 1 commit into
mainfrom
feat/tokenize-import-hf-5g-0
May 5, 2026
Merged

feat(apr-cli): apr tokenize import-hf — §50.4 step 5g.0#1497
noahgift merged 1 commit into
mainfrom
feat/tokenize-import-hf-5g-0

Conversation

@noahgift

@noahgift noahgift commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements §50.4 step 5g.0 per spec §54 (PR #1496) re-scoping. Authors contract `apr-cli-tokenize-import-hf-v1` v1.0.0 PARTIAL_ALGORITHM_LEVEL + new subcommand `apr tokenize import-hf <tokenizer.json> --output

` that extracts vocab.json + merges.txt + manifest.json from a HuggingFace BPE tokenizer.json into aprender's two-file layout.

This unblocks fine-tuning from public Qwen2.5/Llama2/Mistral checkpoints which distribute as a single tokenizer.json (aprender's GPT-2-style BPE loader requires the GPT-2 two-file format).

What ships

Contract (5 falsifiers, all PASS in CI):

  • FALSIFY-TOK-IMPORT-HF-001 — command in dispatch surface
  • FALSIFY-TOK-IMPORT-HF-002 — BPE input → non-empty vocab+merges
  • FALSIFY-TOK-IMPORT-HF-003 — vocab count == |model.vocab|
  • FALSIFY-TOK-IMPORT-HF-004 — merges.txt is one merge per line in order
  • FALSIFY-TOK-IMPORT-HF-005 — non-BPE input fails fast (Unigram/WordPiece)

Subcommand:
```
apr tokenize import-hf
--input HF tokenizer.json (BPE required)
--output

output dir (vocab.json + merges.txt + manifest.json)
--include-added-tokens also emit added_tokens in vocab.json
```

5 unit tests cover FALSIFY-002..005 + --include-added-tokens path.

LIVE smoke (canonical input on this host)

```
$ apr tokenize import-hf --input <Qwen2.5-Coder-0.5B-Instruct/tokenizer.json> \
--output /tmp/qwen-0.5b-tokenizer-extracted --json
{
"bpe_vocab_count": 151643,
"merges_count": 151387,
"added_tokens_count": 22,
"source_sha256": "c0382117ea329cdf...",
...
}
```

Files written: vocab.json (3.2 MiB), merges.txt (1.6 MiB), manifest.json (534 B). merges.txt format matches GPT-2 convention exactly. Evidence: `evidence/section-50-4-step-5g-0-import-hf-2026-05-05/live-extraction-smoke.md`.

Five Whys

  1. Why a new subcommand rather than a script? Per `feedback_stack_tool_extension_not_cli_shim.md`: when apr lacks a feature, extend apr in-tree. One-off scripts are muda.
  2. Why under `apr tokenize`? It already handles tokenizer artifacts (plan/apply/train/encode-corpus); import-hf produces the same output shape as `train` from a different source.
  3. Why default to BPE-only (excluding added_tokens)? Default keeps the BPE state machine pure; `--include-added-tokens` is the explicit opt-in.
  4. Why fail-fast on non-BPE? Silent Unigram/WordPiece extraction would produce a vocab.json the BPE loader accepts but tokenizes incorrectly. Fail-fast cites contract id.
  5. Why not also fix the polymorphic preflight gap (151643/151665 vs 151936 declared)? Separate concern — the preflight's strict `==` semantic is correct for from-scratch but too strict for Qwen-style reserved slots. §55 follow-up.

Net effects

  • Contract `apr-cli-tokenize-import-hf-v1` v1.0.0 PARTIAL_ALGORITHM_LEVEL.
  • 1 new `apr` subcommand wired through 3-surface (clap enum, dispatch, contract).
  • 5 new unit tests + 1 live smoke evidence file.
  • Unblocks 5g.1 (corpus retokenize) modulo the §55 preflight strictness follow-up.
  • MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57% until 5g.3.

Test plan

  • PMAT pre-commit quality gates pass
  • 5/5 import_hf unit tests pass
  • 19/19 commands::tokenize tests pass (14 original + 5 new)
  • cli_commands integration tests (6/6) pass — no surface drift
  • Live smoke on Qwen2.5-Coder-0.5B-Instruct/tokenizer.json: 151643 vocab + 151387 merges extracted
  • CI gate green (workspace-test, ci/gate)
  • Auto-merge fires on green CI

🤖 Generated with Claude Code

Author contract apr-cli-tokenize-import-hf-v1 v1.0.0 PARTIAL_ALGORITHM_LEVEL
+ implement `apr tokenize import-hf <tokenizer.json> --output <DIR>`
that extracts vocab.json + merges.txt + manifest.json from a HuggingFace
BPE tokenizer.json into aprender's two-file layout.

This is the prerequisite step from §54's 5g re-scoping: aprender's
GPT-2-style BPE loader requires vocab.json + merges.txt; public
Qwen2.5/Llama2/Mistral tokenizers distribute as a single tokenizer.json.
Without import-hf, fine-tuning from these checkpoints is blocked.

## What ships

Contract `contracts/apr-cli-tokenize-import-hf-v1.yaml` v1.0.0 (5 falsifiers):
  FALSIFY-TOK-IMPORT-HF-001 — command registered in dispatch surface
  FALSIFY-TOK-IMPORT-HF-002 — BPE input produces non-empty vocab+merges
  FALSIFY-TOK-IMPORT-HF-003 — vocab count == |tokenizer.json:model.vocab|
  FALSIFY-TOK-IMPORT-HF-004 — merges.txt is one merge per line in order
  FALSIFY-TOK-IMPORT-HF-005 — non-BPE input fails fast (Unigram/WordPiece)

Subcommand `apr tokenize import-hf`:
  --input <FILE>             HF tokenizer.json (BPE required)
  --output <DIR>             output dir for vocab.json + merges.txt + manifest.json
  --include-added-tokens     also emit added_tokens (e.g., <|im_start|>) in vocab.json

Output dir layout (drop-in compatible with apr tokenize encode-corpus
and apr pretrain --tokenizer):
  vocab.json    — JSON object: token-string → integer-id
  merges.txt    — #version: 0.2 header + one space-separated merge per line
  manifest.json — provenance: source path, sha256, counts, timestamp

5 unit tests cover FALSIFY-002..005 + the --include-added-tokens path.

## LIVE smoke (canonical input)

apr tokenize import-hf \
  --input <Qwen2.5-Coder-0.5B-Instruct/tokenizer.json> \
  --output /tmp/qwen-0.5b-tokenizer-extracted --json
  → effective_vocab=151643, merges=151387, added_tokens=22, sha256 captured
  Files written: 3.2 MiB vocab.json, 1.6 MiB merges.txt, 534 B manifest.json

Evidence file: evidence/section-50-4-step-5g-0-import-hf-2026-05-05/live-extraction-smoke.md

## Five Whys

1. Why a new subcommand rather than a script? Per
   feedback_stack_tool_extension_not_cli_shim.md: when apr lacks a
   feature, extend apr in-tree. Authored as an apr subcommand with a
   provable contract; one-off scripts would have been muda.
2. Why under `apr tokenize` rather than `apr import` or top-level?
   `apr tokenize` already handles tokenizer artifacts (plan/apply/train/
   encode-corpus); import-hf is symmetric — produces the same output
   shape (vocab.json + merges.txt) as `train` does, just from an
   external HF source rather than from a corpus. `apr import` is for
   model files, not tokenizers.
3. Why default to BPE-only (excluding added_tokens)? The aprender
   BPE loader is a state machine; added tokens (e.g., <|endoftext|>)
   are control-string substitutions handled differently. Default mode
   keeps the BPE machine pure; --include-added-tokens is the explicit
   opt-in for when downstream consumption needs the unified vocab.
4. Why fail-fast on non-BPE? Unigram and WordPiece have different
   state machines; silent extraction would produce a vocab.json that
   the BPE loader accepts but tokenizes incorrectly. Fail-fast names
   the actual model.type and cites the contract id (auditability).
5. Why not also handle the polymorphic preflight gap (151643/151936)?
   That's a separate concern — the preflight's strict `==` semantic
   is correct for from-scratch models but too strict for Qwen-style
   models with reserved slots. §55 follow-up; out of 5g.0 scope.

## Net effects

- Contract apr-cli-tokenize-import-hf-v1 v1.0.0 PARTIAL_ALGORITHM_LEVEL.
- 1 new apr subcommand (`apr tokenize import-hf`) wired into dispatch.
- 5 new unit tests + 1 live smoke evidence file.
- Unblocks 5g.1 (Qwen-tokenized corpus pretokenization) modulo the
  §55 preflight strictness finding (vocab gap 151665 vs 151936).
- MODEL-1 ship %: unchanged at 91%.
- MODEL-2 ship %: unchanged at 57% until 5g.3 val_loss < 9.38.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496 — re-scoped 5g into 5g.0/.../5g.3),
      contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.2.0 FUNCTIONAL (sibling)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 5, 2026 03:14
@noahgift noahgift merged commit e84b3e4 into main May 5, 2026
11 checks passed
@noahgift noahgift deleted the feat/tokenize-import-hf-5g-0 branch May 5, 2026 03:38
noahgift added a commit that referenced this pull request May 5, 2026
…ted; full run is ~17hr operator-dispatch

§55 (in-flight PR #1500) closes the polymorphic preflight strictness
gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that
validates 5g.1's correctness end-to-end before committing to the
multi-hour full run.

  apr tokenize encode-corpus \
    --corpus <python-permissive-5k.jsonl> \
    --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
    --output <smoke-shards> --shard-tokens 1000000

  → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs)
  → ~110 sec / M-token single-thread
  → No errors; shard rotation correct
  → Killed before manifest.json write (sufficient evidence accumulated)

  Legacy 50257-vocab:   ~64 sec / M-token  →  9.99 hr for 565M (validated)
  Qwen 151643-vocab:    ~110 sec / M-token →  ~17 hr for 565M (projected)

Qwen is ~70% slower per-token because the BPE merge table is 3× larger
(151387 vs 49997 merges); per-character merge-table search dominates
encoding cost. Below the 48hr feedback_compute_pre_authorized.md
ceiling, so 5g.1 full run is pre-authorized.

  5g.0  ✅ MERGED PR #1497  (apr tokenize import-hf)
  5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation)
  5g.1  CORRECTNESS-VALIDATED (this PR), full run pending operator
  5g.2  gated on 5g.1 full run
  5g.3  gated on 5g.2 (val_loss < 9.38 verdict)

1. Why smoke before full run? ~17hr non-trivial; smoke proves chain
   correctness before committing to long wall.
2. Why 5000 docs? Smallest slice that exercises shard rotation (12M
   tokens > 10 shards).
3. Why kill smoke instead of complete? 13 shards = sufficient evidence;
   finishing wouldn't add information.
4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost.
5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below
   48hr ceiling; ROI negative for current cycle.

- Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way —
  §56 has no code/contract changes).
- 5g.1 reaches CORRECTNESS-VALIDATED state.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight),
      §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…ted; full run is ~17hr operator-dispatch (#1501)

§55 (in-flight PR #1500) closes the polymorphic preflight strictness
gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that
validates 5g.1's correctness end-to-end before committing to the
multi-hour full run.

  apr tokenize encode-corpus \
    --corpus <python-permissive-5k.jsonl> \
    --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
    --output <smoke-shards> --shard-tokens 1000000

  → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs)
  → ~110 sec / M-token single-thread
  → No errors; shard rotation correct
  → Killed before manifest.json write (sufficient evidence accumulated)

  Legacy 50257-vocab:   ~64 sec / M-token  →  9.99 hr for 565M (validated)
  Qwen 151643-vocab:    ~110 sec / M-token →  ~17 hr for 565M (projected)

Qwen is ~70% slower per-token because the BPE merge table is 3× larger
(151387 vs 49997 merges); per-character merge-table search dominates
encoding cost. Below the 48hr feedback_compute_pre_authorized.md
ceiling, so 5g.1 full run is pre-authorized.

  5g.0  ✅ MERGED PR #1497  (apr tokenize import-hf)
  5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation)
  5g.1  CORRECTNESS-VALIDATED (this PR), full run pending operator
  5g.2  gated on 5g.1 full run
  5g.3  gated on 5g.2 (val_loss < 9.38 verdict)

1. Why smoke before full run? ~17hr non-trivial; smoke proves chain
   correctness before committing to long wall.
2. Why 5000 docs? Smallest slice that exercises shard rotation (12M
   tokens > 10 shards).
3. Why kill smoke instead of complete? 13 shards = sufficient evidence;
   finishing wouldn't add information.
4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost.
5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below
   48hr ceiling; ROI negative for current cycle.

- Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way —
  §56 has no code/contract changes).
- 5g.1 reaches CORRECTNESS-VALIDATED state.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight),
      §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant