feat(apr-cli): tokenize encode-corpus accepts parquet input (#1410) by noahgift · Pull Request #1412 · paiml/aprender

noahgift · 2026-05-03T06:31:44Z

Summary

Closes the muda gap that forced uv run --with pyarrow shims when tokenizing Stack v1.2 / codeparrot corpora — apr now ingests parquet natively.
Streaming row-group reader via Apache Arrow keeps peak memory bounded by row-group size, not shard size (Stack v1.2 shards are ~200 MB; reading them all-at-once would OOM at 30 GB).
Manifest now records input_format ∈ {jsonl, parquet} for provenance audits.
New contract apr-cli-tokenize-encode-corpus-parquet-v1.yaml v1.0.0 PROPOSED — 4 FALSIFY-PARQ-INPUT-001..004 at PARTIAL_ALGORITHM_LEVEL.

Why

Per feedback_stack_tool_extension_not_cli_shim.md and SHIP-TWO-001 §26.8: apr is canonical. Stack v1.2 (bigcode/the-stack-dedup) and codeparrot Python parquet shards are blocked from MODEL-2 training without a JSONL conversion step — and that step would be a third-party-Python detour the project rules forbid.

This PR is the apr-extend solution: parquet input becomes a first-class corpus format alongside JSONL, with no behavior change for existing JSONL pipelines.

What changed

crates/apr-cli/src/commands/tokenize_parquet.rs (new): is_parquet, collect_parquet_files, iter_parquet_content — pure-Rust streaming adapter. Handles Utf8 + LargeUtf8 array types, skips null cells, fail-fast on missing column with available-columns diagnostic.
crates/apr-cli/src/commands/tokenize.rs:
- New CorpusFormat enum + collect_corpus_files that detects format by extension and prefers parquet when both are present in a directory.
- New iter_corpus_texts unified iterator that yields (file_display, locator, text) triples regardless of source. Locator is "line N" or "row N" so error messages stay consistent.
- run_encode_corpus refactored to one loop over the unified iterator. JSONL path is byte-equivalent for existing callers.
- Manifest gets a new input_format field.
crates/apr-cli/Cargo.toml: parquet = \"57\" (default-features = false; arrow,snap,zstd) + arrow-array = \"57\".
contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml (new): PROPOSED, pv validate clean. 4 falsifiers, 3 proof_obligations.

Test plan

cargo test -p apr-cli --lib commands::tokenize → 14/14 PASS (10 new + 4 existing)
cargo clippy -p apr-cli --lib --no-deps -- -D warnings clean
pv validate contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml → 0 errors, 0 warnings
Live smoke on /mnt/nvme-raid0/datasets/the-stack-dedup/data/python/data-00000-of-00144.parquet (200 MB Stack v1.2 shard) with model-2-tokenizer-v1: shard-00000.bin grew to ~75K tokens before kill — dispatch confirmed live.
CI required checks (ci / gate, workspace-test)

Unblocks

MODEL-2 retrain on the 27 GB Stack v1.2 Python corpus, targeting val_loss < 9.38 (the empirical CSN-Python plateau per ship-two-models-spec §25).

🤖 Generated with Claude Code

Stack v1.2 (`bigcode/the-stack-dedup`) and codeparrot Python corpora ship as parquet shards, not JSONL. Without this extension, callers have to shell out to `uv run --with pyarrow` to convert parquet → JSONL before calling `apr tokenize encode-corpus` — exactly the kind of CLI-shim that `feedback_stack_tool_extension_not_cli_shim.md` flags as muda. Per APR-MONO §26.8: `apr` is canonical, extend in-tree. Changes - New module `commands/tokenize_parquet.rs`: streaming parquet adapter via Apache Arrow's ParquetRecordBatchReaderBuilder. Reads one row group at a time so peak memory is bounded by row-group size, not shard size (Stack v1.2 shards are ~200 MB). - New `commands/tokenize.rs::collect_corpus_files` and `iter_corpus_texts` helpers: detect format by extension, yield unified `(file_display, locator, text)` triples regardless of source. JSONL behavior is unchanged for back-compat. - Manifest now records `input_format` ∈ {jsonl, parquet} for downstream provenance audits. - `parquet = "57"` and `arrow-array = "57"` added to apr-cli deps (default-features = false; only `arrow,snap,zstd` for parquet). Contract - `contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml` (PROPOSED, v1.0.0) — 4 falsifiers FALSIFY-PARQ-INPUT-001..004 at PARTIAL_ALGORITHM_LEVEL via the unit tests below; pv validate clean. Verification - `cargo test -p apr-cli --lib commands::tokenize` → 14/14 PASS (10 new tokenize_parquet tests + 4 existing tokenize tests) - `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean - Live smoke against `/mnt/nvme-raid0/datasets/the-stack-dedup/data/python/data-00000-of-00144.parquet` with `model-2-tokenizer-v1` (vocab 32k): debug build wrote shard-00000.bin (~75K tokens) before kill — dispatch confirmed live. Unblocks - MODEL-2 retrain on Stack v1.2 27 GB Python corpus, targeting val_loss < 9.38 (the empirical CSN-Python floor per spec §25). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…age tensors (#1413) Closes the `apr_diff_values_compat` invariant of `apr-cli-trace-save-tensor-v1` at PARTIAL_ALGORITHM_LEVEL via a new `diff_05_aprt_stage.rs` include slot. When both inputs to `apr diff --values` start with magic bytes `APRT` (the 12-byte header written by `apr trace --save-tensor`), the dispatch now bypasses the RosettaStone whole-model walker and runs an element-wise stage-tensor diff: - max|diff| with index - RMS diff - Cosine similarity (f64-accumulated for numerical stability) - Top-K divergences sorted by |a - b| Both JSON and pretty text output are supported. Mismatched dim_product or layer fields fail-fast with a diagnostic error so callers don't silently compare incompatible stages. ## Five Whys (why now, why this scope) 1. **Why is this needed?** `apr trace --save-tensor` (PR-A #1405, PR-B #1406, PR-C-prep #1407) writes per-stage f32 tensors as `APRT`-prefixed files. Without an APRT-aware diff, layer-0 stage-by-stage element-wise bisection per `feedback_model_1_ships_gpu_only.md` is gated on external tooling — exactly the kind of muda the APR-MONO §26.8 rule forbids. 2. **Why extend `apr diff` and not write a new subcommand?** The `apr_diff_values_compat` invariant in `apr-cli-trace-save-tensor-v1` already names `apr diff --values` as the verifier. Extending the existing flag keeps the contract surface stable. 3. **Why an include!() file instead of inlining into diff.rs?** diff.rs already follows that pattern (diff_accumulator, diff_output_json_text, diff_04). Keeping APRT logic in `diff_05_aprt_stage.rs` lets it be audited / removed independently and doesn't grow the parent file. 4. **Why pin via `provenance_pin_pr_d_rev1`?** Future renames of either `is_aprt_stage_file` or the file path break the include!() chain; the pin makes that visible at test-time and forces a contract bump. 5. **Why now?** Tokenization of the 27 GB Stack v1.2 Python corpus is running in the background for MODEL-2 (PR #1412 merged). The SHIP-007 PR-C-real cascade for MODEL-1 needs PR-D infrastructure ready when step 2 (forward_traced threading) lands. PR-D is independent and can merge in parallel with #1408. ## Verification - `cargo test -p apr-cli --lib commands::diff::aprt` → 11/11 PASS - is_aprt_stage_file: detects/rejects/truncated/missing (4 tests) - compute_aprt_stage_stats: identical=zero, known max/RMS, top-K sort (3) - run_aprt_stage_diff: dim/layer mismatch errors, identical succeeds (3) - provenance_pin_pr_d_rev1 (1) - `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors ## Contract update `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0: - New FALSIFY-APR-TRACE-SAVE-009 binding `apr_diff_values_compat` at PARTIAL_ALGORITHM_LEVEL with 4-line `algorithm_evidence` block citing this PR's unit tests. ## Ship % update MODEL-1: ~64% → ~66% (PR-D is small but discharges 1 PARTIAL invariant and clears infrastructure blocker for SHIP-007 step E). MODEL-2: corpus tokenization in progress (~33h ETA). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 3, 2026 06:31

Merge branch 'main' into feat/tokenize-encode-corpus-parquet-input

3dfe5e5

noahgift merged commit 54f76f4 into main May 3, 2026
10 checks passed

noahgift deleted the feat/tokenize-encode-corpus-parquet-input branch May 3, 2026 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-cli): tokenize encode-corpus accepts parquet input (#1410)#1412

feat(apr-cli): tokenize encode-corpus accepts parquet input (#1410)#1412
noahgift merged 2 commits into
mainfrom
feat/tokenize-encode-corpus-parquet-input

noahgift commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 3, 2026

Summary

Why

What changed

Test plan

Unblocks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant