feat(apr-cli): tokenize encode-corpus accepts parquet input (#1410)#1412
Merged
Conversation
Stack v1.2 (`bigcode/the-stack-dedup`) and codeparrot Python corpora
ship as parquet shards, not JSONL. Without this extension, callers
have to shell out to `uv run --with pyarrow` to convert parquet → JSONL
before calling `apr tokenize encode-corpus` — exactly the kind of
CLI-shim that `feedback_stack_tool_extension_not_cli_shim.md` flags as
muda. Per APR-MONO §26.8: `apr` is canonical, extend in-tree.
Changes
- New module `commands/tokenize_parquet.rs`: streaming parquet adapter
via Apache Arrow's ParquetRecordBatchReaderBuilder. Reads one row
group at a time so peak memory is bounded by row-group size, not
shard size (Stack v1.2 shards are ~200 MB).
- New `commands/tokenize.rs::collect_corpus_files` and
`iter_corpus_texts` helpers: detect format by extension, yield
unified `(file_display, locator, text)` triples regardless of source.
JSONL behavior is unchanged for back-compat.
- Manifest now records `input_format` ∈ {jsonl, parquet} for downstream
provenance audits.
- `parquet = "57"` and `arrow-array = "57"` added to apr-cli deps
(default-features = false; only `arrow,snap,zstd` for parquet).
Contract
- `contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml` (PROPOSED,
v1.0.0) — 4 falsifiers FALSIFY-PARQ-INPUT-001..004 at
PARTIAL_ALGORITHM_LEVEL via the unit tests below; pv validate clean.
Verification
- `cargo test -p apr-cli --lib commands::tokenize` → 14/14 PASS
(10 new tokenize_parquet tests + 4 existing tokenize tests)
- `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean
- Live smoke against `/mnt/nvme-raid0/datasets/the-stack-dedup/data/python/data-00000-of-00144.parquet`
with `model-2-tokenizer-v1` (vocab 32k): debug build wrote
shard-00000.bin (~75K tokens) before kill — dispatch confirmed live.
Unblocks
- MODEL-2 retrain on Stack v1.2 27 GB Python corpus, targeting
val_loss < 9.38 (the empirical CSN-Python floor per spec §25).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…age tensors (#1413) Closes the `apr_diff_values_compat` invariant of `apr-cli-trace-save-tensor-v1` at PARTIAL_ALGORITHM_LEVEL via a new `diff_05_aprt_stage.rs` include slot. When both inputs to `apr diff --values` start with magic bytes `APRT` (the 12-byte header written by `apr trace --save-tensor`), the dispatch now bypasses the RosettaStone whole-model walker and runs an element-wise stage-tensor diff: - max|diff| with index - RMS diff - Cosine similarity (f64-accumulated for numerical stability) - Top-K divergences sorted by |a - b| Both JSON and pretty text output are supported. Mismatched dim_product or layer fields fail-fast with a diagnostic error so callers don't silently compare incompatible stages. ## Five Whys (why now, why this scope) 1. **Why is this needed?** `apr trace --save-tensor` (PR-A #1405, PR-B #1406, PR-C-prep #1407) writes per-stage f32 tensors as `APRT`-prefixed files. Without an APRT-aware diff, layer-0 stage-by-stage element-wise bisection per `feedback_model_1_ships_gpu_only.md` is gated on external tooling — exactly the kind of muda the APR-MONO §26.8 rule forbids. 2. **Why extend `apr diff` and not write a new subcommand?** The `apr_diff_values_compat` invariant in `apr-cli-trace-save-tensor-v1` already names `apr diff --values` as the verifier. Extending the existing flag keeps the contract surface stable. 3. **Why an include!() file instead of inlining into diff.rs?** diff.rs already follows that pattern (diff_accumulator, diff_output_json_text, diff_04). Keeping APRT logic in `diff_05_aprt_stage.rs` lets it be audited / removed independently and doesn't grow the parent file. 4. **Why pin via `provenance_pin_pr_d_rev1`?** Future renames of either `is_aprt_stage_file` or the file path break the include!() chain; the pin makes that visible at test-time and forces a contract bump. 5. **Why now?** Tokenization of the 27 GB Stack v1.2 Python corpus is running in the background for MODEL-2 (PR #1412 merged). The SHIP-007 PR-C-real cascade for MODEL-1 needs PR-D infrastructure ready when step 2 (forward_traced threading) lands. PR-D is independent and can merge in parallel with #1408. ## Verification - `cargo test -p apr-cli --lib commands::diff::aprt` → 11/11 PASS - is_aprt_stage_file: detects/rejects/truncated/missing (4 tests) - compute_aprt_stage_stats: identical=zero, known max/RMS, top-K sort (3) - run_aprt_stage_diff: dim/layer mismatch errors, identical succeeds (3) - provenance_pin_pr_d_rev1 (1) - `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors ## Contract update `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0: - New FALSIFY-APR-TRACE-SAVE-009 binding `apr_diff_values_compat` at PARTIAL_ALGORITHM_LEVEL with 4-line `algorithm_evidence` block citing this PR's unit tests. ## Ship % update MODEL-1: ~64% → ~66% (PR-D is small but discharges 1 PARTIAL invariant and clears infrastructure blocker for SHIP-007 step E). MODEL-2: corpus tokenization in progress (~33h ETA). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
uv run --with pyarrowshims when tokenizing Stack v1.2 / codeparrot corpora —aprnow ingests parquet natively.input_format∈ {jsonl, parquet} for provenance audits.apr-cli-tokenize-encode-corpus-parquet-v1.yamlv1.0.0 PROPOSED — 4 FALSIFY-PARQ-INPUT-001..004 at PARTIAL_ALGORITHM_LEVEL.Why
Per
feedback_stack_tool_extension_not_cli_shim.mdand SHIP-TWO-001 §26.8:apris canonical. Stack v1.2 (bigcode/the-stack-dedup) and codeparrot Python parquet shards are blocked from MODEL-2 training without a JSONL conversion step — and that step would be a third-party-Python detour the project rules forbid.This PR is the apr-extend solution: parquet input becomes a first-class corpus format alongside JSONL, with no behavior change for existing JSONL pipelines.
What changed
crates/apr-cli/src/commands/tokenize_parquet.rs(new):is_parquet,collect_parquet_files,iter_parquet_content— pure-Rust streaming adapter. Handles Utf8 + LargeUtf8 array types, skips null cells, fail-fast on missing column with available-columns diagnostic.crates/apr-cli/src/commands/tokenize.rs:CorpusFormatenum +collect_corpus_filesthat detects format by extension and prefers parquet when both are present in a directory.iter_corpus_textsunified iterator that yields(file_display, locator, text)triples regardless of source. Locator is"line N"or"row N"so error messages stay consistent.run_encode_corpusrefactored to one loop over the unified iterator. JSONL path is byte-equivalent for existing callers.input_formatfield.crates/apr-cli/Cargo.toml:parquet = \"57\"(default-features = false;arrow,snap,zstd) +arrow-array = \"57\".contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml(new): PROPOSED,pv validateclean. 4 falsifiers, 3 proof_obligations.Test plan
cargo test -p apr-cli --lib commands::tokenize→ 14/14 PASS (10 new + 4 existing)cargo clippy -p apr-cli --lib --no-deps -- -D warningscleanpv validate contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml→ 0 errors, 0 warnings/mnt/nvme-raid0/datasets/the-stack-dedup/data/python/data-00000-of-00144.parquet(200 MB Stack v1.2 shard) withmodel-2-tokenizer-v1: shard-00000.bin grew to ~75K tokens before kill — dispatch confirmed live.ci / gate,workspace-test)Unblocks
val_loss < 9.38(the empirical CSN-Python plateau per ship-two-models-spec §25).🤖 Generated with Claude Code