Skip to content

feat(apr-cli): tokenize encode-corpus accepts parquet input (#1410)#1412

Merged
noahgift merged 2 commits into
mainfrom
feat/tokenize-encode-corpus-parquet-input
May 3, 2026
Merged

feat(apr-cli): tokenize encode-corpus accepts parquet input (#1410)#1412
noahgift merged 2 commits into
mainfrom
feat/tokenize-encode-corpus-parquet-input

Conversation

@noahgift

@noahgift noahgift commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Closes the muda gap that forced uv run --with pyarrow shims when tokenizing Stack v1.2 / codeparrot corpora — apr now ingests parquet natively.
  • Streaming row-group reader via Apache Arrow keeps peak memory bounded by row-group size, not shard size (Stack v1.2 shards are ~200 MB; reading them all-at-once would OOM at 30 GB).
  • Manifest now records input_format ∈ {jsonl, parquet} for provenance audits.
  • New contract apr-cli-tokenize-encode-corpus-parquet-v1.yaml v1.0.0 PROPOSED — 4 FALSIFY-PARQ-INPUT-001..004 at PARTIAL_ALGORITHM_LEVEL.

Why

Per feedback_stack_tool_extension_not_cli_shim.md and SHIP-TWO-001 §26.8: apr is canonical. Stack v1.2 (bigcode/the-stack-dedup) and codeparrot Python parquet shards are blocked from MODEL-2 training without a JSONL conversion step — and that step would be a third-party-Python detour the project rules forbid.

This PR is the apr-extend solution: parquet input becomes a first-class corpus format alongside JSONL, with no behavior change for existing JSONL pipelines.

What changed

  • crates/apr-cli/src/commands/tokenize_parquet.rs (new): is_parquet, collect_parquet_files, iter_parquet_content — pure-Rust streaming adapter. Handles Utf8 + LargeUtf8 array types, skips null cells, fail-fast on missing column with available-columns diagnostic.
  • crates/apr-cli/src/commands/tokenize.rs:
    • New CorpusFormat enum + collect_corpus_files that detects format by extension and prefers parquet when both are present in a directory.
    • New iter_corpus_texts unified iterator that yields (file_display, locator, text) triples regardless of source. Locator is "line N" or "row N" so error messages stay consistent.
    • run_encode_corpus refactored to one loop over the unified iterator. JSONL path is byte-equivalent for existing callers.
    • Manifest gets a new input_format field.
  • crates/apr-cli/Cargo.toml: parquet = \"57\" (default-features = false; arrow,snap,zstd) + arrow-array = \"57\".
  • contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml (new): PROPOSED, pv validate clean. 4 falsifiers, 3 proof_obligations.

Test plan

  • cargo test -p apr-cli --lib commands::tokenize → 14/14 PASS (10 new + 4 existing)
  • cargo clippy -p apr-cli --lib --no-deps -- -D warnings clean
  • pv validate contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml → 0 errors, 0 warnings
  • Live smoke on /mnt/nvme-raid0/datasets/the-stack-dedup/data/python/data-00000-of-00144.parquet (200 MB Stack v1.2 shard) with model-2-tokenizer-v1: shard-00000.bin grew to ~75K tokens before kill — dispatch confirmed live.
  • CI required checks (ci / gate, workspace-test)

Unblocks

  • MODEL-2 retrain on the 27 GB Stack v1.2 Python corpus, targeting val_loss < 9.38 (the empirical CSN-Python plateau per ship-two-models-spec §25).

🤖 Generated with Claude Code

Stack v1.2 (`bigcode/the-stack-dedup`) and codeparrot Python corpora
ship as parquet shards, not JSONL. Without this extension, callers
have to shell out to `uv run --with pyarrow` to convert parquet → JSONL
before calling `apr tokenize encode-corpus` — exactly the kind of
CLI-shim that `feedback_stack_tool_extension_not_cli_shim.md` flags as
muda. Per APR-MONO §26.8: `apr` is canonical, extend in-tree.

Changes
- New module `commands/tokenize_parquet.rs`: streaming parquet adapter
  via Apache Arrow's ParquetRecordBatchReaderBuilder. Reads one row
  group at a time so peak memory is bounded by row-group size, not
  shard size (Stack v1.2 shards are ~200 MB).
- New `commands/tokenize.rs::collect_corpus_files` and
  `iter_corpus_texts` helpers: detect format by extension, yield
  unified `(file_display, locator, text)` triples regardless of source.
  JSONL behavior is unchanged for back-compat.
- Manifest now records `input_format` ∈ {jsonl, parquet} for downstream
  provenance audits.
- `parquet = "57"` and `arrow-array = "57"` added to apr-cli deps
  (default-features = false; only `arrow,snap,zstd` for parquet).

Contract
- `contracts/apr-cli-tokenize-encode-corpus-parquet-v1.yaml` (PROPOSED,
  v1.0.0) — 4 falsifiers FALSIFY-PARQ-INPUT-001..004 at
  PARTIAL_ALGORITHM_LEVEL via the unit tests below; pv validate clean.

Verification
- `cargo test -p apr-cli --lib commands::tokenize` → 14/14 PASS
  (10 new tokenize_parquet tests + 4 existing tokenize tests)
- `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean
- Live smoke against `/mnt/nvme-raid0/datasets/the-stack-dedup/data/python/data-00000-of-00144.parquet`
  with `model-2-tokenizer-v1` (vocab 32k): debug build wrote
  shard-00000.bin (~75K tokens) before kill — dispatch confirmed live.

Unblocks
- MODEL-2 retrain on Stack v1.2 27 GB Python corpus, targeting
  val_loss < 9.38 (the empirical CSN-Python floor per spec §25).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 3, 2026 06:31
@noahgift noahgift merged commit 54f76f4 into main May 3, 2026
10 checks passed
@noahgift noahgift deleted the feat/tokenize-encode-corpus-parquet-input branch May 3, 2026 07:05
noahgift added a commit that referenced this pull request May 3, 2026
…age tensors (#1413)

Closes the `apr_diff_values_compat` invariant of `apr-cli-trace-save-tensor-v1`
at PARTIAL_ALGORITHM_LEVEL via a new `diff_05_aprt_stage.rs` include slot.

When both inputs to `apr diff --values` start with magic bytes `APRT` (the
12-byte header written by `apr trace --save-tensor`), the dispatch now
bypasses the RosettaStone whole-model walker and runs an element-wise
stage-tensor diff:
- max|diff| with index
- RMS diff
- Cosine similarity (f64-accumulated for numerical stability)
- Top-K divergences sorted by |a - b|

Both JSON and pretty text output are supported. Mismatched dim_product or
layer fields fail-fast with a diagnostic error so callers don't silently
compare incompatible stages.

## Five Whys (why now, why this scope)

1. **Why is this needed?** `apr trace --save-tensor` (PR-A #1405, PR-B #1406,
   PR-C-prep #1407) writes per-stage f32 tensors as `APRT`-prefixed files.
   Without an APRT-aware diff, layer-0 stage-by-stage element-wise
   bisection per `feedback_model_1_ships_gpu_only.md` is gated on external
   tooling — exactly the kind of muda the APR-MONO §26.8 rule forbids.
2. **Why extend `apr diff` and not write a new subcommand?** The
   `apr_diff_values_compat` invariant in `apr-cli-trace-save-tensor-v1`
   already names `apr diff --values` as the verifier. Extending the
   existing flag keeps the contract surface stable.
3. **Why an include!() file instead of inlining into diff.rs?** diff.rs
   already follows that pattern (diff_accumulator, diff_output_json_text,
   diff_04). Keeping APRT logic in `diff_05_aprt_stage.rs` lets it be
   audited / removed independently and doesn't grow the parent file.
4. **Why pin via `provenance_pin_pr_d_rev1`?** Future renames of either
   `is_aprt_stage_file` or the file path break the include!() chain;
   the pin makes that visible at test-time and forces a contract bump.
5. **Why now?** Tokenization of the 27 GB Stack v1.2 Python corpus is
   running in the background for MODEL-2 (PR #1412 merged). The SHIP-007
   PR-C-real cascade for MODEL-1 needs PR-D infrastructure ready when
   step 2 (forward_traced threading) lands. PR-D is independent and can
   merge in parallel with #1408.

## Verification

- `cargo test -p apr-cli --lib commands::diff::aprt` → 11/11 PASS
  - is_aprt_stage_file: detects/rejects/truncated/missing (4 tests)
  - compute_aprt_stage_stats: identical=zero, known max/RMS, top-K sort (3)
  - run_aprt_stage_diff: dim/layer mismatch errors, identical succeeds (3)
  - provenance_pin_pr_d_rev1 (1)
- `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean
- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors

## Contract update

`apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0:
- New FALSIFY-APR-TRACE-SAVE-009 binding `apr_diff_values_compat` at
  PARTIAL_ALGORITHM_LEVEL with 4-line `algorithm_evidence` block citing
  this PR's unit tests.

## Ship % update

MODEL-1: ~64% → ~66% (PR-D is small but discharges 1 PARTIAL invariant
and clears infrastructure blocker for SHIP-007 step E).
MODEL-2: corpus tokenization in progress (~33h ETA).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant