Skip to content

feat(apr-publish): LFS batch upload + NDJSON commits + valid model-index YAML (PMAT-690 defect 5)#1772

Merged
noahgift merged 5 commits into
mainfrom
feat/apr-publish-lfs-batch-defect-5
May 18, 2026
Merged

feat(apr-publish): LFS batch upload + NDJSON commits + valid model-index YAML (PMAT-690 defect 5)#1772
noahgift merged 5 commits into
mainfrom
feat/apr-publish-lfs-batch-defect-5

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Three cascading defects in apr publish that made the publish silently fail. Surfaced publishing paiml/albor-370m-v1 on 2026-05-17:

  • 5a — LFS batch upload missing for 5MB-5GB files. upload_via_lfs skipped the LFS batch API step entirely for files that HF returns with uploadMode: lfs and no inline URL (the band between regular preupload and Xet's 5GiB threshold).
  • 5b — JSON commits return 200 but drop files. Both commit_lfs_pointer and upload_direct used JSON `addOrUpdate` operations. HF replies `success: true` but silently discards the file. Memory rule feedback_hf_commit_ndjson_load_bearing.md (2026-04-18) already mandated NDJSON + lfsFile, but the rule was only enforced in aprender-data.
  • 5c — Empty model-index rejected by HF. `ModelCard::to_huggingface` always emitted `model-index:` with a name, but only emitted `results:` if metrics existed. HF rejects with HTTP 400 `"model-index[0].results" is required`.

After all three fixes, a fresh publish lands the LFS artifacts + auto-generated README:

```
1632 .gitattributes
164 README.md
2166458784 albor-370m-v1-q4k.gguf (lfs=True)
2524492804 albor-370m-v1.apr (lfs=True)
2520702380 albor-370m-v1.safetensors (lfs=True)
```

paiml/albor-370m-v1 is now live on HF Hub with all 3 binary artifacts.

Detail

5a — upload_via_lfs_batch (new)

Implements the standard LFS Batch API flow:

  1. POST `/{repo}.git/info/lfs/objects/batch` with `{operation, transfers, objects[{oid, size}]}`.
  2. Parse `objects[0].actions.upload.href`. If absent, the blob already exists — skip PUT.
  3. PUT the data with optional headers.
  4. Optional verify POST.

5b — NDJSON commit format

Both file-commit paths now send `application/x-ndjson` with two lines:

```json
{"key":"header","value":{"summary":"...","description":""}}
{"key":"lfsFile","value":{"path":"...","algo":"sha256","oid":"...","size":...}}
```

(For small files: `{"key":"file","value":{"path":"...","content":"base64...","encoding":"base64"}}`.)

5c — Conditional model-index block

```rust
if !self.metrics.is_empty() {
output.push_str("model-index:\n");
// ... only emit results when there are metrics to report
}
```

Known gaps (separate follow-up)

`find_model_files` in apr-cli/src/commands/publish.rs only picks .apr/.safetensors/.gguf — config.json, vocab.json, merges.txt, and user-provided README.md from the staging dir are skipped. The 11.6KB user-authored model card was ignored in favor of the auto-generated 164-byte stub. This is a file-selection defect, not upload-correctness — separate PR.

Test plan

  • cargo test -p aprender-core --lib model_card → 25 passed
  • Empirical: full publish of paiml/albor-370m-v1 succeeds with 3 LFS files on disk
  • Unit test pinning model-index YAML behavior with empty + non-empty metrics
  • Integration test (mockito or recorded fixture) for the LFS batch + NDJSON flow

🤖 Generated with Claude Code

…dex YAML (PMAT-690 P3-C-prep defect 5)

Surfaced publishing paiml/albor-370m-v1 on 2026-05-17. apr publish
returned "✓ Published" yet the repo showed only .gitattributes. Three
distinct sub-defects cascaded together:

5a. LFS batch upload missing for 5MB–5GB files
============================================
HF's preupload endpoint returns `uploadMode: "lfs"` with no inline
URLs for files in the 5MB–5GB band, expecting the client to fetch the
presigned S3 URL via the LFS Batch API
(POST `/{repo}.git/info/lfs/objects/batch`). Our upload_via_lfs skipped
this step entirely and went straight to commit_lfs_pointer, landing
orphaned pointers (the Xet branch handles >5GiB; nothing covered the
gap below it).

Fix: new `upload_via_lfs_batch` method ported from aprender-data's
working flow. Calls batch API → parses `objects[0].actions.upload.href`
→ PUTs the blob with optional headers → optional verify POST.
Empirical: paiml/albor-370m-v1 .apr (2.52GB) + .gguf (2.17GB) +
.safetensors (2.52GB) all upload at ~67 MB/s.

5b. JSON `addOrUpdate` commit returns 200 but drops files
=========================================================
The memory rule `feedback_hf_commit_ndjson_load_bearing.md`
(2026-04-18) — "HF commit MUST use application/x-ndjson + lfsFile key"
— was already known, but only enforced in aprender-data. The model
publish path in aprender-core used JSON with `op: "addOrUpdate"` for
BOTH the LFS-pointer commit and the small-file commit. HF returned
HTTP 200 + `success: true` for both, but actually persisted nothing
beyond `.gitattributes`.

Fix:
- commit_lfs_pointer now sends NDJSON `{key: "header"} \n {key: "lfsFile", value: {path, algo: "sha256", oid, size}}` with `Content-Type: application/x-ndjson`.
- upload_direct now sends NDJSON `{key: "header"} \n {key: "file", value: {path, content (base64), encoding: "base64"}}` (matches build_ndjson_upload_payload in aprender-data).

5c. Auto-generated README rejected by HF (HTTP 400)
====================================================
ModelCard::to_huggingface unconditionally emits `model-index:` with a
name but only emits `results:` if metrics are non-empty. HF's metadata
validator requires `model-index[0].results`, so the auto-generated
README was rejected with:

  "model-index[0].results" is required

Fix: skip the entire `model-index:` block when metrics are empty.
Empty model-index is invalid and signal-free anyway.

End-to-end verification
=======================
After all three fixes:
  $ apr publish /tmp/albor-370m-staging paiml/albor-370m-v1 ...
  ✓ Published

  $ curl ".../api/models/paiml/albor-370m-v1/tree/main"
       1632  .gitattributes
        164  README.md
   2166458784  albor-370m-v1-q4k.gguf (lfs=True)
   2524492804  albor-370m-v1.apr      (lfs=True)
   2520702380  albor-370m-v1.safetensors (lfs=True)

All 3 LFS artifacts + auto-generated README on the repo.

Known gaps (filed as follow-up)
================================
- find_model_files (apr-cli/src/commands/publish.rs:496) only picks
  .apr/.safetensors/.gguf — companion files (config.json, vocab.json,
  merges.txt, user-provided README.md from staging dir) are skipped.
- The 11.6KB user-authored model card in /tmp/albor-370m-staging/
  README.md was ignored in favor of the auto-generated 164-byte stub.

These are file-selection defects, not upload-correctness — separate PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 17, 2026 21:41
@noahgift noahgift merged commit 1837321 into main May 18, 2026
10 checks passed
@noahgift noahgift deleted the feat/apr-publish-lfs-batch-defect-5 branch May 18, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant