Skip to content

fix(cli): filter raw HARs from publishable manuscripts#2525

Merged
mergify[bot] merged 3 commits into
mainfrom
fix/issue-2321-har-manuscripts
May 31, 2026
Merged

fix(cli): filter raw HARs from publishable manuscripts#2525
mergify[bot] merged 3 commits into
mainfrom
fix/issue-2321-har-manuscripts

Conversation

@tmchow

@tmchow tmchow commented May 31, 2026

Copy link
Copy Markdown
Owner

Summary

Publishable manuscripts no longer carry raw browser-sniff HAR captures or huge capture files. Archive, lock promote, and publish package now all copy manuscript trees through the same publishable filter, while derived artifacts such as research notes, proofs, traffic analysis, and unique-path reports remain bundled.

publish package also strips any .manuscripts directory copied from the source CLI before re-adding manuscripts through that filter, with a fallback for embedded manuscript runs when the archive root is unavailable.

Closes #2321

Verification

  • go test ./internal/pipeline -run 'TestArchiveRunArtifactsCopiesDiscovery|TestPromoteWorkingCLI_StagesRunstateManuscripts' -count=1
  • go test ./internal/cli -run 'TestPublishPackageIncludesManuscripts|TestPublishPackageFiltersEmbeddedManuscriptsFallback' -count=1
  • go test ./...
  • go build -o ./cli-printing-press ./cmd/cli-printing-press
  • golangci-lint run ./...

@mergify

mergify Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 require-ready-label-and-ci

Wonderful, this rule succeeded.
  • #changes-requested-reviews-by = 0
  • #review-threads-unresolved = 0
  • check-success = build-and-test
  • check-success = generated-test
  • check-success = go-lint
  • check-success = golden
  • check-success = pr-title
  • check-success = test
  • any of:
    • label = ready-to-merge
    • all of:
      • head = release-please--branches--main
      • title ~= ^chore\(main\): release
  • any of:
    • -files ~= ^(\.github/workflows/|\.github/scripts/|scripts/|\.github/CODEOWNERS$)
    • author = tmchow
    • approved-reviews-by = mvanhorn
    • approved-reviews-by = tmchow
    • author = mvanhorn
  • any of:
    • check-success = Greptile Review
    • label = queued
    • check-neutral = Greptile Review
    • check-skipped = Greptile Review
    • head ~= ^mergify/merge-queue/
    • all of:
      • head = release-please--branches--main
      • title ~= ^chore\(main\): release

@greptile-apps

greptile-apps Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR prevents raw HAR captures and oversized capture files from being bundled into published CLI packages by routing all manuscript copy operations through a new CopyPublishableManuscriptDir filter. The filter is applied consistently across archive, lock/promote, and publish-package flows.

  • New filter (copyDirFiltered): skips files (and symlinks, by name and resolved-target check) whose extension is .har (case-insensitive) or whose size is ≥ 100 MB; CopyDir is unchanged and still passes nil for backward compatibility.
  • Publish package hardening: staged .manuscripts copied from the source CLI is stripped before manuscripts are re-added through the filter; a fallback reads embedded manuscripts from the source CLI directory when the archive root has no matching run.
  • Test coverage: all three affected copy sites gain before/after assertions for .har exclusion, large-file exclusion, and preservation of derived artifacts (traffic analysis JSON, research notes, proofs).

Confidence Score: 5/5

Safe to merge. The filter is applied consistently across all three copy sites (archive, promote, publish), the two-pass symlink check correctly covers both symlink names and resolved targets, and every changed code path has a corresponding test that verifies both exclusion and inclusion behavior.

The filter logic in copyDirFiltered is straightforward — regular files are checked directly by name and size, symlinks get a two-pass check (symlink name first, then os.Stat-followed target name and size). CopyDir is unchanged and backward-compatible. The publish-package fallback cleanly strips the staged manuscripts directory before re-adding through the filter, sourcing from the original dir rather than the staged copy. Test coverage spans all three call sites and explicitly exercises the threshold boundary (exactly 100 MB file, symlink-to-.har, symlink-to-large-file).

No files require special attention.

Important Files Changed

Filename Overview
internal/pipeline/publish.go Core change: refactors CopyDir into copyDirFiltered with an optional skip predicate; adds CopyPublishableManuscriptDir that strips .har files and files ≥ 100 MB. Symlink filtering is two-pass: first by symlink name, then by resolved target name and followed-stat size. ArchiveRunArtifacts now routes through the new filter.
internal/cli/publish.go publish package now strips the staged .manuscripts directory copied from the source CLI before re-adding manuscripts through the publishable filter; adds a fallback that picks up embedded manuscripts from the source CLI's own .manuscripts when the archive root has no matching run.
internal/pipeline/lock.go One-line change: stageRunstateManuscripts now calls CopyPublishableManuscriptDir instead of CopyDir, bringing the lock/promote path in line with archive and publish.
internal/pipeline/publish_test.go New TestCopyPublishableManuscriptDirFiltersSymlinks exercises the four symlink cases: plain symlink (preserved), symlink named .har (filtered), symlink pointing to .har (filtered), and symlink pointing to a 100 MB file (filtered at threshold boundary).
internal/cli/publish_test.go Extends TestPublishPackageIncludesManuscripts to verify .har files are stripped and traffic-analysis.json survives; adds TestPublishPackageFiltersEmbeddedManuscriptsFallback for the embedded manuscript fallback path.
internal/pipeline/climanifest_test.go TestArchiveRunArtifactsCopiesDiscovery extended with a .har file and a 101 MB JSON capture; asserts both are absent from the archived discovery directory.
internal/pipeline/lock_test.go TestPromoteWorkingCLI_StagesRunstateManuscripts extended with a .har capture and a traffic-analysis.json; verifies the former is absent and the latter is present after promote.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Manuscript source] --> B{copyDirFiltered}
    B --> C{Entry type?}
    C -->|Regular file| D{.har ext OR size >= 100 MB?}
    C -->|Symlink| E{Symlink name ends in .har?}
    D -->|Yes| SKIP1[Skip]
    D -->|No| COPY1[Copy file to dst]
    E -->|Yes| SKIP2[Skip]
    E -->|No| F{Target name .har OR target size >= 100 MB?}
    F -->|Yes| SKIP3[Skip]
    F -->|No| COPY2[Create symlink in dst]
    C -->|Directory| MKDIR[MkdirAll in dst]

    subgraph Callers
        AR[ArchiveRunArtifacts] -->|CopyPublishableManuscriptDir| B
        LK[stageRunstateManuscripts] -->|CopyPublishableManuscriptDir| B
        PP[publish package] --> G{Archive run found?}
        G -->|Yes| H[Use archive path]
        G -->|No| I[Fallback: embedded .manuscripts]
        H -->|CopyPublishableManuscriptDir| B
        I -->|CopyPublishableManuscriptDir| B
    end
Loading

Reviews (3): Last reviewed commit: "fix(cli): filter manuscript symlink targ..." | Re-trigger Greptile

Comment thread internal/pipeline/publish.go
@tmchow tmchow added the ready-to-merge Allow Mergify to queue and merge this PR when protections pass label May 31, 2026
@mergify mergify Bot added the queued PR is in the Mergify merge queue label May 31, 2026
mergify Bot added a commit that referenced this pull request May 31, 2026
mergify Bot added a commit that referenced this pull request May 31, 2026
mergify Bot added a commit that referenced this pull request May 31, 2026
@mergify mergify Bot merged commit a5f6e70 into main May 31, 2026
28 checks passed
@mergify mergify Bot deleted the fix/issue-2321-har-manuscripts branch May 31, 2026 18:07
@mergify

mergify Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Merge Queue Status

This pull request spent 17 minutes 29 seconds in the queue, including 14 minutes 29 seconds running CI.

Required conditions to merge

@mergify mergify Bot removed the queued PR is in the Mergify merge queue label May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge Allow Mergify to queue and merge this PR when protections pass

Projects

None yet

Development

Successfully merging this pull request may close these issues.

generator: raw sniff HARs (with session PII) bundled into publishable manuscripts

1 participant