Skip to content

fix(scripts): harden YAML scanner against shell/regex fragment misclassification#262

Merged
BartWaardenburg merged 1 commit intofallow-rs:mainfrom
fmguerreiro:upstream/yaml-scanner
May 3, 2026
Merged

fix(scripts): harden YAML scanner against shell/regex fragment misclassification#262
BartWaardenburg merged 1 commit intofallow-rs:mainfrom
fmguerreiro:upstream/yaml-scanner

Conversation

@fmguerreiro
Copy link
Copy Markdown
Contributor

Problem

fallow dead-code on a typical CI repo emits 20+ WARN invalid entry pattern lines from tokens extracted out of GitHub Actions workflow run: blocks. Three classes of false positive:

  • GitHub Actions expressions: curl "${{ env.URL }}/api/health" whitespace-splits, producing chunks like }}/api/health" (unclosed alternation in glob syntax).
  • jq array iterators: jq -r '.[]' file.json produces the token '.[]' (empty character class).
  • Perl regex fragments: grep -oP '(?<=Module )\./[^ ]+(?= ...)' produces chunks like )\./[^ (backslash plus unclosed class).

These reach globset::GlobBuilder::new(...).build() which rejects them and emits the warnings. They are not real entry patterns and should never reach globset.

Fix

Two changes in crates/core/src/scripts/:

  1. could_be_file_path (new, mod.rs): negative-only guard rejecting tokens whose syntax precludes a Unix path. Three rules:

    • ${{ ... }} GHA expressions, plus standalone }} not balanced by {{ (so Mustache/Handlebars template paths like templates/{{name}}.hbs still pass).
    • Backslash anywhere in the token.
    • Empty [] or unclosed [^... character classes (first [ only; suffices for in-the-wild fragments).
  2. looks_like_file_path refactor (mod.rs): now delegates to could_be_file_path for the negative checks before applying its strict positive rules (known extension, .//../ prefix, contains / and not a scoped package or URL). Behaviour unchanged for callers; the split exposes a lenient filter for use cases where bare-name paths (deploy.log, Makefile) must pass through.

  3. extract_ci_signals filter (ci.rs): cmd.config_args (the value after --config/-c) bypasses parse_script's existing looks_like_file_path gate, so jq array iterators or regex fragments passed via flags can leak into entry_files. Re-filter through could_be_file_path here. cmd.file_args are already pre-filtered by parse_script and are not re-checked.

Tests

10 unit tests in mod.rs (existing GHA test renamed for consistency, 5 new for jq/regex/Mustache/bare-name/lenient-positive cases) and 2 integration tests in ci.rs using verbatim in-the-wild YAML fragments. Next.js dynamic route segments (app/[id]/page.tsx, pages/[...slug].ts) are explicitly tested as positive cases to guard against false negatives in the bracket-class rule.

Risk

  • The bracket-class rule only checks the first [ in a token. Documented in the code comment. A token with a valid leading class followed by a malformed trailing class would slip through; no such fragment has been observed in the wild.
  • The }} rule allows balanced Mustache/Handlebars templates. If an upstream user has paths containing literal }} without a matching {{, they would be filtered. Such paths are not valid Unix path syntax in any case.

Verified on

cargo test --workspace, cargo clippy --workspace --all-targets -- -D warnings, cargo fmt --all -- --check all clean. Real-world: this fork-pinned binary is currently running in a Turborepo monorepo CI (example output) where it dropped stderr noise from 30+ warnings to 0 across the dead-code, audit, and health commands.

…ssification

The CI YAML scanner extracts shell command tokens from `run:` blocks
into entry-pattern candidates and feeds them to globset. Several token
classes are not valid filesystem paths and produce noisy
`invalid entry pattern` warnings on every run:

- GitHub Actions expressions: `${{ env.URL }}/api/health` whitespace-splits
  into chunks like `}}/api/health` (unclosed alternation).
- jq array iterators: `jq -r '.[]'` produces the token `'.[]'`
  (empty character class in glob syntax).
- Perl regex fragments: `grep -oP '(?<=Module )\\./[^ ]+...'` produces
  chunks like `)\\./[^` (backslash plus unclosed class).

Two changes:

1. Tighten `looks_like_file_path` with negative guards for tokens that
   cannot be Unix paths: `${{`/`}}`, backslashes, and `[` followed
   immediately by `]` or with no closing `]`. Next.js dynamic-route
   segments like `app/[id]/page.tsx` are unaffected because their `[`
   has at least one char before `]`.
2. Filter `extract_ci_signals`'s `cmd.config_args`/`cmd.file_args` through
   the negative-only guard `could_be_file_path` before extending
   `entry_files`, so quoted shell tokens that `parse_script` classifies
   as args never reach globset compilation. Bare-name paths like
   `deploy.log` still pass through.

Tests: 6 new unit tests covering each rejection class plus a Next.js
dynamic-route positive case, 2 integration tests using verbatim
in-the-wild fragments from real-world workflow YAML.
@BartWaardenburg BartWaardenburg merged commit db70a40 into fallow-rs:main May 3, 2026
1 check passed
BartWaardenburg added a commit that referenced this pull request May 3, 2026
PR #262 added `could_be_file_path` in `crates/core/src/scripts/mod.rs` to
reject tokens whose syntax precludes a Unix path (GHA expressions,
backslash escapes, malformed `[...]`) before they reach globset
compilation. The sibling `looks_like_file_path` /
`looks_like_script_file` in `crates/core/src/discover/parse_scripts.rs`
classify the same kinds of tokens for package.json / Dockerfile /
Procfile / fly.toml extraction, but feed `resolve_entry_path_with_tracking`
(real-fs `is_file()` check) instead of globset, so they never produced
the noisy `invalid entry pattern` warning.

No behavioral bug exists today, but maintaining two near-identical
classifiers invites drift. Promote `could_be_file_path` from private to
`pub` and call it as a pre-filter from the discover-side classifiers.
This is a single-source-of-truth refactor with a micro perf win
(skip `Path::join` + `is_file` syscalls on tokens that cannot be paths)
and zero behavioral change for users.

Verified: total_issues on the next.js benchmark is unchanged (24825),
no `invalid entry pattern` warnings emitted (already 0 after #262), all
1917 fallow-core unit tests + 297 integration tests pass, full workspace
test suite green.
@BartWaardenburg
Copy link
Copy Markdown
Collaborator

Released in v2.63.0. Thanks @fmguerreiro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants