fix(scripts): harden YAML scanner against shell/regex fragment misclassification#262
Merged
BartWaardenburg merged 1 commit intofallow-rs:mainfrom May 3, 2026
Merged
Conversation
…ssification
The CI YAML scanner extracts shell command tokens from `run:` blocks
into entry-pattern candidates and feeds them to globset. Several token
classes are not valid filesystem paths and produce noisy
`invalid entry pattern` warnings on every run:
- GitHub Actions expressions: `${{ env.URL }}/api/health` whitespace-splits
into chunks like `}}/api/health` (unclosed alternation).
- jq array iterators: `jq -r '.[]'` produces the token `'.[]'`
(empty character class in glob syntax).
- Perl regex fragments: `grep -oP '(?<=Module )\\./[^ ]+...'` produces
chunks like `)\\./[^` (backslash plus unclosed class).
Two changes:
1. Tighten `looks_like_file_path` with negative guards for tokens that
cannot be Unix paths: `${{`/`}}`, backslashes, and `[` followed
immediately by `]` or with no closing `]`. Next.js dynamic-route
segments like `app/[id]/page.tsx` are unaffected because their `[`
has at least one char before `]`.
2. Filter `extract_ci_signals`'s `cmd.config_args`/`cmd.file_args` through
the negative-only guard `could_be_file_path` before extending
`entry_files`, so quoted shell tokens that `parse_script` classifies
as args never reach globset compilation. Bare-name paths like
`deploy.log` still pass through.
Tests: 6 new unit tests covering each rejection class plus a Next.js
dynamic-route positive case, 2 integration tests using verbatim
in-the-wild fragments from real-world workflow YAML.
8 tasks
BartWaardenburg
added a commit
that referenced
this pull request
May 3, 2026
PR #262 added `could_be_file_path` in `crates/core/src/scripts/mod.rs` to reject tokens whose syntax precludes a Unix path (GHA expressions, backslash escapes, malformed `[...]`) before they reach globset compilation. The sibling `looks_like_file_path` / `looks_like_script_file` in `crates/core/src/discover/parse_scripts.rs` classify the same kinds of tokens for package.json / Dockerfile / Procfile / fly.toml extraction, but feed `resolve_entry_path_with_tracking` (real-fs `is_file()` check) instead of globset, so they never produced the noisy `invalid entry pattern` warning. No behavioral bug exists today, but maintaining two near-identical classifiers invites drift. Promote `could_be_file_path` from private to `pub` and call it as a pre-filter from the discover-side classifiers. This is a single-source-of-truth refactor with a micro perf win (skip `Path::join` + `is_file` syscalls on tokens that cannot be paths) and zero behavioral change for users. Verified: total_issues on the next.js benchmark is unchanged (24825), no `invalid entry pattern` warnings emitted (already 0 after #262), all 1917 fallow-core unit tests + 297 integration tests pass, full workspace test suite green.
Collaborator
|
Released in v2.63.0. Thanks @fmguerreiro. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
fallow dead-codeon a typical CI repo emits 20+WARN invalid entry patternlines from tokens extracted out of GitHub Actions workflowrun:blocks. Three classes of false positive:curl "${{ env.URL }}/api/health"whitespace-splits, producing chunks like}}/api/health"(unclosed alternation in glob syntax).jq -r '.[]' file.jsonproduces the token'.[]'(empty character class).grep -oP '(?<=Module )\./[^ ]+(?= ...)'produces chunks like)\./[^(backslash plus unclosed class).These reach
globset::GlobBuilder::new(...).build()which rejects them and emits the warnings. They are not real entry patterns and should never reach globset.Fix
Two changes in
crates/core/src/scripts/:could_be_file_path(new,mod.rs): negative-only guard rejecting tokens whose syntax precludes a Unix path. Three rules:${{ ... }}GHA expressions, plus standalone}}not balanced by{{(so Mustache/Handlebars template paths liketemplates/{{name}}.hbsstill pass).[]or unclosed[^...character classes (first[only; suffices for in-the-wild fragments).looks_like_file_pathrefactor (mod.rs): now delegates tocould_be_file_pathfor the negative checks before applying its strict positive rules (known extension,.//../prefix, contains/and not a scoped package or URL). Behaviour unchanged for callers; the split exposes a lenient filter for use cases where bare-name paths (deploy.log,Makefile) must pass through.extract_ci_signalsfilter (ci.rs):cmd.config_args(the value after--config/-c) bypassesparse_script's existinglooks_like_file_pathgate, so jq array iterators or regex fragments passed via flags can leak intoentry_files. Re-filter throughcould_be_file_pathhere.cmd.file_argsare already pre-filtered byparse_scriptand are not re-checked.Tests
10 unit tests in
mod.rs(existing GHA test renamed for consistency, 5 new for jq/regex/Mustache/bare-name/lenient-positive cases) and 2 integration tests inci.rsusing verbatim in-the-wild YAML fragments. Next.js dynamic route segments (app/[id]/page.tsx,pages/[...slug].ts) are explicitly tested as positive cases to guard against false negatives in the bracket-class rule.Risk
[in a token. Documented in the code comment. A token with a valid leading class followed by a malformed trailing class would slip through; no such fragment has been observed in the wild.}}rule allows balanced Mustache/Handlebars templates. If an upstream user has paths containing literal}}without a matching{{, they would be filtered. Such paths are not valid Unix path syntax in any case.Verified on
cargo test --workspace,cargo clippy --workspace --all-targets -- -D warnings,cargo fmt --all -- --checkall clean. Real-world: this fork-pinned binary is currently running in a Turborepo monorepo CI (example output) where it dropped stderr noise from 30+ warnings to 0 across thedead-code,audit, andhealthcommands.