Skip to content

prek 0.4.2 reports false-positive "files modified" for in-place rewrites (clean-worktree fast path) #2127

@GnoX

Description

@GnoX

In prek 0.4.2, a hook that rewrites a file in place - even with byte-identical content - can be reported as having modified files. This affects every mode that uses the "clean worktree" fast path (plain prek run, --last-commit, --from-ref/--to-ref); only --all-files is always safe. prek 0.4.1 handles the same input correctly.

I'm not sure how many hooks are affected by this, this regression seems to be quite specific for formatters that always rewrite the contents (all major formatter hooks are safe) - we're using one that started to fail after release.

Reproducer

mkdir repro && cd repro && git init -q -b main
git config user.email t@t && git config user.name t

cat > .pre-commit-config.yaml <<'EOF'
repos:
  - repo: local
    hooks:
      - id: stat-touch
        name: rewrite identical content (bumps mtime)
        entry: bash -euc 'for f in "$@"; do c=$(cat "$f"); printf "%s" "$c" > "$f"; done' --
        language: system
        files: \.txt$
EOF

printf "old" > a.txt && git add . && git commit -qm initial
git checkout -qb feature
printf "new" > a.txt && git add . && git commit -qm change

md5sum a.txt
uvx --from prek==0.4.2 prek run --from-ref main --to-ref feature --verbose
md5sum a.txt   # identical
22af645d1859cb5ca6da0c484f1f37ea  a.txt
rewrite identical content (bumps mtime)..................................Failed
- hook id: stat-touch
- files were modified by this hook
22af645d1859cb5ca6da0c484f1f37ea  a.txt

0.4.1 reports Passed on the same setup.

prek --from-ref/--to-ref --all-files
0.4.1 Passed Passed
0.4.2 Failed (false positive) Passed

Root cause

PR #2109 ("Optimize diff checks for clean worktrees") added DiffTracker with a Clean baseline that uses has_worktree_diff as the post-hook check:

// crates/prek/src/cli/run/diff.rs
DiffBaseline::Clean => {
    if !git::has_worktree_diff(self.path).await? {
        return Ok(false);
    }
    self.baseline = DiffBaseline::Snapshot(git::get_diff(self.path).await?);
    Ok(true)   // reached on any stat change, content not checked
}

has_worktree_diff runs git diff-files --quiet, which is stat-cache based:

$ echo hello > a && git add a && git commit -qm i
$ git diff-files --quiet; echo $?                          # 0
$ cat a > /tmp/x && cat /tmp/x > a    # rewrite, identical content
$ git diff-files --quiet; echo $?                          # 1  <- false positive
$ git diff --quiet; echo $?                                # 0  (content-aware)

Any in-place rewrite - which is what some formatters do, even on a no-op pass - bumps mtime, trips --quiet, and the Clean arm returns Ok(true) without ever checking content.

The Clean baseline is selected whenever should_stash = !all_files && files.is_empty() && directories.is_empty() -> worktree_cleaned = true -> clean_baseline = true. That is true for plain prek run, --last-commit, and --from-ref/--to-ref alike (confirmed reproducing on --last-commit as well). Only --all-files takes the unknown_baseline path, which content-diffs via get_diff before/after, so stat-only changes wash out there.

PR #2109's comment message calls out the analogous concern for CRLF (diff-files --quiet false positives on CRLF-normalized worktrees), but the same mechanism produces false positives for any plain rewrite.

Suggested fix

In the DiffBaseline::Clean arm, fall through to get_diff() when has_worktree_diff returns true, and only report Ok(true) if the content-aware diff is non-empty. Keeps the fast path for hooks that genuinely don't touch files; only pays the content-diff cost on a stat-change (rare in steady state).

// crates/prek/src/cli/run/diff.rs
DiffBaseline::Clean => {
    if !git::has_worktree_diff(self.path).await? {
        return Ok(false);
    }
    // `git diff-files --quiet` is stat-based and reports dirty for any in-place rewrite
    // (mtime bump) even when content is unchanged 
    let curr_diff = git::get_diff(self.path).await?;
    if curr_diff.is_empty() {
        return Ok(false);
    }
    self.baseline = DiffBaseline::Snapshot(curr_diff);
    Ok(true)
}

This change was tested to work in the same environment.

Environment

  • prek 0.4.2 (PyPI via uvx), Ubuntu 24.04. Reproducible with the snippet above on a fresh checkout.

edit: Added suggested fix section (not confident enough to create a PR as I never worked with Rust, I'll let a professional do it)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions