Skip to content

Optimize detect_private_key by chunked reading and using aho-corasick#1791

Merged
j178 merged 2 commits intomasterfrom
optimize-detect-private-key
Mar 15, 2026
Merged

Optimize detect_private_key by chunked reading and using aho-corasick#1791
j178 merged 2 commits intomasterfrom
optimize-detect-private-key

Conversation

@j178
Copy link
Copy Markdown
Owner

@j178 j178 commented Mar 15, 2026

Optimize detect_private_key scanning.

Copilot AI review requested due to automatic review settings March 15, 2026 07:34
@j178 j178 added the performance Performance improvements label Mar 15, 2026
@j178 j178 changed the base branch from master to cache-file-tags-filtering March 15, 2026 07:36
@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

let file_tags = filenames
.par_iter()
.filter_map(|filename| match tags_from_path(filename) {
Ok(tags) => Some((*filename, tags)),

P1 Badge Recompute file tags after hooks mutate files

Caching tags_from_path once in for_project makes tag filtering stale for the rest of the project run: run_hooks builds a single FileFilter and reuses it across all priority groups, but earlier hooks can modify/delete files or change executable/shebang metadata. In those cases, later hooks will be filtered with outdated tags, so they can run on files that no longer exist/match (or skip files that now should match), which is a behavior regression from the previous per-hook tag lookup.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes private key detection (a built-in pre-commit hook) by switching from whole-file reads to chunked scanning with an Aho–Corasick matcher, and reduces repeated file-tag identification work during hook filtering.

Changes:

  • Implement chunked streaming scan in detect_private_key using a prebuilt aho_corasick::AhoCorasick matcher with boundary carryover.
  • Cache tags_from_path results per project in FileFilter to avoid recomputing tags during multiple filtering passes.
  • Add aho-corasick as a workspace dependency and enable it for the prek crate.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
crates/prek/src/hooks/pre_commit_hooks/detect_private_key.rs Stream-based scanning with Aho–Corasick + small carry buffer to catch boundary-split markers.
crates/prek/src/cli/run/filter.rs Adds a per-project tag cache (file_tags) used by type/tag filtering.
crates/prek/Cargo.toml Adds aho-corasick dependency for the prek crate.
Cargo.toml Adds aho-corasick to workspace dependencies.
Cargo.lock Records the new dependency in the prek package’s dependency list.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.72%. Comparing base (d6f6e43) to head (f441983).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1791   +/-   ##
=======================================
  Coverage   91.72%   91.72%           
=======================================
  Files          98       98           
  Lines       20025    20037   +12     
=======================================
+ Hits        18367    18379   +12     
  Misses       1658     1658           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@prek-ci-bot
Copy link
Copy Markdown

prek-ci-bot bot commented Mar 15, 2026

📦 Cargo Bloat Comparison

Binary size change: +0.00% (24.8 MiB → 24.8 MiB)

Expand for cargo-bloat output

Head Branch Results

 File  .text     Size             Crate Name
 1.3%   2.7% 332.0KiB        aws_lc_sys aws_lc_0_38_0_aes_gcm_encrypt_avx512
 1.3%   2.7% 332.0KiB        aws_lc_sys aws_lc_0_38_0_aes_gcm_decrypt_avx512
 0.3%   0.6%  76.3KiB              prek prek::languages::<impl prek::config::Language>::run::{{closure}}::{{closure}}
 0.3%   0.6%  69.8KiB             prek? <prek::cli::Command as clap_builder::derive::Subcommand>::augment_subcommands
 0.3%   0.6%  69.8KiB              prek prek::languages::<impl prek::config::Language>::run::{{closure}}::{{closure}}
 0.2%   0.4%  51.0KiB annotate_snippets annotate_snippets::renderer::render::render
 0.2%   0.4%  50.6KiB              prek prek::languages::<impl prek::config::Language>::install::{{closure}}
 0.2%   0.4%  46.1KiB              prek prek::run::{{closure}}
 0.2%   0.3%  42.0KiB              prek prek::cli::run::run::run::{{closure}}
 0.1%   0.3%  31.8KiB             prek? <prek::cli::RunArgs as clap_builder::derive::Args>::augment_args
 0.1%   0.2%  28.0KiB        aws_lc_sys aws_lc_0_38_0_edwards25519_scalarmuldouble_alt
 0.1%   0.2%  27.8KiB      serde_saphyr saphyr_parser_bw::scanner::Scanner<T>::fetch_more_tokens
 0.1%   0.2%  27.5KiB        aws_lc_sys aws_lc_0_38_0_edwards25519_scalarmuldouble
 0.1%   0.2%  25.8KiB              prek prek::cli::try_repo::try_repo::{{closure}}
 0.1%   0.2%  24.9KiB             prek? <prek::config::_::<impl serde_core::de::Deserialize for prek::config::Config>::deserialize::__Visitor as serde_core::de::Visitor>::visit_map
 0.1%   0.2%  23.4KiB              prek prek::hooks::meta_hooks::MetaHooks::run::{{closure}}
 0.1%   0.2%  22.4KiB      serde_saphyr saphyr_parser_bw::scanner::Scanner<T>::fetch_more_tokens
 0.1%   0.2%  22.3KiB         [Unknown] Lp384_montjscalarmul_alt_p384_montjadd
 0.1%   0.2%  22.0KiB               std core::ptr::drop_in_place<prek::languages::<impl prek::config::Language>::install::{{closure}}>
 0.1%   0.2%  21.6KiB              prek prek::workspace::Project::init_hooks::{{closure}}
41.1%  86.0%  10.2MiB                   And 23274 smaller methods. Use -n N to show more.
47.8% 100.0%  11.8MiB                   .text section size, the file size is 24.8MiB

Base Branch Results

 File  .text     Size             Crate Name
 1.3%   2.7% 332.0KiB        aws_lc_sys aws_lc_0_38_0_aes_gcm_encrypt_avx512
 1.3%   2.7% 332.0KiB        aws_lc_sys aws_lc_0_38_0_aes_gcm_decrypt_avx512
 0.3%   0.6%  76.3KiB              prek prek::languages::<impl prek::config::Language>::run::{{closure}}::{{closure}}
 0.3%   0.6%  69.8KiB              prek prek::languages::<impl prek::config::Language>::run::{{closure}}::{{closure}}
 0.3%   0.6%  68.0KiB             prek? <prek::cli::Command as clap_builder::derive::Subcommand>::augment_subcommands
 0.2%   0.4%  51.0KiB annotate_snippets annotate_snippets::renderer::render::render
 0.2%   0.4%  50.6KiB              prek prek::languages::<impl prek::config::Language>::install::{{closure}}
 0.2%   0.4%  46.9KiB              prek prek::run::{{closure}}
 0.2%   0.3%  42.0KiB              prek prek::cli::run::run::run::{{closure}}
 0.1%   0.3%  32.0KiB             prek? <prek::cli::RunArgs as clap_builder::derive::Args>::augment_args
 0.1%   0.2%  28.0KiB        aws_lc_sys aws_lc_0_38_0_edwards25519_scalarmuldouble_alt
 0.1%   0.2%  27.8KiB      serde_saphyr saphyr_parser_bw::scanner::Scanner<T>::fetch_more_tokens
 0.1%   0.2%  27.5KiB        aws_lc_sys aws_lc_0_38_0_edwards25519_scalarmuldouble
 0.1%   0.2%  25.8KiB              prek prek::cli::try_repo::try_repo::{{closure}}
 0.1%   0.2%  24.9KiB             prek? <prek::config::_::<impl serde_core::de::Deserialize for prek::config::Config>::deserialize::__Visitor as serde_core::de::Visitor>::visit_map
 0.1%   0.2%  23.4KiB              prek prek::hooks::meta_hooks::MetaHooks::run::{{closure}}
 0.1%   0.2%  22.4KiB      serde_saphyr saphyr_parser_bw::scanner::Scanner<T>::fetch_more_tokens
 0.1%   0.2%  22.3KiB         [Unknown] Lp384_montjscalarmul_alt_p384_montjadd
 0.1%   0.2%  22.0KiB               std core::ptr::drop_in_place<prek::languages::<impl prek::config::Language>::install::{{closure}}>
 0.1%   0.2%  21.6KiB              prek prek::workspace::Project::init_hooks::{{closure}}
41.1%  86.0%  10.2MiB                   And 23301 smaller methods. Use -n N to show more.
47.8% 100.0%  11.8MiB                   .text section size, the file size is 24.8MiB

@prek-ci-bot
Copy link
Copy Markdown

prek-ci-bot bot commented Mar 15, 2026

⚡️ Hyperfine Benchmarks

Summary: 0 regressions, 0 improvements above the 10% threshold.

Environment
  • OS: Linux 6.14.0-1017-azure
  • CPU: 4 cores
  • prek version: prek 0.3.5+21 (c11e786 2026-03-15)
  • Rust version: rustc 1.94.0 (4a4ef493e 2026-03-02)
  • Hyperfine version: hyperfine 1.20.0
CLI Commands

Benchmarking basic commands in the main repo:

prek --version

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base --version 2.4 ± 0.1 2.2 2.6 1.01 ± 0.08
prek-head --version 2.4 ± 0.2 2.2 2.8 1.00

prek list

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base list 9.1 ± 0.5 8.7 12.4 1.00
prek-head list 9.3 ± 0.7 8.7 12.8 1.02 ± 0.09

prek validate-config .pre-commit-config.yaml

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base validate-config .pre-commit-config.yaml 3.1 ± 0.1 3.0 3.2 1.01 ± 0.02
prek-head validate-config .pre-commit-config.yaml 3.1 ± 0.0 3.0 3.2 1.00

prek sample-config

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base sample-config 2.6 ± 0.0 2.6 2.7 1.00
prek-head sample-config 2.8 ± 0.2 2.6 3.3 1.07 ± 0.08
Cold vs Warm Runs

Comparing first run (cold) vs subsequent runs (warm cache):

prek run --all-files (cold - no cache)

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run --all-files 150.9 ± 4.1 147.3 161.8 1.00
prek-head run --all-files 152.8 ± 2.1 149.2 155.3 1.01 ± 0.03

prek run --all-files (warm - with cache)

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run --all-files 151.9 ± 4.2 147.7 161.7 1.00
prek-head run --all-files 153.1 ± 5.0 147.4 165.7 1.01 ± 0.04
Full Hook Suite

Running the builtin hook suite on the benchmark workspace:

prek run --all-files (full builtin hook suite)

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run --all-files 151.7 ± 4.2 145.2 163.0 1.00
prek-head run --all-files 157.2 ± 22.9 147.5 312.0 1.04 ± 0.15
Individual Hook Performance

Benchmarking each hook individually on the test repo:

prek run trailing-whitespace --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run trailing-whitespace --all-files 22.4 ± 0.4 21.6 23.4 1.00
prek-head run trailing-whitespace --all-files 22.5 ± 1.3 21.3 27.3 1.00 ± 0.06

prek run end-of-file-fixer --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run end-of-file-fixer --all-files 28.9 ± 2.4 25.3 34.5 1.06 ± 0.10
prek-head run end-of-file-fixer --all-files 27.2 ± 1.3 25.4 30.6 1.00

prek run check-json --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run check-json --all-files 12.9 ± 0.5 12.1 13.8 1.01 ± 0.09
prek-head run check-json --all-files 12.8 ± 1.0 11.8 15.6 1.00

prek run check-yaml --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run check-yaml --all-files 12.1 ± 0.2 11.7 12.8 1.00
prek-head run check-yaml --all-files 12.6 ± 1.1 11.9 16.5 1.05 ± 0.09

prek run check-toml --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run check-toml --all-files 12.3 ± 0.2 11.9 12.6 1.00
prek-head run check-toml --all-files 12.7 ± 1.1 11.7 15.9 1.04 ± 0.09

prek run check-xml --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run check-xml --all-files 12.4 ± 0.3 11.9 12.9 1.00
prek-head run check-xml --all-files 12.5 ± 0.4 11.9 13.3 1.01 ± 0.04

prek run detect-private-key --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run detect-private-key --all-files 16.1 ± 1.3 14.6 19.8 1.00
prek-head run detect-private-key --all-files 18.4 ± 1.0 16.5 21.0 1.14 ± 0.11
Installation Performance

Benchmarking hook installation (fast path hooks skip Python setup):

prek install-hooks (cold - no cache)

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base install-hooks 4.7 ± 0.0 4.7 4.8 1.00
prek-head install-hooks 4.7 ± 0.1 4.7 4.8 1.00 ± 0.01

prek install-hooks (warm - with cache)

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base install-hooks 4.7 ± 0.0 4.7 4.8 1.00
prek-head install-hooks 4.9 ± 0.4 4.7 5.6 1.05 ± 0.08
File Filtering/Scoping Performance

Testing different file selection modes:

prek run (staged files only)

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run 18.6 ± 0.2 18.3 18.9 1.00
prek-head run 18.7 ± 0.2 18.4 19.1 1.00 ± 0.01

prek run --files '*.json' (specific file type)

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run --files '*.json' 7.5 ± 0.1 7.4 7.7 1.00
prek-head run --files '*.json' 8.6 ± 1.1 7.4 11.3 1.15 ± 0.14
Workspace Discovery & Initialization

Benchmarking hook discovery and initialization overhead:

prek run --dry-run --all-files (measures init overhead)

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run --dry-run --all-files 10.5 ± 0.1 10.3 11.0 1.00 ± 0.02
prek-head run --dry-run --all-files 10.5 ± 0.1 10.3 10.7 1.00
Meta Hooks Performance

Benchmarking meta hooks separately:

prek run check-hooks-apply --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run check-hooks-apply --all-files 14.3 ± 0.1 14.2 14.6 1.00
prek-head run check-hooks-apply --all-files 15.1 ± 1.2 14.1 17.4 1.05 ± 0.08

prek run check-useless-excludes --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run check-useless-excludes --all-files 13.6 ± 0.7 12.4 14.3 1.08 ± 0.06
prek-head run check-useless-excludes --all-files 12.6 ± 0.2 12.4 13.0 1.00

prek run identity --all-files

Command Mean [ms] Min [ms] Max [ms] Relative
prek-base run identity --all-files 11.4 ± 0.2 11.0 11.8 1.00
prek-head run identity --all-files 11.5 ± 0.2 11.2 11.8 1.00 ± 0.03

@j178 j178 force-pushed the optimize-detect-private-key branch from a8ad45c to f475271 Compare March 15, 2026 09:01
@j178 j178 changed the base branch from cache-file-tags-filtering to master March 15, 2026 09:24
j178 added 2 commits March 15, 2026 17:26
Benchmark `detect-private-key` in hyperfine script
Reduce file operations in fix_end_of_file.
Copilot AI review requested due to automatic review settings March 15, 2026 09:27
@j178 j178 force-pushed the optimize-detect-private-key branch from 69e87bb to f441983 Compare March 15, 2026 09:27
@j178 j178 merged commit da182c8 into master Mar 15, 2026
50 checks passed
@j178 j178 deleted the optimize-detect-private-key branch March 15, 2026 09:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves performance of the detect-private-key builtin hook by avoiding full-file reads and using an Aho–Corasick matcher for multi-pattern scanning, and wires the hook into the benchmarking script.

Changes:

  • Switch detect-private-key to chunked scanning with a small carry-over window + aho-corasick matcher.
  • Add aho-corasick as a workspace dependency and to the prek crate.
  • Include detect-private-key in the hyperfine benchmark hook list; minor refactor in fix_end_of_file file scanning.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/hyperfine-run-benchmarks.sh Adds detect-private-key to the list of individually benchmarked hooks.
crates/prek/src/hooks/pre_commit_hooks/fix_end_of_file.rs Refactors EOF fixer to reuse file size and use a fixed buffer for scanning.
crates/prek/src/hooks/pre_commit_hooks/detect_private_key.rs Implements chunked scanning with carry-over and Aho–Corasick for faster detection.
crates/prek/Cargo.toml Adds aho-corasick dependency to the prek crate.
Cargo.toml Adds aho-corasick to workspace dependencies.
Cargo.lock Locks the new dependency.
Comments suppressed due to low confidence (1)

crates/prek/src/hooks/pre_commit_hooks/fix_end_of_file.rs:94

  • find_last_non_ending seeks backwards using SeekFrom::Current(-block_size), but after each read_exact the cursor ends up back at EOF, so the next iteration re-reads the same tail block again. If a file ends with >4KB of line endings, this can incorrectly conclude the whole file is line endings and truncate it to empty. Consider seeking to an absolute offset based on data_len - read_len - block_size (or using SeekFrom::End(-(read_len as i64 + block_size as i64))) so each loop scans progressively earlier blocks.
    while read_len < data_len {
        let block_size = MAX_SCAN_SIZE.min(usize::try_from(data_len - read_len)?);
        // SAFETY: block_size is guaranteed to be less than or equal to MAX_SCAN_SIZE
        reader
            .seek(SeekFrom::Current(-i64::try_from(block_size).unwrap()))
            .await?;
        reader.read_exact(&mut buf[..block_size]).await?;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants