Skip to content

Python CLI: Optimize --exclude and .semgrepignore to ignore directories without listing their content#596

Merged
maciejpirog merged 1 commit intomainfrom
mpir/cli-optimize-exclude-flag
Feb 20, 2026
Merged

Python CLI: Optimize --exclude and .semgrepignore to ignore directories without listing their content#596
maciejpirog merged 1 commit intomainfrom
mpir/cli-optimize-exclude-flag

Conversation

@maciejpirog
Copy link
Contributor

@maciejpirog maciejpirog commented Feb 18, 2026

Overview

Relevant issue: #528

Current status:

  • In python, when you exclude a dir, it still needs to list all files in that dir
  • This is optimized with --experimental and here we try to achieve similar behaviour

Tests

Tested with

pytest tests/default/unit/targeting/test_target_manager.py tests/default/e2e/test_target_selection.py -v

which seem sufficiently comprehensive

Evaluation of performance

The env for testing is set up using the script atttached to #528, in particular:

$ ls -a
.             ..            .m2           HelloWorld.kt

$ find .m2 | wc -l
  280004

Comapring the result. All commands are opengrep scan -c $RULES ..., where $RULES point to a local copy of the semgrep rule repo.

Test Before Fix After Fix
Baseline (HelloWorld.kt) 17.912 16.534
With flag (--exclude .m2 .) 1:52.67 20.306
With .semgrepignore file (.) 3:30.92 22.228

Implementation details

Problem

When excluding large directories (e.g. node_modules, build), the Python frontend was slow and still reported tens/hundreds of thousands of skipped files in the scan summary. The root cause was that
Target.files_from_filesystem() uses path.glob("**/*"), which expands the entire directory tree before any filtering occurs. filter_excludes and FileIgnore.filter_paths then pattern-match against every
single collected file — O(N) in the number of files inside excluded directories.

By contrast, osemgrep (src/targeting/Find_targets.ml) checks each directory against exclusion patterns before descending, making excluded directories O(1) regardless of their size.

Solution

Add directory-level early pruning to the Python filesystem walker, matching osemgrep's behavior, with identical scan semantics.

Target.files_from_filesystem_with_dir_pruning (new method)

Replaces path.glob("**/*") with os.walk(topdown=True, followlinks=False). Before descending into each subdirectory, two checks are applied in-place on dirnames:

  1. --exclude patterns via wcglob.globfilter — patterns are passed in already preprocessed (e.g. node_modules**/node_modules, **/node_modules/**) so the preprocessing cost is paid once per scan.
    Paths are expressed relative to the scan root (not absolute) so that **/-prefixed wcmatch patterns match correctly.

  2. .semgrepignore patterns via FileIgnore._survives(dir / ".__check__") — checking a virtual sentinel file inside the directory (rather than the directory path itself) correctly matches both dir
    patterns (via the generated dir/** fnmatch pattern) and dir/ folder patterns. _survives is pure pattern-matching with no filesystem access, so this is safe with non-existent paths.

TargetManager.get_all_files_with_dir_pruning (new method)

Preprocesses exclude patterns once, then iterates over targets mirroring the exact fallback chain of Target.files() — substituting files_from_filesystem_with_dir_pruning wherever files_from_filesystem
would have been called:

Scenario Behaviour
Explicit file target Delegates to target.files() unchanged
Diff/baseline mode Tries files_from_git_diff(); on failure falls through
ignore_baseline_handler=True Uses pruned walk directly
Git-tracked-only (the default, since respect_git_ignore = not no_git_ignore and no_git_ignore defaults to False) Tries files_from_git_ls(); on success uses that result; on failure falls through to
pruned walk
--no-git-ignore (pure filesystem) Uses pruned walk directly

The critical fix for the default scan mode: previously when files_from_git_ls() failed (non-git directory or git unavailable), it silently fell back to the original files_from_filesystem() with no pruning.
Now it falls back to files_from_filesystem_with_dir_pruning.

Decorated with @lru_cache so the filesystem is walked at most once per unique (exclude_patterns, ignore_baseline_handler, file_ignore) combination.

TargetManager.get_files_for_language (modified)

Replaces self.get_all_files(ignore_baseline_handler) with a call to get_all_files_with_dir_pruning, passing:

  • --exclude patterns for the current product + PATHS_ALWAYS_SKIPPED (.git)
  • The product's FileIgnore object (if respect_semgrepignore is set)

The existing filter_excludes and filter_paths calls remain as safety nets for file-level patterns (e.g. --exclude "*.pyc") and for results from git-based traversals.

@maciejpirog maciejpirog force-pushed the mpir/cli-optimize-exclude-flag branch from 781a5fe to c981e90 Compare February 19, 2026 14:35
@maciejpirog maciejpirog changed the title [WIP] Python CLI: Optimize --exclude to ignore directories without listing their content Python CLI: Optimize --exclude to ignore directories without listing their content Feb 19, 2026
@maciejpirog maciejpirog changed the title Python CLI: Optimize --exclude to ignore directories without listing their content Python CLI: Optimize --exclude and .semgrepignore to ignore directories without listing their content Feb 19, 2026
result |= target.files_from_git_diff()
continue
except (subprocess.CalledProcessError, FileNotFoundError):
pass # fall through to git_tracked_only / filesystem
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original code had some logs here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a log instead of pass

result |= target.files_from_git_ls()
continue
except (subprocess.CalledProcessError, FileNotFoundError):
pass # fall through to pruned filesystem walk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as before, original code has logs, why swallow exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We swallow exceptions to use the backup behviour. If we get no files from git diff, we try git ls-files, and if it fails, we use all files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a log instead of pass

]
for filename in filenames:
filepath = dirpath / filename
if self._is_valid_file(filepath):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it begin with _ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private method, as with _survives

# patterns (matched via `dir/**`) and `dir/` folder patterns.
and (
file_ignore is None
or file_ignore._survives((dirpath / d / ".__check__").absolute())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does _survive start with _ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, underscore is used for private methods, see, e.g., https://www.datacamp.com/tutorial/python-private-methods-explained

Copy link
Collaborator

@dimitris-m dimitris-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo the logging issue; see comments

@maciejpirog maciejpirog force-pushed the mpir/cli-optimize-exclude-flag branch from c981e90 to a364416 Compare February 19, 2026 17:33
@maciejpirog maciejpirog merged commit 334dfbb into main Feb 20, 2026
43 checks passed
@maciejpirog maciejpirog deleted the mpir/cli-optimize-exclude-flag branch February 20, 2026 11:49
@maciejpirog maciejpirog mentioned this pull request Feb 25, 2026
tmeijn pushed a commit to tmeijn/dotfiles that referenced this pull request Mar 2, 2026
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [opengrep/opengrep](https://github.com/opengrep/opengrep) | patch | `v1.16.1` → `v1.16.2` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>opengrep/opengrep (opengrep/opengrep)</summary>

### [`v1.16.2`](https://github.com/opengrep/opengrep/releases/tag/v1.16.2): Opengrep 1.16.2

[Compare Source](opengrep/opengrep@v1.16.1...v1.16.2)

#### Improvements

- Python CLI: Optimize `--exclude` and `.semgrepignore` to ignore directories without listing their content by [@&#8203;maciejpirog](https://github.com/maciejpirog) in [#&#8203;596](opengrep/opengrep#596)

**Full Changelog**: <opengrep/opengrep@v1.16.1...v1.16.2>

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever MR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My40Ni42IiwidXBkYXRlZEluVmVyIjoiNDMuNDYuNiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiUmVub3ZhdGUgQm90IiwiYXV0b21hdGlvbjpib3QtYXV0aG9yZWQiLCJkZXBlbmRlbmN5LXR5cGU6OnBhdGNoIl19-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants