Python CLI: Optimize --exclude and .semgrepignore to ignore directories without listing their content#596
Conversation
781a5fe to
c981e90
Compare
--exclude to ignore directories without listing their content--exclude to ignore directories without listing their content
--exclude to ignore directories without listing their content--exclude and .semgrepignore to ignore directories without listing their content
cli/src/semgrep/target_manager.py
Outdated
| result |= target.files_from_git_diff() | ||
| continue | ||
| except (subprocess.CalledProcessError, FileNotFoundError): | ||
| pass # fall through to git_tracked_only / filesystem |
There was a problem hiding this comment.
the original code had some logs here...
There was a problem hiding this comment.
added a log instead of pass
cli/src/semgrep/target_manager.py
Outdated
| result |= target.files_from_git_ls() | ||
| continue | ||
| except (subprocess.CalledProcessError, FileNotFoundError): | ||
| pass # fall through to pruned filesystem walk |
There was a problem hiding this comment.
as before, original code has logs, why swallow exceptions?
There was a problem hiding this comment.
We swallow exceptions to use the backup behviour. If we get no files from git diff, we try git ls-files, and if it fails, we use all files.
There was a problem hiding this comment.
added a log instead of pass
cli/src/semgrep/target_manager.py
Outdated
| ] | ||
| for filename in filenames: | ||
| filepath = dirpath / filename | ||
| if self._is_valid_file(filepath): |
There was a problem hiding this comment.
why does it begin with _ ?
There was a problem hiding this comment.
private method, as with _survives
| # patterns (matched via `dir/**`) and `dir/` folder patterns. | ||
| and ( | ||
| file_ignore is None | ||
| or file_ignore._survives((dirpath / d / ".__check__").absolute()) |
There was a problem hiding this comment.
why does _survive start with _ ?
There was a problem hiding this comment.
In Python, underscore is used for private methods, see, e.g., https://www.datacamp.com/tutorial/python-private-methods-explained
dimitris-m
left a comment
There was a problem hiding this comment.
lgtm modulo the logging issue; see comments
c981e90 to
a364416
Compare
This MR contains the following updates: | Package | Update | Change | |---|---|---| | [opengrep/opengrep](https://github.com/opengrep/opengrep) | patch | `v1.16.1` → `v1.16.2` | MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot). **Proposed changes to behavior should be submitted there as MRs.** --- ### Release Notes <details> <summary>opengrep/opengrep (opengrep/opengrep)</summary> ### [`v1.16.2`](https://github.com/opengrep/opengrep/releases/tag/v1.16.2): Opengrep 1.16.2 [Compare Source](opengrep/opengrep@v1.16.1...v1.16.2) #### Improvements - Python CLI: Optimize `--exclude` and `.semgrepignore` to ignore directories without listing their content by [@​maciejpirog](https://github.com/maciejpirog) in [#​596](opengrep/opengrep#596) **Full Changelog**: <opengrep/opengrep@v1.16.1...v1.16.2> </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever MR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this MR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My40Ni42IiwidXBkYXRlZEluVmVyIjoiNDMuNDYuNiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiUmVub3ZhdGUgQm90IiwiYXV0b21hdGlvbjpib3QtYXV0aG9yZWQiLCJkZXBlbmRlbmN5LXR5cGU6OnBhdGNoIl19-->
Overview
Relevant issue: #528
Current status:
--experimentaland here we try to achieve similar behaviourTests
Tested with
which seem sufficiently comprehensive
Evaluation of performance
The env for testing is set up using the script atttached to #528, in particular:
Comapring the result. All commands are
opengrep scan -c $RULES ..., where$RULESpoint to a local copy of the semgrep rule repo.HelloWorld.kt)--exclude .m2 .).semgrepignorefile (.)Implementation details
Problem
When excluding large directories (e.g.
node_modules,build), the Python frontend was slow and still reported tens/hundreds of thousands of skipped files in the scan summary. The root cause was thatTarget.files_from_filesystem()usespath.glob("**/*"), which expands the entire directory tree before any filtering occurs.filter_excludesandFileIgnore.filter_pathsthen pattern-match against everysingle collected file — O(N) in the number of files inside excluded directories.
By contrast,
osemgrep(src/targeting/Find_targets.ml) checks each directory against exclusion patterns before descending, making excluded directories O(1) regardless of their size.Solution
Add directory-level early pruning to the Python filesystem walker, matching osemgrep's behavior, with identical scan semantics.
Target.files_from_filesystem_with_dir_pruning(new method)Replaces
path.glob("**/*")withos.walk(topdown=True, followlinks=False). Before descending into each subdirectory, two checks are applied in-place ondirnames:--excludepatterns viawcglob.globfilter— patterns are passed in already preprocessed (e.g.node_modules→**/node_modules,**/node_modules/**) so the preprocessing cost is paid once per scan.Paths are expressed relative to the scan root (not absolute) so that
**/-prefixed wcmatch patterns match correctly..semgrepignorepatterns viaFileIgnore._survives(dir / ".__check__")— checking a virtual sentinel file inside the directory (rather than the directory path itself) correctly matches bothdirpatterns (via the generated
dir/**fnmatch pattern) anddir/folder patterns._survivesis pure pattern-matching with no filesystem access, so this is safe with non-existent paths.TargetManager.get_all_files_with_dir_pruning(new method)Preprocesses exclude patterns once, then iterates over targets mirroring the exact fallback chain of
Target.files()— substitutingfiles_from_filesystem_with_dir_pruningwhereverfiles_from_filesystemwould have been called:
target.files()unchangedfiles_from_git_diff(); on failure falls throughignore_baseline_handler=Truerespect_git_ignore = not no_git_ignoreandno_git_ignoredefaults toFalse)files_from_git_ls(); on success uses that result; on failure falls through to--no-git-ignore(pure filesystem)The critical fix for the default scan mode: previously when
files_from_git_ls()failed (non-git directory or git unavailable), it silently fell back to the originalfiles_from_filesystem()with no pruning.Now it falls back to
files_from_filesystem_with_dir_pruning.Decorated with
@lru_cacheso the filesystem is walked at most once per unique(exclude_patterns, ignore_baseline_handler, file_ignore)combination.TargetManager.get_files_for_language(modified)Replaces
self.get_all_files(ignore_baseline_handler)with a call toget_all_files_with_dir_pruning, passing:--excludepatterns for the current product +PATHS_ALWAYS_SKIPPED(.git)FileIgnoreobject (ifrespect_semgrepignoreis set)The existing
filter_excludesandfilter_pathscalls remain as safety nets for file-level patterns (e.g.--exclude "*.pyc") and for results from git-based traversals.