perf: adjust default sqlite PRAGMAs, auto detect network fstype#4152
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a POSIX-only network-filesystem detector and uses it when opening SQLite connections to apply different PRAGMA settings; documentation updated to state automatic detection and list the PRAGMAs used for network vs non-network filesystems. Changes
Sequence Diagram(s)sequenceDiagram
participant Snakemake
participant FSDetector as is_network_filesystem
participant SQLite as SQLiteConnection
Snakemake->>SQLite: open connection (sqlite URL)
SQLite->>FSDetector: resolve DB path & check mounts (psutil)
FSDetector-->>SQLite: is_network_fs? (true|false)
alt network filesystem
SQLite->>SQLite: PRAGMA journal_mode=PERSIST
SQLite->>SQLite: PRAGMA synchronous=OFF
SQLite->>SQLite: PRAGMA temp_store=MEMORY
SQLite->>SQLite: PRAGMA cache_size=-64000
else local filesystem
SQLite->>SQLite: PRAGMA journal_mode=TRUNCATE
SQLite->>SQLite: PRAGMA synchronous=NORMAL
end
SQLite-->>Snakemake: connection ready
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
docs/executing/provenance.rst (1)
23-30: Documentation lists fewer filesystem types than actually detected.The implementation detects additional network filesystems (
glusterfs,cifs,smbfs) that aren't mentioned here. Consider updating the parenthetical to include these, or use phrasing like "common network filesystems" to avoid maintaining an exhaustive list in documentation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/executing/provenance.rst` around lines 23 - 30, Update the parenthetical in the opening sentence that lists detected network filesystems so it either includes the additional types the implementation checks for (glusterfs, cifs, smbfs) or replace the explicit list with a non-exhaustive phrase such as "common network filesystems (e.g., NFS, CephFS, Lustre, GPFS)" to avoid divergence; modify the text around the string starting "By default, snakemake automatically detects if the database is located on a network filesystem" and keep the PRAGMA examples unchanged.src/snakemake/persistence/db.py (2)
260-271: The network filesystem type list may need extension.As noted in the PR description, the detection list might need extension. Some commonly used parallel/network filesystems not currently included:
beegfs(BeeGFS)afs(Andrew File System)pvfs2/orangefs(Parallel Virtual File System)fuse.sshfs(SSHFS via FUSE)fhgfs(older BeeGFS name)This is flagged for awareness rather than as a blocking issue, since false negatives simply result in more conservative (non-network) PRAGMA defaults.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/snakemake/persistence/db.py` around lines 260 - 271, The network_fs_types set used for detecting network filesystems (the network_fs_types variable and the membership check return fstype.casefold() in network_fs_types) is missing additional common parallel/network FS names; update that set to include entries like "beegfs", "fhgfs", "afs", "pvfs2", "orangefs", and "fuse.sshfs" (and optionally "sshfs") so the fstype.casefold() membership test correctly detects these filesystems.
247-258: Symlinks may cause incorrect filesystem detection.Using
os.path.abspath()doesn't resolve symbolic links. If the persistence path contains a symlink pointing to network storage (e.g.,/home/user/data→/mnt/nfs/data), the detection will match against/homeinstead of/mnt/nfs, returning the wrong result.Consider using
os.path.realpath()to resolve symlinks before matching:♻️ Proposed fix
- path = os.path.abspath(path) + path = os.path.realpath(path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/snakemake/persistence/db.py` around lines 247 - 258, The code resolves the persistence path with os.path.abspath() before matching mount points, which doesn't follow symlinks and can cause incorrect filesystem detection; change the resolution to use os.path.realpath() when assigning path so the matching loop over psutil.disk_partitions(all=True) correctly compares the real mountpoint, keeping the logic that updates longest_prefix_length and fstype unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/executing/provenance.rst`:
- Around line 23-30: Update the parenthetical in the opening sentence that lists
detected network filesystems so it either includes the additional types the
implementation checks for (glusterfs, cifs, smbfs) or replace the explicit list
with a non-exhaustive phrase such as "common network filesystems (e.g., NFS,
CephFS, Lustre, GPFS)" to avoid divergence; modify the text around the string
starting "By default, snakemake automatically detects if the database is located
on a network filesystem" and keep the PRAGMA examples unchanged.
In `@src/snakemake/persistence/db.py`:
- Around line 260-271: The network_fs_types set used for detecting network
filesystems (the network_fs_types variable and the membership check return
fstype.casefold() in network_fs_types) is missing additional common
parallel/network FS names; update that set to include entries like "beegfs",
"fhgfs", "afs", "pvfs2", "orangefs", and "fuse.sshfs" (and optionally "sshfs")
so the fstype.casefold() membership test correctly detects these filesystems.
- Around line 247-258: The code resolves the persistence path with
os.path.abspath() before matching mount points, which doesn't follow symlinks
and can cause incorrect filesystem detection; change the resolution to use
os.path.realpath() when assigning path so the matching loop over
psutil.disk_partitions(all=True) correctly compares the real mountpoint, keeping
the logic that updates longest_prefix_length and fstype unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 12329397-1fd6-484d-85f4-1e98b5963583
📒 Files selected for processing (2)
docs/executing/provenance.rstsrc/snakemake/persistence/db.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/snakemake/persistence/db.py`:
- Around line 255-257: The code uses a raw startswith check on part.mountpoint
(variables path and part.mountpoint) which causes false matches like
"/mnt/data2" matching "/mnt/data"; change the condition to a path-boundary-aware
check such as: ensure path == part.mountpoint or path.startswith(part.mountpoint
+ os.sep) (import os if needed or use os.sep) so only true mountpoint prefixes
are matched, keeping the longest_prefix_length logic intact.
- Line 83: The code currently calls is_network_filesystem(self.path) which
checks the local persistence directory instead of the actual SQLite file
referenced by the backend URL; change this to parse the SQLite file path out of
the configured DB URL (e.g., strip the "sqlite://" or "sqlite:///" scheme and
any query/fragment) and call is_network_filesystem(sqlite_db_path) instead so
the PRAGMA selection uses the real database file location; update the code
around the is_network_filesystem call (referencing self.path and
is_network_filesystem) to derive sqlite_db_path from the persistence backend DB
URL attribute (e.g., self.db_url / self._db_url / persistence-backend-db-url
variable) before performing the filesystem check.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ce45863d-d7d7-4ce8-b944-9991f051ec7d
📒 Files selected for processing (2)
docs/executing/provenance.rstsrc/snakemake/persistence/db.py
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/executing/provenance.rst
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 1 file(s) based on 2 unresolved review comments. Files modified:
Commit: The changes have been pushed to the Time taken: |
Fixed 1 file(s) based on 2 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
|
Please format your code with |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/snakemake/persistence/db.py (1)
83-95:⚠️ Potential issue | 🟠 MajorUse
engine.url.databaseinstead of manually parsing thedb_urlstring.The current approach of manually stripping
sqlite:///andsqlite://prefixes is brittle and fails for several valid SQLite URL formats. Driver-qualified URLs likesqlite+pysqlite:///..., URLs with query parameters, memory databases, and Windows paths all bypass this parsing logic. This causesis_network_filesystem()to inspect the wrong path, leading to incorrect PRAGMA selection.SQLAlchemy provides the normalized database path via
engine.url.database, which handles all valid SQLite URL formats correctly:
sqlite:///relative.db→"relative.db"sqlite:////abs/path.db→"/abs/path.db"sqlite+pysqlite:////abs/path.db→"/abs/path.db"Suggested change
- # Parse SQLite file path from the DB URL for filesystem checks - sqlite_db_path = db_url - if sqlite_db_path.startswith("sqlite:///"): - sqlite_db_path = sqlite_db_path[len("sqlite:///"):] - elif sqlite_db_path.startswith("sqlite://"): - sqlite_db_path = sqlite_db_path[len("sqlite://"):] - # Remove query parameters and fragments if present - if "?" in sqlite_db_path: - sqlite_db_path = sqlite_db_path.split("?", 1)[0] - if "#" in sqlite_db_path: - sqlite_db_path = sqlite_db_path.split("#", 1)[0] - - is_network_fs = is_network_filesystem(sqlite_db_path) + sqlite_db_path = ( + self.engine.url.database if self.engine.dialect.name == "sqlite" else None + ) + is_network_fs = bool(sqlite_db_path) and is_network_filesystem(sqlite_db_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/snakemake/persistence/db.py` around lines 83 - 95, Replace the brittle manual parsing of db_url used to derive sqlite_db_path with SQLAlchemy's normalized value: read the database path from engine.url.database (use engine.url.database instead of slicing db_url), then pass that to is_network_filesystem and subsequent logic; update references to sqlite_db_path, db_url, and is_network_fs accordingly so is_network_filesystem(engine.url.database) is used and handles driver-qualified URLs, memory DBs, query params, and Windows paths correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/snakemake/persistence/db.py`:
- Around line 266-271: The path-to-mountpoint matching loop incorrectly treats
root mountpoints because path.startswith(part.mountpoint + os.sep) becomes
startswith("//") for part.mountpoint == os.sep; update the condition inside the
loop that iterates psutil.disk_partitions to explicitly handle root: use if path
== part.mountpoint or (part.mountpoint != os.sep and
path.startswith(part.mountpoint + os.sep)); keep the existing logic that updates
longest_prefix_length and fstype, referencing the variables path,
part.mountpoint, longest_prefix_length, and fstype so the root "/" mount is
matched correctly and network mounts aren’t misclassified as local.
---
Duplicate comments:
In `@src/snakemake/persistence/db.py`:
- Around line 83-95: Replace the brittle manual parsing of db_url used to derive
sqlite_db_path with SQLAlchemy's normalized value: read the database path from
engine.url.database (use engine.url.database instead of slicing db_url), then
pass that to is_network_filesystem and subsequent logic; update references to
sqlite_db_path, db_url, and is_network_fs accordingly so
is_network_filesystem(engine.url.database) is used and handles driver-qualified
URLs, memory DBs, query params, and Windows paths correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9d3717d2-f1af-4475-9f11-8034f41afe71
📒 Files selected for processing (1)
src/snakemake/persistence/db.py
johanneskoester
left a comment
There was a problem hiding this comment.
Cool! Two things below.
🤖 I have created a release *beep* *boop* --- ## [9.21.0](v9.20.0...v9.21.0) (2026-05-13) ### Features * add a function to help with prepending arguments to filenames; close [#672](#672) ([#4090](#4090)) ([14ccd1d](14ccd1d)) ### Bug Fixes * close plugin handlers after draining QueueListener in LoggerManager.stop() ([#4137](#4137)) ([b2a9e69](b2a9e69)) ### Performance Improvements * adjust default sqlite PRAGMAs, auto detect network fstype ([#4152](#4152)) ([3df2d35](3df2d35)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This PR adjusts the default PRAGMAs to better suit networked file systems (and tries to auto-detect them, though we may have to extend the list or just always apply the same PRAGMAs, open to discussion).
QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).AI-assistance disclosure
I used AI assistance for:
Summary by CodeRabbit
New Features
Documentation