Skip to content

perf: adjust default sqlite PRAGMAs, auto detect network fstype#4152

Merged
johanneskoester merged 17 commits into
mainfrom
perf/db-backend-sqlite-pragmas
May 13, 2026
Merged

perf: adjust default sqlite PRAGMAs, auto detect network fstype#4152
johanneskoester merged 17 commits into
mainfrom
perf/db-backend-sqlite-pragmas

Conversation

@tedil

@tedil tedil commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

This PR adjusts the default PRAGMAs to better suit networked file systems (and tries to auto-detect them, though we may have to extend the list or just always apply the same PRAGMAs, open to discussion).

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).
  • I, as a human being, have checked each line of code in this pull request

AI-assistance disclosure

I used AI assistance for:

  • Code generation (e.g., when writing an implementation or fixing a bug) (review, coderabbitai)
  • Test/benchmark generation
  • Documentation (including examples)
  • Research and understanding

Summary by CodeRabbit

  • New Features

    • Automatic detection when the persistence database is on a network filesystem and application of different SQLite settings to improve reliability.
  • Documentation

    • Clarified behavior and listed PRAGMA sets used for network FS (journal_mode=PERSIST, synchronous=OFF, temp_store=MEMORY, cache_size=-64000) versus local FS (journal_mode=TRUNCATE, synchronous=NORMAL); reordered caveats and recommendations.

@tedil tedil requested a review from cmeesters April 12, 2026 09:22
@tedil tedil requested a review from johanneskoester as a code owner April 12, 2026 09:22
@coderabbitai

coderabbitai Bot commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a POSIX-only network-filesystem detector and uses it when opening SQLite connections to apply different PRAGMA settings; documentation updated to state automatic detection and list the PRAGMAs used for network vs non-network filesystems.

Changes

Cohort / File(s) Summary
Documentation
docs/executing/provenance.rst
Replaced prior guidance with statement that Snakemake auto-detects network filesystems and lists PRAGMAs. Documents network-FS PRAGMAs: journal_mode=PERSIST, synchronous=OFF, temp_store=MEMORY, cache_size=-64000; non-network defaults: synchronous=NORMAL, journal_mode=TRUNCATE. Reordered caveats after busy_timeout and added trailing newline.
Persistence Database
src/snakemake/persistence/db.py
Added POSIX-only `is_network_filesystem(path: Path

Sequence Diagram(s)

sequenceDiagram
    participant Snakemake
    participant FSDetector as is_network_filesystem
    participant SQLite as SQLiteConnection

    Snakemake->>SQLite: open connection (sqlite URL)
    SQLite->>FSDetector: resolve DB path & check mounts (psutil)
    FSDetector-->>SQLite: is_network_fs? (true|false)
    alt network filesystem
        SQLite->>SQLite: PRAGMA journal_mode=PERSIST
        SQLite->>SQLite: PRAGMA synchronous=OFF
        SQLite->>SQLite: PRAGMA temp_store=MEMORY
        SQLite->>SQLite: PRAGMA cache_size=-64000
    else local filesystem
        SQLite->>SQLite: PRAGMA journal_mode=TRUNCATE
        SQLite->>SQLite: PRAGMA synchronous=NORMAL
    end
    SQLite-->>Snakemake: connection ready
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adjusting SQLite PRAGMA defaults and auto-detecting network filesystem types for performance optimization.
Description check ✅ Passed The description covers the main purpose, completes all QC checklist items, and discloses AI assistance with specifics, matching the required template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/db-backend-sqlite-pragmas

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
docs/executing/provenance.rst (1)

23-30: Documentation lists fewer filesystem types than actually detected.

The implementation detects additional network filesystems (glusterfs, cifs, smbfs) that aren't mentioned here. Consider updating the parenthetical to include these, or use phrasing like "common network filesystems" to avoid maintaining an exhaustive list in documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/executing/provenance.rst` around lines 23 - 30, Update the parenthetical
in the opening sentence that lists detected network filesystems so it either
includes the additional types the implementation checks for (glusterfs, cifs,
smbfs) or replace the explicit list with a non-exhaustive phrase such as "common
network filesystems (e.g., NFS, CephFS, Lustre, GPFS)" to avoid divergence;
modify the text around the string starting "By default, snakemake automatically
detects if the database is located on a network filesystem" and keep the PRAGMA
examples unchanged.
src/snakemake/persistence/db.py (2)

260-271: The network filesystem type list may need extension.

As noted in the PR description, the detection list might need extension. Some commonly used parallel/network filesystems not currently included:

  • beegfs (BeeGFS)
  • afs (Andrew File System)
  • pvfs2 / orangefs (Parallel Virtual File System)
  • fuse.sshfs (SSHFS via FUSE)
  • fhgfs (older BeeGFS name)

This is flagged for awareness rather than as a blocking issue, since false negatives simply result in more conservative (non-network) PRAGMA defaults.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/snakemake/persistence/db.py` around lines 260 - 271, The network_fs_types
set used for detecting network filesystems (the network_fs_types variable and
the membership check return fstype.casefold() in network_fs_types) is missing
additional common parallel/network FS names; update that set to include entries
like "beegfs", "fhgfs", "afs", "pvfs2", "orangefs", and "fuse.sshfs" (and
optionally "sshfs") so the fstype.casefold() membership test correctly detects
these filesystems.

247-258: Symlinks may cause incorrect filesystem detection.

Using os.path.abspath() doesn't resolve symbolic links. If the persistence path contains a symlink pointing to network storage (e.g., /home/user/data/mnt/nfs/data), the detection will match against /home instead of /mnt/nfs, returning the wrong result.

Consider using os.path.realpath() to resolve symlinks before matching:

♻️ Proposed fix
-    path = os.path.abspath(path)
+    path = os.path.realpath(path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/snakemake/persistence/db.py` around lines 247 - 258, The code resolves
the persistence path with os.path.abspath() before matching mount points, which
doesn't follow symlinks and can cause incorrect filesystem detection; change the
resolution to use os.path.realpath() when assigning path so the matching loop
over psutil.disk_partitions(all=True) correctly compares the real mountpoint,
keeping the logic that updates longest_prefix_length and fstype unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/executing/provenance.rst`:
- Around line 23-30: Update the parenthetical in the opening sentence that lists
detected network filesystems so it either includes the additional types the
implementation checks for (glusterfs, cifs, smbfs) or replace the explicit list
with a non-exhaustive phrase such as "common network filesystems (e.g., NFS,
CephFS, Lustre, GPFS)" to avoid divergence; modify the text around the string
starting "By default, snakemake automatically detects if the database is located
on a network filesystem" and keep the PRAGMA examples unchanged.

In `@src/snakemake/persistence/db.py`:
- Around line 260-271: The network_fs_types set used for detecting network
filesystems (the network_fs_types variable and the membership check return
fstype.casefold() in network_fs_types) is missing additional common
parallel/network FS names; update that set to include entries like "beegfs",
"fhgfs", "afs", "pvfs2", "orangefs", and "fuse.sshfs" (and optionally "sshfs")
so the fstype.casefold() membership test correctly detects these filesystems.
- Around line 247-258: The code resolves the persistence path with
os.path.abspath() before matching mount points, which doesn't follow symlinks
and can cause incorrect filesystem detection; change the resolution to use
os.path.realpath() when assigning path so the matching loop over
psutil.disk_partitions(all=True) correctly compares the real mountpoint, keeping
the logic that updates longest_prefix_length and fstype unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 12329397-1fd6-484d-85f4-1e98b5963583

📥 Commits

Reviewing files that changed from the base of the PR and between 0fc5fd1 and 8438069.

📒 Files selected for processing (2)
  • docs/executing/provenance.rst
  • src/snakemake/persistence/db.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/snakemake/persistence/db.py`:
- Around line 255-257: The code uses a raw startswith check on part.mountpoint
(variables path and part.mountpoint) which causes false matches like
"/mnt/data2" matching "/mnt/data"; change the condition to a path-boundary-aware
check such as: ensure path == part.mountpoint or path.startswith(part.mountpoint
+ os.sep) (import os if needed or use os.sep) so only true mountpoint prefixes
are matched, keeping the longest_prefix_length logic intact.
- Line 83: The code currently calls is_network_filesystem(self.path) which
checks the local persistence directory instead of the actual SQLite file
referenced by the backend URL; change this to parse the SQLite file path out of
the configured DB URL (e.g., strip the "sqlite://" or "sqlite:///" scheme and
any query/fragment) and call is_network_filesystem(sqlite_db_path) instead so
the PRAGMA selection uses the real database file location; update the code
around the is_network_filesystem call (referencing self.path and
is_network_filesystem) to derive sqlite_db_path from the persistence backend DB
URL attribute (e.g., self.db_url / self._db_url / persistence-backend-db-url
variable) before performing the filesystem check.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ce45863d-d7d7-4ce8-b944-9991f051ec7d

📥 Commits

Reviewing files that changed from the base of the PR and between 8438069 and 3b53692.

📒 Files selected for processing (2)
  • docs/executing/provenance.rst
  • src/snakemake/persistence/db.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/executing/provenance.rst

Comment thread src/snakemake/persistence/db.py Outdated
Comment thread src/snakemake/persistence/db.py Outdated
@coderabbitai

coderabbitai Bot commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 1 file(s) based on 2 unresolved review comments.

Files modified:

  • src/snakemake/persistence/db.py

Commit: 529775523df725c602accf8ce4c82f06ad5477c4

The changes have been pushed to the perf/db-backend-sqlite-pragmas branch.

Time taken: 16m 12s

Fixed 1 file(s) based on 2 unresolved review comments.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
@github-actions

Copy link
Copy Markdown
Contributor

Please format your code with pixi run format

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/snakemake/persistence/db.py (1)

83-95: ⚠️ Potential issue | 🟠 Major

Use engine.url.database instead of manually parsing the db_url string.

The current approach of manually stripping sqlite:/// and sqlite:// prefixes is brittle and fails for several valid SQLite URL formats. Driver-qualified URLs like sqlite+pysqlite:///..., URLs with query parameters, memory databases, and Windows paths all bypass this parsing logic. This causes is_network_filesystem() to inspect the wrong path, leading to incorrect PRAGMA selection.

SQLAlchemy provides the normalized database path via engine.url.database, which handles all valid SQLite URL formats correctly:

  • sqlite:///relative.db"relative.db"
  • sqlite:////abs/path.db"/abs/path.db"
  • sqlite+pysqlite:////abs/path.db"/abs/path.db"
Suggested change
-        # Parse SQLite file path from the DB URL for filesystem checks
-        sqlite_db_path = db_url
-        if sqlite_db_path.startswith("sqlite:///"):
-            sqlite_db_path = sqlite_db_path[len("sqlite:///"):]
-        elif sqlite_db_path.startswith("sqlite://"):
-            sqlite_db_path = sqlite_db_path[len("sqlite://"):]
-        # Remove query parameters and fragments if present
-        if "?" in sqlite_db_path:
-            sqlite_db_path = sqlite_db_path.split("?", 1)[0]
-        if "#" in sqlite_db_path:
-            sqlite_db_path = sqlite_db_path.split("#", 1)[0]
-
-        is_network_fs = is_network_filesystem(sqlite_db_path)
+        sqlite_db_path = (
+            self.engine.url.database if self.engine.dialect.name == "sqlite" else None
+        )
+        is_network_fs = bool(sqlite_db_path) and is_network_filesystem(sqlite_db_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/snakemake/persistence/db.py` around lines 83 - 95, Replace the brittle
manual parsing of db_url used to derive sqlite_db_path with SQLAlchemy's
normalized value: read the database path from engine.url.database (use
engine.url.database instead of slicing db_url), then pass that to
is_network_filesystem and subsequent logic; update references to sqlite_db_path,
db_url, and is_network_fs accordingly so
is_network_filesystem(engine.url.database) is used and handles driver-qualified
URLs, memory DBs, query params, and Windows paths correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/snakemake/persistence/db.py`:
- Around line 266-271: The path-to-mountpoint matching loop incorrectly treats
root mountpoints because path.startswith(part.mountpoint + os.sep) becomes
startswith("//") for part.mountpoint == os.sep; update the condition inside the
loop that iterates psutil.disk_partitions to explicitly handle root: use if path
== part.mountpoint or (part.mountpoint != os.sep and
path.startswith(part.mountpoint + os.sep)); keep the existing logic that updates
longest_prefix_length and fstype, referencing the variables path,
part.mountpoint, longest_prefix_length, and fstype so the root "/" mount is
matched correctly and network mounts aren’t misclassified as local.

---

Duplicate comments:
In `@src/snakemake/persistence/db.py`:
- Around line 83-95: Replace the brittle manual parsing of db_url used to derive
sqlite_db_path with SQLAlchemy's normalized value: read the database path from
engine.url.database (use engine.url.database instead of slicing db_url), then
pass that to is_network_filesystem and subsequent logic; update references to
sqlite_db_path, db_url, and is_network_fs accordingly so
is_network_filesystem(engine.url.database) is used and handles driver-qualified
URLs, memory DBs, query params, and Windows paths correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9d3717d2-f1af-4475-9f11-8034f41afe71

📥 Commits

Reviewing files that changed from the base of the PR and between 3b53692 and 5297755.

📒 Files selected for processing (1)
  • src/snakemake/persistence/db.py

Comment thread src/snakemake/persistence/db.py Outdated
Comment thread src/snakemake/persistence/db.py Outdated
Comment thread src/snakemake/persistence/db.py Outdated
Comment thread src/snakemake/persistence/db.py Outdated
@cmeesters cmeesters self-requested a review April 15, 2026 12:42

@johanneskoester johanneskoester left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Two things below.

Comment thread src/snakemake/persistence/db.py
Comment thread src/snakemake/persistence/db.py
Comment thread src/snakemake/persistence/db.py
Comment thread src/snakemake/persistence/db.py
@johanneskoester johanneskoester enabled auto-merge (squash) May 6, 2026 07:57
@johanneskoester johanneskoester disabled auto-merge May 6, 2026 08:04
@johanneskoester johanneskoester enabled auto-merge (squash) May 13, 2026 17:49
@johanneskoester johanneskoester merged commit 3df2d35 into main May 13, 2026
56 of 59 checks passed
@johanneskoester johanneskoester deleted the perf/db-backend-sqlite-pragmas branch May 13, 2026 18:05
johanneskoester pushed a commit that referenced this pull request May 14, 2026
🤖 I have created a release *beep* *boop*
---


##
[9.21.0](v9.20.0...v9.21.0)
(2026-05-13)


### Features

* add a function to help with prepending arguments to filenames; close
[#672](#672)
([#4090](#4090))
([14ccd1d](14ccd1d))


### Bug Fixes

* close plugin handlers after draining QueueListener in
LoggerManager.stop()
([#4137](#4137))
([b2a9e69](b2a9e69))


### Performance Improvements

* adjust default sqlite PRAGMAs, auto detect network fstype
([#4152](#4152))
([3df2d35](3df2d35))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants