fix: parser.py incorrectly parsing multiline f-strings#3701
fix: parser.py incorrectly parsing multiline f-strings#3701johanneskoester merged 6 commits intosnakemake:mainfrom
Conversation
Fix multiline F-Strings removing lines fixes snakemake#3700
📝 WalkthroughWalkthroughRework of f-string parsing in Changes
Sequence Diagram(s)sequenceDiagram
participant Tokenizer
participant Parser
participant SplitLines as split_token_lines
Tokenizer->>Parser: emit initial f-string TokenInfo
Parser->>SplitLines: split_token_lines(initial_token)
SplitLines-->>Parser: map[line_no -> fragment]
loop remaining f-string tokens
Tokenizer->>Parser: emit next TokenInfo
Parser->>SplitLines: split_token_lines(next_token)
SplitLines-->>Parser: additional fragments
Parser->>Parser: merge fragments into lines map
end
Parser->>Parser: determine overall start/end, trim boundaries
Parser->>Parser: join sorted lines -> reconstructed f-string
Parser-->>Tokenizer: return reconstructed fragment
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Assessment against linked issues
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
Not sure if this is the best way, but it fixes my minimal example (see #3700 ). Feel free to edit |
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
src/snakemake/parser.py (1)
111-169: Fix potential infinite loop and improve line collection logicThe
parse_fstringmethod has several issues that need to be addressed:
- Potential infinite loop: Line 132 increments
lineunconditionally even whentoken_.end[0]might be less thantoken_.start[0](though unlikely in practice).- Incorrect conditional logic: Line 138 uses
>when it should be<to keep the longer content.- Line parsing complexity: The nested line iteration (lines 130-142) is overly complex for what should be a simpler operation.
Apply this diff to fix the issues:
def parse_fstring(self, token: tokenize.TokenInfo): """ only for python >= 3.12, since then python changed the parsing manner of f-string, see [pep-0701](https://peps.python.org/pep-0701) Here, we just read where the f-string start and end from tokens. Luckily, each token records the content of the line, and we can just take what we want there. """ lines_contents = dict() # collect all lines until the end of the f-string def collect_lines(token_: tokenize.TokenInfo): lines_contents[token_.start[0]] = token_.line # if the token spans multiple lines, we need to collect all of them if token_.end[0] != token_.start[0]: # iterate over all lines that the token spans token_string = token_.line line = token_.start[0] - while line <= token_.end[0]: + while line <= token_.end[0] and token_string: end_of_line_index = (token_string.index("\n") + 1) if "\n" in token_string \ else len(token_string) lines_content = token_string[: end_of_line_index] # don't take the new line content if the one already stored is longer - if line not in lines_contents or len(lines_contents[line]) > len(lines_content): + if line not in lines_contents or len(lines_contents[line]) < len(lines_content): lines_contents[line] = lines_content token_string = token_string[end_of_line_index:] line += 1
🧹 Nitpick comments (1)
src/snakemake/parser.py (1)
133-134: Simplify ternary expression using min()The ternary expression can be simplified for better readability.
- end_of_line_index = (token_string.index("\n") + 1) if "\n" in token_string \ - else len(token_string) + end_of_line_index = (token_string.index("\n") + 1) if "\n" in token_string else len(token_string)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/snakemake/parser.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit Configuration File
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
src/snakemake/parser.py
🧬 Code Graph Analysis (1)
src/snakemake/parser.py (3)
src/snakemake/gui.py (1)
index(84-95)src/snakemake/rules.py (1)
snakefile(142-143)src/snakemake/workflow.py (1)
snakefile(1539-1544)
🔇 Additional comments (2)
src/snakemake/parser.py (2)
156-167: Well-structured f-string boundary handlingThe logic for determining and trimming f-string boundaries is clean and correct. The code properly:
- Captures the outermost f-string boundaries
- Trims the first line from the start position
- Trims the last line to the end position
- Joins all lines in the correct order
146-154: Correct f-string nesting handlingThe logic for tracking nested f-strings using a counter is well-implemented and correctly handles the Python 3.12+ tokenization behavior.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/snakemake/parser.py (3)
121-138: Consider adding edge case handling for malformed f-strings.While the implementation correctly handles well-formed f-strings, consider adding defensive checks for edge cases where tokens might be malformed or incomplete.
For example, you could add validation after line 138:
# remainder of the line after the newline character token_string = token_string[end_of_line_index:] i += 1 + + # Ensure we have content for the token's end line + if token_.end[0] not in lines_contents: + lines_contents[token_.end[0]] = ""
153-153: Clarify the comment about thet1variable.The inline comment could be more explicit about what
t1represents at this point in the code.- f_string_end = t1.end # t1 is the FSTRING_END token closing the outermost f-string + f_string_end = t1.end # t1 is the FSTRING_END token closing the outermost f-string
111-163: Consider adding a test case for the multiline f-string fix.The PR checklist indicates that no test case was added. Given that this fixes a specific bug (#3700), it would be valuable to add a regression test to ensure the issue doesn't reoccur.
Would you like me to help generate a test case that verifies multiline f-strings with literal lines are correctly preserved? This would prevent regression of issue #3700.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/snakemake/parser.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit Configuration File
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
src/snakemake/parser.py
🧬 Code Graph Analysis (1)
src/snakemake/parser.py (2)
src/snakemake/rules.py (1)
snakefile(142-143)src/snakemake/workflow.py (1)
snakefile(1539-1544)
🔇 Additional comments (2)
src/snakemake/parser.py (2)
111-163: LGTM! The fix correctly handles multiline f-strings.The rewritten
parse_fstringmethod properly collects all lines spanned by the f-string tokens and preserves literal lines without expressions. The approach of using a line-by-line dictionary to accumulate content and then trimming only the boundaries is sound and directly addresses the issue reported in #3700.
143-149: Ensure nested f-strings are tested under Python 3.12I didn’t find any existing test coverage for nested f-strings. To be confident that the
isin_fstringcounter handles all the new parsing rules in Python 3.12, please:
- Add dedicated tests (e.g., in
tests/test_parser.py) covering complex nested f-string scenarios:
• f-strings inside f-strings with conversions (!r,!s) and format specifiers
• Multiple nesting levels and escaped braces- Verify that
collect_linesstill correctly captures inner expressions and thatisin_fstringincrements/decrements as expected- Run the full test suite on Python 3.12 to confirm no regressions
johanneskoester
left a comment
There was a problem hiding this comment.
Thanks a lot!! Can you add a testcase that failed before and works now?
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
src/snakemake/parser.py (4)
78-91: Helper looks correct; consider minor doc tweak.
split_token_lines()correctly maps physical line numbers to their full line text with newlines preserved. No functional concerns. As a tiny nit, consider avoiding hard-coded token-type integers in the example to keep the doc resilient across Python versions.
136-160: Remove dead code:lines_contentsandcollect_lines()are unused.This block builds
lines_contentsbut it’s never read when constructingt. It adds complexity without effect and may confuse future maintainers.Apply this diff to drop the unused logic:
- lines_contents = dict() - - def collect_lines(token_: tokenize.TokenInfo): - # iterate over all lines that the token spans - token_string = token_.line - i = token_.start[0] - while i <= token_.end[0]: - end_of_line_index = ( - (token_string.index("\n") + 1) - if "\n" in token_string - else len(token_string) - ) - line_content = token_string[:end_of_line_index] - - # don't take the new line content if the one already stored is longer - if i not in lines_contents or len(lines_contents[i]) < len( - line_content - ): - lines_contents[i] = line_content - - # remainder of the line after the newline character - token_string = token_string[end_of_line_index:] - i += 1 - - # Collect lines for the first and subsequent tokens that are part of the f-string - collect_lines(token)
162-162: Remove unused variablerelated_lines.Static analysis is right: this local is never used.
Apply this diff:
- related_lines = token.start[0]
166-167: Make end-index computation explicit to avoid negative slicing subtlety.
t1.end[1] - len(t1.line)relies on negative indexing semantics againsts, which is correct but non-obvious. Computing the absolute end offset improves clarity and reduces off-by-one risks during future edits. While here, you can safely update thelinesmap on every iteration; values for the same physical line are identical across tokens (TokenInfo.line is the full physical line), so overriding is harmless and simplifies control flow.Apply this diff:
- if t1.end[0] not in lines: - lines.update(split_token_lines(t1)) + # Merge the physical lines covered by this token. Overwrites are safe because + # TokenInfo.line holds the entire physical line for every token on that line. + lines.update(split_token_lines(t1)) @@ - s = "".join(lines[i] for i in sorted(lines)) - t = s[token.start[1] : t1.end[1] - len(t1.line)] + s = "".join(lines[i] for i in sorted(lines)) + # Map the tokenizer's end column back into the concatenated buffer 's'. + end_in_s = len(s) - len(t1.line) + t1.end[1] + t = s[token.start[1] : end_in_s]Also applies to: 175-179
tests/test_fstring/Snakefile (1)
28-38: Test update captures the regression well; suggest adding the exact minimal reproducer from #3700.The adjusted expectation verifies that literal lines adjacent to expression braces are preserved. To lock in the specific issue (#3700: dropping a pure-literal middle line), consider adding this assertion as well.
Proposed addition:
@@ +# Regression for issue #3700: preserve pure-literal middle line in multi-line f-strings +assert ( + f"""abc +def +{1+1}""" + == """abc +def +2""" +)I can push this as a follow-up commit if you’d like.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
src/snakemake/parser.py(2 hunks)tests/test_fstring/Snakefile(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
src/snakemake/parser.py
🧠 Learnings (1)
📚 Learning: 2024-10-06T14:09:54.370Z
Learnt from: johanneskoester
PR: snakemake/snakemake#3117
File: tests/test_wrapper/Snakefile:11-11
Timestamp: 2024-10-06T14:09:54.370Z
Learning: Changes made within test cases, such as in `tests/test_wrapper/Snakefile`, are for testing purposes and do not require updates to the project documentation.
Applied to files:
tests/test_fstring/Snakefile
🪛 Ruff (0.12.2)
src/snakemake/parser.py
162-162: Local variable related_lines is assigned to but never used
Remove assignment to unused variable related_lines
(F841)
🔇 Additional comments (1)
src/snakemake/parser.py (1)
187-189: Correctness check: placement ofparse_fstring()call is appropriate.Calling
parse_fstring()when encounteringFSTRING_STARTand then replacing the lastcmdtuple ensuresAbstractCmdconsumers receive the full f-string text as a single unit. This is key for preserving multi-line literal segments (the bug fix target).If you want to double-check across a range of f-string shapes (single-line, nested, trailing text), I can provide a small corpus and a script to diff the pre/post parser outputs.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/snakemake/parser.py (1)
139-140: Consider validating the safety comment.The commented safety check suggests checking all tokens might be necessary. While the current approach works for the reported issue, edge cases with complex nested f-strings might benefit from this validation.
Consider uncommenting the safety check or documenting why it's not needed:
- # if t1.end[0] not in lines: for safety, check all tokens - lines.update(split_token_lines(t1)) + # Update lines dict with all tokens in the f-string to ensure + # we capture all line content, even from overlapping tokens + lines.update(split_token_lines(t1))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
src/snakemake/parser.py(2 hunks)tests/test_fstring/Snakefile(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_fstring/Snakefile
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
src/snakemake/parser.py
🔇 Additional comments (5)
src/snakemake/parser.py (5)
78-91: LGTM! Well-documented helper function for token line splitting.The new
split_token_linesfunction properly handles multiline tokens by returning line number and content pairs. The docstring with the concrete example clearly illustrates the problem being solved.
136-136: Good use of dict for efficient line aggregation.Converting the split lines to a dictionary ensures unique line numbers are maintained, which is crucial for handling overlapping tokens in f-strings.
126-152: Fix successfully addresses the multiline f-string issue.The refactored
parse_fstringmethod now properly aggregates all lines from f-string tokens, which correctly preserves literal lines that don't contain expressions. This directly addresses issue #3700 where the middle literal line "def" was being omitted.The approach of collecting all unique lines across the f-string token sequence and then reconstructing the content is sound.
148-149: Line trimming calculation validated and correct
I’ve reviewed the Python 3.12 tokenize behavior and our existing multiline f-string tests (e.g. tests/test_fstring/Snakefile covers pure-literal middle lines, embedded expressions, conditional parts, and curly-brace escaping). The slicet = s[token.start[1] : t1.end[1] - len(t1.line)]correctly uses the start column of the FSTRING_START token and the length of the FSTRING_END token’s
.line(which includes the closing quotes and trailing newline) to trim indentation and remove only the closing quotes/newline—preserving the entire f-string literal as intended. No changes needed here.
149-149: Confirm slice logic for f-strings under Python 3.12+
parse_fstringis only invoked when running on Python ≥ 3.12 (guarded byis_fstring_start), so your current tests in an earlier interpreter won’t exercise theFSTRING_START/FSTRING_ENDpaths. Please re-run the token-position script in a Python 3.12+ environment, for example:python3.12 - << 'EOF' import sys, tokenize, io test_cases = [ 'f"""line1\nline2\n{1+1}"""', "f'''abc\ndef\n{1+1}'''", 'f"simple {x}"', ] for s in test_cases: print(f"\nTest case: {repr(s)}") for tok in tokenize.generate_tokens(io.StringIO(s).readline): if tok.type in (tokenize.FSTRING_START, tokenize.FSTRING_MIDDLE, tokenize.FSTRING_END): print(f" {tokenize.tok_name[tok.type]} start={tok.start} end={tok.end} line={repr(tok.line)}") EOFOnce you have the true positions, verify that for both single- and multi-line f-strings:
- The slice end
t1.end[1] - len(t1.line)(a negative offset) correctly trims only the closing quotes, without dropping or including extra characters.- No off-by-one errors arise from trailing newlines or incomplete
tok.linecontents.If you find that the negative offset approach is unreliable, you can compute an absolute end index instead:
- t = s[token.start[1] : t1.end[1] - len(t1.line)] + # prefix_len = length of the first line of the f-string + prefix_len = len(token.line) + # abs_end = prefix_len + column index of the closing quote + abs_end = prefix_len + t1.end[1] + t = s[token.start[1] : abs_end]
@cademirch I just push the commits here, fell free to edit~ |
|
Thanks, guys! |
🤖 I have created a release *beep* *boop* --- ## [9.11.0](v9.10.1...v9.11.0) (2025-09-05) ### Features * add landing page to the report containing custom metadata defined with a YTE yaml template ([#3452](#3452)) ([c754d80](c754d80)) * Issue a warning when unsupported characters are used in wildcard names ([#1587](#1587)) ([fa57355](fa57355)) ### Bug Fixes * avoid checking output files in immediate-submit mode ([#3569](#3569)) ([58c42cf](58c42cf)) * clarify --keep-going flag help text to distinguish runtime vs DAG construction errors ([#3705](#3705)) ([a3a8ef4](a3a8ef4)) * enable docstring assignment in `use rule ... with:` block ([#3710](#3710)) ([2191962](2191962)) * fix missing attribute error in greedy scheduler settings when using touch, dryrun or immediate submit. ([6471004](6471004)) * only trigger script with CODE label ([#3707](#3707)) ([2d01f8c](2d01f8c)) * parser.py incorrectly parsing multiline f-strings ([#3701](#3701)) ([06ece76](06ece76)) * parsing ordinary yaml strings ([#3506](#3506)) ([ddf334e](ddf334e)) * remove temp files when using checkpoints ([#3716](#3716)) ([5ff3e20](5ff3e20)) * Restore python rules changes triggering reruns. (GH: [#3213](#3213)) ([#3485](#3485)) ([2f663be](2f663be)) * unit testing ([#3680](#3680)) ([06ba7e6](06ba7e6)) * use queue_input_wait_time when updating queue input jobs ([#3712](#3712)) ([a945a19](a945a19)) ### Documentation * output files within output directories is an error ([#2848](#2848)) ([#2913](#2913)) ([37272e5](37272e5)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Fix multiline F-Strings removing lines fixes snakemake#3700 <!--Add a description of your PR here--> ### QC <!-- Make sure that you can tick the boxes below. --> * [x] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Bug Fixes - Correctly parses multi-line f-strings (Python 3.12+), preventing crashes, truncation, and misaligned spans. - Preserves pure-literal middle lines in multi-line f-strings and handles complex/nested cases more reliably. - Refactor - Reworked f-string parsing to aggregate content across full token spans for more accurate line handling without changing public APIs. - Tests - Added regression test for preserving literal middle lines in multi-line f-strings (issue snakemake#3700) and updated expectations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: hwrn <hwrn.aou@sjtu.edu.cn> Co-authored-by: Aoran Hu <48747984+Hocnonsense@users.noreply.github.com>
🤖 I have created a release *beep* *boop* --- ## [9.11.0](snakemake/snakemake@v9.10.1...v9.11.0) (2025-09-05) ### Features * add landing page to the report containing custom metadata defined with a YTE yaml template ([snakemake#3452](snakemake#3452)) ([c754d80](snakemake@c754d80)) * Issue a warning when unsupported characters are used in wildcard names ([snakemake#1587](snakemake#1587)) ([fa57355](snakemake@fa57355)) ### Bug Fixes * avoid checking output files in immediate-submit mode ([snakemake#3569](snakemake#3569)) ([58c42cf](snakemake@58c42cf)) * clarify --keep-going flag help text to distinguish runtime vs DAG construction errors ([snakemake#3705](snakemake#3705)) ([a3a8ef4](snakemake@a3a8ef4)) * enable docstring assignment in `use rule ... with:` block ([snakemake#3710](snakemake#3710)) ([2191962](snakemake@2191962)) * fix missing attribute error in greedy scheduler settings when using touch, dryrun or immediate submit. ([6471004](snakemake@6471004)) * only trigger script with CODE label ([snakemake#3707](snakemake#3707)) ([2d01f8c](snakemake@2d01f8c)) * parser.py incorrectly parsing multiline f-strings ([snakemake#3701](snakemake#3701)) ([06ece76](snakemake@06ece76)) * parsing ordinary yaml strings ([snakemake#3506](snakemake#3506)) ([ddf334e](snakemake@ddf334e)) * remove temp files when using checkpoints ([snakemake#3716](snakemake#3716)) ([5ff3e20](snakemake@5ff3e20)) * Restore python rules changes triggering reruns. (GH: [snakemake#3213](snakemake#3213)) ([snakemake#3485](snakemake#3485)) ([2f663be](snakemake@2f663be)) * unit testing ([snakemake#3680](snakemake#3680)) ([06ba7e6](snakemake@06ba7e6)) * use queue_input_wait_time when updating queue input jobs ([snakemake#3712](snakemake#3712)) ([a945a19](snakemake@a945a19)) ### Documentation * output files within output directories is an error ([snakemake#2848](snakemake#2848)) ([snakemake#2913](snakemake#2913)) ([37272e5](snakemake@37272e5)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Fix multiline F-Strings removing lines
fixes #3700
QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
Bug Fixes
Refactor
Tests