Skip to content

bpo-46091: Correctly calculate indentation levels for whitespace lines with continuation characters#30130

Merged
pablogsal merged 5 commits intopython:mainfrom
pablogsal:whitespace_cont
Jan 25, 2022
Merged

bpo-46091: Correctly calculate indentation levels for whitespace lines with continuation characters#30130
pablogsal merged 5 commits intopython:mainfrom
pablogsal:whitespace_cont

Conversation

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Dec 16, 2021

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully @lysnikolaou can review the fix to the C code? It's deleting an awful lot of code.

Comment on lines +1518 to +1530
def fib(n):
\
'''Print a Fibonacci series up to n.'''
\
a, b = 0, 1
\
while a < n:
\
print(a, end=' ')
\
a, b = b, a+b
\
print()
Copy link
Member

@gvanrossum gvanrossum Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is both too long and too short.

It doesn't really test anything more than a shorter example, say just the first three lines.

Moreover, it doesn't test that the code is parsed the same as the equivalent version without backslashes.

And it doesn't test the stated property quoted in bpo-46091 from the docs: "Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation."

Copy link
Member Author

@pablogsal pablogsal Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, it doesn't test that the code is parsed the same as the equivalent version without backslashes.

We could check the AST, but likely you want to check that is tokenized the same as the code without backslashes, but that's is a bit more complicated to test here as we don't have direct access to the C tokenizer.

And it doesn't test the stated property quoted in bpo-46091 from the docs: "Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation."

Hummmm is testing the second part at least as it's checking that the whitespace up the the \ determines the identation of the next line as opposed to something like:

pass
      \
      
pass

where the \ and the \n will just produce an empty line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have access to the tokenizer, checking that the ASTs are the same is reasonable. In fact, you can probably just use ast.unparse() and compare the output with the equivalent without backslashes.

But this reminds me. Does tokenize.py have the same behavior?

Copy link
Member Author

@pablogsal pablogsal Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the tokenization of:

def fib(n):
    \
"""Print a Fibonacci series up to n."""
    \
a, b = 0, 1

with \:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,7:            NAME           'fib'
1,7-1,8:            OP             '('
1,8-1,9:            NAME           'n'
1,9-1,10:           OP             ')'
1,10-1,11:          OP             ':'
1,11-1,12:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
3,0-3,39:           STRING         '"""Print a Fibonacci series up to n."""'
3,39-3,40:          NEWLINE        '\n'
5,0-5,1:            NAME           'a'
5,1-5,2:            OP             ','
5,3-5,4:            NAME           'b'
5,5-5,6:            OP             '='
5,7-5,8:            NUMBER         '0'
5,8-5,9:            OP             ','
5,10-5,11:          NUMBER         '1'
5,11-5,12:          NEWLINE        '\n'
6,0-6,0:            DEDENT         ''
6,0-6,0:            ENDMARKER      ''

and without:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,7:            NAME           'fib'
1,7-1,8:            OP             '('
1,8-1,9:            NAME           'n'
1,9-1,10:           OP             ')'
1,10-1,11:          OP             ':'
1,11-1,12:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,43:           STRING         '"""Print a Fibonacci series up to n."""'
2,43-2,44:          NEWLINE        '\n'
3,4-3,5:            NAME           'a'
3,5-3,6:            OP             ','
3,7-3,8:            NAME           'b'
3,9-3,10:           OP             '='
3,11-3,12:          NUMBER         '0'
3,12-3,13:          OP             ','
3,14-3,15:          NUMBER         '1'
3,15-3,16:          NEWLINE        '\n'
4,0-4,0:            DEDENT         ''
4,0-4,0:            ENDMARKER      ''

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unfortunately the tokenization for:

pass
        \
pass

is wrong:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,4:            NAME           'pass'
1,4-1,5:            NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
3,0-3,1:            NEWLINE        '\n'
4,0-4,0:            DEDENT         ''
4,0-4,4:            NAME           'pass'
4,4-4,5:            NEWLINE        '\n'
5,0-5,0:            ENDMARKER      ''

but that should be fixed separately to this PR

@pablogsal
Copy link
Member Author

pablogsal commented Dec 16, 2021

It's deleting an awful lot of code.

Is really moving it into a function and calling that function in a new place (in addition to the old place). The important part is this:

            else if (c == '\\') {
                c = tok_continuation_line(tok);
                if (c == -1) {
                    return ERRORTOKEN;
                }
            }

which is basically skipping over pairs of \ followed by \n if the previous characters in the line are whitespace characters. And handling errors correctly (as when \ is followed by something else than \n).

…hitespace lines with continuation characters
@pablogsal
Copy link
Member Author

I have pushed a bunch of tests testing the C tokenizer (I forgot I actually exposed an API for testing long ago 😅 ) and some extra corrections. Turns out this is very tricky to get right because "the backslash and the newline just need to eliminate themselves" and "the indentation is marked by the first backlash that we see" is complicated to keep in sync.

…s for whitespace lines with continuation characters
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Jan 16, 2022
@pablogsal pablogsal merged commit a0efc0c into python:main Jan 25, 2022
@pablogsal pablogsal deleted the whitespace_cont branch January 25, 2022 22:12
@miss-islington
Copy link
Contributor

Thanks @pablogsal for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Sorry, @pablogsal, I could not cleanly backport this to 3.10 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker a0efc0c1960e2c49e0092694d98395555270914c 3.10

@miss-islington
Copy link
Contributor

Sorry @pablogsal, I had trouble checking out the 3.9 backport branch.
Please backport using cherry_picker on command line.
cherry_picker a0efc0c1960e2c49e0092694d98395555270914c 3.9

pablogsal added a commit to pablogsal/cpython that referenced this pull request Jan 25, 2022
…ce lines with continuation characters (pythonGH-30130).

(cherry picked from commit a0efc0c)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
@bedevere-bot bedevere-bot removed the needs backport to 3.10 only security fixes label Jan 25, 2022
@bedevere-bot
Copy link

GH-30898 is a backport of this pull request to the 3.10 branch.

pablogsal added a commit that referenced this pull request Jan 25, 2022
…ce lines with continuation characters (GH-30130). (GH-30898)

(cherry picked from commit a0efc0c)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this pull request Jan 25, 2022
…e lines with continuation characters (pythonGH-30130). (pythonGH-30898)

(cherry picked from commit a0efc0c)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>.
(cherry picked from commit 3fc8b74)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Stale PR or inactive for long period of time.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants