Skip to content

formatter: multi char tokens in SimpleTokenizer #5610

Merged
MichaReiser merged 12 commits intoastral-sh:mainfrom
davidszotten:simple-tokenizer-multi-char-tokens
Jul 10, 2023
Merged

formatter: multi char tokens in SimpleTokenizer #5610
MichaReiser merged 12 commits intoastral-sh:mainfrom
davidszotten:simple-tokenizer-multi-char-tokens

Conversation

@davidszotten
Copy link
Contributor

No description provided.

@davidszotten davidszotten force-pushed the simple-tokenizer-multi-char-tokens branch from 7af5c9c to 9671e62 Compare July 8, 2023 09:50
@github-actions
Copy link
Contributor

github-actions bot commented Jul 8, 2023

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      8.3±0.02ms     4.9 MB/sec    1.01      8.4±0.02ms     4.9 MB/sec
formatter/numpy/ctypeslib.py               1.00   1777.2±2.45µs     9.4 MB/sec    1.00   1781.6±3.53µs     9.3 MB/sec
formatter/numpy/globals.py                 1.00    195.4±0.28µs    15.1 MB/sec    1.01    197.1±1.11µs    15.0 MB/sec
formatter/pydantic/types.py                1.00      4.0±0.01ms     6.3 MB/sec    1.00      4.0±0.01ms     6.3 MB/sec
linter/all-rules/large/dataset.py          1.01     14.0±0.02ms     2.9 MB/sec    1.00     13.9±0.04ms     2.9 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      3.5±0.01ms     4.7 MB/sec    1.00      3.5±0.02ms     4.8 MB/sec
linter/all-rules/numpy/globals.py          1.00    361.9±0.70µs     8.2 MB/sec    1.00    362.8±1.07µs     8.1 MB/sec
linter/all-rules/pydantic/types.py         1.00      6.2±0.01ms     4.1 MB/sec    1.00      6.1±0.01ms     4.1 MB/sec
linter/default-rules/large/dataset.py      1.01      7.1±0.02ms     5.7 MB/sec    1.00      7.1±0.01ms     5.8 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.02   1465.9±2.00µs    11.4 MB/sec    1.00   1443.7±2.13µs    11.5 MB/sec
linter/default-rules/numpy/globals.py      1.01    156.3±0.47µs    18.9 MB/sec    1.00    155.1±0.27µs    19.0 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.2±0.01ms     8.0 MB/sec    1.00      3.2±0.01ms     8.1 MB/sec

Windows

group                                      main                                    pr
-----                                      ----                                    --
formatter/large/dataset.py                 1.00     10.6±0.61ms     3.8 MB/sec     1.03     10.9±0.67ms     3.7 MB/sec
formatter/numpy/ctypeslib.py               1.00      2.2±0.11ms     7.5 MB/sec     1.03      2.3±0.13ms     7.3 MB/sec
formatter/numpy/globals.py                 1.00   265.9±18.58µs    11.1 MB/sec     1.01   267.3±19.84µs    11.0 MB/sec
formatter/pydantic/types.py                1.02      5.2±0.49ms     4.9 MB/sec     1.00      5.1±0.31ms     5.0 MB/sec
linter/all-rules/large/dataset.py          1.00     17.6±0.80ms     2.3 MB/sec     1.01     17.9±0.81ms     2.3 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      4.6±0.22ms     3.7 MB/sec     1.03      4.7±0.21ms     3.6 MB/sec
linter/all-rules/numpy/globals.py          1.00   554.9±29.80µs     5.3 MB/sec     1.03   573.9±38.12µs     5.1 MB/sec
linter/all-rules/pydantic/types.py         1.00      8.0±0.45ms     3.2 MB/sec     1.02      8.1±0.42ms     3.1 MB/sec
linter/default-rules/large/dataset.py      1.00      9.1±0.44ms     4.5 MB/sec     1.02      9.2±0.48ms     4.4 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00  1976.4±128.45µs     8.4 MB/sec    1.00  1985.3±130.29µs     8.4 MB/sec
linter/default-rules/numpy/globals.py      1.00   235.5±13.96µs    12.5 MB/sec     1.01   237.0±15.92µs    12.4 MB/sec
linter/default-rules/pydantic/types.py     1.01      4.1±0.23ms     6.2 MB/sec     1.00      4.1±0.17ms     6.3 MB/sec

@davidszotten davidszotten marked this pull request as ready for review July 8, 2023 10:46

c => {
let kind = TokenKind::from_non_trivia_char(c);
let kind = if self.eat_multichar("if", c) {
Copy link
Member

@MichaReiser MichaReiser Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will incorrectly lex ify as an if keyword because it doesn't test what follows after. Lexing out keyword is a bit harder than I assumed it is.

I'm also unsure if LLVM is able to short-circuit if the expression, for example, starts with a number.

I think the way I would implement this is to implement lexing for identifiers, and then use a match on the lexed identifier to test if it is a keyword. You'll need to add a dependency to unic-ucd-ident (used by RustPython's lexer) to detect the start/end of an identifier.

if is_identifier_start(c) {
	self.cursor.eat_while(is_identifier_continuation);

	// If we want to support all identifiers, then we need to handle the string prefixes `b'`, `rf` etc. but we can ignore those for keywords
	
	let range = TextRange::at(self.offset, token_len);
	let kind = match &self.source[range] {	 // This requires storing the whole source on the tokenizer. I think that's fine
		"if" => TokenKind::If,
		"else" => TokenKind::Else,
		"match" => TokenKind::Match // Match is a soft keyword that depends on the context but we can always lex it as a keyword and leave it to the caller (parser) to decide if it should be handled as an identifier or keyword.
		...,
		_ => TokenKind::Other // Potentially an identifier, but only if it isn't a string prefix. We can ignore this for now https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
	}
}

fn is_identifier_start(c: char) -> bool {
	c.is_ascii_alphabetic() || c == '_' || is_non_ascii_identifier_start(c) 
}

// Checks if the character c is a valid continuation character as described
// in https://docs.python.org/3/reference/lexical_analysis.html#identifiers
fn is_identifier_continuation(c: char) -> bool {
    if c.is_ascii() {
        matches!(c, 'a'..='z' | 'A'..='Z' | '_' | '0'..='9')
    } else {
        is_xid_continue(c)
    }
}

fn is_non_ascii_identifier_start(c: char) -> bool {
    is_xid_start(c)
}

Something similar should work for the backward parsing, except that it must start with an is_non_ascii_identifier_start and the last (or first, depending on how you look at the problem) must be is_identifier_start . We should be able to move the match expression into some shared function.

@MichaReiser
Copy link
Member

Neat. Thank you for working on this. I think there's an edge case that I didn't consider initially, making this a little less simple than I assumed (we may also need to rename the SimpleTokenizer at some point, it's getting close to a full python lexer)

@davidszotten
Copy link
Contributor Author

Thanks, will have a go

@davidszotten
Copy link
Contributor Author

in unicode fun, some multi-character grapheme clusters (including one from black/simple_cases/tricky_unicode_symbols.py, ) now result in different tokens when parsed in reverse.

 left:  `[Token { kind: Bogus, range: 0..3 }, Token { kind: Other, range: 3..6 }]`,
 right: `[Token { kind: Other, range: 0..6 }]`

we can maybe fix by using something like https://crates.io/crates/unicode-segmentation but i'm not sure of the perf implications and i guess we don't need this while we only support keyword tokens

TokenKind::Continuation
} else {
let kind = TokenKind::from_non_trivia_char(c);
let kind = if is_identifier_start(c) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to change this to test if it is an id continuation because we are testing the last character and not the first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes. but i guess we also need to change the strategy a bit, since we must _finish with an identifier_start? ah you mentioned this below

@davidszotten davidszotten force-pushed the simple-tokenizer-multi-char-tokens branch from 5407cac to e6d9551 Compare July 9, 2023 11:12
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. Well done! I'm sorry that it turned out much more complicated than I thought. Let me know when you're ready to merge.

Comment on lines +1222 to +1223
.next()
.is_none(),
"Didn't expect any other token"
token.kind() == token_kind,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can use debug_assert_eq now to compare the kinds.

token.kind() == token_kind,
"expected a `{token_kind:?}` token",
);
debug_assert!(tokens.next().is_none(), "Didn't expect any other token");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can use debug_assert_eq(..., None) here. It should give us a better error message when it throws (I believe)

let token_len = self.cursor.token_len();

let range = TextRange::at(self.offset, token_len);
self.parse_identifier(range)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The method name indicates to me that this parses an identifier and not a keyword. I haven't been able to come up with a name that I like, but was thinking of match_keyword or to_keyword_or_other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i don't mind. i also considered making it a method on TokenKind (similar to from_non_trivia_char)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work too!

Comment on lines +451 to +455
if self.source[range]
.chars()
.next()
.is_some_and(is_identifier_start)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. You could use let mut identifier_start = c and then update the variable in the eat_back_while callback if it is an identifier_continuation but I would prefer what you have right now.

/// `match`
Match,

/// Any other non trivia token. Always has a length of 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops i guess this is no longer true (length 1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find

@davidszotten
Copy link
Contributor Author

ok i think that's ready from me now

@davidszotten
Copy link
Contributor Author

thanks for the impl help and the review!

@MichaReiser
Copy link
Member

Thank you for implementing. This is so cool :) Not long, and our SimpleTokenizer can replace RustPython's tokenizer :P

@MichaReiser MichaReiser merged commit 1e894f3 into astral-sh:main Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants