Skip to content

Reset cursor at dedent after whitespace#11630

Merged
dhruvmanila merged 1 commit intodhruv/parser-phase-2from
dhruv/dedent-after-whitespace
May 31, 2024
Merged

Reset cursor at dedent after whitespace#11630
dhruvmanila merged 1 commit intodhruv/parser-phase-2from
dhruv/dedent-after-whitespace

Conversation

@dhruvmanila
Copy link
Member

Summary

This PR fixes a bug to reset the cursor before emitting the Dedent token which had preceding whitespaces.

Previously, we would just use TextRange::empty(self.offset()) because the range for each token would be computed at the place where it was emitted. Now, the lexer handles this centrally in next_token which means we require special handling at certain places. This is one such case.

Computing the range centrally in next_token has a benefit that the return type of lexer methods is a simple TokenKind but it does add a small complexity to handle such cases. I think it's worth doing this but if anyone thinks otherwise, feel free to comment.

Test Plan

Add a test case and update the snapshot.

@dhruvmanila dhruvmanila added bug Something isn't working parser Related to the parser labels May 31, 2024
@dhruvmanila dhruvmanila requested a review from MichaReiser as a code owner May 31, 2024 06:04
@codspeed-hq
Copy link

codspeed-hq bot commented May 31, 2024

CodSpeed Performance Report

Merging #11630 will create unknown performance changes

Comparing dhruv/dedent-after-whitespace (b600b42) with dhruv/parser-phase-2 (e8ab801)

Summary

🆕 30 new benchmarks

Benchmarks breakdown

Benchmark dhruv/parser-phase-2 dhruv/dedent-after-whitespace Change
🆕 formatter[large/dataset.py] N/A 9.3 ms N/A
🆕 formatter[numpy/ctypeslib.py] N/A 1.8 ms N/A
🆕 formatter[numpy/globals.py] N/A 240.5 µs N/A
🆕 formatter[pydantic/types.py] N/A 3.5 ms N/A
🆕 formatter[unicode/pypinyin.py] N/A 649.7 µs N/A
🆕 lexer[large/dataset.py] N/A 1.1 ms N/A
🆕 lexer[numpy/ctypeslib.py] N/A 226 µs N/A
🆕 lexer[numpy/globals.py] N/A 29.6 µs N/A
🆕 lexer[pydantic/types.py] N/A 503.2 µs N/A
🆕 lexer[unicode/pypinyin.py] N/A 76.7 µs N/A
🆕 linter/all-rules[large/dataset.py] N/A 16.1 ms N/A
🆕 linter/all-rules[numpy/ctypeslib.py] N/A 4 ms N/A
🆕 linter/all-rules[numpy/globals.py] N/A 692.6 µs N/A
🆕 linter/all-rules[pydantic/types.py] N/A 7.7 ms N/A
🆕 linter/all-rules[unicode/pypinyin.py] N/A 2 ms N/A
🆕 linter/default-rules[large/dataset.py] N/A 3.7 ms N/A
🆕 linter/default-rules[numpy/ctypeslib.py] N/A 941.4 µs N/A
🆕 linter/default-rules[numpy/globals.py] N/A 184.5 µs N/A
🆕 linter/default-rules[pydantic/types.py] N/A 1.9 ms N/A
🆕 linter/default-rules[unicode/pypinyin.py] N/A 358.3 µs N/A
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

@github-actions
Copy link
Contributor

github-actions bot commented May 31, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+1 -231940 violations, +0 -0 fixes in 31 projects; 19 projects unchanged)

DisnakeDev/disnake (+0 -6809 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- disnake/__main__.py:240:1: E117 Over-indented
- disnake/__main__.py:268:1: E113 Unexpected indentation
- disnake/__main__.py:281:1: E117 Over-indented
- disnake/__main__.py:285:1: E113 Unexpected indentation
- disnake/__main__.py:290:1: E117 Over-indented
- disnake/__main__.py:293:1: E117 Over-indented
- disnake/__main__.py:296:1: E117 Over-indented
- disnake/__main__.py:299:1: E117 Over-indented
... 4710 additional changes omitted for rule E117
- disnake/__main__.py:328:1: E113 Unexpected indentation
- disnake/__main__.py:34:1: E113 Unexpected indentation
... 6799 additional changes omitted for project

PostHog/HouseWatch (+0 -115 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- housewatch/api/analyze.py:102:1: E113 Unexpected indentation
- housewatch/api/analyze.py:113:1: E113 Unexpected indentation
- housewatch/api/analyze.py:124:1: E113 Unexpected indentation
- housewatch/api/analyze.py:130:1: E113 Unexpected indentation
- housewatch/api/analyze.py:133:1: E113 Unexpected indentation
- housewatch/api/analyze.py:140:1: E113 Unexpected indentation
... 45 additional changes omitted for rule E113
- housewatch/api/analyze.py:148:1: E117 Over-indented
- housewatch/api/analyze.py:250:1: E117 Over-indented
- housewatch/api/analyze.py:299:1: E117 Over-indented
- housewatch/api/analyze.py:313:1: E117 Over-indented
... 105 additional changes omitted for project

RasaHQ/rasa (+1 -6707 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- .github/scripts/download_pretrained.py:48:1: E113 Unexpected indentation
- .github/scripts/download_pretrained.py:50:1: E117 Over-indented
- .github/scripts/download_pretrained.py:60:1: E113 Unexpected indentation
- .github/scripts/download_pretrained.py:67:1: E113 Unexpected indentation
- .github/scripts/download_pretrained.py:88:1: E117 Over-indented
- .github/scripts/mr_generate_summary.py:54:1: E113 Unexpected indentation
- .github/scripts/mr_publish_results.py:107:1: E117 Over-indented
- .github/scripts/mr_publish_results.py:110:1: E113 Unexpected indentation
- .github/scripts/mr_publish_results.py:114:1: E117 Over-indented
- .github/scripts/mr_publish_results.py:117:1: E117 Over-indented
... 6698 additional changes omitted for project

alteryx/featuretools (+0 -2208 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- docs/notebook_version_standardizer.py:126:1: E117 Over-indented
- docs/notebook_version_standardizer.py:131:1: E117 Over-indented
- docs/notebook_version_standardizer.py:153:1: E117 Over-indented
- docs/notebook_version_standardizer.py:160:1: E117 Over-indented
- docs/notebook_version_standardizer.py:16:1: E117 Over-indented
- docs/notebook_version_standardizer.py:25:1: E113 Unexpected indentation
- docs/notebook_version_standardizer.py:30:1: E117 Over-indented
... 1456 additional changes omitted for rule E117
- docs/notebook_version_standardizer.py:61:1: E113 Unexpected indentation
- docs/notebook_version_standardizer.py:73:1: E113 Unexpected indentation
- featuretools/computational_backends/calculate_feature_matrix.py:172:1: E113 Unexpected indentation
... 2198 additional changes omitted for project

apache/airflow (+0 -42650 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

- airflow/__init__.py:109:1: E117 Over-indented
- airflow/__init__.py:118:1: E113 Unexpected indentation
- airflow/__init__.py:122:1: E117 Over-indented
- airflow/__init__.py:126:1: E113 Unexpected indentation
- airflow/__main__.py:47:1: E113 Unexpected indentation
- airflow/api/__init__.py:37:1: E117 Over-indented
- airflow/api/__init__.py:40:1: E113 Unexpected indentation
- airflow/api/__init__.py:46:1: E117 Over-indented
- airflow/api/auth/backend/kerberos_auth.py:101:1: E117 Over-indented
- airflow/api/auth/backend/kerberos_auth.py:103:1: E117 Over-indented
... 23997 additional changes omitted for rule E117
... 42640 additional changes omitted for project

aws/aws-sam-cli (+0 -10085 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- samcli/cli/cli_config_file.py:110:1: E117 Over-indented
- samcli/cli/cli_config_file.py:121:1: E117 Over-indented
- samcli/cli/cli_config_file.py:155:1: E117 Over-indented
- samcli/cli/cli_config_file.py:172:1: E117 Over-indented
- samcli/cli/cli_config_file.py:244:1: E113 Unexpected indentation
- samcli/cli/cli_config_file.py:310:1: E113 Unexpected indentation
- samcli/cli/cli_config_file.py:315:1: E117 Over-indented
- samcli/cli/cli_config_file.py:319:1: E113 Unexpected indentation
- samcli/cli/cli_config_file.py:323:1: E113 Unexpected indentation
- samcli/cli/cli_config_file.py:327:1: E113 Unexpected indentation
... 10075 additional changes omitted for project

bloomberg/pytest-memray (+0 -66 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- src/pytest_memray/marks.py:122:1: E113 Unexpected indentation
- src/pytest_memray/marks.py:144:1: E113 Unexpected indentation
- src/pytest_memray/marks.py:164:1: E113 Unexpected indentation
- src/pytest_memray/marks.py:171:1: E113 Unexpected indentation
- src/pytest_memray/marks.py:183:1: E113 Unexpected indentation
- src/pytest_memray/marks.py:214:1: E113 Unexpected indentation
... 33 additional changes omitted for rule E113
- src/pytest_memray/marks.py:217:1: E117 Over-indented
- src/pytest_memray/plugin.py:130:1: E117 Over-indented
- src/pytest_memray/plugin.py:142:1: E117 Over-indented
- src/pytest_memray/plugin.py:159:1: E117 Over-indented
... 56 additional changes omitted for project

bokeh/bokeh (+0 -6215 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

- docs/bokeh/docserver.py:100:1: E113 Unexpected indentation
- docs/bokeh/docserver.py:97:1: E117 Over-indented
- examples/advanced/extensions/parallel_plot/parallel_plot.py:23:1: E117 Over-indented
- examples/advanced/extensions/parallel_plot/parallel_plot.py:29:1: E113 Unexpected indentation
- examples/advanced/extensions/parallel_plot/parallel_plot.py:74:1: E113 Unexpected indentation
- examples/basic/data/server_sent_events_source.py:63:1: E113 Unexpected indentation
- examples/integration/layout/multi_plot_layout.py:13:1: E117 Over-indented
- examples/models/basic_plot.py:45:1: E113 Unexpected indentation
- examples/models/buttons.py:78:1: E113 Unexpected indentation
... 1984 additional changes omitted for rule E113
- examples/models/calendars.py:40:1: E117 Over-indented
... 6205 additional changes omitted for project

freedomofpress/securedrop (+0 -2565 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- admin/bootstrap.py:121:1: E117 Over-indented
- admin/bootstrap.py:126:1: E117 Over-indented
- admin/bootstrap.py:135:1: E117 Over-indented
- admin/bootstrap.py:158:1: E117 Over-indented (comment)
- admin/bootstrap.py:159:1: E117 Over-indented (comment)
- admin/bootstrap.py:160:1: E117 Over-indented (comment)
... 1743 additional changes omitted for rule E117
- admin/bootstrap.py:290:1: E113 Unexpected indentation
- admin/bootstrap.py:301:1: E113 Unexpected indentation
- admin/bootstrap.py:61:1: E113 Unexpected indentation
- admin/securedrop_admin/__init__.py:1033:1: E116 Unexpected indentation (comment)
... 2555 additional changes omitted for project

fronzbot/blinkpy (+0 -503 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- blinkpy/api.py:423:1: E117 Over-indented
- blinkpy/api.py:425:1: E117 Over-indented
- blinkpy/api.py:452:1: E117 Over-indented
- blinkpy/api.py:454:1: E117 Over-indented
- blinkpy/api.py:512:1: E117 Over-indented
- blinkpy/api.py:514:1: E117 Over-indented
... 332 additional changes omitted for rule E117
- blinkpy/auth.py:148:1: E113 Unexpected indentation
- blinkpy/auth.py:161:1: E113 Unexpected indentation
- blinkpy/auth.py:47:1: E113 Unexpected indentation
- blinkpy/auth.py:61:1: E113 Unexpected indentation
... 493 additional changes omitted for project

... Truncated remaining completed project reports due to GitHub comment length restrictions

Changes by rule (4 rules affected)

code total + violation - violation + fix - fix
E117 152372 0 152372 0 0
E113 73488 0 73488 0 0
E116 6080 0 6080 0 0
RUF100 1 1 0 0 0

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@dhruvmanila dhruvmanila force-pushed the dhruv/parser-phase-2 branch from 2d34e13 to e8ab801 Compare May 31, 2024 06:27
@dhruvmanila dhruvmanila requested a review from AlexWaygood as a code owner May 31, 2024 06:27
@dhruvmanila dhruvmanila force-pushed the dhruv/dedent-after-whitespace branch from 4faed6a to b600b42 Compare May 31, 2024 06:27
@dhruvmanila dhruvmanila removed the request for review from AlexWaygood May 31, 2024 06:28
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are in total 5 usages of TextRange::empty(offset)) in the existing Lexer code. Would you mind reviewing if we need to make the same change in other places?

//
// Here, the cursor is at `^` and the `indentation` contains the whitespaces before
// the `pass` token.
self.cursor.start_token();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit that reading self.cursor.start_token() is hard to understand. Why are we starting a token when we're just about to return it. Your comment certainly helps. That makes me wonder if we can instead add a method on Lexer with a more meaningful name.

self.update_token_start_to_current_offset() // A bi tlong
self.reset_token_start()

but I must admit, I don't find them much better than what we have. But maybe you have an idea for a better name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. Let me think about this.

@dhruvmanila
Copy link
Member Author

dhruvmanila commented May 31, 2024

There are in total 5 usages of TextRange::empty(offset)) in the existing Lexer code. Would you mind reviewing if we need to make the same change in other places?

I did review them, sorry for not mentioning in the PR description. This isn't needed in all places, it's only require if the lexer skips whitespaces. For posterity, the usages are:

// Return dedent tokens until the current indentation level matches the indentation of the next token.
else if let Some(indentation) = self.pending_indentation.take() {
match self.indentations.current().try_compare(indentation) {
Ok(Ordering::Greater) => {
self.pending_indentation = Some(indentation);
let offset = self.offset();
self.indentations.dedent_one(indentation).map_err(|_| {
LexicalError::new(LexicalErrorType::IndentationError, self.token_range())
})?;
return Ok((Tok::Dedent, TextRange::empty(offset)));
}

⬆️ This emits any pending dedent tokens. There are no whitespaces consumed by the lexer before this, so no need to reset the cursor.

// Next, insert a trailing newline, if required.
if !self.state.is_new_logical_line() {
self.state = State::AfterNewline;
Ok((Tok::Newline, TextRange::empty(self.offset())))
}
// Next, flush the indentation stack to zero.
else if self.indentations.dedent().is_some() {
Ok((Tok::Dedent, TextRange::empty(self.offset())))
} else {
Ok((Tok::EndOfFile, TextRange::empty(self.offset())))
}

⬆️ We already reset the cursor before this function is called ⬇️

// The lexer might've skipped whitespaces, so update the start offset
self.cursor.start_token();
if let Some(c) = self.cursor.bump() {
if c.is_ascii() {
self.consume_ascii_character(c)
} else if is_unicode_identifier_start(c) {
let identifier = self.lex_identifier(c);
self.state = State::Other;
identifier
} else {
self.push_error(LexicalError::new(
LexicalErrorType::UnrecognizedToken { tok: c },
self.token_range(),
))
}
} else {
// Reached the end of the file. Emit a trailing newline token if not at the beginning of a logical line,
// empty the dedent stack, and finally, return the EndOfFile token.
self.consume_end()
}

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining!

@dhruvmanila dhruvmanila merged commit 8196720 into dhruv/parser-phase-2 May 31, 2024
@dhruvmanila dhruvmanila deleted the dhruv/dedent-after-whitespace branch May 31, 2024 11:04
dhruvmanila added a commit that referenced this pull request Jun 3, 2024
## Summary

This PR fixes a bug to reset the cursor before emitting the `Dedent`
token which had preceding whitespaces.

Previously, we would just use `TextRange::empty(self.offset())` because
the range for each token would be computed at the place where it was
emitted. Now, the lexer handles this centrally in `next_token` which
means we require special handling at certain places. This is one such
case.

Computing the range centrally in `next_token` has a benefit that the
return type of lexer methods is a simple `TokenKind` but it does add a
small complexity to handle such cases. I think it's worth doing this but
if anyone thinks otherwise, feel free to comment.

## Test Plan

Add a test case and update the snapshot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working parser Related to the parser

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants