Store expression hierarchy in semantic model snapshots by charliermarsh · Pull Request #6345 · astral-sh/ruff

charliermarsh · 2023-08-04T16:23:57Z

Summary

When we iterate over the AST for analysis, we often process nodes in a "deferred" manner. For example, if we're analyzing a function, we push the function body onto a deferred stack, along with a snapshot of the current semantic model state. Later, when we analyze the body, we restore the semantic model state from the snapshot. This ensures that we know the correct scope, hierarchy of statement parents, etc., when we go to analyze the function body.

Historically, we haven't included the expression hierarchy in the model snapshot -- so we track the current expression parents in the visitor, but we never save and restore them when processing deferred nodes. This can lead to subtle bugs, in that methods like expr_parent() aren't guaranteed to be correct, if you're in a deferred visitor.

This PR migrates expression tracking to mirror statement tracking exactly. So we push all expressions onto an IndexVec, and include the current expression on the snapshot. This ensures that expr_parent() and related methods are "always correct" rather than "sometimes correct".

There's a performance cost here, both at runtime and in terms of memory consumption (we now store an additional pointer for every expression). In my hyperfine testing, it's about a 1% performance decrease for all-rules on CPython (up to 533.8ms, from 528.3ms) and a 4% performance decrease for default-rules on CPython (up to 212ms, from 204ms). However... I think this is worth it given the incorrectness of our current approach. In the future, we may want to reconsider how we do these upward traversals (e.g., with something like a red-green tree). (Note: in #6351, the slowdown seems to be entirely removed.)

charliermarsh · 2023-08-04T16:24:39Z

This will also fix #6285, but I'll handle that in a separate PR, I want to update the fixtures and it would be too noisy here.

github-actions · 2023-08-04T16:50:06Z

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      8.2±0.15ms     4.9 MB/sec    1.00      8.2±0.13ms     5.0 MB/sec
formatter/numpy/ctypeslib.py               1.00  1632.6±39.05µs    10.2 MB/sec    1.00  1634.7±39.80µs    10.2 MB/sec
formatter/numpy/globals.py                 1.00    182.8±6.00µs    16.1 MB/sec    1.01    184.1±7.57µs    16.0 MB/sec
formatter/pydantic/types.py                1.01      3.5±0.09ms     7.4 MB/sec    1.00      3.4±0.05ms     7.4 MB/sec
linter/all-rules/large/dataset.py          1.00     11.0±0.07ms     3.7 MB/sec    1.02     11.3±0.16ms     3.6 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.8±0.01ms     6.0 MB/sec    1.00      2.8±0.01ms     6.0 MB/sec
linter/all-rules/numpy/globals.py          1.00    384.1±0.57µs     7.7 MB/sec    1.01    387.1±0.66µs     7.6 MB/sec
linter/all-rules/pydantic/types.py         1.00      5.0±0.05ms     5.1 MB/sec    1.02      5.1±0.03ms     5.0 MB/sec
linter/default-rules/large/dataset.py      1.00      5.4±0.03ms     7.6 MB/sec    1.02      5.5±0.03ms     7.4 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1139.3±5.59µs    14.6 MB/sec    1.02   1157.0±6.87µs    14.4 MB/sec
linter/default-rules/numpy/globals.py      1.00    127.5±1.16µs    23.1 MB/sec    1.03    131.1±3.65µs    22.5 MB/sec
linter/default-rules/pydantic/types.py     1.00      2.4±0.01ms    10.7 MB/sec    1.02      2.4±0.02ms    10.4 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00     13.2±0.70ms     3.1 MB/sec    1.00     13.3±0.73ms     3.1 MB/sec
formatter/numpy/ctypeslib.py               1.06      2.7±0.29ms     6.2 MB/sec    1.00      2.6±0.16ms     6.5 MB/sec
formatter/numpy/globals.py                 1.00   293.5±33.83µs    10.1 MB/sec    1.01   297.9±56.58µs     9.9 MB/sec
formatter/pydantic/types.py                1.00      5.5±0.32ms     4.7 MB/sec    1.01      5.5±0.27ms     4.6 MB/sec
linter/all-rules/large/dataset.py          1.00     18.3±0.78ms     2.2 MB/sec    1.04     19.1±0.86ms     2.1 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      4.7±0.27ms     3.5 MB/sec    1.02      4.9±0.26ms     3.4 MB/sec
linter/all-rules/numpy/globals.py          1.00   582.3±30.81µs     5.1 MB/sec    1.00   579.5±40.79µs     5.1 MB/sec
linter/all-rules/pydantic/types.py         1.03      8.6±0.47ms     3.0 MB/sec    1.00      8.3±0.52ms     3.1 MB/sec
linter/default-rules/large/dataset.py      1.02     10.0±0.63ms     4.1 MB/sec    1.00      9.8±0.45ms     4.2 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00  1922.4±98.52µs     8.7 MB/sec    1.01  1932.4±77.76µs     8.6 MB/sec
linter/default-rules/numpy/globals.py      1.00   216.2±19.84µs    13.6 MB/sec    1.06   228.8±25.06µs    12.9 MB/sec
linter/default-rules/pydantic/types.py     1.00      4.2±0.26ms     6.0 MB/sec    1.06      4.5±0.39ms     5.7 MB/sec

charliermarsh · 2023-08-04T18:07:14Z

Hmm, we don't need back references for expressions (we never need to be able to go from a node to its ID, or from a node to its parent -- we can always go from an ID to its parent), so perhaps we can improve performance here by omitting the hash map from Nodes in that case. I will test it out.

charliermarsh · 2023-08-04T18:36:04Z

Ok, I'm going to resolve the performance degradation in a separate PR because I think it's easier to review this as-is.

MichaReiser · 2023-08-07T07:14:10Z

crates/ruff/src/rules/flake8_pytest_style/rules/assertion.rs

                    // the assertion is part of a larger expression.
                    if checker.semantic().stmt().is_expr_stmt()
                        && checker.semantic().expr_parent().is_none()
-                        && !checker.semantic().scope().kind.is_lambda()


Is this change intentional?

Yes! Not necessary. See the identical change in crates/ruff/src/rules/pandas_vet/rules/inplace_argument.rs that removes the large comment around Case 3. The fact that these tests still pass is a evidence that this change is working as expected.

MichaReiser · 2023-08-07T07:17:49Z

crates/ruff_python_semantic/src/model.rs

+    /// Stack of all visited expressions.
+    exprs: Nodes<'a, Expr>,
+
+    /// The identifier of the current expression.
+    expr_id: Option<NodeId>,


Nit: We now store both a stack of statements and expressions. This gives us almost a full fidelity parent chain, except that it isn't possible to get the parent statement of an expression (except for the current expression).

What do you think of introducing a Vec<AnyNodeRef> that stores all parent nodes (Ultimately means O(n) pointers to be written, where n = len(AST)) instead? You can retrieve the current statement by traversing upwards until you find the first is_statement() node.

The one issue of that is that there's currently no way to go from AnyNodeRef to &Stmt. We would need to introduce a AnyStatementRef enum for that.

I will consider this separately, it makes sense to me, although it may come with the performance regression that I documented here and fixed in #6351.

I was looking into this and I want to understand the last paragraph here. Is the suggestion that methods like SemanticModel#current_statement would now return AnyStatementRef? And so we'd change all consumers of that method to work with AnyStatementRef rather than &Stmt?

If so, we also need AnyExpressionRef, and I'd ultimately like to store the full-fidelity chain here including Parameters, Arguments, etc...

E.g., these kinds of changes: https://github.com/astral-sh/ruff/compare/charlie/any-node-ref?expand=1#diff-c2c8422959405016e1fb54ca23edee9716bd06ad64e610ceea9612a8291122fcR1815. It's feasible but a lot of call sites to work through, so I want to make sure I understand what's being suggested before I think on it further.

I might propose that we do this incrementally by adding an internal type to the semantic crate that's like AnyNodeRef, but less granular:

#[derive(Debug, Clone)] pub enum AnyNodeRef<'a> { Stmt(&'a Stmt), Expr(&'a Expr), } impl<'a> AnyNodeRef<'a> { pub fn as_statement(&self) -> Option<&'a Stmt> { match self { AnyNodeRef::Stmt(stmt) => Some(stmt), AnyNodeRef::Expr(_) => None, } } }

That way, we wouldn't need to change the SemanticModel interface at all for now.

MichaReiser · 2023-08-07T07:19:34Z

crates/ruff_python_semantic/src/model.rs

+            .flat_map(|id| {
+                self.exprs
+                    .ancestor_ids(id)
+                    .skip(1)


Why is it necessary to skip(1)? Shouldn't ancestor_ids return the ancestors only?

Right now, ancestor_ids starts with the current ID IIRC, probably because it's convenient with the successors API. Do you find that confusing?

(This is current behavior on main so I'll look into it separately.)

MichaReiser · 2023-08-07T07:20:32Z

crates/ruff_python_semantic/src/model.rs

+        self.expr_id
+            .iter()
+            .copied()
+            .flat_map(|id| {
+                self.exprs
+                    .ancestor_ids(id)
+                    .skip(1)
+                    .map(|id| &self.exprs[id])
+            })
+            .copied()


Nit: Use successors?

Suggested change

self.expr_id

.iter()

.copied()

.flat_map(|id| {

self.exprs

.ancestor_ids(id)

.skip(1)

.map(|id| &self.exprs[id])

})

.copied()

std::iter::successors(self.expr_id, |id| self.exprs.ancestor_ids(id).skip(1).map(|id| self.exprs[id])))

This doesn't quite do the same thing. The awkwardness is that self.expr_id can be None.

I'm not sure if I understand. The first argument to std::iter::successors is an Option. Both successors and expr_id.flat_mapshould short circuit if expr_id isNone`.

Oh, I think you just had misplaced parentheses in your suggestion and I misunderstood.

MichaReiser · 2023-08-07T07:21:54Z

Could we add some tests for the new introduced APIs? Or how did you test that this works as expected?

charliermarsh · 2023-08-07T13:37:49Z

Could we add some tests for the new introduced APIs? Or how did you test that this works as expected?

I'm going to add some tests in a separate PR, I wanted to make sure that this had a clean snapshot suite.

## Summary This PR fixes the performance degradation introduced in #6345. Instead of using the generic `Nodes` structs, we now use separate `Statement` and `Expression` structs. Importantly, we can avoid tracking a bunch of state for expressions that we need for parents: we don't need to track reference-to-ID pointers (we just have no use-case for this -- I'd actually like to remove this from statements too, but we need it for branch detection right now), we don't need to track depth, etc. In my testing, this entirely removes the regression on all-rules, and gets us down to 2ms slower on the default rules (as a crude hyperfine benchmark, so this is within margin of error IMO). No behavioral changes.

## Summary When we iterate over the AST for analysis, we often process nodes in a "deferred" manner. For example, if we're analyzing a function, we push the function body onto a deferred stack, along with a snapshot of the current semantic model state. Later, when we analyze the body, we restore the semantic model state from the snapshot. This ensures that we know the correct scope, hierarchy of statement parents, etc., when we go to analyze the function body. Historically, we _haven't_ included the _expression_ hierarchy in the model snapshot -- so we track the current expression parents in the visitor, but we never save and restore them when processing deferred nodes. This can lead to subtle bugs, in that methods like `expr_parent()` aren't guaranteed to be correct, if you're in a deferred visitor. This PR migrates expression tracking to mirror statement tracking exactly. So we push all expressions onto an `IndexVec`, and include the current expression on the snapshot. This ensures that `expr_parent()` and related methods are "always correct" rather than "sometimes correct". There's a performance cost here, both at runtime and in terms of memory consumption (we now store an additional pointer for every expression). In my hyperfine testing, it's about a 1% performance decrease for all-rules on CPython (up to 533.8ms, from 528.3ms) and a 4% performance decrease for default-rules on CPython (up to 212ms, from 204ms). However... I think this is worth it given the incorrectness of our current approach. In the future, we may want to reconsider how we do these upward traversals (e.g., with something like a red-green tree). (**Note**: in astral-sh#6351, the slowdown seems to be entirely removed.)

…#6351) ## Summary This PR fixes the performance degradation introduced in astral-sh#6345. Instead of using the generic `Nodes` structs, we now use separate `Statement` and `Expression` structs. Importantly, we can avoid tracking a bunch of state for expressions that we need for parents: we don't need to track reference-to-ID pointers (we just have no use-case for this -- I'd actually like to remove this from statements too, but we need it for branch detection right now), we don't need to track depth, etc. In my testing, this entirely removes the regression on all-rules, and gets us down to 2ms slower on the default rules (as a crude hyperfine benchmark, so this is within margin of error IMO). No behavioral changes.

## Summary This PR is a follow-up to the suggestion in #6345 (comment) to use a single stack to store all statements and expressions, rather than using separate vectors for each, which gives us something closer to a full-fidelity chain. (We can then generalize this concept to include all other AST nodes too.) This is in part made possible by the removal of the hash map from `&Stmt` to `StatementId` (#6694), which makes it much cheaper to store these using a single interface (since doing so no longer introduces the requirement that we hash all expressions). I'll follow-up with some profiling, but a few notes on how the data requirements have changed: - We now store a `BranchId` for every expression, not just every statement, so that's an extra `u32`. - We now store a single `NodeId` on every snapshot, rather than separate `StatementId` and `ExpressionId` IDs, so that's one fewer `u32` for each snapshot. - We're probably doing a few more lookups in general, since any calls to `current_statement()` etc. now have to iterate up the node hierarchy until they identify the first statement. ## Test Plan `cargo test`

Store expressions in an IndexVec

9832d6f

charliermarsh force-pushed the charlie/expressions branch from 8d20f7b to 4a097bc Compare August 4, 2023 16:24

charliermarsh requested a review from MichaReiser August 4, 2023 16:29

charliermarsh force-pushed the charlie/expressions branch 2 times, most recently from 74a9d8f to cb1cc8b Compare August 4, 2023 16:36

Store expressions in snapshot

98c6075

charliermarsh force-pushed the charlie/expressions branch from cb1cc8b to 98c6075 Compare August 4, 2023 16:38

charliermarsh mentioned this pull request Aug 4, 2023

Use separate structs for expression and statement tracking #6351

Merged

charliermarsh added the internal An internal refactor or improvement label Aug 4, 2023

MichaReiser approved these changes Aug 7, 2023

View reviewed changes

charliermarsh merged commit 89e4e03 into main Aug 7, 2023

charliermarsh deleted the charlie/expressions branch August 7, 2023 13:42

charliermarsh mentioned this pull request Aug 7, 2023

Extend nested union detection to handle bitwise or Union expressions #6399

Merged

charliermarsh mentioned this pull request Aug 21, 2023

Use a single node hierarchy to track statements and expressions #6709

Merged

Conversation

charliermarsh commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

charliermarsh commented Aug 4, 2023

Uh oh!

github-actions bot commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Check Results

Ecosystem

Benchmark

Linux

Windows

Uh oh!

charliermarsh commented Aug 4, 2023

Uh oh!

charliermarsh commented Aug 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaReiser commented Aug 7, 2023

Uh oh!

charliermarsh commented Aug 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

charliermarsh commented Aug 4, 2023 •

edited

Loading

github-actions bot commented Aug 4, 2023 •

edited

Loading