Store expression hierarchy in semantic model snapshots#6345
Store expression hierarchy in semantic model snapshots#6345charliermarsh merged 2 commits intomainfrom
Conversation
|
This will also fix #6285, but I'll handle that in a separate PR, I want to update the fixtures and it would be too noisy here. |
8d20f7b to
4a097bc
Compare
74a9d8f to
cb1cc8b
Compare
cb1cc8b to
98c6075
Compare
PR Check ResultsEcosystem✅ ecosystem check detected no changes. BenchmarkLinuxWindows |
|
Hmm, we don't need back references for expressions (we never need to be able to go from a node to its ID, or from a node to its parent -- we can always go from an ID to its parent), so perhaps we can improve performance here by omitting the hash map from |
|
Ok, I'm going to resolve the performance degradation in a separate PR because I think it's easier to review this as-is. |
| // the assertion is part of a larger expression. | ||
| if checker.semantic().stmt().is_expr_stmt() | ||
| && checker.semantic().expr_parent().is_none() | ||
| && !checker.semantic().scope().kind.is_lambda() |
There was a problem hiding this comment.
Yes! Not necessary. See the identical change in crates/ruff/src/rules/pandas_vet/rules/inplace_argument.rs that removes the large comment around Case 3. The fact that these tests still pass is a evidence that this change is working as expected.
| /// Stack of all visited expressions. | ||
| exprs: Nodes<'a, Expr>, | ||
|
|
||
| /// The identifier of the current expression. | ||
| expr_id: Option<NodeId>, |
There was a problem hiding this comment.
Nit: We now store both a stack of statements and expressions. This gives us almost a full fidelity parent chain, except that it isn't possible to get the parent statement of an expression (except for the current expression).
What do you think of introducing a Vec<AnyNodeRef> that stores all parent nodes (Ultimately means O(n) pointers to be written, where n = len(AST)) instead? You can retrieve the current statement by traversing upwards until you find the first is_statement() node.
The one issue of that is that there's currently no way to go from AnyNodeRef to &Stmt. We would need to introduce a AnyStatementRef enum for that.
There was a problem hiding this comment.
I will consider this separately, it makes sense to me, although it may come with the performance regression that I documented here and fixed in #6351.
There was a problem hiding this comment.
I was looking into this and I want to understand the last paragraph here. Is the suggestion that methods like SemanticModel#current_statement would now return AnyStatementRef? And so we'd change all consumers of that method to work with AnyStatementRef rather than &Stmt?
There was a problem hiding this comment.
If so, we also need AnyExpressionRef, and I'd ultimately like to store the full-fidelity chain here including Parameters, Arguments, etc...
There was a problem hiding this comment.
E.g., these kinds of changes: https://github.com/astral-sh/ruff/compare/charlie/any-node-ref?expand=1#diff-c2c8422959405016e1fb54ca23edee9716bd06ad64e610ceea9612a8291122fcR1815. It's feasible but a lot of call sites to work through, so I want to make sure I understand what's being suggested before I think on it further.
There was a problem hiding this comment.
I might propose that we do this incrementally by adding an internal type to the semantic crate that's like AnyNodeRef, but less granular:
#[derive(Debug, Clone)]
pub enum AnyNodeRef<'a> {
Stmt(&'a Stmt),
Expr(&'a Expr),
}
impl<'a> AnyNodeRef<'a> {
pub fn as_statement(&self) -> Option<&'a Stmt> {
match self {
AnyNodeRef::Stmt(stmt) => Some(stmt),
AnyNodeRef::Expr(_) => None,
}
}
}That way, we wouldn't need to change the SemanticModel interface at all for now.
| .flat_map(|id| { | ||
| self.exprs | ||
| .ancestor_ids(id) | ||
| .skip(1) |
There was a problem hiding this comment.
Why is it necessary to skip(1)? Shouldn't ancestor_ids return the ancestors only?
There was a problem hiding this comment.
Right now, ancestor_ids starts with the current ID IIRC, probably because it's convenient with the successors API. Do you find that confusing?
There was a problem hiding this comment.
(This is current behavior on main so I'll look into it separately.)
| self.expr_id | ||
| .iter() | ||
| .copied() | ||
| .flat_map(|id| { | ||
| self.exprs | ||
| .ancestor_ids(id) | ||
| .skip(1) | ||
| .map(|id| &self.exprs[id]) | ||
| }) | ||
| .copied() |
There was a problem hiding this comment.
Nit: Use successors?
| self.expr_id | |
| .iter() | |
| .copied() | |
| .flat_map(|id| { | |
| self.exprs | |
| .ancestor_ids(id) | |
| .skip(1) | |
| .map(|id| &self.exprs[id]) | |
| }) | |
| .copied() | |
| std::iter::successors(self.expr_id, |id| self.exprs.ancestor_ids(id).skip(1).map(|id| self.exprs[id]))) |
There was a problem hiding this comment.
This doesn't quite do the same thing. The awkwardness is that self.expr_id can be None.
There was a problem hiding this comment.
I'm not sure if I understand. The first argument to std::iter::successors is an Option. Both successors and expr_id.flat_mapshould short circuit if expr_id isNone`.
There was a problem hiding this comment.
Oh, I think you just had misplaced parentheses in your suggestion and I misunderstood.
|
Could we add some tests for the new introduced APIs? Or how did you test that this works as expected? |
I'm going to add some tests in a separate PR, I wanted to make sure that this had a clean snapshot suite. |
## Summary This PR fixes the performance degradation introduced in #6345. Instead of using the generic `Nodes` structs, we now use separate `Statement` and `Expression` structs. Importantly, we can avoid tracking a bunch of state for expressions that we need for parents: we don't need to track reference-to-ID pointers (we just have no use-case for this -- I'd actually like to remove this from statements too, but we need it for branch detection right now), we don't need to track depth, etc. In my testing, this entirely removes the regression on all-rules, and gets us down to 2ms slower on the default rules (as a crude hyperfine benchmark, so this is within margin of error IMO). No behavioral changes.
## Summary When we iterate over the AST for analysis, we often process nodes in a "deferred" manner. For example, if we're analyzing a function, we push the function body onto a deferred stack, along with a snapshot of the current semantic model state. Later, when we analyze the body, we restore the semantic model state from the snapshot. This ensures that we know the correct scope, hierarchy of statement parents, etc., when we go to analyze the function body. Historically, we _haven't_ included the _expression_ hierarchy in the model snapshot -- so we track the current expression parents in the visitor, but we never save and restore them when processing deferred nodes. This can lead to subtle bugs, in that methods like `expr_parent()` aren't guaranteed to be correct, if you're in a deferred visitor. This PR migrates expression tracking to mirror statement tracking exactly. So we push all expressions onto an `IndexVec`, and include the current expression on the snapshot. This ensures that `expr_parent()` and related methods are "always correct" rather than "sometimes correct". There's a performance cost here, both at runtime and in terms of memory consumption (we now store an additional pointer for every expression). In my hyperfine testing, it's about a 1% performance decrease for all-rules on CPython (up to 533.8ms, from 528.3ms) and a 4% performance decrease for default-rules on CPython (up to 212ms, from 204ms). However... I think this is worth it given the incorrectness of our current approach. In the future, we may want to reconsider how we do these upward traversals (e.g., with something like a red-green tree). (**Note**: in astral-sh#6351, the slowdown seems to be entirely removed.)
…#6351) ## Summary This PR fixes the performance degradation introduced in astral-sh#6345. Instead of using the generic `Nodes` structs, we now use separate `Statement` and `Expression` structs. Importantly, we can avoid tracking a bunch of state for expressions that we need for parents: we don't need to track reference-to-ID pointers (we just have no use-case for this -- I'd actually like to remove this from statements too, but we need it for branch detection right now), we don't need to track depth, etc. In my testing, this entirely removes the regression on all-rules, and gets us down to 2ms slower on the default rules (as a crude hyperfine benchmark, so this is within margin of error IMO). No behavioral changes.
## Summary This PR is a follow-up to the suggestion in #6345 (comment) to use a single stack to store all statements and expressions, rather than using separate vectors for each, which gives us something closer to a full-fidelity chain. (We can then generalize this concept to include all other AST nodes too.) This is in part made possible by the removal of the hash map from `&Stmt` to `StatementId` (#6694), which makes it much cheaper to store these using a single interface (since doing so no longer introduces the requirement that we hash all expressions). I'll follow-up with some profiling, but a few notes on how the data requirements have changed: - We now store a `BranchId` for every expression, not just every statement, so that's an extra `u32`. - We now store a single `NodeId` on every snapshot, rather than separate `StatementId` and `ExpressionId` IDs, so that's one fewer `u32` for each snapshot. - We're probably doing a few more lookups in general, since any calls to `current_statement()` etc. now have to iterate up the node hierarchy until they identify the first statement. ## Test Plan `cargo test`
Summary
When we iterate over the AST for analysis, we often process nodes in a "deferred" manner. For example, if we're analyzing a function, we push the function body onto a deferred stack, along with a snapshot of the current semantic model state. Later, when we analyze the body, we restore the semantic model state from the snapshot. This ensures that we know the correct scope, hierarchy of statement parents, etc., when we go to analyze the function body.
Historically, we haven't included the expression hierarchy in the model snapshot -- so we track the current expression parents in the visitor, but we never save and restore them when processing deferred nodes. This can lead to subtle bugs, in that methods like
expr_parent()aren't guaranteed to be correct, if you're in a deferred visitor.This PR migrates expression tracking to mirror statement tracking exactly. So we push all expressions onto an
IndexVec, and include the current expression on the snapshot. This ensures thatexpr_parent()and related methods are "always correct" rather than "sometimes correct".There's a performance cost here, both at runtime and in terms of memory consumption (we now store an additional pointer for every expression). In my hyperfine testing, it's about a 1% performance decrease for all-rules on CPython (up to 533.8ms, from 528.3ms) and a 4% performance decrease for default-rules on CPython (up to 212ms, from 204ms). However... I think this is worth it given the incorrectness of our current approach. In the future, we may want to reconsider how we do these upward traversals (e.g., with something like a red-green tree). (Note: in #6351, the slowdown seems to be entirely removed.)