Add branch detection to the semantic model#6694
Conversation
0863ad7 to
062aff6
Compare
| } | ||
| } | ||
| false | ||
| } |
There was a problem hiding this comment.
Glad to see this removed.
| #[derive(Debug, Default)] | ||
| pub struct Statements<'a> { | ||
| statements: IndexVec<StatementId, StatementWithParent<'a>>, | ||
| statement_to_id: FxHashMap<StatementKey, StatementId>, |
There was a problem hiding this comment.
Glad to see this go.
062aff6 to
c936bf9
Compare
PR Check ResultsEcosystem✅ ecosystem check detected no changes. BenchmarkLinuxWindows |
c936bf9 to
6dbadca
Compare
|
Small improvement on benchmarks (no decreases in performance, |
|
(I don't think any of the benchmarks even trigger the branch detection though, that's just from not hashing the statement etc. presumedly.) |
|
I replaced one of the benchmarks with a file that includes a lot of redefinitions-within-branches, small improvement: |
6dbadca to
ccf8e7f
Compare
|
Merging as a non-behavior-changing refactor that I have high confidence in and don't want to distract others by. If anyone is eager to review I'll happily address follow-up feedback. |
|
This new data structure has similarities to our data flow analysis. Do you see way how we could merge the two? Out of scope: It would be nice if we only built up these more advanced analyses if a rule consuming the data is enabled. |
| /// | ||
| /// Each of the three arms of the `if`-`elif`-`else` would be considered a branch, and would be | ||
| /// assigned their own unique [`BranchId`]. | ||
| #[newtype_index] |
There was a problem hiding this comment.
i don't like that the macro adds the field to the struct, it makes it hard to follow and hard to IDE/tool-inspect what data that struct actually is
## Summary This PR is a follow-up to the suggestion in #6345 (comment) to use a single stack to store all statements and expressions, rather than using separate vectors for each, which gives us something closer to a full-fidelity chain. (We can then generalize this concept to include all other AST nodes too.) This is in part made possible by the removal of the hash map from `&Stmt` to `StatementId` (#6694), which makes it much cheaper to store these using a single interface (since doing so no longer introduces the requirement that we hash all expressions). I'll follow-up with some profiling, but a few notes on how the data requirements have changed: - We now store a `BranchId` for every expression, not just every statement, so that's an extra `u32`. - We now store a single `NodeId` on every snapshot, rather than separate `StatementId` and `ExpressionId` IDs, so that's one fewer `u32` for each snapshot. - We're probably doing a few more lookups in general, since any calls to `current_statement()` etc. now have to iterate up the node hierarchy until they identify the first statement. ## Test Plan `cargo test`
Summary
We have a few rules that rely on detecting whether two statements are in different branches -- for example, different arms of an
if-else. Historically, the way this was implemented is that, given two statement IDs, we'd find the common parent (by traversing upwards via ourStatementsabstraction); then identify branches "manually" by matching the parents againsttry,if, andmatch, and returning iterators over the arms; then check if there's an arm for which one of the statements is a child, and the other is not.This has a few drawbacks:
First, the code is generally a bit hard to follow (Konsti mentioned this too when working on the
ElifElseClauserefactor).Second, this is the only place in the codebase where we need to go from
&StmttoStatementID-- everywhere else, we only need to go in the other direction. Supporting these lookups means we need to maintain a mapping from&StmttoStatementIDthat includes every&Stmtin the program. (We also end up maintaining adepthlevel for every statement.) I'd like to get rid of these requirements to improve efficiency, reduce complexity, and enable us to treat AST modes more generically in the future. (When I looked at adding the&Exprto our existing statement-tracking infrastructure, maintaining a hash map with all the statements noticeably hurt performance.)The solution implemented here instead makes branches a first-class concept in the semantic model. Like with
Statements, we now have aBranchesabstraction, where each branch points to its optional parent. When we store statements, we store theBranchIDalongside each statement. When we need to detect whether two statements are in the same branch, we just realize each statement's branch path and compare the two. (Assuming that the two statements are in the same scope, then they're on the same branch IFF one branch path is a subset of the other, starting from the top.) We then add some calls to the visitor to push and pop branches in the appropriate places, forif,try, andmatchstatements.Note that a branch is not 1:1 with a statement; instead, each branch is closer to a suite, but not every suite is a branch. For example, each arm in an
if-elif-elseis a branch, but theelsein aforloop is not considered a branch.In addition to being much simpler, this should also be more efficient, since we've shed the entire
&Stmthash map, plus thedepththat we track onStatementWithParentin favor of a singleOption<BranchID>onStatementWithParentplus a single vector for all branches. The lookups should be faster too, since instead of doing a bunch of jumps around with the hash map + repeated recursive calls to find the common parents, we instead just do a few simple lookups in theBranchesvector to realize and compare the branch paths.Test Plan
cargo test-- we have a lot of coverage for this, which we inherited from PyFlakes