-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[WIP / RFC] Cranelift: Basic support for EGraph roundtripping. #4249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a work-in-progress, and meant to sketch the direction I've been thinking in for a mid-end framework. A proper BA RFC will come soon. This PR builds a phase in the optimization pipeline that converts a CLIF CFG into an egraph representing the function body. Each node represents an original instruction or operator. The "skeleton" of side-effecting instructions is retained, but non-side-effecting (pure) operators are allowed to "float": the egraph will naturally deduplicate them during build, and we will determine their proper place when we convert back to a CFG representation. The conversion from the egraph back to the CFG is done via a new algorithm I call "scoped elaboration". The basic idea is to do a preorder traversal of the domtree, and at each level, evaluate the values of the eclasses called upon by the side-effect skeleton, memoizing in an eclass-to-SSA-value map. This map is a scoped hashmap, with scopes at each domtree level. In this way, (i) when a value is computed in a location that dominates another instance of that value, the first replacees the second; but (ii) we never produce "partially dead" computations, i.e. we never hoist to a level in the domtree where a node is not "anticipated" (always eventually computed). This exactly matches what GVN does today. With a small tweak, it can also subsume LICM: we need to be loop-nest-aware in our recursive eclass elaboration, and potentially place nodes higher up the domtree (and higher up in the scoped hashmap). Unlike what I had been thinking in Monday's meeting, this produces CLIF out of the egraph and then allows that to be lowered. It's overall simpler and a better starting point (thanks @abrown for tipping me over the edge in this). The way it produces CLIF now could be made more efficient: it could reuse instructions already in the DFG for nodes that are *not* duplicated (likely most of them) rather than clearing all and repopulating. This PR does *not* do anything to actually rewrite in the egraph. That's the next step! I need to work out exactly how to integrate ISLE with some sort of rewrite machinery. I have some ideas about efficient dispatch with an "operand-tree discriminants shape analysis" on the egraph and indexing rules by their matched shape; more to come.
Subscribe to Label Actioncc @peterhuene DetailsThis issue or pull request has been labeled: "cranelift", "cranelift:meta", "wasmtime:api"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
fitzgen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super excited about this and very pleasantly surprised by how small the changes here ended up being (of course, no rules yet, but nonetheless!)
A few nitpicks below.
cranelift/codegen/src/egg.rs
Outdated
| /// subgraph of the egraph that we use for codegen. | ||
| /// | ||
| /// To avoid cycles, we do a cycle-finding DFS as part of | ||
| /// extraction that disqualifies enodes (removes them from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// extraction that disqualifies enodes (removes them from | |
| /// extraction that disqualifies enodes (removes them from |
| @@ -0,0 +1,183 @@ | |||
| //! Elaboration phase: lowers EGraph back to sequences of operations | |||
| //! in CFG nodes. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would really benefit from an overview of the scoped elaboration algorithm up here at the top.
| } | ||
|
|
||
| for child in domtree.children(block) { | ||
| self.elaborate_block(child, block_params_fn, block_roots_fn, domtree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably be rewritten to use iteration + an explicit work stack, instead of stack recursion. Just to be more robust in the face of malicious input. I don't think we have any stack recursion anywhere else in Cranelift, afaik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for sure, this is the plan!
Incidentally we'll probably want to audit egg itself for this property too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know some key parts of egg are structured as a workqueue but I haven't audited it to check if that's true everywhere. In fact there are APIs (such as the Pattern and CostFunction traits) that as I recall have default implementations that are recursive, but can be implemented iteratively if the client code puts in the work.
| if matches!(node, Node::Param { .. }) { | ||
| unreachable!("Param nodes should already be inserted"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda funky, would be better IMO as
| if matches!(node, Node::Param { .. }) { | |
| unreachable!("Param nodes should already be inserted"); | |
| } | |
| assert!( | |
| !matches!(node, Node::Param { .. }), | |
| "Param nodes should already be inserted", | |
| ); |
which is a little funky because of !macro!() but also there could be a Node::is_param method to clean that up.
| // Is the node a result projection? If so, at this point we | ||
| // have everything we need; no need to allocate a new Value | ||
| // for the result. | ||
| if let Node::Result { value, result, .. } = node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick-y for sure, but I'm not a huge fan of the name Result for this type of node. I think Projection would be an improvement and Pick or something along those lines might be even better.
cranelift/codegen/src/egg/extract.rs
Outdated
|
|
||
| let mut best_cost_and_node = None; | ||
| for (i, node) in egraph[id].nodes.iter().enumerate() { | ||
| let this_cost = self.visit_enode(egraph, node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, the mutual recursion between visit_enode and visit_eclass should be rewritten with iteration and an explicit stack before we consider merging this.
cranelift/codegen/src/egg/extract.rs
Outdated
| @@ -0,0 +1,90 @@ | |||
| //! Extraction phase: pick one enode per eclass, avoiding loops. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also benefit from a description of the algorithm here.
| } | ||
|
|
||
| /// Insert a key-value pair if absent, panicking otherwise. | ||
| pub fn insert_if_absent(&mut self, key: K, value: V) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhh okay, so this panics if it is already present. That's a bit confusing, since given the method name I would assume that this silently returns if the entry already exists.
Maybe call this insert_and_assert_absent or even just insert_absent?
| test optimize | ||
| set opt_level=none | ||
| set use_egraphs=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth having test egraphs in the long run, since this combination of settings is a bit of a magical incantation.
|
|
||
| ; check: block0(v0: i32, v1: i32): | ||
| ; nextln: v2 = iadd v0, v1 | ||
| ; check: block1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to have had the redundant phi elimination pass run on it? Because the egraph isn't doing that yet, right? Another reason to have test egraph so that we aren't pulling in incidental/unrelated changes to these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. I think we'll actually want to do something equivalent to redundant-phis as a rewrite rule eventually (you're probably thinking along these lines too :-) ); I need to think through how exactly to do this best without introducing cycles in the egraph from the blockparam nodes (or maybe we do, and just ignore during elaboration).
|
How fast is this code? How much would the increase in compile time for this be? |
I haven't measured yet but I will soon. There are some optimizations I want to do first (e.g., make the egraph-to-CLIF lowering work without rebuilding the whole function body, reusing existing |
This moves to a strategy based on `hashbrown::raw::RawTable` and "eq-with-context" / "hash-with-context" traits, used to allow nodes to be stored once in a `BumpVec` (which encapsulates a range in a shared `Vec`) and then used in an eclass, with keys in the deduplication hashtable referring to an eclass and enode index in that eclass. This moves back to the enodes-without-IDs strategy used in `egg`, which makes the graph rebuild simpler, but retains the single-storage / no-cloning property. Along the way, carrying through the eq-with-context / hash-with-context to the `Node` itself, we can remove the `&mut [Id]` and `&mut [Type]` slices and replace with `BumpVec`s. Actually they could be made `NonGrowingBumpVec`s for 8 bytes instead of 12; that is future work. This should allow equivalent algorithmic approaches to `egg` but without the separate allocations; all allocations for nodes are now within the entity-component-system-style large `Vec`s.
…stimated node count
…Slice` instead). This brings the overhead of `wasmtime compile spidermonkey.wasm` with `use_egraphs=true` and `opt_level=none` to 4.3% slower than `opt_level=speed`. (This is an interesting comparison because the egraph build/extract roundtrip subsumes GVN, the main time-sink in today's opt pass.)
…parate ISLE environment.
Lots of TODOs still. The main one is how to deal with *multiple* rewrites arising from a single rule invocation on an eclass. I think what needs to happen is that we have multi-ctors as well as multi-etors; these return a `SmallVec<[T; 8]>` or something of the sort. So then the `simplify` toplevel returns a list of eclass IDs it has built that are equivalent to the original. This means that we then need to have two different ABIs for multi-terms: "eager" (internal multi-ctor) and "lazy" (external multi-etor). But this is workable I think. The egraph implementation needs much more stress-testing as well!
Subscribe to Label ActionDetailsThis issue or pull request has been labeled: "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
This is necessary to allow the `simplify` rule in the mid-end to return multiple Ids of new e-classes that are equivalent to the original. In general, multiplicity is a property of an extractor or constructor (namely, a single match returns multiple results/tuples), not just an external extractor.
…ops via generation numbers
…ive; working on improving amode lowering.
The ISLE language's lexer previously used a very primitive `i64::from_str_radix` call to parse integer constants, allowing values in the range -2^63..2^63 only. Also, underscores to separate digits (as is allwoed in Rust) were not supported. Finally, 128-bit constants were not supported at all. This PR addresses all issues above: - Integer constants are internally stored as 128-bit values. - Parsing supports either signed (-2^127..2^127) or unsigned (0..2^128) range. Negation works independently of that, so one can write `-0xffff..ffff` (128 bits wide, i.e., -(2^128-1)) to get a `1`. - Underscores are supported to separate groups of digits, so one can write `0xffff_ffff`. - A minor oversight was fixed: hex constants can start with `0X` (uppercase) as well as `0x`, for consistency with Rust and C. This PR also adds a new kind of ISLE test that actually runs a driver linked to compiled ISLE code; we previously didn't have any such tests, but it is now quite useful to assert correct interpretation of constant values.
…ms to be down to 1%-ish or so again
…und remaining because some nodes cost zero points
| better optimization, but at the cost of a longer compile time. | ||
| "#, | ||
| false, | ||
| true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean it will do egraph optimizations even if opt_level is set to none?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a work-in-progress branch and I'm doing some hacking on some benchmarking infrastructure, and I wanted to name points-in-time by hashes only, without a separate config. We will definitely not enable opts if opts are disabled. Please note the commit comment and please disregard throwaway commits on my work-in-progress branch; there will be ample opportunity to review when the real PRs come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't see the commit message as I was looking at the github diff view for all changes since the last notification.
|
Superseded by #4953; closing this one! |
This is a work-in-progress, and meant to sketch the direction I've been
thinking in for a mid-end framework. A proper BA RFC will come soon.
This PR builds a phase in the optimization pipeline that converts a CLIF
CFG into an egraph representing the function body. Each node represents
an original instruction or operator. The "skeleton" of side-effecting
instructions is retained, but non-side-effecting (pure) operators are
allowed to "float": the egraph will naturally deduplicate them during
build, and we will determine their proper place when we convert back to
a CFG representation.
The conversion from the egraph back to the CFG is done via a new
algorithm I call "scoped elaboration". The basic idea is to do a
preorder traversal of the domtree, and at each level, evaluate the
values of the eclasses called upon by the side-effect skeleton,
memoizing in an eclass-to-SSA-value map. This map is a scoped hashmap,
with scopes at each domtree level. In this way, (i) when a value is
computed in a location that dominates another instance of that value,
the first replacees the second; but (ii) we never produce "partially
dead" computations, i.e. we never hoist to a level in the domtree where
a node is not "anticipated" (always eventually computed). This exactly
matches what GVN does today. With a small tweak, it can also subsume
LICM: we need to be loop-nest-aware in our recursive eclass elaboration,
and potentially place nodes higher up the domtree (and higher up in the
scoped hashmap).
Unlike what I had been thinking in Monday's meeting, this produces CLIF
out of the egraph and then allows that to be lowered. It's overall
simpler and a better starting point (thanks @abrown for tipping me over
the edge in this). The way it produces CLIF now could be made more
efficient: it could reuse instructions already in the DFG for nodes that
are not duplicated (likely most of them) rather than clearing all and
repopulating.
This PR does not do anything to actually rewrite in the egraph. That's
the next step! I need to work out exactly how to integrate ISLE with
some sort of rewrite machinery. I have some ideas about efficient
dispatch with an "operand-tree discriminants shape analysis" on the
egraph and indexing rules by their matched shape; more to come.
cc @fitzgen @jameysharp @abrown @avanhatt @mlfbrown