[strings] Add a StringLifting pass#7389
Conversation
|
I experimented with |
tlively
left a comment
There was a problem hiding this comment.
What do you think about building string lifting and lowering directly into the parsers (via IRBuilder) and the binary writer in the long term?
|
|
||
| const Name WasmStringsModule = "wasm:js-string"; | ||
|
|
||
| const Name WasmStringConstsModule = "'"; |
There was a problem hiding this comment.
It would be best to make this configurable, just like it is in the standard API.
There was a problem hiding this comment.
I guess it could be a pass option, but do you see users wanting anything else than this? Anything else would be less optimal, so I'd lean to guiding people to this.
There was a problem hiding this comment.
The empty string would be a good choice ;)
There was a problem hiding this comment.
Looks like it is...
I can add options to control that for the previous pass and this new pass as a followup?
There was a problem hiding this comment.
It would be good to additionally test that operations imported from the wrong module, with the wrong name, or with the wrong type are not lifted.
There was a problem hiding this comment.
I added tests for wrong module and base names.
For the wrong type, I think it's enough that we'll get a validation error later?
There was a problem hiding this comment.
From looking at the string builtins spec, it looks like trying to import a builtin function with the wrong type will work fine at instantiation and fail when it is called. So it is possible to have a valid module that has one of these "incorrect" imports and never traps because it happens to never call the import. If we want to support such "valid" modules (and I think we should because we can), we would need to leave such imports unmodified here, which means checking the type before lifting.
There was a problem hiding this comment.
But I think it is more helpful for us to issue a validation error, forcing the user to get it right? Otherwise the user is depending on runtime errors from the VM, which means that if the user's testing happens to not check enough code paths, runtime errors can reach production.
I don't see a benefit to allowing such runtime-trapping modules, so giving the users more static type checking seems good?
There was a problem hiding this comment.
In that case, perhaps it would be better to check the type so we can provide a more informative error message. I imagine the validation errors would be somewhat obscure.
There was a problem hiding this comment.
Fair enough, better errors might be useful here. Added.
| if (!func->imported() || func->module != WasmStringsModule) { | ||
| continue; | ||
| } | ||
| if (func->base == "fromCharCodeArray") { |
There was a problem hiding this comment.
Ideally we should be checking for the expected function types as well.
| // TODO: Add casts. We generate new string.* instructions, and all their | ||
| // string inputs should be stringref, not externref, but we have not | ||
| // converted all externrefs to stringrefs (since some externrefs might | ||
| // be something else). It is not urgent to fix this as the validator | ||
| // accepts externrefs there atm, and since toolchains will lower | ||
| // strings out at the end anyhow (which would remove such casts). Note | ||
| // that if we add a type import for stringref then this problem would | ||
| // become a lot simpler (we'd convert that type to stringref). |
There was a problem hiding this comment.
Would there be any downside of considering the casts to be part of the semantics of the lifted string operations? We would have to update the interpreter to perform the casts, but this better matches the semantics and typing of the lowered string operations. We would type the output of e.g. string.concat as string as an optimization, but it's inputs would remain extern references, allowing all valid lowered modules to be represented as lifted modules without having to insert new casts.
There was a problem hiding this comment.
Interesting, yeah, we could do things that way too.
It does feel "cleaner" though for string.concat to receive stringrefs... I'd lean toward waiting for type imports and using that to fix this, eventually (as it doesn't seem urgent to fix).
There was a problem hiding this comment.
We don't need to solve this now because the fuzzer currently ensures that only strings flow into string operations? But that means we're missing out on fuzzing interesting programs that might send other values to string builtins.
There was a problem hiding this comment.
Hmm, good point... while we import only functions and tags, and we only create strings as true externrefs, we could end up with a struct or array that is externalized. So a string instruction could receive such a non-string value.
It looks like the fuzzer does not generate new externalize instructions itself. So the reason we have not seen this yet is that hitting this depends on using one of the few initial contents with an externalize, and happening to put exactly the right thing there.
We would need to fix this before adding more externalize, in other words. Which is worth doing, as you said.
That seems to have downsides:
The upside is not having support for a nonstandard binary and text format, but I think the nonstandard stuff is exactly what we want in terms of serializing our IR, and that our IR with stringref operations is what we want as an IR (e.g. a string concat operation is literally an instruction |
I'm not too worried about adding complexity to IRBuilder since it's so well contained. For the writer, it could just call out automatically to the lowering pass.
I don't see any reason not to perform lifting as early as possible and lowering as late as possible. That will get the best optimizations.
No problem, we can continue parsing and printing the non-standard string operations and types in the text parser and printer.
Not having to support a nonstandard binary format, yes, but as mentioned above, I think we should keep the text format for testing purposes. I agree that "our IR with stringref operations is what we want as an IR," and building lifting into the parser strongly aligns with that. |
Debugging could be a reason, for example. Another possible reason is interaction/ordering with other lifting and lowering operations (like say the lowering we have for intrinsics atm). I can't think of a bad interaction between them, though.
I agree removing a nonstandard thing is nice, but it is also a small amount of code, and already written. And the binary format is also useful. E.g. reduction is faster on the binary format, and it may even be needed, if the bug does not reproduce in the text format (which does happen sometimes, as our text format has some differences, like the unreachable type). And, it is nice that the binary and text formats are as parallel as possible (which, again, we've already compromised on, but hopefully we can avoid adding more such divergences). |
|
The main benefit I would want is that defaults matter, so having to manually remember to add these passes would be bad. That benefit could also be realized by adding lifting lowering passes at the beginning and end of the default optimization pipeline. (But in that case the lowering and lifting in the middle of e.g.g |
I get your point that it would be nice for this to "just work" for users. But losing some type information between cycles, or even between invocations of wasm-opt, is generally harmful. In more detail: lifting + lowering is not "perfect". That is, the lowering at the end of one cycle and the lifting at the start of another may not perfectly cancel out. In fact, they do not, in general: we lower stringref to externref, and do not lift it back, so we depend on optimizations to re-refine things, which may take multiple cycles. Now, in VMs this does not matter because the user will lower stringref to externref at the very end anyhow, but it could help in the middle, say by proving a |
|
But lowering is the lossy path. Lifting actually recovers type information because the string operations get the more refined |
Yes, but: With automatic lifting&lowering, the lowering at the end of the first line is the problem, as it undoes refinements. We may recoup them on the second line (after it lifts during load) but we may not. |
|
What if we had a flag to allow opting out of the automatic lowering in the binary writer? Such users could use the flag for their intermediate stages, and users who just use a simple, default pass pipeline would get the optimal and standards-compliant behavior without having to be experts. |
It sounds like you're suggesting Is that right - so we do keep around the binary writing/reading logic for strings?
Users would need to be aware of the issue to know if they are getting the optimal pipeline, so I am not sure this helps much. But maybe there are more such users? I don't follow the standards-compliant point? In my proposal if the user never uses |
|
Yes, if we consider it a requirement that users can split their pipeline across multiple invocations without lowering in the middle, then we would need to keep the binary reading and writing around. I'm not quite convinced this is worth it, but I agree it removes the possibility of regressing current behavior, so I'm happy to start here. I do think there are probably many users who just run Never mind the standards-compliant comment, it was indeed self-evident. My main point is that it would be good to do the best thing by default. |
|
I do agree that most users likely just do But I guess the question is what do we want to focus on - making the common case work well, or avoiding subtle issues in rarer cases? I agree you handle the common case well, but a toolchain author missing And I still feel that adding this magic in the binary reading and writing has more downsides than benefits. For example, how can we avoid adding overhead? Lifting can be very efficient - check for the imports, and if none exist, do nothing - but lowering requires scanning the entire module in general, to see if there is anything we need to lower. I can't think of a good way to avoid that overhead. |
|
Code LGTM. Let's discuss automatic lifting and lowering offline. I don't think strings are the only thing we will want this for. |
This converts imported string constants into
string.const, and importedstring instructions into
string.*expressions. After this pass they arerepresented using stringref and we can optimize them fully (e.g.
precomputing a
string.concatof two constants). Typically a user wouldlater lower then back down using StringLowering.
This pass allows users to avoid emitting stringref directly, which means
they are emitting standard wasm which can run in VMs, leaving wasm-opt
entirely optional.
Also refactor a few shared constants with StringLowering into a helper file.
Left as TODOs: contents of the strings custom section, and casts (see
comments in source).
Fixes most of #7370