[strings] Add a StringLifting pass by kripken · Pull Request #7389 · WebAssembly/binaryen

kripken · 2025-03-21T18:29:41Z

This converts imported string constants into string.const, and imported
string instructions into string.* expressions. After this pass they are
represented using stringref and we can optimize them fully (e.g.
precomputing a string.concat of two constants). Typically a user would
later lower then back down using StringLowering.

This pass allows users to avoid emitting stringref directly, which means
they are emitting standard wasm which can run in VMs, leaving wasm-opt
entirely optional.

Also refactor a few shared constants with StringLowering into a helper file.

Left as TODOs: contents of the strings custom section, and casts (see
comments in source).

Fixes most of #7370

kripken · 2025-03-21T18:43:00Z

I experimented with --string-lowering --string-lifting before doing optimizations on j2wasm output, that is, lowering the stringref they currently emit, to represent what they could emit instead, and then lifting it back, and then optimizing. Overall there was no substantial difference in the final result, aside from not being able to infer stringref in as many places, but that would get lowered at the end anyhow. So this seems ok from an optimization perspective.

tlively

What do you think about building string lifting and lowering directly into the parsers (via IRBuilder) and the binary writer in the long term?

tlively · 2025-03-21T19:43:32Z

src/passes/string-utils.cpp

+
+const Name WasmStringsModule = "wasm:js-string";
+
+const Name WasmStringConstsModule = "'";


It would be best to make this configurable, just like it is in the standard API.

I guess it could be a pass option, but do you see users wanting anything else than this? Anything else would be less optimal, so I'd lean to guiding people to this.

The empty string would be a good choice ;)

..is that... valid? 🔍

Looks like it is...

I can add options to control that for the previous pass and this new pass as a followup?

Sounds good 👍

tlively · 2025-03-21T19:46:07Z

test/lit/passes/string-lifting.wast

It would be good to additionally test that operations imported from the wrong module, with the wrong name, or with the wrong type are not lifted.

I added tests for wrong module and base names.

For the wrong type, I think it's enough that we'll get a validation error later?

From looking at the string builtins spec, it looks like trying to import a builtin function with the wrong type will work fine at instantiation and fail when it is called. So it is possible to have a valid module that has one of these "incorrect" imports and never traps because it happens to never call the import. If we want to support such "valid" modules (and I think we should because we can), we would need to leave such imports unmodified here, which means checking the type before lifting.

But I think it is more helpful for us to issue a validation error, forcing the user to get it right? Otherwise the user is depending on runtime errors from the VM, which means that if the user's testing happens to not check enough code paths, runtime errors can reach production.

I don't see a benefit to allowing such runtime-trapping modules, so giving the users more static type checking seems good?

In that case, perhaps it would be better to check the type so we can provide a more informative error message. I imagine the validation errors would be somewhat obscure.

Fair enough, better errors might be useful here. Added.

tlively · 2025-03-21T19:48:01Z

src/passes/StringLifting.cpp

+      if (!func->imported() || func->module != WasmStringsModule) {
+        continue;
+      }
+      if (func->base == "fromCharCodeArray") {


Ideally we should be checking for the expected function types as well.

tlively · 2025-03-21T20:04:39Z

src/passes/StringLifting.cpp

+    // TODO: Add casts. We generate new string.* instructions, and all their
+    //       string inputs should be stringref, not externref, but we have not
+    //       converted all externrefs to stringrefs (since some externrefs might
+    //       be something else). It is not urgent to fix this as the validator
+    //       accepts externrefs there atm, and since toolchains will lower
+    //       strings out at the end anyhow (which would remove such casts). Note
+    //       that if we add a type import for stringref then this problem would
+    //       become a lot simpler (we'd convert that type to stringref).


Would there be any downside of considering the casts to be part of the semantics of the lifted string operations? We would have to update the interpreter to perform the casts, but this better matches the semantics and typing of the lowered string operations. We would type the output of e.g. string.concat as string as an optimization, but it's inputs would remain extern references, allowing all valid lowered modules to be represented as lifted modules without having to insert new casts.

Interesting, yeah, we could do things that way too.

It does feel "cleaner" though for string.concat to receive stringrefs... I'd lean toward waiting for type imports and using that to fix this, eventually (as it doesn't seem urgent to fix).

We don't need to solve this now because the fuzzer currently ensures that only strings flow into string operations? But that means we're missing out on fuzzing interesting programs that might send other values to string builtins.

Hmm, good point... while we import only functions and tags, and we only create strings as true externrefs, we could end up with a struct or array that is externalized. So a string instruction could receive such a non-string value.

It looks like the fuzzer does not generate new externalize instructions itself. So the reason we have not seen this yet is that hitting this depends on using one of the few initial contents with an externalize, and happening to put exactly the right thing there.

We would need to fix this before adding more externalize, in other words. Which is worth doing, as you said.

kripken · 2025-03-21T20:12:58Z

What do you think about building string lifting and lowering directly into the parsers (via IRBuilder) and the binary writer in the long term?

That seems to have downsides:

It would add complexity to the parser and writer.
It would mean that users have less control over when this transformation happens (e.g. maybe they want to run some general optimization cycles before any changes).
If the parser and writer did this automatically, we'd have no way to directly represent Binaryen IR strings in text format, so any debugging would end up having to consider that the IR is different from the wasm.

The upside is not having support for a nonstandard binary and text format, but I think the nonstandard stuff is exactly what we want in terms of serializing our IR, and that our IR with stringref operations is what we want as an IR (e.g. a string concat operation is literally an instruction string.concat and not some kind of import).

tlively · 2025-03-21T20:24:08Z

What do you think about building string lifting and lowering directly into the parsers (via IRBuilder) and the binary writer in the long term?

That seems to have downsides:

It would add complexity to the parser and writer.

I'm not too worried about adding complexity to IRBuilder since it's so well contained. For the writer, it could just call out automatically to the lowering pass.

It would mean that users have less control over when this transformation happens (e.g. maybe they want to run some general optimization cycles before any changes).

I don't see any reason not to perform lifting as early as possible and lowering as late as possible. That will get the best optimizations.

If the parser and writer did this automatically, we'd have no way to directly represent Binaryen IR strings in text format, so any debugging would end up having to consider that the IR is different from the wasm.

No problem, we can continue parsing and printing the non-standard string operations and types in the text parser and printer.

The upside is not having support for a nonstandard binary and text format, but I think the nonstandard stuff is exactly what we want in terms of serializing our IR, and that our IR with stringref operations is what we want as an IR (e.g. a string concat operation is literally an instruction string.concat and not some kind of import).

Not having to support a nonstandard binary format, yes, but as mentioned above, I think we should keep the text format for testing purposes. I agree that "our IR with stringref operations is what we want as an IR," and building lifting into the parser strongly aligns with that.

kripken · 2025-03-21T20:32:03Z

I don't see any reason not to perform lifting as early as possible and lowering as late as possible. That will get the best optimizations.

Debugging could be a reason, for example.

Another possible reason is interaction/ordering with other lifting and lowering operations (like say the lowering we have for intrinsics atm). I can't think of a bad interaction between them, though.

Not having to support a nonstandard binary format [is the benefit], yes, but as mentioned above, I think we should keep the text format for testing purposes.

I agree removing a nonstandard thing is nice, but it is also a small amount of code, and already written.

And the binary format is also useful. E.g. reduction is faster on the binary format, and it may even be needed, if the bug does not reproduce in the text format (which does happen sometimes, as our text format has some differences, like the unreachable type).

And, it is nice that the binary and text formats are as parallel as possible (which, again, we've already compromised on, but hopefully we can avoid adding more such divergences).

tlively · 2025-03-21T22:03:30Z

The main benefit I would want is that defaults matter, so having to manually remember to add these passes would be bad. That benefit could also be realized by adding lifting lowering passes at the beginning and end of the default optimization pipeline. (But in that case the lowering and lifting in the middle of e.g.g -O3 -O3 would be useless. Maybe we want a concept of early passes and late passes based on the maximum used optimization level or something like that?)

kripken · 2025-03-21T23:03:33Z

The main benefit I would want is that defaults matter, so having to manually remember to add these passes would be bad. That benefit could also be realized by adding lifting lowering passes at the beginning and end of the default optimization pipeline.

I get your point that it would be nice for this to "just work" for users. But losing some type information between cycles, or even between invocations of wasm-opt, is generally harmful.

In more detail: lifting + lowering is not "perfect". That is, the lowering at the end of one cycle and the lifting at the start of another may not perfectly cancel out. In fact, they do not, in general: we lower stringref to externref, and do not lift it back, so we depend on optimizations to re-refine things, which may take multiple cycles. Now, in VMs this does not matter because the user will lower stringref to externref at the very end anyhow, but it could help in the middle, say by proving a ref.test of stringref succeeds.

tlively · 2025-03-22T00:01:14Z

But lowering is the lossy path. Lifting actually recovers type information because the string operations get the more refined string type. So lifting as early as possible and lowering as late as possible is always optimal. I agree that doing a lowering in the middle of -O3 -O3 would be suboptimal, so we should find a way to lift by default while avoiding that problem.

kripken · 2025-03-24T20:06:13Z

But lowering is the lossy path. Lifting actually recovers type information because the string operations get the more refined string type. So lifting as early as possible and lowering as late as possible is always optimal.

Yes, but:

wasm-opt input.wasm -O3 -O3 --gufa -O3 -O3 -o mid.wasm
wasm-opt mid.wasm -O3 -O3 --type-finalizing -o final.wasm

With automatic lifting&lowering, the lowering at the end of the first line is the problem, as it undoes refinements. We may recoup them on the second line (after it lifts during load) but we may not.

tlively · 2025-03-24T23:48:34Z

What if we had a flag to allow opting out of the automatic lowering in the binary writer? Such users could use the flag for their intermediate stages, and users who just use a simple, default pass pipeline would get the optimal and standards-compliant behavior without having to be experts.

kripken · 2025-03-25T16:51:42Z

What if we had a flag to allow opting out of the automatic lowering in the binary writer?

It sounds like you're suggesting

wasm-opt input.wasm [..opts..] -o mid.wasm --opt-out-of-string-lowering
wasm-opt mid.wasm [..opts..] -o final.wasm

Is that right - so we do keep around the binary writing/reading logic for strings?

users who just use a simple, default pass pipeline would get the optimal and standards-compliant behavior without having to be experts

Users would need to be aware of the issue to know if they are getting the optimal pipeline, so I am not sure this helps much. But maybe there are more such users?

I don't follow the standards-compliant point? In my proposal if the user never uses --string-lifting --string-lowering then they are obviously standards-compliant (assuming they emit such code themselves), and if they do use it, they will still emit standards-compliant code (unless they forget the --string-lowering, but in your suggestion they could also leave an --opt-out-of-string-lowering at the end).

tlively · 2025-03-25T17:28:10Z

Yes, if we consider it a requirement that users can split their pipeline across multiple invocations without lowering in the middle, then we would need to keep the binary reading and writing around. I'm not quite convinced this is worth it, but I agree it removes the possibility of regressing current behavior, so I'm happy to start here.

I do think there are probably many users who just run -O3 and call it a day. Certainly we can't expect users who aren't toolchain maintainers to know anything about what the pipeline is doing.

Never mind the standards-compliant comment, it was indeed self-evident. My main point is that it would be good to do the best thing by default.

kripken · 2025-03-25T18:00:21Z

I do agree that most users likely just do -O3, so that would work well for them.

But I guess the question is what do we want to focus on - making the common case work well, or avoiding subtle issues in rarer cases? I agree you handle the common case well, but a toolchain author missing --opt-out-of-string-lowering is pretty subtle, far harder to notice than not getting string opts at all.

And I still feel that adding this magic in the binary reading and writing has more downsides than benefits. For example, how can we avoid adding overhead? Lifting can be very efficient - check for the imports, and if none exist, do nothing - but lowering requires scanning the entire module in general, to see if there is anything we need to lower. I can't think of a good way to avoid that overhead.

tlively · 2025-03-25T18:20:28Z

Code LGTM. Let's discuss automatic lifting and lowering offline. I don't think strings are the only thing we will want this for.

kripken added 30 commits March 13, 2025 09:50

start

1080997

start

5dbaabe

start

01b2e3f

Merge remote-tracking branch 'origin/main' into string.lifting

db1f704

Merge remote-tracking branch 'origin/main' into string.lifting

86206ce

work

62d03dd

test

293108a

work

9b87039

work

9c10263

tets

1265b4a

work

f8fe257

work

430bdeb

work

ea3cdf8

test

1c397fd

work

b54fde8

work

b9ce258

work

2c62c8e

work

7757a3c

work

ef1bc83

work

2d09c55

work

7a72d6b

format

4593e5b

work

c96f53b

fix

c68ceaf

fix

b9d21c1

todo

7da54db

changelog

84813d3

fix

f7fa2fe

comment

8a06e64

comment

7f752fc

help text

69dc58d

kripken requested a review from tlively March 21, 2025 18:29

Merge branch 'main' into string.lifting

03806a6

tlively reviewed Mar 21, 2025

View reviewed changes

add test

3881100

type checking

aec3756

format

10dedaf

tlively approved these changes Mar 25, 2025

View reviewed changes

Merge branch 'main' into string.lifting

9efe8c3

kripken merged commit aee292b into WebAssembly:main Mar 25, 2025
14 checks passed

kripken deleted the string.lifting branch March 25, 2025 19:46


		const Name WasmStringsModule = "wasm:js-string";

		const Name WasmStringConstsModule = "'";

Conversation

kripken commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kripken commented Mar 21, 2025

Uh oh!

tlively left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken commented Mar 21, 2025

Uh oh!

tlively commented Mar 21, 2025

Uh oh!

kripken commented Mar 21, 2025

Uh oh!

tlively commented Mar 21, 2025

Uh oh!

kripken commented Mar 21, 2025

Uh oh!

tlively commented Mar 22, 2025

Uh oh!

kripken commented Mar 24, 2025

Uh oh!

tlively commented Mar 24, 2025

Uh oh!

kripken commented Mar 25, 2025

Uh oh!

tlively commented Mar 25, 2025

Uh oh!

kripken commented Mar 25, 2025

Uh oh!

tlively commented Mar 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kripken commented Mar 21, 2025 •

edited

Loading

kripken Mar 21, 2025 •

edited

Loading