Skip to content

Add optional WASM feature to the native library, allowing it to run wasm-compiled parsers via wasmtime#1864

Merged
maxbrunsfeld merged 37 commits intomasterfrom
wasm-language
Nov 28, 2023
Merged

Add optional WASM feature to the native library, allowing it to run wasm-compiled parsers via wasmtime#1864
maxbrunsfeld merged 37 commits intomasterfrom
wasm-language

Conversation

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

@maxbrunsfeld maxbrunsfeld commented Sep 8, 2022

Background

Currently, Tree-sitter parsers can be compiled to WebAssembly (aka 'WASM') and run within a web browser, via the web-tree-sitter JavaScript library, which contains a WASM build of the Tree-sitter library.

In some applications that use the native Tree-sitter library, it would also be very useful to be able to load these same WASM builds of parsers, in order to support adding parsers as 🔌 plugins 🔌 , without requiring users to download platform-specific native binaries or to compile C code on their own machines.

Change

This PR adds a new optional WASM feature to the core library, which can be enabled in Rust via the wasm cargo feature, and in C via the TREE_SITTER_FEATURE_WASM macro.

This feature allows you to build a native TSLanguage object from a WASM buffer. You can then use this language object just like any other (native-compiled) language object: parsing text that lives in native memory, constructing syntax trees on the native heap, sending it between threads, etc. Tree-sitter languages are mostly just plain immutable data, so they're easy to unmarshal from a compiled wasm module.

The only difference is that when using a wasm-based language with a TSParser, you must first provide the parser with a TSWasmStore. This wasm store object wraps a wasmtime::Store object, which is used by the Tree-sitter library to invoke the language's lexing functions which are code, not data, so they require a WASM runtime to execute.

Notes

Wasmtime Dependency - I originally thought that it'd be cool to code against WebAssembly's standard wasm.h C interface, which is supposedly implemented by multiple runtimes (mainly wasmtime and V8). This way, applications using Tree-sitter would have multiple choices of which WASM runtime to link Tree-sitter against.

As I got into the details, I ended finding that some of the C APIs that I needed were currently unimplemented in wasmtime, but that Wasmtime provides its own custom C interface which is fully implemented, and which is supposedly better-designed from a performance perspective. So I ended up using Wasmtime's own C API. This means you can't use V8 as the wasm runtime when using this Tree-sitter feature, at least for now.

Tasks

  • build
    • Open wasmtime PR for tweaks to the wasmtime-c-api Cargo.toml
  • library
    • Support parsers without external scanners
    • Support parsers whose external scanners don't use libc or libc++
    • Support parsers that use libc and libc++ by bundling a small WASM module exporting all of the same stdlib symbols that as web-tree-sitter
    • Generalize logic for copying parse tables out of wasm memory, removing hard-coded stuff
    • Make sure wasm stores work correctly when loading multiple languages
    • Make error handling robust
      • return an informative error from ts_wasm_store_load_language
  • cli
    • Add a --wasm flag to the tree-sitter parse CLI command, which causes any parsers to be compiled to WASM instead of native shared libraries, and loaded via the new logic.
    • Add a --wasm flag to the tree-sitter test command that works the same way
    • Exercise this new behavior in the test suite. Maybe just run the entire test suite in both WASM and non-WASM modes?

@maxbrunsfeld maxbrunsfeld mentioned this pull request Sep 8, 2022
31 tasks
@maxbrunsfeld maxbrunsfeld force-pushed the wasm-language branch 2 times, most recently from 951b5ce to 4bb82ce Compare October 24, 2022 22:36
Replace non-mutating `ts_parser_wasm_store` function with
`ts_parser_take_wasm_store`, which removes and returns the wasm
store, in order to facilitate single ownership.
@RubixDev
Copy link
Copy Markdown

  • Open wasmtime PR for tweaks to the wasmtime-c-api Cargo.toml

I don't exactly understand why the Rust dependency on wasmtime-c-api is necessary. The only place it is used is the test() function in lib/binding_rust/wasm_language.rs which appears to be unused, and even that just seems to call wasmtime::Engine::default() internally. So is it really necessary to create a PR for allowing use of that crate from Rust?

@RubixDev
Copy link
Copy Markdown

Okay it seems I completely missed the point. Depending on the C API from Rust allows the tree-sitter C lib to be compiled and linked easily without having to manually vendor wasmtime. I created a PR on wasmtime here: bytecodealliance/wasmtime#6765.

@maxbrunsfeld maxbrunsfeld marked this pull request as ready for review November 28, 2023 16:42
@maxbrunsfeld maxbrunsfeld merged commit 034f0d0 into master Nov 28, 2023
@maxbrunsfeld maxbrunsfeld deleted the wasm-language branch November 28, 2023 20:08
@maxbrunsfeld
Copy link
Copy Markdown
Contributor Author

Ok, this experimental feature is complete enough for people to try using in downstream applications. There will probably be bugs, but the feature basically works.

@panekj
Copy link
Copy Markdown

panekj commented Nov 28, 2023

Thanks a lot for the work! We will be happy to test it in Lapce (hopefully soon)

@clason
Copy link
Copy Markdown
Member

clason commented Nov 28, 2023

Huge step towards a truly platform-independent parser ecosystem. Out of interest (and I apologize if it's obvious), how does this relate to #949?

@maxbrunsfeld
Copy link
Copy Markdown
Contributor Author

Out of interest (and I apologize if it's obvious), how does this relate to #949?

Yeah, that's a good question. That same limitation applies when using wasm-compiled parsers in this mode: there is a fixed (small) subset of the C and C++ standard libraries available that is compiled into the library, and any external scanners that rely on symbols that aren't in this subset will not work.

There may be some way to change how parsers are compiled to wasm, such that they can include code that they need from the standard library, but I don't know how to do that with Emscripten (or Clang) thus far. I feel like this design, though somewhat limiting, mostly works in practice.

But as part of stabilizing this feature, it would probably be good to add some tooling around detecting when external scanners use functions that are unavailable in a wasm context, and emitting warnings.

@clason
Copy link
Copy Markdown
Member

clason commented Nov 28, 2023

Ah, pity. That is a dealbreaker for Neovim (as the Markdown parser is now required infrastructure, and many parsers in nvim-treesitter have a scanner.c).

But still, great feature!

@maxbrunsfeld
Copy link
Copy Markdown
Contributor Author

maxbrunsfeld commented Nov 28, 2023

Also, some big limitations to call out:

  • Right now, there's no way to delete a Language, because they previously consisted only of static data. Now, with wasm-based languages, they are allocated dynamically at runtime. I plan to add an API for deleting a Wasm-based language, so that you can dynamically reload a given language after changing its wasm (enabling a hot-reloading workflow for parsers). That'll happen in a future PR.
  • There's not yet any docs for how to use this feature. I'll add those in a subsequent PR.
  • I haven't used this feature in an actual editor yet, so I repeat, there are probably bugs.

@clason
Copy link
Copy Markdown
Member

clason commented Nov 28, 2023

There's not yet any docs for how to use this feature. I'll add those in a subsequent PR.

Yeah, I noticed ;) Lots of stabbing in the dark for me...

@maxbrunsfeld
Copy link
Copy Markdown
Contributor Author

maxbrunsfeld commented Nov 28, 2023

Ah, pity. That is a dealbreaker for Neovim (as the Markdown parser is now required infrastructure, and many parsers in nvim-treesitter have a scanner.c).

Just to be clear, parsers with a scanner.c (or scanner.cc) will still work fine, as long as the scanner only uses certain functions from the C and C++ standard libraries. If there is some critical function that the Markdown scanner uses that I haven't thought to include, it's pretty easy to add it to the core. We need to add it here and here.

@tree-sitter tree-sitter deleted a comment from PyaePhyo8612 Nov 28, 2023
@savetheclocktower
Copy link
Copy Markdown
Contributor

But as part of stabilizing this feature, it would probably be good to add some tooling around detecting when external scanners use functions that are unavailable in a wasm context, and emitting warnings.

This would be fantastic. We've tried to do so in Pulsar by adding a runtime check, but that requires a user to hit the specific code path first, and serves only to present a more understandable error message. If there were (or if there is) a way to analyze a wasm file and know which externals it wants, that'd be a major improvement.

@clason
Copy link
Copy Markdown
Member

clason commented Nov 30, 2023

Another question, just to be sure: If this mode is enabled, will you still be able to use native parsers (i.e., support both .so and .wasm in an editor)?

@maxbrunsfeld
Copy link
Copy Markdown
Contributor Author

Yes, you can still use the native parsers.

maxbrunsfeld added a commit to zed-industries/zed that referenced this pull request Jan 3, 2024
This PR adds undocumented functionality for loading custom language
plugins at runtime. I don't intend to expose the functionality to end
users yet, but this will allow the team to test the capability
internally.

### Implementation

There isn't much new code in Zed. Most of the work here is within
Tree-sitter, in PRs tree-sitter/tree-sitter#1864
and tree-sitter/tree-sitter#2840, which allow
Tree-sitter to load languages from WASM blobs. I've tested the
functionality in Tree-sitter's test suite and via its CLI, but having it
wired into Zed allows us to test the functionality more fully.

### Details

Now, on startup, Zed will look for subdirectories inside of
`~/Application Support/plugins`. These subdirectories are expected to
look similar to the per-language subdirectories in
[`crates/zed2/src/languages`](https://github.com/zed-industries/zed/tree/main/crates/zed2/src/languages),
except that they also contain a `.wasm` file for the parser itself.

I'll add more details here as I go.
@joaomoreno
Copy link
Copy Markdown

Any plan on shipping this?

@icp1994
Copy link
Copy Markdown

icp1994 commented Jan 26, 2024

Can (should?) the latest tree-sitter release be built against a published wasmtime version instead of git rev?

@maxbrunsfeld
Copy link
Copy Markdown
Contributor Author

The challenge right now is that wasmtime-c-api is not available on crates.io. I'm trying to get that resolved (bytecodealliance/wasmtime#7837), but if there are blockers, I may need to work around it by temporarily removing the wasm feature from the version that we publish to crates.io, or vendoring the wasmtime-c-api sources somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.