Skip to content

Proposal: Add inter-language type bindings #1274

@PoignardAzur

Description

@PoignardAzur

WebAssembly is currently very good at executing code written in arbitrary languages from a given (usually JS) interpreter, but it lacks several key features when it comes to combining multiple arbitrary languages together.

One of these features is a language-agnostic type system. I would like to propose that one or several such system(s) be added to WebAssembly.

As an aside, in previous feature discussions, some contributors have expressed that language-interoperability shouldn't be a design goal of WebAssembly. While I agree that it shouldn't necessarily be a high-priority goal, I think it is a goal striving for in the long term. So before I go into design goals, I'm going to lay out the reasons why I think language interoperability is worth the effort.

Why care about language interoperability?

The benefits of lower language-to-language barriers include:

  • More libraries for wasm users: This goes without saying, but improving language interoperability means that users can use existing libraries more often, even if the library is written in a different language than they're using.

  • Easier adoption of small languages: In the current marketplace, it's often difficult for languages without corporate support to get traction. New languages (and even languages like D with years of refinement) have to compete with languages with large ecosystems, and suffer from their own lack of libraries. Language interoperability would allow them to use existing ecosystems like Python's or Java's.

  • Better language-agnostic toolchains: Right now, most-languages have their own library loading scheme and package manager (or, in the case C/C++, several non-official ones). Writing a language-agnostic project builder is hard, because these languages often have subtle dependencies, and ABI incompatibilities, that require a monolithic project-wide solution to resolve. A robut inter-language type system would make it easier for projects to be split into smaller modules, that can be handled by a npm-like solution.

Overall, I think the first point is the most important, by a wide margin. A better type system means better access to other languages, which means more opportunities to reuse code instead of writing it from scratch. I can't overstate how important that is.

Requirements

With that in mind, I want to outline the requirements an inter-language type system would need to pass.

I'm writing under the assumptions that the type system would be strictly used to annotate functions passed between modules, and would not check how languages use their own linear or managed memory in any way.

To be truly useful in a wasm setting, such a type system would need:

1 - Safety

  • Type-safe: The callee must only have access to data specified by the caller, object-capabilities-style.
  • Memory should be "forgotten" at the end of a call. A callee shouldn't be able to gain access to a caller's data, return, and then access that data again in any form.

2 - Overhead

  • Developers should be comfortable making inter-module calls regularly, eg, in a render loop.
  • Zero-copy: The type system should be expressive enough to allow interpreters to implement zero-copy strategies if they want to, and expressive enough for these implementers to know when zero-copy is optimal.

3 - Struct graphs

  • The type system should include structures, optional pointers, variable-length arrays, slices, etc.
  • Ideally, the caller should be able to send an object graph scattered in memory while respecting requirements 1 and 2.

4 - Reference types

  • Modules should be able to exchange reference types nested deep within structure graphs.

5 - Bridge between memory layouts

  • This is a very important point. Different categories of languages have different requirements. Languages relying on linear memory would want to pass slices of memory, whereas languages relying on GC would want to pass GC references.
  • An ideal type system should express semantic types, and let languages decide how to interpret them in memory. While passing data between languages with incompatible memory layouts will always incur some overhead, passing data between similar languages should ideally be cheap (eg, embedders should avoid serialization-deserialization steps if a memcpy can do the same job).
  • Additional bindings may also allow for caching and other optimization strategies.
  • The conversion work when passing data between two modules should be transparent to the developer, as long as the semantic types are compatible.

6 - Compile-time error handling

  • Any error related to invalid function call arguments should be detectable and expressible at compile-time, unlike in, eg, JS, where TypeErrors are thrown at runtime when trying to evaluate the argument.
  • Ideally, language compilers themselves should detect type errors when importing wasm modules, and output expressive, idiomatic errors to the user. What form this error-checking should take would need to be detailed in the tool-conventions repository.
  • This means that an IDL with existing converters to other languages would be a plus.

7 - Provide a Schelling point for inter-language interaction

  • This is easier said than done, but I think wasm should send a signal to all compiler writers, that the standard way to interoperate between languages is X. For obvious reasons, having mutliple competing standards for language interoperability isn't desirable.

Proposed implementation

What I propose is for bindings to the Cap'n'Proto IDL by @kentonv to be added to Webassembly.

They would work in a similar fashion to WebIDL bindings: wasm modules would export functions, and use special instructions to bind them to typed signatures; other modules would import these signatures, and bind them to their own functions.

The following pseudo-syntax is meant to give an idea of what these bindings would look like; it's approximative and heavily inspired by the WebIDL proposal, and focuses more on the technical challenges than on providing exhaustive lists of instructions.

Capnproto binding instructions would all be stored in a new Cap'n'proto bindings section.

Cap'n'proto types

The standard would need an internal representation of capnproto's schema language. As an example, the following Capnproto type:

struct Person {
  name @0 :Text;
  birthdate @3 :Date;

  email @1 :Text;
  phones @2 :List(PhoneNumber);

  struct PhoneNumber {
    number @0 :Text;
    type @1 :Type;

    enum Type {
      mobile @0;
      home @1;
      work @2;
    }
  }
}

struct Date {
  year @0 :Int16;
  month @1 :UInt8;
  day @2 :UInt8;
}

might be represented as

(@capnproto type $Date (struct
    (field "year" Int16)
    (field "month" UInt8)
    (field "day" UInt8)
))
(@capnproto type $Person_PhoneNumber_Type (enum 0 1 2))
(@capnproto type $Person_PhoneNumber (struct
    (field "number" Text)
    (field "type" $Person_PhoneNumber_Type)
))
(@capnproto type $Person (struct
    (field "name" Text)
    (field "email" Text)
    (field "phones" (generic List $Person_PhoneNumber))
    (field "birthdate" $Data)
))

Serializing from linear memory

Capnproto messages pass two types of data: segments (raw bytes), and capabilities.

These roughly map to WebAssembly's linear memory and tables. As such, the simplest possbile way for webassembly to create capnproto messages would be to pass an offset and length to linear memory for segments, and an offset and length to a table for capabilities.

(A better approach could be devised for capabilities, to avoid runtime type checks.)

Note that the actual serialization computations would take place in the glue code, if at all (see Generating the glue code).

Binding operators

Operator Immediates Children Description
segment off‑idx
len‑idx
Takes the off-idx'th and len-idx'th wasm values of the source tuple, which must both be i32s, as the offset and length of a slice of linear memory in which a segment is stored.
captable off‑idx
len‑idx
Takes the off-idx'th and len-idx'th wasm values of the source tuple, which must both be i32s, as the offset and length of a slice of table in which the capability table is stored.
message capnproto-type
capability-table
segments Creates a capnproto message with the format capnproto-type, using the provided capability table and segments.

Serializing from managed memory

It's difficult to pin down specific behavior before the GC proposal lands. But the general implementation is that capnproto bindings would use a single conversion operator to get capnproto types from GC types.

The conversion rules for low-level types would be fairly straightforward: i8 converts to Int8, UInt8 and bool, i16 converts to Int16, etc. High-level types would convert to their capnproto equivalents: structure and array references convert to pointers, opaque references convert to capabilities.

A more complete proposal would need to define a strategy for enum and unions.

Binding operators

Operator Immediates Children Description
as capnproto-type
idx
Takes the idx'th wasm value of the source tuple, which must be a reference, and produces a capnproto value of capnproto-type.

Deserializing to linear memory

Deserializing to linear memory is mostly similar to serializing from it, with one added caveat: the wasm code often doesn't know in advance how much memory the capnproto type will take, and need to provide the host with some sort of dynamic memory management method.

In the WebIDL bindings proposal, the proposed solution is to pass allocator callbacks to the host function. For capnproto bindings, this method would be insufficient, because dynamic allocations need to happen both on the caller side and the callee side.

Another solution would be to allow incoming binding maps to bind to two incoming binding expressions (and thus two functions): one that allocates the memory for the capnproto data, and one that actually takes the data.

Deserializing to managed memory

Deserializing to managed memory would use the same kind of conversion operator as the opposed direction.

Generating the glue code

When linking two wasm modules together (whether statically or dynamically), the embedder should list all capnproto types common to both modules, bindings between function types and and capnproto types, and generate glue code between every different pair of function types.

The glue code would depend on the types of the bound data. Glue code between linear memory bindings would boil down to memcpy calls. Glue code between managed memory bindings would boil down to passing references. On the other hand, glue code between linear and managed memory would involve more complicated nested conversion operations.

For instance, a Java module could export a function, taking the arguments as GC types, and bind that function to a typed signature; the interpreter should allow a Python module and a C++ to import that type signature; the C++ binding would pass data from linear memory, whereas the Python binding would pass data from GC memory. The necessary conversions would be transparent to the Java, Python and C++ compilers.

Alternate solutions

In this section, I'll examine alternate ways to exchange data, and how they rate on the metrics defined in the Requirements section.

Exchange JSON messages

It's the brute-force solution. I'm not going to spend to much time on that one, because its flaws are fairly obvious. It fails to meet requirements 2, 4 and 6.

Send raw bytes encoded in a serialization format

It's a partial solution. Define a way for wasm modules to pass slices of linear memory and tables to other modules, and module writers can then use a serialization format (capnproto, protobuff or some other) to encode a structured graph into a sequence of bytes, pass the bytes, and use the same format to decode it.

It passes 1 and 3, and it can pass 2 and 4 with some tweaking (eg pass the references as indices to a table). It can pass 6 if the user makes sure to export the serialization type to a type definition in the caller's language.

However, it fails at requirements 5 and 7. It's impractical when binding between two GC implementations; for instance, a Python module calling a Java library with through Protobuf would need to serialize a dictionary as linear memory, pass that slice of memory, and then deserialize it as a Java object, instead making a few hashtable lookups that can be optimized away in a JIT implementation.

And it encourages each library writer to use their own serialization format (JSON, Protobuf, FlatBuffer, Cap'n Proto, SBE), which isn't ideal for interoperability; although that could be alleviated by defining a canonical serialization format in tool-conventions.

However, adding the possibility to pass arbitrary slices of linear memory would be a good first step.

Send GC objects

It would be possible to rely on modules sending each other GC objects.

The solution has some advantages: the GC proposal is already underway; it passes 1, 3, 4 and 7. GC-collected data is expensive to allocate, but cheap to pass around.

However, that solution is not ideal for C-like languages. For instance, a D module passing data to a Rust module would need to serialize its data into a GC graph, pass the graph to the Rust function, which would deserialize it into its linear memory. This process allocates GC nodes which are immediately discarded, for a lot of unnecessary overhead.

That aside, the current GC proposal has no built-in support for enums and unions; and error handling would either be at link time or run time instead of compile time, unless the compiler can read and understand wasm GC types.

Use other encodings

Any serialization library that defines a type system could work for wasm.

Capnproto seems most appropriate, because of its emphasis on zero-copy, and its built-in object capabilities which map neatly to reference types.

Remaining work

The following concepts would need to be fleshed out to turn this bare-bones proposal into a document that can be submitted to the Community Group.

  • Binding operators
  • GC type equivalences
  • Object capabilities
  • Bool arrays
  • Arrays
  • Constants
  • Generics
  • Type evolution
  • Add a third "getters and setters" binding type.
  • Possible caching strategies
  • Support for multiple tables and linear memories

In the meantime, any feedback on what I've already written would be welcome. The scope here is pretty vast, so I'd appreciate help narrowing down what questions this proposal needs to answer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions