Let's make surface creation safe!

There has been some discussion in the past over how to make create_surface safe:

https://github.com/gfx-rs/wgpu-rs/issues/78

https://github.com/gfx-rs/wgpu-rs/pull/281#discussion_r415495633

From these comments, my impression was that the only issue was not being able to trust the pointers provided by the type implementing https://docs.rs/raw-window-handle/0.3.3/raw_window_handle/trait.HasRawWindowHandle.html.  However, this is an `unsafe trait` and the requirements it puts on its implementors are actually quite strong:

> Safety guarantees

> **Users can safely assume that non-null/0 fields are valid handles**, and it is up to the implementer of this trait to ensure that condition is upheld. However, *It is entirely valid behavior for fields within each platform-specific RawWindowHandle variant to be null or 0, and appropriate checking should be done before the handle is used.**

> Despite that qualification, implementers should still make a best-effort attempt to fill in all available fields. If an implementation doesn't, and a downstream user needs the field, it should try to derive the field from other fields the implementer does provide via whatever methods the platform provides.

> **The exact handles returned by raw_window_handle must remain consistent between multiple calls to raw_window_handle, and must be valid for at least the lifetime of the HasRawWindowHandle implementer.**

(emphasis mine).

We can instantly see that just by taking `&W` where `W: HasRawWindowHandle`, we do not need to worry about arbitrary pointers or integers--we can rely on the guarantees of `HasRawWindowHandle` alone, which ensures that as long as the underlying `W` instance is alive:

* The handles won't change.
* All handles will either be valid, or null/0 (depending on platform).

Since wgpu already needs to deal with handles in a backend-sensitive way, it makes perfect sense for it to just check all the individual handles.  Problem solved, right?  Unfortunately, it turns out that while this is the only *documented* source of unsafety, it is not the only source of unsafety when using a surface created by `create_surface`.  @cwfitzgerald gave the following code as an example, when performing rendering using a task system off the main thread:

```rust
frame = None;
loop {
    MainEventsCleared => {
        wait_for_frame(frame);

        frame = Some(render());
    }
}
```

When trying to close a window, this event loop would segfault on some Linux systems, but moving `wait_for_frame` below the `frame = _` portion (at the end of the `MainEventsCleaared` arm) would prevent the segfault.  Here, `wait_for_frame` waits for all commands to be submitted from the CPU to the GPU (but does not wait for the GPU).  This suggests that after exiting the event loop (dropping the window), the window surface may be freed on the CPU on some platforms while there are still in-flight commands being sent to the surface from another thread; presumably, someone (the window manager or graphics driver) is able to make sure device resources are not reused until rendering is complete, but the same is not true of host resources.

The implication here is clear: in general, we can't treat a surface as valid unless the underlying window from which it was created is valid.  Here, "window" doesn't necessarily have to be a window (although in practice it usually will be), it just has to be the object that owns whatever resource is referred to by the handle it passes to wgpu.  Note that in a wgpu context, we only care about "real" surfaces with valid handles.

While on some platforms (e.g. Android), it seems that a window is always kept alive as long as the surface is, on others (e.g. X11 on Linux), it seems like the surface may be deallocated together with the Window (we could not inspect the actual implementation of https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCreateXlibSurfaceKHR.html so we could not verify this for certain, but either way the Vulkan documentation doesn't indicate that there's any requirement that the surface hold onto a reference to the window).  So our first conclusion is that if we want to be safe, we should make sure--within Rust--that a window always outlives the surface from which it was created.

We discussed several approaches here, but the most obvious is this one:

1. Surface is parameterized by its underlying window: `Surface<W>`.
2. The safe version of `create_surface` (in wgpu) takes `W: HasRawWindowHandle` *by value* rather than by reference.
3. The safe version of `create_surface` (in wgpu) validates any handles it uses are non-0 / non-null (depending on platform), runs all the checks and resource creation steps it currently does, and then stores `W` *within* the `Surface` structure.
4. `Surface<W>` exposes a few methods to get back the window data: `fn window(&self) -> &W`, and `fn into_window(self) -> W`.  Note that we do *not* provide `fn window_mut(&mut self) -> &mut W`, for reasons explained below.

First, safety arguments for why this makes it okay to use the window handle associated with the surface:

- From the definition of `HasRawWindowHandle`, as long as the underlying `W` is alive, the window handles will not change after repeated calls, so we know our checks from (3) will continue to be valid.
- From the definition of `HasRawWindowHandle`, as long as the underlying `W` is alive, the handle returned from a call will refer to a live window, so we know the window is valid as long as the instance of `W` is; which is at least as long as the `Surface` itself, *provided* that we don't provide any means of partially moving out of the `Window`.
- And, because we don't provide `&mut W` access or other partial window move methods, there is no way to alter the stored window without dropping the whole `Surface` (this is why we can't implement `deref_mut()`).

These may seem like uncomfortably strong restrictions on the window used for gfx-rs.  We address these concerns in two ways:

- Technically: We propose adding several implementations of `HasRawWindowHandle` upstream in the `raw-window-handle` crate (all of which we believe are justified by the requirements on `HasRawWindowHandle`):

```rust

impl<'a, T: HasRawWindowHandle> HasRawWindowHandle for &'a T { /* .. */ }
impl<T: HasRawWindowHandle> HasRawWindowHandle for Rc<T> { /* .. */ }
impl<T: HasRawWindowHandle> HasRawWindowHandle for Arc<T> { /* .. */ }

```

This means that in order to share access to the `Window` with the rest of an application, one can simply pass `Arc<W>`, `Rc<W>`, or `&W` (and possibly other implementations, should they prove useful) to `create_surface`.  Due to the nature of how windows work (all meaningful operations on them must work in the context of a separate device being able to query them through a duplicable handle), we know that the important, shared component of the window (the part that needs to be alive when the surface is) must have interior mutability of some sort if it can be mutated at all; for actually existing window APIs, this interior mutability is made safe dynamically rather than within the type system, so we don't lose anything by preventing `&mut` access to the window while wgpu may have references to it.  In exchange, we get a powerful safety argument that relies solely on the correct implementation of `HasRawWindowHandle` and `create_surface`, rather than requiring detailed investigation of every possible window management API.

We also examine practical, large wgpu projects to see if these restrictons would be onerous.  From my own work on Veloren I know that these restrictions would be no problem for us.  We also looked at Bevy, Dotrix, and Iced, and manually verified that they could move to this API with minimal changes:

- The Bevy API has a global hashmap from window ids to winit windows and could pretty clearly be turned into a map from window ids to `Arc<winit::Window>`.
  It requires `&mut` access only to its own internal `bevy::Window`, which is distinct from the one to which you have a handle, and even this is only needed
  during window creation right before it's inserted into the hash map.  So there would be no issue with switching to `Arc<winit::Window>`.
- For Dotrix, the event loop gets an owned `winit::Window` and it could easily have `Renderer` own it: https://github.com/lowenware/dotrix/blob/fix/wgpu-pipeline-creation/src/application.rs#L67.  `Arc<winit::Window>` would make it even easier, but is not required; the code that needs access to the window can always get at the underlying `Renderer` instance by accessing it through the global services structure, so if the renderer owned the `winit::Window` (indirectly through the surface), it would work fine (`Renderer::new` constructs and stores the surface).
- For iced, it will require a minor change to their compositor API (currently, it basically just propagates up the requirement for `&W` instead of `W`, but as that API is unsafe it will have to be changed regardless).  In the actual examples I could find, the window is easily available by value, however, e.g. https://github.com/hecrj/iced/blob/ff15ebc54778dee2a7262469c8a2bcd5abecd4d1/examples/integration/src/main.rs and https://github.com/hecrj/iced/blob/1f7e8b7f3d1804c39c8e0934b25f3ef178de269c/wgpu/src/window/compositor.rs.

So we can see that our theoretical expectations translate well to practice.  We can also see that in all known big examples, `winit::Window` is what's providing the `RawWindowHandle`, which makes sense since it works to abstract window creation across as many platforms as possible.  As such, verifying that `winit::Window`'s implementation of `HasRawWindowHandle` satisfies the trait requirements, together with correctness of `wgpu`'s implementation of functions on `Surface`, will in practice be sufficient to verify the safety of real world code using this API (which is important because it means that we are not simply hiding an added unsafety burden by moving it somewhere else in the stack).

However, this is *still* not enough!  This is because we still haven't addressed the thing you actually *do* with surfaces, which is use them with swap chains.  Swap chains are currently produced from the combination of a surface and a compatible device.  When `get_next_frame()` is called on a swap chain, we acquire a texture resource on the device; the resource is compatible with the target surface.  If successful, we can then render to the texture resource from our device, and then finally (on dropping the swap frame) "release" the texture, and it is queued for presentation to the surface.  Clearly, the safety of all of this depends crucially on both the device and surface remaining alive for the duration of the swap chain, but this is not guaranteed by the current API!

As discussed on Matrix, the swap chain API is still very much in flux, with no major players having made a firm commitment to any particular API.  However, wgpu itself has been learning from how people use it (together with constraints imposed by cross-platform requirements), and the current thought is that *surfaces* should own the queue currently associated with swap chains; a surface will own one swap chain at a time, and (from our discussion) wgpu has no interest in surfaces that are not part of a swap chain.  This takes care of one safety issue: if a surface replaces the functionality of the swap chain, then clearly the swap chain cannot outlive the surface!

Since the swap chain needs to allocate device memory, this also implies that the swap chain should maintain a strong reference to the backing `Device`.  It is possible that given that `Window` is alive, the existing gfx-rs `Context` mechanism will be sufficient to prevent `Device` host memory from being deallocated, since unlike with `Window` we started out with ownership over the device data; however, it's possible that the current complex dependency graph around swap chains, swap chain frames, devices, surfaces, etc. could be greatly simplified by maintaining a direct `Arc<Device>`.  Either of these mechanisms should be sufficient to ensure safety of *creating* a swap frame while being reasonably flexible.

To make sure that a swapchain *frame* is still alive, the most straightforward method would be to add a lifetime to `SwapChainFrame`; `get_next_frame(&'a self) -> SwapChainFrame<'a>` would be the new signature defined on `Surface`, and `SwapChainFrame` would maintain a reference to the `Surface` that instantiated it.  If this is too inflexible for some purposes (although I suspect most could be made to fit into this pattern), we could define on the method as `get_next_frame<S: Borrow<Self>>(surface: S) -> SwapChainFrame<S>`, which would allow using `Rc<Surface>`, `Arc<Surface>`, `Surface`, `&Surface`, etc. as desired (we woul also want to provide an `into_surface()` and `surface()` method as usual); though we might not be able to get away with this safely for arbitrary `Borrow<Self>`, we could at least hardcode the stuff we know works.  Alternately, we could again rely on reference counting / pointer following in the shared context, and avoid any explicit references (as `Surface` would already take care of them).

---

I believe this plan closes all the potential soundness holes in the current API around windows and surfaces.  There may still be implementation issues, of course, but the safety argument should hold as long as `HasRawWindowHandle` is implemented properly.

---

So, to recap:

- `create_surface` needs to check for null and zero handles as appropriate (in a cross-platform way).
- The `Surface` will own the window `W` and provide `&W` access and `into_window(self)`, but not `&mut W`.
- `Surface` will own the resources needed for the actual swap chain (queue etc.) and those resources in turn should hold onto the Device.
- Swap chain frames will hold references to the `Surface` (and therefore `Device`) that produced them when they are acquired (exact mechanism TBD).
- From manual inspection, all known major users of wgpu can easily accommodate this API change, as long as it comes together with `HasRawWindowHandle` for `Arc<W: AsRawWindowHandle>` and related instances implemented upstream (technically I think all of them can actually own the `Window`, but the `Arc` instance makes it easier).
- Jointly, this all makes creating a surface and acquiring a swap chain (which are the two main operations on surfaces) safe regardless of what safe code does with the window; all the safety requirements rely solely on a correct implementation of unsafe trait `AsRawWindowHandle` (needs to satisfy the requirements it already has).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's make surface creation safe! #1463

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Let's make surface creation safe! #1463

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions