Skip to content

Let's make surface creation safe! #1463

@pythonesque

Description

@pythonesque

There has been some discussion in the past over how to make create_surface safe:

gfx-rs/wgpu-rs#78

gfx-rs/wgpu-rs#281 (comment)

From these comments, my impression was that the only issue was not being able to trust the pointers provided by the type implementing https://docs.rs/raw-window-handle/0.3.3/raw_window_handle/trait.HasRawWindowHandle.html. However, this is an unsafe trait and the requirements it puts on its implementors are actually quite strong:

Safety guarantees

Users can safely assume that non-null/0 fields are valid handles, and it is up to the implementer of this trait to ensure that condition is upheld. However, It is entirely valid behavior for fields within each platform-specific RawWindowHandle variant to be null or 0, and appropriate checking should be done before the handle is used.*

Despite that qualification, implementers should still make a best-effort attempt to fill in all available fields. If an implementation doesn't, and a downstream user needs the field, it should try to derive the field from other fields the implementer does provide via whatever methods the platform provides.

The exact handles returned by raw_window_handle must remain consistent between multiple calls to raw_window_handle, and must be valid for at least the lifetime of the HasRawWindowHandle implementer.

(emphasis mine).

We can instantly see that just by taking &W where W: HasRawWindowHandle, we do not need to worry about arbitrary pointers or integers--we can rely on the guarantees of HasRawWindowHandle alone, which ensures that as long as the underlying W instance is alive:

  • The handles won't change.
  • All handles will either be valid, or null/0 (depending on platform).

Since wgpu already needs to deal with handles in a backend-sensitive way, it makes perfect sense for it to just check all the individual handles. Problem solved, right? Unfortunately, it turns out that while this is the only documented source of unsafety, it is not the only source of unsafety when using a surface created by create_surface. @cwfitzgerald gave the following code as an example, when performing rendering using a task system off the main thread:

frame = None;
loop {
    MainEventsCleared => {
        wait_for_frame(frame);

        frame = Some(render());
    }
}

When trying to close a window, this event loop would segfault on some Linux systems, but moving wait_for_frame below the frame = _ portion (at the end of the MainEventsCleaared arm) would prevent the segfault. Here, wait_for_frame waits for all commands to be submitted from the CPU to the GPU (but does not wait for the GPU). This suggests that after exiting the event loop (dropping the window), the window surface may be freed on the CPU on some platforms while there are still in-flight commands being sent to the surface from another thread; presumably, someone (the window manager or graphics driver) is able to make sure device resources are not reused until rendering is complete, but the same is not true of host resources.

The implication here is clear: in general, we can't treat a surface as valid unless the underlying window from which it was created is valid. Here, "window" doesn't necessarily have to be a window (although in practice it usually will be), it just has to be the object that owns whatever resource is referred to by the handle it passes to wgpu. Note that in a wgpu context, we only care about "real" surfaces with valid handles.

While on some platforms (e.g. Android), it seems that a window is always kept alive as long as the surface is, on others (e.g. X11 on Linux), it seems like the surface may be deallocated together with the Window (we could not inspect the actual implementation of https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCreateXlibSurfaceKHR.html so we could not verify this for certain, but either way the Vulkan documentation doesn't indicate that there's any requirement that the surface hold onto a reference to the window). So our first conclusion is that if we want to be safe, we should make sure--within Rust--that a window always outlives the surface from which it was created.

We discussed several approaches here, but the most obvious is this one:

  1. Surface is parameterized by its underlying window: Surface<W>.
  2. The safe version of create_surface (in wgpu) takes W: HasRawWindowHandle by value rather than by reference.
  3. The safe version of create_surface (in wgpu) validates any handles it uses are non-0 / non-null (depending on platform), runs all the checks and resource creation steps it currently does, and then stores W within the Surface structure.
  4. Surface<W> exposes a few methods to get back the window data: fn window(&self) -> &W, and fn into_window(self) -> W. Note that we do not provide fn window_mut(&mut self) -> &mut W, for reasons explained below.

First, safety arguments for why this makes it okay to use the window handle associated with the surface:

  • From the definition of HasRawWindowHandle, as long as the underlying W is alive, the window handles will not change after repeated calls, so we know our checks from (3) will continue to be valid.
  • From the definition of HasRawWindowHandle, as long as the underlying W is alive, the handle returned from a call will refer to a live window, so we know the window is valid as long as the instance of W is; which is at least as long as the Surface itself, provided that we don't provide any means of partially moving out of the Window.
  • And, because we don't provide &mut W access or other partial window move methods, there is no way to alter the stored window without dropping the whole Surface (this is why we can't implement deref_mut()).

These may seem like uncomfortably strong restrictions on the window used for gfx-rs. We address these concerns in two ways:

  • Technically: We propose adding several implementations of HasRawWindowHandle upstream in the raw-window-handle crate (all of which we believe are justified by the requirements on HasRawWindowHandle):
impl<'a, T: HasRawWindowHandle> HasRawWindowHandle for &'a T { /* .. */ }
impl<T: HasRawWindowHandle> HasRawWindowHandle for Rc<T> { /* .. */ }
impl<T: HasRawWindowHandle> HasRawWindowHandle for Arc<T> { /* .. */ }

This means that in order to share access to the Window with the rest of an application, one can simply pass Arc<W>, Rc<W>, or &W (and possibly other implementations, should they prove useful) to create_surface. Due to the nature of how windows work (all meaningful operations on them must work in the context of a separate device being able to query them through a duplicable handle), we know that the important, shared component of the window (the part that needs to be alive when the surface is) must have interior mutability of some sort if it can be mutated at all; for actually existing window APIs, this interior mutability is made safe dynamically rather than within the type system, so we don't lose anything by preventing &mut access to the window while wgpu may have references to it. In exchange, we get a powerful safety argument that relies solely on the correct implementation of HasRawWindowHandle and create_surface, rather than requiring detailed investigation of every possible window management API.

We also examine practical, large wgpu projects to see if these restrictons would be onerous. From my own work on Veloren I know that these restrictions would be no problem for us. We also looked at Bevy, Dotrix, and Iced, and manually verified that they could move to this API with minimal changes:

So we can see that our theoretical expectations translate well to practice. We can also see that in all known big examples, winit::Window is what's providing the RawWindowHandle, which makes sense since it works to abstract window creation across as many platforms as possible. As such, verifying that winit::Window's implementation of HasRawWindowHandle satisfies the trait requirements, together with correctness of wgpu's implementation of functions on Surface, will in practice be sufficient to verify the safety of real world code using this API (which is important because it means that we are not simply hiding an added unsafety burden by moving it somewhere else in the stack).

However, this is still not enough! This is because we still haven't addressed the thing you actually do with surfaces, which is use them with swap chains. Swap chains are currently produced from the combination of a surface and a compatible device. When get_next_frame() is called on a swap chain, we acquire a texture resource on the device; the resource is compatible with the target surface. If successful, we can then render to the texture resource from our device, and then finally (on dropping the swap frame) "release" the texture, and it is queued for presentation to the surface. Clearly, the safety of all of this depends crucially on both the device and surface remaining alive for the duration of the swap chain, but this is not guaranteed by the current API!

As discussed on Matrix, the swap chain API is still very much in flux, with no major players having made a firm commitment to any particular API. However, wgpu itself has been learning from how people use it (together with constraints imposed by cross-platform requirements), and the current thought is that surfaces should own the queue currently associated with swap chains; a surface will own one swap chain at a time, and (from our discussion) wgpu has no interest in surfaces that are not part of a swap chain. This takes care of one safety issue: if a surface replaces the functionality of the swap chain, then clearly the swap chain cannot outlive the surface!

Since the swap chain needs to allocate device memory, this also implies that the swap chain should maintain a strong reference to the backing Device. It is possible that given that Window is alive, the existing gfx-rs Context mechanism will be sufficient to prevent Device host memory from being deallocated, since unlike with Window we started out with ownership over the device data; however, it's possible that the current complex dependency graph around swap chains, swap chain frames, devices, surfaces, etc. could be greatly simplified by maintaining a direct Arc<Device>. Either of these mechanisms should be sufficient to ensure safety of creating a swap frame while being reasonably flexible.

To make sure that a swapchain frame is still alive, the most straightforward method would be to add a lifetime to SwapChainFrame; get_next_frame(&'a self) -> SwapChainFrame<'a> would be the new signature defined on Surface, and SwapChainFrame would maintain a reference to the Surface that instantiated it. If this is too inflexible for some purposes (although I suspect most could be made to fit into this pattern), we could define on the method as get_next_frame<S: Borrow<Self>>(surface: S) -> SwapChainFrame<S>, which would allow using Rc<Surface>, Arc<Surface>, Surface, &Surface, etc. as desired (we woul also want to provide an into_surface() and surface() method as usual); though we might not be able to get away with this safely for arbitrary Borrow<Self>, we could at least hardcode the stuff we know works. Alternately, we could again rely on reference counting / pointer following in the shared context, and avoid any explicit references (as Surface would already take care of them).


I believe this plan closes all the potential soundness holes in the current API around windows and surfaces. There may still be implementation issues, of course, but the safety argument should hold as long as HasRawWindowHandle is implemented properly.


So, to recap:

  • create_surface needs to check for null and zero handles as appropriate (in a cross-platform way).
  • The Surface will own the window W and provide &W access and into_window(self), but not &mut W.
  • Surface will own the resources needed for the actual swap chain (queue etc.) and those resources in turn should hold onto the Device.
  • Swap chain frames will hold references to the Surface (and therefore Device) that produced them when they are acquired (exact mechanism TBD).
  • From manual inspection, all known major users of wgpu can easily accommodate this API change, as long as it comes together with HasRawWindowHandle for Arc<W: AsRawWindowHandle> and related instances implemented upstream (technically I think all of them can actually own the Window, but the Arc instance makes it easier).
  • Jointly, this all makes creating a surface and acquiring a swap chain (which are the two main operations on surfaces) safe regardless of what safe code does with the window; all the safety requirements rely solely on a correct implementation of unsafe trait AsRawWindowHandle (needs to satisfy the requirements it already has).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: wsiIssues with swapchain management or windowinghelp requiredWe need community help to make this happen.type: enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions