-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Let's make surface creation safe! #1463
Description
There has been some discussion in the past over how to make create_surface safe:
From these comments, my impression was that the only issue was not being able to trust the pointers provided by the type implementing https://docs.rs/raw-window-handle/0.3.3/raw_window_handle/trait.HasRawWindowHandle.html. However, this is an unsafe trait and the requirements it puts on its implementors are actually quite strong:
Safety guarantees
Users can safely assume that non-null/0 fields are valid handles, and it is up to the implementer of this trait to ensure that condition is upheld. However, It is entirely valid behavior for fields within each platform-specific RawWindowHandle variant to be null or 0, and appropriate checking should be done before the handle is used.*
Despite that qualification, implementers should still make a best-effort attempt to fill in all available fields. If an implementation doesn't, and a downstream user needs the field, it should try to derive the field from other fields the implementer does provide via whatever methods the platform provides.
The exact handles returned by raw_window_handle must remain consistent between multiple calls to raw_window_handle, and must be valid for at least the lifetime of the HasRawWindowHandle implementer.
(emphasis mine).
We can instantly see that just by taking &W where W: HasRawWindowHandle, we do not need to worry about arbitrary pointers or integers--we can rely on the guarantees of HasRawWindowHandle alone, which ensures that as long as the underlying W instance is alive:
- The handles won't change.
- All handles will either be valid, or null/0 (depending on platform).
Since wgpu already needs to deal with handles in a backend-sensitive way, it makes perfect sense for it to just check all the individual handles. Problem solved, right? Unfortunately, it turns out that while this is the only documented source of unsafety, it is not the only source of unsafety when using a surface created by create_surface. @cwfitzgerald gave the following code as an example, when performing rendering using a task system off the main thread:
frame = None;
loop {
MainEventsCleared => {
wait_for_frame(frame);
frame = Some(render());
}
}When trying to close a window, this event loop would segfault on some Linux systems, but moving wait_for_frame below the frame = _ portion (at the end of the MainEventsCleaared arm) would prevent the segfault. Here, wait_for_frame waits for all commands to be submitted from the CPU to the GPU (but does not wait for the GPU). This suggests that after exiting the event loop (dropping the window), the window surface may be freed on the CPU on some platforms while there are still in-flight commands being sent to the surface from another thread; presumably, someone (the window manager or graphics driver) is able to make sure device resources are not reused until rendering is complete, but the same is not true of host resources.
The implication here is clear: in general, we can't treat a surface as valid unless the underlying window from which it was created is valid. Here, "window" doesn't necessarily have to be a window (although in practice it usually will be), it just has to be the object that owns whatever resource is referred to by the handle it passes to wgpu. Note that in a wgpu context, we only care about "real" surfaces with valid handles.
While on some platforms (e.g. Android), it seems that a window is always kept alive as long as the surface is, on others (e.g. X11 on Linux), it seems like the surface may be deallocated together with the Window (we could not inspect the actual implementation of https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCreateXlibSurfaceKHR.html so we could not verify this for certain, but either way the Vulkan documentation doesn't indicate that there's any requirement that the surface hold onto a reference to the window). So our first conclusion is that if we want to be safe, we should make sure--within Rust--that a window always outlives the surface from which it was created.
We discussed several approaches here, but the most obvious is this one:
- Surface is parameterized by its underlying window:
Surface<W>. - The safe version of
create_surface(in wgpu) takesW: HasRawWindowHandleby value rather than by reference. - The safe version of
create_surface(in wgpu) validates any handles it uses are non-0 / non-null (depending on platform), runs all the checks and resource creation steps it currently does, and then storesWwithin theSurfacestructure. Surface<W>exposes a few methods to get back the window data:fn window(&self) -> &W, andfn into_window(self) -> W. Note that we do not providefn window_mut(&mut self) -> &mut W, for reasons explained below.
First, safety arguments for why this makes it okay to use the window handle associated with the surface:
- From the definition of
HasRawWindowHandle, as long as the underlyingWis alive, the window handles will not change after repeated calls, so we know our checks from (3) will continue to be valid. - From the definition of
HasRawWindowHandle, as long as the underlyingWis alive, the handle returned from a call will refer to a live window, so we know the window is valid as long as the instance ofWis; which is at least as long as theSurfaceitself, provided that we don't provide any means of partially moving out of theWindow. - And, because we don't provide
&mut Waccess or other partial window move methods, there is no way to alter the stored window without dropping the wholeSurface(this is why we can't implementderef_mut()).
These may seem like uncomfortably strong restrictions on the window used for gfx-rs. We address these concerns in two ways:
- Technically: We propose adding several implementations of
HasRawWindowHandleupstream in theraw-window-handlecrate (all of which we believe are justified by the requirements onHasRawWindowHandle):
impl<'a, T: HasRawWindowHandle> HasRawWindowHandle for &'a T { /* .. */ }
impl<T: HasRawWindowHandle> HasRawWindowHandle for Rc<T> { /* .. */ }
impl<T: HasRawWindowHandle> HasRawWindowHandle for Arc<T> { /* .. */ }This means that in order to share access to the Window with the rest of an application, one can simply pass Arc<W>, Rc<W>, or &W (and possibly other implementations, should they prove useful) to create_surface. Due to the nature of how windows work (all meaningful operations on them must work in the context of a separate device being able to query them through a duplicable handle), we know that the important, shared component of the window (the part that needs to be alive when the surface is) must have interior mutability of some sort if it can be mutated at all; for actually existing window APIs, this interior mutability is made safe dynamically rather than within the type system, so we don't lose anything by preventing &mut access to the window while wgpu may have references to it. In exchange, we get a powerful safety argument that relies solely on the correct implementation of HasRawWindowHandle and create_surface, rather than requiring detailed investigation of every possible window management API.
We also examine practical, large wgpu projects to see if these restrictons would be onerous. From my own work on Veloren I know that these restrictions would be no problem for us. We also looked at Bevy, Dotrix, and Iced, and manually verified that they could move to this API with minimal changes:
- The Bevy API has a global hashmap from window ids to winit windows and could pretty clearly be turned into a map from window ids to
Arc<winit::Window>.
It requires&mutaccess only to its own internalbevy::Window, which is distinct from the one to which you have a handle, and even this is only needed
during window creation right before it's inserted into the hash map. So there would be no issue with switching toArc<winit::Window>. - For Dotrix, the event loop gets an owned
winit::Windowand it could easily haveRendererown it: https://github.com/lowenware/dotrix/blob/fix/wgpu-pipeline-creation/src/application.rs#L67.Arc<winit::Window>would make it even easier, but is not required; the code that needs access to the window can always get at the underlyingRendererinstance by accessing it through the global services structure, so if the renderer owned thewinit::Window(indirectly through the surface), it would work fine (Renderer::newconstructs and stores the surface). - For iced, it will require a minor change to their compositor API (currently, it basically just propagates up the requirement for
&Winstead ofW, but as that API is unsafe it will have to be changed regardless). In the actual examples I could find, the window is easily available by value, however, e.g. https://github.com/hecrj/iced/blob/ff15ebc54778dee2a7262469c8a2bcd5abecd4d1/examples/integration/src/main.rs and https://github.com/hecrj/iced/blob/1f7e8b7f3d1804c39c8e0934b25f3ef178de269c/wgpu/src/window/compositor.rs.
So we can see that our theoretical expectations translate well to practice. We can also see that in all known big examples, winit::Window is what's providing the RawWindowHandle, which makes sense since it works to abstract window creation across as many platforms as possible. As such, verifying that winit::Window's implementation of HasRawWindowHandle satisfies the trait requirements, together with correctness of wgpu's implementation of functions on Surface, will in practice be sufficient to verify the safety of real world code using this API (which is important because it means that we are not simply hiding an added unsafety burden by moving it somewhere else in the stack).
However, this is still not enough! This is because we still haven't addressed the thing you actually do with surfaces, which is use them with swap chains. Swap chains are currently produced from the combination of a surface and a compatible device. When get_next_frame() is called on a swap chain, we acquire a texture resource on the device; the resource is compatible with the target surface. If successful, we can then render to the texture resource from our device, and then finally (on dropping the swap frame) "release" the texture, and it is queued for presentation to the surface. Clearly, the safety of all of this depends crucially on both the device and surface remaining alive for the duration of the swap chain, but this is not guaranteed by the current API!
As discussed on Matrix, the swap chain API is still very much in flux, with no major players having made a firm commitment to any particular API. However, wgpu itself has been learning from how people use it (together with constraints imposed by cross-platform requirements), and the current thought is that surfaces should own the queue currently associated with swap chains; a surface will own one swap chain at a time, and (from our discussion) wgpu has no interest in surfaces that are not part of a swap chain. This takes care of one safety issue: if a surface replaces the functionality of the swap chain, then clearly the swap chain cannot outlive the surface!
Since the swap chain needs to allocate device memory, this also implies that the swap chain should maintain a strong reference to the backing Device. It is possible that given that Window is alive, the existing gfx-rs Context mechanism will be sufficient to prevent Device host memory from being deallocated, since unlike with Window we started out with ownership over the device data; however, it's possible that the current complex dependency graph around swap chains, swap chain frames, devices, surfaces, etc. could be greatly simplified by maintaining a direct Arc<Device>. Either of these mechanisms should be sufficient to ensure safety of creating a swap frame while being reasonably flexible.
To make sure that a swapchain frame is still alive, the most straightforward method would be to add a lifetime to SwapChainFrame; get_next_frame(&'a self) -> SwapChainFrame<'a> would be the new signature defined on Surface, and SwapChainFrame would maintain a reference to the Surface that instantiated it. If this is too inflexible for some purposes (although I suspect most could be made to fit into this pattern), we could define on the method as get_next_frame<S: Borrow<Self>>(surface: S) -> SwapChainFrame<S>, which would allow using Rc<Surface>, Arc<Surface>, Surface, &Surface, etc. as desired (we woul also want to provide an into_surface() and surface() method as usual); though we might not be able to get away with this safely for arbitrary Borrow<Self>, we could at least hardcode the stuff we know works. Alternately, we could again rely on reference counting / pointer following in the shared context, and avoid any explicit references (as Surface would already take care of them).
I believe this plan closes all the potential soundness holes in the current API around windows and surfaces. There may still be implementation issues, of course, but the safety argument should hold as long as HasRawWindowHandle is implemented properly.
So, to recap:
create_surfaceneeds to check for null and zero handles as appropriate (in a cross-platform way).- The
Surfacewill own the windowWand provide&Waccess andinto_window(self), but not&mut W. Surfacewill own the resources needed for the actual swap chain (queue etc.) and those resources in turn should hold onto the Device.- Swap chain frames will hold references to the
Surface(and thereforeDevice) that produced them when they are acquired (exact mechanism TBD). - From manual inspection, all known major users of wgpu can easily accommodate this API change, as long as it comes together with
HasRawWindowHandleforArc<W: AsRawWindowHandle>and related instances implemented upstream (technically I think all of them can actually own theWindow, but theArcinstance makes it easier). - Jointly, this all makes creating a surface and acquiring a swap chain (which are the two main operations on surfaces) safe regardless of what safe code does with the window; all the safety requirements rely solely on a correct implementation of unsafe trait
AsRawWindowHandle(needs to satisfy the requirements it already has).