Replace WinRT clipboard API with Win32 for pasting#15360
Replace WinRT clipboard API with Win32 for pasting#15360microsoft-github-policy-service[bot] merged 6 commits intomainfrom
Conversation
25c42f8 to
ac6abff
Compare
This comment has been minimized.
This comment has been minimized.
ac6abff to
515d3d3
Compare
This comment has been minimized.
This comment has been minimized.
515d3d3 to
34caf03
Compare
metathinker
left a comment
There was a problem hiding this comment.
Hi, former developer-maintainer of the Windows clipboard implementation here - sorry for the hit and run review, but when I see clipboard, I'm always a bit intrigued :)
| // The old Win32 clipboard API as used below is somewhere in the order of 1000x faster than the WinRT one, | ||
| // which has about the same latency as a fibre optics connection across the entire WA state. | ||
| // Don't use the WinRT clipboard API. RPC calls are not funny. |
There was a problem hiding this comment.
Question: I know you said you tested this, but I'm still surprised. Is calling UWP Clipboard.GetContent() or DataPackageView.GetWhateverAsync() really that slow? I'd be a bit interested to see how you tested and measured this.
My memory is rusty but my belief is that UWP DataTransfer for clipboard shouldn't be a cross-process RPC call in all cases - just ones where the source data is hosted in an app that clipped it using the UWP or OLE clipboard APIs, forcing a cross-process COM call (IDataObject::Whatever()). (As opposed to when the source item came from the Win32 clipboard; in that case, UWP clipboard impl. in the destination should "just" do the win32k system call.)
There was a problem hiding this comment.
I tested it using some code that I have already deleted. To make the new comparison as realistic as possible, I've chosen to implement the benchmark inside Windows Terminal itself and modified like so:
fire_and_forget TerminalPage::_PasteFromClipboardHandler(const IInspectable /*sender*/, const PasteFromClipboardEventArgs eventArgs)
try
{
const auto beg = std::chrono::high_resolution_clock::now();
winrt::hstring text;
for (int i = 0; i < 1000; i++)
{
const auto data = Clipboard::GetContent();
if (data.Contains(StandardDataFormats::Text()))
{
text = co_await data.GetTextAsync();
}
// Windows Explorer's "Copy address" menu item stores a StorageItem in the clipboard, and no text.
else if (data.Contains(StandardDataFormats::StorageItems()))
{
auto items = co_await data.GetStorageItemsAsync();
if (items.Size() > 0)
{
auto item = items.GetAt(0);
text = item.Path();
}
}
}
const auto end = std::chrono::high_resolution_clock::now();
const auto dur = std::chrono::duration<float, std::milli>(end - beg).count() / 1000.0f;
wchar_t buffer[64];
swprintf_s(buffer, L"%f\n", dur);
OutputDebugStringW(&buffer[0]);
// ... remaining contents of the function ...
}
CATCH_LOG();Pasting text from X takes Y ms per iteration:
- Edge (Win32 clipboard API): 2.5ms
- Notepad: 3ms
- Explorer: 2.4ms
- The same instance of Windows Terminal (WinRT API): 2.1ms
- A different instance of Windows Terminal: also 2.1ms
- My bodgy clipboard code: 2.7ms
It seems there's no difference in latency for us between whether an application uses the WinRT or Win32 API for setting the clipboard contents. This made me curious whether the air got to my head once again and if I got everything, I said above wrong. So, I've tried to recreate my initial test code:
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
// NOTE: This code relies on the latest cppwinrt version. To make matters simpler I've simply
// copy-pasted the winrt headers from terminal's "Generated Files" next to the current file.
#include "winrt/Windows.ApplicationModel.DataTransfer.h"
#include "winrt/Windows.Foundation.h"
int main() {
winrt::init_apartment(winrt::apartment_type::single_threaded);
[]() -> winrt::fire_and_forget {
const auto beg = std::chrono::high_resolution_clock::now();
winrt::hstring text;
for (int i = 0; i < 1000; i++) {
const auto data = winrt::Windows::ApplicationModel::DataTransfer::Clipboard::GetContent();
text = co_await data.GetTextAsync();
}
const auto end = std::chrono::high_resolution_clock::now();
const auto dur = std::chrono::duration<float, std::milli>(end - beg).count() / 1000.0f;
wprintf(L"%f\n", dur);
PostQuitMessage(0);
}();
MSG msg;
while (GetMessageW(&msg, nullptr, 0, 0)) {
DispatchMessageW(&msg);
}
return 0;
}Using it I get approximately the same latency as I get with Windows Terminal's code, which makes me believe that it's most likely correct, even if I can't immediately prove it. (If it's incorrect, please let me know!)
I did this because this allows me to run it in NVIDIA Nsight Systems to observe cross process behavior. Here's how it looks like when pasting from Edge into this application:

With that as an orientation, here's how one call (2.7ms) looks like:

You're right! There's no COM call in GetTextAsync when the clipboard content comes from Edge. GetContent however does involve 1 CoCreateInstance and 3 COM calls. I've blindly assumed that these are 1:1 directly related to the WinRT call that I'm doing. I'm genuinely sorry for this.
When the last instance of the CDataPackageView is destroyed it'll drop the CNonAgileOleRefTaker, which will end up calling OleUninitialize. This is probably not a good idea since this will also destroy the cached window it uses to call OpenClipboard. That's why every time you (or rather I, in my example project) call Clipboard::GetContent it'll recreate the window, which is very expensive overall (about half the entire CPU cost, the rest being mostly scheduling latency). Given that, I expect that the cost of the API would be significantly reduced if it wasn't calling OleUninitialize. And indeed, if I add a:
if (OleInitialize(nullptr) != S_OK) {
abort();
}to the start of my test application, the cost of getting the clipboard contents from Edge is consistently reduced down to about 1.8ms. It is however still in stark contrast to the Win32 API which takes roughly 0.0066ms on average on my system. I didn't run a ETW trace where the remaining cost lies. But if there's any more interest in this topic, I'd be happy to trace it.
I do realize that it could be argued that the Win32 API has an unfair advantage since it doesn't need to do any async thread switching, the way the COM code needs to do it. As such I've made the following test based on this branch:
fire_and_forget TerminalPage::_PasteFromClipboardHandler(const IInspectable /*sender*/, const PasteFromClipboardEventArgs eventArgs)
try
{
const auto beg = std::chrono::high_resolution_clock::now();
winrt::hstring text;
for (int i = 0; i < 1000; i++) // <--- added
{
co_await winrt::resume_background(); // <--- added
{
// ... the win32 clipboard code from this PR ...
}
co_await wil::resume_foreground(Dispatcher()); // <--- added
}
const auto end = std::chrono::high_resolution_clock::now();
const auto dur = std::chrono::duration<float, std::milli>(end - beg).count() / 1000.0f;
wchar_t buffer[64];
swprintf_s(buffer, L"%f\n", dur);
OutputDebugStringW(&buffer[0]);
// ... remaining contents of the function from this PR ...
}
CATCH_LOG();This increases the cost to roughly 0.15ms per iteration on my system.
(We technically don't need to resume_foreground however so the actual cost is 0.013ms.)
There was a problem hiding this comment.
Those benchmarks seem complete enough, I'll accept them, and you can close this thread out. :)
I raised this point because I don't like to see people go back to the win32 clipboard API from UWP clipboard API. Win32 clipboard is really much harder to use correctly; even in a simple case like this you had to build a OpenClipboard busy wait loop which isn't even guaranteed to succeed.
But clearly the speed up is significant enough that it's worth making this change.
There was a problem hiding this comment.
Re the other discussion about ClipboardServer.dll - I might be wrong or out of date to say that it doesn't get used for basic clipboard functions. But my guess would be that its usage when you call UWP clipboard on desktop is for 1 of 2 reasons:
- UWP app processes in a AppContainer use a broker to write to clipboard to filter out attempts to hijack higher-privileged processes with corrupt clipboard data. I thought that function wasn't built into ClipboardServer.dll, but maybe it is, or has become so. Within Windows Terminal, the UWP clipboard code in the caller process should be smart enough to see that it's not running from an AppContainer process and make the calls directly, but maybe it's not or there are other constraints.
- On desktop, ClipboardServer is used to enforce clipboard enterprise data watermarking and restriction (Windows Information Protection/Enterprise Data Protection). This protection should apply to all clipboard APIs but I don't know how or when it's enforced for usage of each API.
There was a problem hiding this comment.
I raised this point because I don't like to see people go back to the win32 clipboard API from UWP clipboard API. Win32 clipboard is really much harder to use correctly; even in a simple case like this you had to build a OpenClipboard busy wait loop which isn't even guaranteed to succeed.
Yeah, that's a totally fair point and I wouldn't mind sticking with the old code. In fact, the new API is honestly way nicer and easier to use. It's genuinely just not fast enough and I think this ever so slightly contributes to the growing computational inefficiency affecting Windows.
If there's an API that's a little bit harder to use, but multiple magnitudes more power efficient 1, I think we not just should use it, but almost have a moral obligation to use it. Not just because we engineers are fickler when it comes to performance than product owners might hope 2, but more importantly because the rising layers of abstraction and corresponding computational latency are unsustainable long-term. 3
In other words, I want Windows to be fast and power efficient, where everything you do is as good as it gets to the fullest extent of the underlying hardware and what we MS engineers are capable of delivering and it's what I'm trying to work towards.
Footnotes
-
The code I've ended up with only needs to do a single
resume_background. This ends up with a latency of just 13us per iteration on my system, or roughly 200x more efficient. ↩ -
Everyone immediately switched to Chrome when it came out, because it was much leaner than IE, same for VS Code and Atom. Why would this be any different for Windows and Proton/Wine one day? ↩
-
If the WinRT API had replaced the Win32 one, so that the latter is just a thin shim, I'd totally get it, but this way? When will we get rid of the original code? ↩
|
If the clipboard owner hangs when Terminal is attempting to paste to a session, how much of Terminal's UI will that block? Just the one pane, the entire window, or all windows? Does Windows enforce a timeout? With conhost, it didn't matter that much because there was only one session per window anyway. |
|
I'm inclined to say that this is okay, for one very major reason: terminal/src/interactivity/win32/Clipboard.cpp Lines 53 to 84 in 0ee2c74 Conhost has used the win32 APIs for decades. I am curious about the "what if the source hangs while we're reading it" question - maybe there's a space for us to wrap this up as an AsyncAction that we can cancel if it doesn't return. That's probably not as expensive as all the other COM calls. Honestly though I'm not super familiar with the edge cases with the win32 API so I'm not sure what we have to be afraid of. What would this lose? There's gotta be some reason to use the modern APIs.... right? right? |
I recently changed this PR to yield to the background thread and untagged #15315 as being fixed, so it won't hang the UI anymore. We can't use an
Not even remotely. Even with yielding to a background thread first the total latency is still 20000% faster than the WinRT API (using percentages like that makes it look a bit funnier).
The worst that can happen is that it works as bad as the conhost implementation, but the only major complaint we've ever gotten there is that pasting breaks Unicode, but that has always been our own fault ( |
This comment has been minimized.
This comment has been minimized.
| } | ||
|
|
||
| attempts++; | ||
| Sleep(10 * attempts); |
There was a problem hiding this comment.
I get that we should wait before trying again, but why are we doing 10*attempts here?
There was a problem hiding this comment.
The usual approach is to do exponential backoffs: Every iteration you increase the backoff duration by like 2x for instance. I felt like doing just a linear backoff would work well enough for our primitive use case. After all, conhost uses the same API and doesn't use any retrying at all.
|
@zadjii-msft @DHowett It'd be great if we could still get this into 1.19. We've been testing it for quite a bit already after all. It's missing just 1 more approval. |
zadjii-msft
left a comment
There was a problem hiding this comment.
sure let's give it a shot
|
This one is from before the CI rewrite so it might need to be manually merged... |
| const auto str = static_cast<const wchar_t*>(GlobalLock(data)); | ||
| if (!str) | ||
| { | ||
| const auto isNewLineLambda = [](auto c) { return c == L'\n' || c == L'\r'; }; | ||
| const auto hasNewLine = std::find_if(text.cbegin(), text.cend(), isNewLineLambda) != text.cend(); | ||
| warnMultiLine = hasNewLine; | ||
| co_return; | ||
| } | ||
|
|
||
| constexpr const std::size_t minimumSizeForWarning = 1024 * 5; // 5 KiB | ||
| const auto warnLargeText = text.size() > minimumSizeForWarning && | ||
| _settings.GlobalSettings().WarnAboutLargePaste(); | ||
| const auto dataCleanup = wil::scope_exit([&]() { | ||
| GlobalUnlock(data); | ||
| }); |
There was a problem hiding this comment.
technically there's wil::unique_hglobal_locked, which takes an HGLOBAL and unlocks it later! On its own!
In the spirit of #15360 this implements the copy part. The problem is that we have an issue accessing the clipboard while other applications continue to work just fine. The major difference between us and the others is that we use the WinRT clipboard APIs. So, the idea is that we just use the Win32 APIs instead. The feel-good side-effect is that this is (no joke) 200-1000x faster, but I suspect no one will notice the -3ms difference down to <0.01ms. The objective effect however is that it just works. This may resolve #16982. ## Validation Steps Performed * Cycle through Text/HTML/RTF-only in the Interaction settings * Paste the contents into Word each time * Text is plain and HTML/RTF are colored ✅
In the spirit of #15360 this implements the copy part. The problem is that we have an issue accessing the clipboard while other applications continue to work just fine. The major difference between us and the others is that we use the WinRT clipboard APIs. So, the idea is that we just use the Win32 APIs instead. The feel-good side-effect is that this is (no joke) 200-1000x faster, but I suspect no one will notice the -3ms difference down to <0.01ms. The objective effect however is that it just works. This may resolve #16982. * Cycle through Text/HTML/RTF-only in the Interaction settings * Paste the contents into Word each time * Text is plain and HTML/RTF are colored ✅ (cherry picked from commit 5f3a857) Service-Card-Id: 92308708 Service-Version: 1.20
The Win32 API is significantly faster than the WinRT one, in the order
of around 300-1000x depending on the CPU and CPU load.
This might slightly improve the situation around #15315, but I suspect
that it requires many more fixes. For instance, we don't really have a
single text input "queue" into which we write. Multiple routines that
resume_backgroundjust toWriteFileinto the input pipe are thusracing against each other, contributing to the laggy feeling.
I also fear that the modern Windows text stack might be inherently
RPC based too, producing worse lag with rising CPU load.
This might fix #14323
Validation Steps Performed
paste text into WT ✅