Use SetEvent rather than QueueUserAPC on Windows#1645
Use SetEvent rather than QueueUserAPC on Windows#1645bakpakin merged 4 commits intojanet-lang:masterfrom
SetEvent rather than QueueUserAPC on Windows#1645Conversation
|
This comment related to a version of the PR before it was rewritten.
|
|
Thanks for looking into this. A month or two back I spent a Saturday looking into this and narrowed it down to the ev/deadline code and how thread cancellation was working, but was unable to fully understand the issue and come up with something that always worked. That said, I was able to easily repro by running just the ev/deadline test in a loop on my windows machine. |
|
Huzzah! I'm pretty confident I've found the problem! I have extensively rewritten the OP with more details. In short: it's a timer resolution issue on Windows (the same problem is not present on POSIX because it uses |
|
This looks good to me, but the fact that a timer issue can cause this is a bit disconcerting. Let's merge this but wait to close #1598 |
This PR fixes a problem I introduced via #1575, namely that when built on Windows platforms Janet can hang when
ev/deadlineis called with a function that infinitely loops. The fix is to useSetEventrather thanQueueUserAPCwith an increasing back-off if a timer completes in less time than the user had specified.Background
The Windows implementation of
ev/deadlineuses a combination ofQueueUserAPCandSleepExto allow a worker thread to interrupt the main thread running the Janet VM if a timeout occurs. While the implementations used on POSIX systems seem to work reliably, CI runs of the Janet project on Windows have, over the past six months, repeatedly hung during the testing oftest/suite-ev.janet.I am now relatively confident that the problem is a consequence of the resolution of timers on Windows.
SleepEx(andWaitForSingleObjectwhich this PR uses instead) claim to support millisecond precision. However, this precision is dependent on the length of the 'tick' of the system clock. If the amount of time is less than the tick, the function may return before the amount of time has elapsed. This creates a problem when testingtest/suite-ev.janetbecause the second test ofev/deadlineuses a timeout of0.01seconds (i.e. 10 milliseconds). What happens on a non-deterministic basis is thatSleepEx(orWaitForSingleObject) will return after less than 10 milliseconds. This will cause the Janet VM to be interrupted and the event loop will iterate. However, on the next iteration, it is possible that the system clock is not at 10 milliseconds ahead and so the timeout in the Janet VM's queue will not trigger. The Janet VM will restart the fiber running the infinite loop and the system will hang.An earlier version of this PR explored whether the problem related to the use of APC and even though that appears to not be the case, this PR uses event objects as a preferable signalling approach.
Implementation
This PR replaces the APC-based mechanism with an event object-based mechanism. It also introduces an increasing back-off mechanism in the timer. The key changes are in
src/core/ev.cinjanet_timeout_body:and in
cfun_ev_deadline:With these changes, I was able to get 10 runs on my fork of Janet to work without incident. I was not previously able to get more than 5 runs.
Limitations
The memory needs of the
JanetThreadedTimeoutstruct andJanetTimeoutstruct are increased on Windows because an additional pointer needs to be allocated for the semaphore event.Additionally, the implementation does not send a message for other return values that
WaitForSingleObjectcan return. I think that's all right forWAIT_OBJECT_Obut perhaps some kind of error should be raised ifWAIT_FAILEDis returned.