LUCENE-10531: Disable distribution test (gui test) on windows.#917
Conversation
|
I would merge this now not to fail the CI in unrelated pull requests; we could re-enable it on windows vm in the future if the GH actions runner is improved. |
|
I'm not sure what's causing this. It looks strange. I'll play around and report if I find out anything. |
|
Thanks. I muted it for now since I thought we shouldn't experiment in the upstream repository, but I'll also look at it when I have spare time. It looks like it can be easily reproducible but repeated job runs would be needed for debugging (it fails once in five times?)... |
|
Just a quick note: I noticed that it looks like some virtual display/monitor is enabled on the Windows Server VM in mocobeta#2. I'm not sure the Windows VM in Policeman Jenkins Server has a virtual display though, I wonder if this could be a difference between GitHub Actions and the Jenkins server. |
|
I've been trying to reproduce this with more debugging info but can't, in spite of several dozen attempts. |
|
Repeating the test in the same machine didn't reproduce the failure for me too. I tried setting up 20 virtual machines and one of them failed. It looks like there is something wrong with the machine setup (?), I have no idea of how to debug it yet. |
|
I tested several timeouts.
It looks like increasing timeout effectively solves the problem to me; I have no idea why :) |
|
It is impossible and insane - something is wrong. I don't believe forking a simple process requires 600 seconds to complete, while running gradle tasks completes orders of magnitude faster. Thank you for testing though. Can you merge the changes from script-testing-windows branch and rerun your stress test? It emits more logging, let's see what's happening there. Separately from the above, those Windows VMs do tend to hang occasionally - I've seen this not just on Lucene but also on other projects. The logs are simply truncated then, no reason or cause can be determined. They're just flaky, occasionally. But in this case I think it looks too predictably stalling in one place to be caused by mere flakiness. |
Sure - let me merge it later today. Also, I'd tune the timeout to 20 to 30 seconds to increase the failure probability (in most cases the test completes within 20 seconds; setting a small timeout larger than 20 seconds could be suitable for capturing unusual cases from my random trials so far). |
|
@dweiss I merged https://github.com/dweiss/lucene/tree/script-testing-windows into my branch. This repeatedly runs the test in 20 VMs, it's a waste of resources though. For debugging I set the timeout to 20 seconds. You can see the CI results here and please feel free to tweak the code and re-run jobs (I think you have write access on this fork). |
|
In a failed run, the thread does not hang but seems to be suspended several times. |
|
I re-run workflows several times and tracked the debug messages in failed runs. In a short summary,
|
|
I finally hit the failure runs that exceed 120 seconds to complete.
Process forking is not a problem at all, launching Luke involves loading many classes (it prepares all panels at starting... sorry) and that can take a long time on windows vm. |
|
Yeah, it looks like it! It is inexplicably slow!... One change I think we could try is to run the forked command with a higher priority (start command has an option for this; cmd doesn't, I believe). |
|
It could be worth trying to load GUI components lazily... starting Luke takes seconds even on a physical machine and a quicker launch is also good for humans. |
Thanks for your suggestion; we could try this workaround though, I feel like it'd be better to keep it disabled and try to solve the slowness of launching the app. It's great to know we can run the gui test with a (virtual) display with Github Actions. |

Occasionally the test is too slow (or hangs?) and fails on Windows VM.
We can't increase the timeout infinitely, I would disable it on GitHub Actions for now (we still run it on Jenkins Servers).