Skip to content

LUCENE-10531: Disable distribution test (gui test) on windows.#917

Merged
mocobeta merged 3 commits intoapache:mainfrom
mocobeta:disable-actions-gui-test-on-windows-vm
May 21, 2022
Merged

LUCENE-10531: Disable distribution test (gui test) on windows.#917
mocobeta merged 3 commits intoapache:mainfrom
mocobeta:disable-actions-gui-test-on-windows-vm

Conversation

@mocobeta
Copy link
Contributor

@mocobeta mocobeta commented May 21, 2022

Occasionally the test is too slow (or hangs?) and fails on Windows VM.
We can't increase the timeout infinitely, I would disable it on GitHub Actions for now (we still run it on Jenkins Servers).

@mocobeta
Copy link
Contributor Author

I would merge this now not to fail the CI in unrelated pull requests; we could re-enable it on windows vm in the future if the GH actions runner is improved.

@mocobeta mocobeta merged commit 59b6d41 into apache:main May 21, 2022
@mocobeta mocobeta deleted the disable-actions-gui-test-on-windows-vm branch May 21, 2022 14:46
@dweiss
Copy link
Contributor

dweiss commented May 21, 2022

I'm not sure what's causing this. It looks strange. I'll play around and report if I find out anything.

@mocobeta
Copy link
Contributor Author

Thanks. I muted it for now since I thought we shouldn't experiment in the upstream repository, but I'll also look at it when I have spare time.

It looks like it can be easily reproducible but repeated job runs would be needed for debugging (it fails once in five times?)...

@mocobeta
Copy link
Contributor Author

Just a quick note: I noticed that it looks like some virtual display/monitor is enabled on the Windows Server VM in mocobeta#2. I'm not sure the Windows VM in Policeman Jenkins Server has a virtual display though, I wonder if this could be a difference between GitHub Actions and the Jenkins server.

@dweiss
Copy link
Contributor

dweiss commented May 21, 2022

I've been trying to reproduce this with more debugging info but can't, in spite of several dozen attempts.

@dweiss
Copy link
Contributor

dweiss commented May 21, 2022

@mocobeta
Copy link
Contributor Author

mocobeta commented May 22, 2022

Repeating the test in the same machine didn't reproduce the failure for me too.

I tried setting up 20 virtual machines and one of them failed.
https://github.com/mocobeta/lucene/runs/6540983796

Screenshot from 2022-05-22 12-27-53

It looks like there is something wrong with the machine setup (?), I have no idea of how to debug it yet.

@mocobeta
Copy link
Contributor Author

mocobeta commented May 22, 2022

I tested several timeouts.

  • 60 seconds: occasionally fails
  • 120 seconds: occasionally fails
  • 600 seconds: didn't see any failure in several dozens of runs

It looks like increasing timeout effectively solves the problem to me; I have no idea why :)
And there seem to be no substantial increases in the workflow's total execution time when I increased the timeout to a large value. The system clock in the VM may be sometimes messed up?

@dweiss
Copy link
Contributor

dweiss commented May 22, 2022

It is impossible and insane - something is wrong. I don't believe forking a simple process requires 600 seconds to complete, while running gradle tasks completes orders of magnitude faster. Thank you for testing though. Can you merge the changes from script-testing-windows branch and rerun your stress test? It emits more logging, let's see what's happening there.

Separately from the above, those Windows VMs do tend to hang occasionally - I've seen this not just on Lucene but also on other projects. The logs are simply truncated then, no reason or cause can be determined. They're just flaky, occasionally. But in this case I think it looks too predictably stalling in one place to be caused by mere flakiness.

@mocobeta
Copy link
Contributor Author

Can you merge the changes from script-testing-windows branch and rerun your stress test? It emits more logging, let's see what's happening there.

Sure - let me merge it later today. Also, I'd tune the timeout to 20 to 30 seconds to increase the failure probability (in most cases the test completes within 20 seconds; setting a small timeout larger than 20 seconds could be suitable for capturing unusual cases from my random trials so far).

@mocobeta
Copy link
Contributor Author

@dweiss I merged https://github.com/dweiss/lucene/tree/script-testing-windows into my branch. This repeatedly runs the test in 20 VMs, it's a waste of resources though. For debugging I set the timeout to 20 seconds.

You can see the CI results here and please feel free to tweak the code and re-run jobs (I think you have write access on this fork).
mocobeta#2

@mocobeta
Copy link
Contributor Author

@mocobeta
Copy link
Contributor Author

mocobeta commented May 22, 2022

I re-run workflows several times and tracked the debug messages in failed runs. In a short summary,

  1. it takes at least about ten seconds to load AWT/Swing classes in the first test run on Windows VM
  2. in the middle of loading classes, sometimes the thread is suspended several times very long time (five seconds or more) by the host machine or scheduler or something else. In worst cases, it could take minutes?

@mocobeta
Copy link
Contributor Author

I finally hit the failure runs that exceed 120 seconds to complete.

Process forking is not a problem at all, launching Luke involves loading many classes (it prepares all panels at starting... sorry) and that can take a long time on windows vm.

@dweiss
Copy link
Contributor

dweiss commented May 22, 2022

Yeah, it looks like it! It is inexplicably slow!... One change I think we could try is to run the forked command with a higher priority (start command has an option for this; cmd doesn't, I believe).

@mocobeta
Copy link
Contributor Author

It could be worth trying to load GUI components lazily... starting Luke takes seconds even on a physical machine and a quicker launch is also good for humans.

@mocobeta
Copy link
Contributor Author

One change I think we could try is to run the forked command with a higher priority

Thanks for your suggestion; we could try this workaround though, I feel like it'd be better to keep it disabled and try to solve the slowness of launching the app.
https://issues.apache.org/jira/browse/LUCENE-10588

It's great to know we can run the gui test with a (virtual) display with Github Actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants