LUCENE-10531: Disable distribution test (gui test) on windows. by mocobeta · Pull Request #917 · apache/lucene

mocobeta · 2022-05-21T04:38:50Z

Occasionally the test is too slow (or hangs?) and fails on Windows VM.
We can't increase the timeout infinitely, I would disable it on GitHub Actions for now (we still run it on Jenkins Servers).

.github/workflows/distribution.yml

mocobeta · 2022-05-21T14:45:48Z

I would merge this now not to fail the CI in unrelated pull requests; we could re-enable it on windows vm in the future if the GH actions runner is improved.

dweiss · 2022-05-21T15:19:45Z

I'm not sure what's causing this. It looks strange. I'll play around and report if I find out anything.

mocobeta · 2022-05-21T15:42:00Z

Thanks. I muted it for now since I thought we shouldn't experiment in the upstream repository, but I'll also look at it when I have spare time.

It looks like it can be easily reproducible but repeated job runs would be needed for debugging (it fails once in five times?)...

mocobeta · 2022-05-21T17:00:03Z

Just a quick note: I noticed that it looks like some virtual display/monitor is enabled on the Windows Server VM in mocobeta#2. I'm not sure the Windows VM in Policeman Jenkins Server has a virtual display though, I wonder if this could be a difference between GitHub Actions and the Jenkins server.

dweiss · 2022-05-21T18:37:16Z

I've been trying to reproduce this with more debugging info but can't, in spite of several dozen attempts.

dweiss · 2022-05-21T18:37:19Z

https://github.com/dweiss/lucene/actions/runs/2363545679

mocobeta · 2022-05-22T03:23:12Z

Repeating the test in the same machine didn't reproduce the failure for me too.

I tried setting up 20 virtual machines and one of them failed.
https://github.com/mocobeta/lucene/runs/6540983796

It looks like there is something wrong with the machine setup (?), I have no idea of how to debug it yet.

mocobeta · 2022-05-22T05:48:12Z

I tested several timeouts.

60 seconds: occasionally fails
120 seconds: occasionally fails
600 seconds: didn't see any failure in several dozens of runs

It looks like increasing timeout effectively solves the problem to me; I have no idea why :)
And there seem to be no substantial increases in the workflow's total execution time when I increased the timeout to a large value. The system clock in the VM may be sometimes messed up?

dweiss · 2022-05-22T06:59:50Z

It is impossible and insane - something is wrong. I don't believe forking a simple process requires 600 seconds to complete, while running gradle tasks completes orders of magnitude faster. Thank you for testing though. Can you merge the changes from script-testing-windows branch and rerun your stress test? It emits more logging, let's see what's happening there.

Separately from the above, those Windows VMs do tend to hang occasionally - I've seen this not just on Lucene but also on other projects. The logs are simply truncated then, no reason or cause can be determined. They're just flaky, occasionally. But in this case I think it looks too predictably stalling in one place to be caused by mere flakiness.

mocobeta · 2022-05-22T07:21:09Z

Can you merge the changes from script-testing-windows branch and rerun your stress test? It emits more logging, let's see what's happening there.

Sure - let me merge it later today. Also, I'd tune the timeout to 20 to 30 seconds to increase the failure probability (in most cases the test completes within 20 seconds; setting a small timeout larger than 20 seconds could be suitable for capturing unusual cases from my random trials so far).

mocobeta · 2022-05-22T08:42:32Z

@dweiss I merged https://github.com/dweiss/lucene/tree/script-testing-windows into my branch. This repeatedly runs the test in 20 VMs, it's a waste of resources though. For debugging I set the timeout to 20 seconds.

You can see the CI results here and please feel free to tweak the code and re-run jobs (I think you have write access on this fork).
mocobeta#2

mocobeta · 2022-05-22T09:47:44Z

In a failed run, the thread does not hang but seems to be suspended several times.
e.g.:
https://github.com/mocobeta/lucene/runs/6542483446?check_suite_focus=true#step:6:1985
https://github.com/mocobeta/lucene/runs/6542483446?check_suite_focus=true#step:6:2056

mocobeta · 2022-05-22T10:39:45Z

I re-run workflows several times and tracked the debug messages in failed runs. In a short summary,

it takes at least about ten seconds to load AWT/Swing classes in the first test run on Windows VM
in the middle of loading classes, sometimes the thread is suspended several times very long time (five seconds or more) by the host machine or scheduler or something else. In worst cases, it could take minutes?

mocobeta · 2022-05-22T11:42:43Z

I finally hit the failure runs that exceed 120 seconds to complete.

Process forking is not a problem at all, launching Luke involves loading many classes (it prepares all panels at starting... sorry) and that can take a long time on windows vm.

dweiss · 2022-05-22T13:50:38Z

Yeah, it looks like it! It is inexplicably slow!... One change I think we could try is to run the forked command with a higher priority (start command has an option for this; cmd doesn't, I believe).

mocobeta · 2022-05-23T01:27:37Z

It could be worth trying to load GUI components lazily... starting Luke takes seconds even on a physical machine and a quicker launch is also good for humans.

mocobeta · 2022-05-23T07:49:10Z

One change I think we could try is to run the forked command with a higher priority

Thanks for your suggestion; we could try this workaround though, I feel like it'd be better to keep it disabled and try to solve the slowness of launching the app.
https://issues.apache.org/jira/browse/LUCENE-10588

It's great to know we can run the gui test with a (virtual) display with Github Actions.

…e#917)

disable distribution test (gui test) on windows.

32a950e

mocobeta mentioned this pull request May 21, 2022

LUCENE-10531: Add @RequiresGUI test group for GUI tests #893

Merged

mocobeta commented May 21, 2022

View reviewed changes

.github/workflows/distribution.yml Outdated Show resolved Hide resolved

Update .github/workflows/distribution.yml

c23af9c

mocobeta commented May 21, 2022

View reviewed changes

.github/workflows/distribution.yml Outdated Show resolved Hide resolved

Update .github/workflows/distribution.yml

1e77c84

mocobeta merged commit 59b6d41 into apache:main May 21, 2022

mocobeta deleted the disable-actions-gui-test-on-windows-vm branch May 21, 2022 14:46

mocobeta added a commit that referenced this pull request May 21, 2022

LUCENE-10531: Disable distribution test (gui test) on windows. (#917)

ca977de

shaie pushed a commit to mdmarshmallow/lucene that referenced this pull request Jun 22, 2022

LUCENE-10531: Disable distribution test (gui test) on windows. (apach…

cf95c1f

…e#917)

asfimport mentioned this pull request Jul 1, 2024

Make Luke launching code faster [LUCENE-10588] #11624

Open

Conversation

mocobeta commented May 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mocobeta commented May 21, 2022

Uh oh!

dweiss commented May 21, 2022

Uh oh!

mocobeta commented May 21, 2022

Uh oh!

mocobeta commented May 21, 2022

Uh oh!

dweiss commented May 21, 2022

Uh oh!

dweiss commented May 21, 2022

Uh oh!

mocobeta commented May 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mocobeta commented May 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dweiss commented May 22, 2022

Uh oh!

mocobeta commented May 22, 2022

Uh oh!

mocobeta commented May 22, 2022

Uh oh!

mocobeta commented May 22, 2022

Uh oh!

mocobeta commented May 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mocobeta commented May 22, 2022

Uh oh!

dweiss commented May 22, 2022

Uh oh!

mocobeta commented May 23, 2022

Uh oh!

mocobeta commented May 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mocobeta commented May 21, 2022 •

edited

Loading

mocobeta commented May 22, 2022 •

edited

Loading

mocobeta commented May 22, 2022 •

edited

Loading

mocobeta commented May 22, 2022 •

edited

Loading