Skip to content

Fix race condition of timeout thread interrupt to stabilize multi level build tests#4254

Merged
lihaoyi merged 13 commits intocom-lihaoyi:mainfrom
lihaoyi:stabilize-multi-level-build
Jan 6, 2025
Merged

Fix race condition of timeout thread interrupt to stabilize multi level build tests#4254
lihaoyi merged 13 commits intocom-lihaoyi:mainfrom
lihaoyi:stabilize-multi-level-build

Conversation

@lihaoyi
Copy link
Copy Markdown
Member

@lihaoyi lihaoyi commented Jan 6, 2025

The basic problem was that thread.interrupt() running before interrupt = false meant there was a chance the thread's if (interrupt) { conditional would be true, resulting in the timeout thread timing out the open socket immediately, and the Mill server process shutting down.

The solution is to move interrupt = false to before thread.interrupt()

This should hopefully fix the flakiness in multi-level-build tests, where the process shutting down would cause the next command to re-spawn a new process, resulting in all classloaders to be re-spawned, violating our assertions

Improved the test error checking so next time a similar thing happens, we get more precise reporting "an unwanted process restart occurred", rather than just "classloader invalidation was unexpected".

Tested manually with while ./mill 'integration.invalidation[multi-level-editing].server.test'; do :; done. This seems to reproduce the problem on my laptop in a few tens of minutes without this fix, after this fix I haven't managed to make it appear

@lihaoyi lihaoyi marked this pull request as ready for review January 6, 2025 07:50
@lihaoyi lihaoyi merged commit f0aa010 into com-lihaoyi:main Jan 6, 2025
@lefou lefou added this to the 0.12.6 milestone Jan 6, 2025
@lihaoyi
Copy link
Copy Markdown
Member Author

lihaoyi commented Jan 6, 2025

I'm hoping this also fixes the flakiness in example.fundamentals.tasks[6-workers].local.test (e.g. https://github.com/com-lihaoyi/mill/actions/runs/12579809859/job/35060755776) and integration.failure[fatal-error].local.test (e.g. https://github.com/com-lihaoyi/mill/actions/runs/12591901838/job/35095817421), both of which could be explained by the Mill process from the previous invocation exiting unexpectedly before the subsequent invocation happens

lihaoyi added a commit that referenced this pull request Jan 7, 2025
This might have been the cause of a lot of flakiness that seems to have
gone away with #4254, as the
server exiting caused the `runBackground` calls to exit causing the http
servers to exit and fail to pick up requests.

Might have been caused by com-lihaoyi/os-lib#324
which made `destroyOnExit` the default for spawned subprocesses. This PR
explicitly disables `destroyOnExit` for the subprocesses where
`background = true`

Covered by a new `integration.invalidation` test that runs under both
`server` and `fork`, that previously failed when run under `fork`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants