-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Description
Description
I started Grid on my laptop and run few tests.
Then the grid was running unused/untouched for a week.
Then I tried to stop it.
Result: it could not stop.
I took a thread dump - it contains phrase "Found one Java-level deadlock:".
Thread dump: https://gist.github.com/asolntsev/a0a30949505b42422b338ea9cb56306e
Reproducible Code
> java -version
openjdk version "17.0.17'
> java -jar selenium-server-4.40.0.jar standalone --selenium-manager true --enable-managed-downloads trueDebugging Logs
17:55:41.910 INFO [LocalNode.stopTimedOutSession] - Session id 6a7a7b915237deb9317591818bea4a41 is stopping on demand...
17:55:41.910 INFO [SessionSlot.stop] - Stopping session 6a7a7b915237deb9317591818bea4a41 (reason: QUIT_COMMAND)
17:55:41.911 INFO [SessionSlot.stop] - Session stopped successfully: 6a7a7b915237deb9317591818bea4a41
17:55:41.911 INFO [LocalSessionMap.removeWithReason] - Deleted session from local Session Map, Id: 6a7a7b915237deb9317591818bea4a41, Node: http://10.10.10.151:4444, Reason: session closed normally (QUIT command)
17:55:41.912 INFO [LocalGridModel.release] - Releasing slot for session id 6a7a7b915237deb9317591818bea4a41
18:13:23.741 INFO [LocalGridModel.purgeDeadNodes] - Switching Node a37d5cd6-72c6-4d6d-8c1a-f04bae993c4a (uri: http://10.10.10.151:4444) from UP to DOWN
01:05:18.213 INFO [LocalGridModel.purgeDeadNodes] - Switching Node a37d5cd6-72c6-4d6d-8c1a-f04bae993c4a (uri: http://10.10.10.151:4444) from UP to DOWN
04:30:42.307 INFO [LocalGridModel.purgeDeadNodes] - Removing Node a37d5cd6-72c6-4d6d-8c1a-f04bae993c4a (uri: http://10.10.10.151:4444), DOWN for too long
^C08:27:13.860 INFO [LocalNode.stopAllSessions] - Trying to stop all running sessions before shutting down...
^C^C
Preliminary AI analysis:
Based on the thread dump, I've identified a classic deadlock scenario involving three threads: Local Distributor - Node Health Check, node-health-check-263, and Local Distributor - Purge Dead Nodes.
Here's a breakdown of the lock contention:
Local Distributor - Node Health Checkis waiting to acquire a read lock inLocalNodeRegistry.runHealthChecks, which is held bynode-health-check-263.node-health-check-263is waiting to acquire a write lock inLocalGridModel.setAvailability, which is held byLocal Distributor - Purge Dead Nodes.Local Distributor - Purge Dead Nodesis waiting to acquire a write lock inLocalNodeRegistry.remove, which is held bynode-health-check-263.
This creates a circular dependency, leading to the deadlock. The root cause appears to be inconsistent lock ordering between different operations within LocalNodeRegistry and LocalGridModel.
Solution
@VietND96 @joerg1985 @shs96c This is AI suggestion how to fix the deadlock. Does it seem reasonable?
To resolve this, I have removed the unnecessary lock in the updateNodeAvailability method within LocalNodeRegistry.java. This ensures a consistent lock acquisition order and prevents the deadlock. The file has been updated with the fix.