Skip to content

[🐛 Bug]: Deadlock in Selenium Grid #17018

@asolntsev

Description

@asolntsev

Description

I started Grid on my laptop and run few tests.
Then the grid was running unused/untouched for a week.
Then I tried to stop it.

Result: it could not stop.
I took a thread dump - it contains phrase "Found one Java-level deadlock:".

Thread dump: https://gist.github.com/asolntsev/a0a30949505b42422b338ea9cb56306e

Reproducible Code

> java -version
openjdk version "17.0.17'


> java -jar selenium-server-4.40.0.jar standalone --selenium-manager true --enable-managed-downloads true

Debugging Logs

17:55:41.910 INFO [LocalNode.stopTimedOutSession] - Session id 6a7a7b915237deb9317591818bea4a41 is stopping on demand...
17:55:41.910 INFO [SessionSlot.stop] - Stopping session 6a7a7b915237deb9317591818bea4a41 (reason: QUIT_COMMAND)
17:55:41.911 INFO [SessionSlot.stop] - Session stopped successfully: 6a7a7b915237deb9317591818bea4a41
17:55:41.911 INFO [LocalSessionMap.removeWithReason] - Deleted session from local Session Map, Id: 6a7a7b915237deb9317591818bea4a41, Node: http://10.10.10.151:4444, Reason: session closed normally (QUIT command)
17:55:41.912 INFO [LocalGridModel.release] - Releasing slot for session id 6a7a7b915237deb9317591818bea4a41
18:13:23.741 INFO [LocalGridModel.purgeDeadNodes] - Switching Node a37d5cd6-72c6-4d6d-8c1a-f04bae993c4a (uri: http://10.10.10.151:4444) from UP to DOWN
01:05:18.213 INFO [LocalGridModel.purgeDeadNodes] - Switching Node a37d5cd6-72c6-4d6d-8c1a-f04bae993c4a (uri: http://10.10.10.151:4444) from UP to DOWN
04:30:42.307 INFO [LocalGridModel.purgeDeadNodes] - Removing Node a37d5cd6-72c6-4d6d-8c1a-f04bae993c4a (uri: http://10.10.10.151:4444), DOWN for too long


^C08:27:13.860 INFO [LocalNode.stopAllSessions] - Trying to stop all running sessions before shutting down...


^C^C

Preliminary AI analysis:

Based on the thread dump, I've identified a classic deadlock scenario involving three threads: Local Distributor - Node Health Check, node-health-check-263, and Local Distributor - Purge Dead Nodes.

Here's a breakdown of the lock contention:

  1. Local Distributor - Node Health Check is waiting to acquire a read lock in LocalNodeRegistry.runHealthChecks, which is held by node-health-check-263.
  2. node-health-check-263 is waiting to acquire a write lock in LocalGridModel.setAvailability, which is held by Local Distributor - Purge Dead Nodes.
  3. Local Distributor - Purge Dead Nodes is waiting to acquire a write lock in LocalNodeRegistry.remove, which is held by node-health-check-263.

This creates a circular dependency, leading to the deadlock. The root cause appears to be inconsistent lock ordering between different operations within LocalNodeRegistry and LocalGridModel.

Solution

@VietND96 @joerg1985 @shs96c This is AI suggestion how to fix the deadlock. Does it seem reasonable?

To resolve this, I have removed the unnecessary lock in the updateNodeAvailability method within LocalNodeRegistry.java. This ensures a consistent lock acquisition order and prevents the deadlock. The file has been updated with the fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-needs-triagingA Selenium member will evaluate this soon!B-gridEverything grid and server relatedC-javaJava BindingsI-defectSomething is not working as intendedOS-mac

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions