Skip to content

[grid] Preventing potential deadlock in Distributor#17022

Merged
VietND96 merged 3 commits intotrunkfrom
grid-distributor
Jan 30, 2026
Merged

[grid] Preventing potential deadlock in Distributor#17022
VietND96 merged 3 commits intotrunkfrom
grid-distributor

Conversation

@VietND96
Copy link
Member

🔗 Related Issues

💥 What does this PR do?

Fixes #17018 potential deadlock in LocalGridModel by firing events outside the lock scope

  • Fix deadlock between health check and purge dead nodes threads by moving event firing outside of lock scope in LocalGridModel
  • Events (NodeRestartedEvent, NodeRemovedEvent) were previously fired while holding LocalGridModel.lock, causing circular lock dependency with LocalNodeRegistry.lock

Write a unit test to simulate (based on analysis & thread dump in #17018)

  Thread A (node-health-check):                                                                                                                                                                                     
  1. LocalNodeRegistry.updateNodeAvailability() acquires LocalNodeRegistry.lock                                                                                                                                     
  2. Calls model.setAvailability() → waits for LocalGridModel.lock                                                                                                                                                  
                                                                                                                                                                                                                    
  Thread B (Purge Dead Nodes):                                                                                                                                                                                      
  1. model.purgeDeadNodes() acquires LocalGridModel.lock                                                                                                                                                            
  2. Fires NodeRemovedEvent inside lock                                                                                                                                                                             
  3. Event listener calls LocalNodeRegistry.remove() → waits for LocalNodeRegistry.lock                                                                                                                             
                                                                                                                                                                                                                    
  This creates a circular wait condition → deadlock. 

Fixed by updating LocalGridModel to fire events outside the lock scope:

  1. purgeDeadNodes(): Collect removed nodes in a set, release lock, then fire NodeRemovedEvent for each
  2. add(): Store restarted node reference, release lock, then fire NodeRestartedEvent if needed

🔧 Implementation Notes

💡 Additional Considerations

🔄 Types of changes

  • Cleanup (formatting, renaming)
  • Bug fix (backwards compatible)
  • New feature (non-breaking change which adds functionality and tests!)
  • Breaking change (fix or feature that would cause existing functionality to change)

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@qodo-code-review
Copy link
Contributor

PR Type

Bug fix


Description

  • Prevent deadlock between health check and purge threads by firing events outside lock scope

  • Move NodeRemovedEvent firing outside purgeDeadNodes() lock to avoid circular lock dependency

  • Move NodeRestartedEvent firing outside add() lock when node restarts with same URI

  • Add comprehensive test suite simulating deadlock scenarios and verifying event firing behavior


File Walkthrough

Relevant files
Bug fix
LocalGridModel.java
Fire events outside lock scope in LocalGridModel                 

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java

  • Refactored add() method to collect restarted node reference and fire
    NodeRestartedEvent after releasing write lock
  • Refactored purgeDeadNodes() method to collect removed nodes in a set
    and fire NodeRemovedEvent events after releasing write lock
  • Updated comment from "Send the previous state" to "Save the previous
    state" to reflect the new approach
+13/-3   
Tests
LocalGridModelDeadlockTest.java
Add deadlock prevention tests for LocalGridModel                 

java/test/org/openqa/selenium/grid/distributor/local/LocalGridModelDeadlockTest.java

  • New test class with three test methods simulating deadlock scenarios
  • shouldNotDeadlockWhenPurgingNodesWhileUpdatingAvailability() simulates
    concurrent threads with external locks to verify no deadlock occurs
  • purgeDeadNodesShouldFireEventsOutsideLock() verifies NodeRemovedEvent
    is fired after lock release by calling model methods in event listener
  • addShouldFireRestartEventOutsideLock() verifies NodeRestartedEvent is
    fired after lock release when node restarts with same URI
+275/-0 

@selenium-ci selenium-ci added B-grid Everything grid and server related C-java Java Bindings labels Jan 29, 2026
@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Jan 29, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🟡
🎫 #1234
🟡
🎫 #5678
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Swallowed exceptions: The new test code catches broad Exception and ignores it (including interruption), which
can silently hide failures and reduce debuggability.

Referred Code
      } catch (Exception e) {
        // Ignore interruption
      } finally {
        testComplete.countDown();
      }
    });

// Thread 2: Calls purgeDeadNodes which fires event
executor.submit(
    () -> {
      try {
        thread2Started.countDown();
        thread1Started.await(1, TimeUnit.SECONDS);

        model.purgeDeadNodes();
      } catch (Exception e) {
        // Ignore interruption
      } finally {
        testComplete.countDown();
      }

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Jan 29, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Improve test to correctly verify lock release

Improve the purgeDeadNodesShouldFireEventsOutsideLock test by attempting to
acquire a write lock instead of a read lock in the event listener. This provides
a more accurate verification that the lock is released before the event is
fired, as the current read lock check can pass even if the write lock is held.

java/test/org/openqa/selenium/grid/distributor/local/LocalGridModelDeadlockTest.java [183-196]

 // When event fires, try to call another model method that needs the lock
 events.addListener(
     NodeRemovedEvent.listener(
         status -> {
           try {
-            // This needs the read lock - if event is fired inside write lock,
-            // this would work due to lock downgrading, but we verify the pattern
-            model.getSnapshot();
+            // This needs the write lock. If the event is fired inside the
+            // purgeDeadNodes's write lock, this will block and time out.
+            // We use a dummy node ID since we just need to acquire the lock.
+            model.setAvailability(new NodeId(UUID.randomUUID()), UP);
             canAcquireLock.set(true);
           } catch (Exception e) {
             canAcquireLock.set(false);
           }
           eventFired.countDown();
         }));
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a flaw in the test logic where lock downgrading in ReentrantReadWriteLock could lead to a false positive, and proposes a more robust check by attempting to acquire a write lock, significantly improving the test's reliability.

Medium
Catch exceptions during removal events

Wrap the events.fire(new NodeRemovedEvent(node)) call within a try-catch block
inside the loop. This prevents a single failing event listener from stopping the
notification process for other removed nodes.

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java [290]

-removedNodes.forEach(node -> events.fire(new NodeRemovedEvent(node)));
+for (NodeStatus node : removedNodes) {
+  try {
+    events.fire(new NodeRemovedEvent(node));
+  } catch (RuntimeException e) {
+    LOG.warning(String.format(
+      "Error firing NodeRemovedEvent for node %s: %s",
+      node.getNodeId(), e.getMessage()));
+  }
+}
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that an exception in an event listener could halt the processing of subsequent node removal events, and proposes a robust solution by wrapping each event fire in a try-catch block.

Medium
Preserve event firing order

Change removedNodes from a HashSet to an ArrayList to ensure that
NodeRemovedEvents are fired in a deterministic order.

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java [231]

-Set<NodeStatus> removedNodes = new HashSet<>();
+List<NodeStatus> removedNodes = new ArrayList<>();
  • Apply / Chat
Suggestion importance[1-10]: 2

__

Why: The suggestion to use a List for deterministic event order is valid, but the current implementation's use of a Set is also correct and has no functional downside, as the order of NodeRemovedEvents is not critical.

Low
Learned
best practice
Don’t ignore thread failures

Capture any thrown exceptions and restore the interrupt flag on interruption,
then assert no exceptions occurred so the test cannot pass while threads
silently failed.

java/test/org/openqa/selenium/grid/distributor/local/LocalGridModelDeadlockTest.java [115-131]

+AtomicReference<Throwable> threadFailure = new AtomicReference<>();
+
 executor.submit(
     () -> {
       try {
         thread1Started.countDown();
         thread2Started.await(1, TimeUnit.SECONDS);
 
         synchronized (externalLock) {
-          // This simulates LocalNodeRegistry.updateNodeAvailability()
-          // which holds its lock and calls model.setAvailability()
           model.setAvailability(nodeId, UP);
         }
-      } catch (Exception e) {
-        // Ignore interruption
+      } catch (InterruptedException e) {
+        Thread.currentThread().interrupt();
+        threadFailure.compareAndSet(null, e);
+      } catch (Throwable t) {
+        threadFailure.compareAndSet(null, t);
       } finally {
         testComplete.countDown();
       }
     });
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Do not swallow exceptions/interrupts in concurrency tests; propagate failures and restore interrupt status to avoid false positives.

Low
Clean up executors reliably

Move shutdown into a finally block and await termination (best-effort) so
threads are reliably cleaned up even when assertions fail or time out.

java/test/org/openqa/selenium/grid/distributor/local/LocalGridModelDeadlockTest.java [149-154]

-boolean completed = testComplete.await(5, TimeUnit.SECONDS);
-executor.shutdownNow();
+boolean completed;
+try {
+  completed = testComplete.await(5, TimeUnit.SECONDS);
+} finally {
+  executor.shutdownNow();
+  executor.awaitTermination(5, TimeUnit.SECONDS);
+}
 
 if (!completed) {
   deadlockDetected.set(true);
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why:
Relevant best practice - Ensure executor lifecycle is safely managed in tests (shutdown in finally and await termination) to prevent thread leaks and flakiness.

Low
  • Update

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@VietND96 VietND96 merged commit a47041a into trunk Jan 30, 2026
56 of 57 checks passed
@VietND96 VietND96 deleted the grid-distributor branch January 30, 2026 07:02
@asolntsev asolntsev added this to the 4.41.0 milestone Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-grid Everything grid and server related C-java Java Bindings Review effort 3/5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[🐛 Bug]: Deadlock in Selenium Grid

4 participants