Skip to content

Fixing a race condition in EnrichCoordinatorProxyAction that can leave an item stuck in its queue#90688

Merged
masseyke merged 6 commits intoelastic:mainfrom
masseyke:fix/EnrichCoordinatorProxyAction-race-condition
Oct 5, 2022
Merged

Fixing a race condition in EnrichCoordinatorProxyAction that can leave an item stuck in its queue#90688
masseyke merged 6 commits intoelastic:mainfrom
masseyke:fix/EnrichCoordinatorProxyAction-race-condition

Conversation

@masseyke
Copy link
Copy Markdown
Member

@masseyke masseyke commented Oct 5, 2022

There is a race condition in EnrichCoordinatorProxyAction that can result in an item being stuck in its queue even once all threads related to any schedule() calls have completed. The item will be flushed out on the next call to schedule() but there is no guarantee if or when that will happen. This PR adds an additional check for orphaned items in the queue.

Here's what I believe is happening (I can only reproduce it in fewer than 1 in 10,000 tries so I don't have direct evidence):

  • Say thread # 1 calls schedule(), and then coordinateLookups(). It drains the whole queue, and comes through that while loop a second time and there's nothing there so it's about to call remoteRequestPermits.release()
  • But while that happens thread # 2 is in schedule() and calls offer.
  • So now thread # 1 still has the remoteRequestPermits lock and has decided there's nothing in the queue.
  • So now thread # 2 comes into coordinateLookups() and gets false from remoteRequestPermits.tryAcquire() because thread # 1 still has the lock
  • Now thread # 2 exists from coordinateLookups() and thread # 1 exits from coordinateLookups() but there's still 1 thing in the queue
  • So now there's something in the queue that's just going to stick around until someone schedules something else.

(Note that there are actually more threads than just the 2 I mention since coordinateLookups() makes an async call back to itself)

Closes #90598

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Oct 5, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Oct 5, 2022

This PR causes a few more loops in the code, but I don't think it will be a noticeable performance hit -- the additional loops are rare and fast. I ran the test (CoordinatorTests.testAllSearchesExecuted()) 100,000 times, and this branch was hit fewer than 7,000 times. The runtime was no different than it was without the change.

@masseyke masseyke added :Distributed/Ingest Node Execution or management of Ingest Pipelines and removed :Data Management/Other labels Oct 5, 2022
@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Oct 5, 2022

@elasticmachine update branch

Copy link
Copy Markdown
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, fixed a small typo is all.

…ich/action/EnrichCoordinatorProxyAction.java

Co-authored-by: James Baiera <james.baiera@gmail.com>
@masseyke masseyke merged commit 120da9b into elastic:main Oct 5, 2022
@masseyke masseyke deleted the fix/EnrichCoordinatorProxyAction-race-condition branch October 5, 2022 20:28
Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 10, 2022
* main: (150 commits)
  Remove ToXContent interface from ChunkedToXContent (elastic#90409)
  Remove extra SearchService constructor (elastic#90733)
  Update min version for the diagnosis yaml test (elastic#90731)
  Use the AggTestConfig object in testCase (elastic#90699)
  [DOCS] Add links to clear trained model deployment cache API (elastic#90727)
  Assert wildcards are not expanded as specified by request options  (elastic#90641)
  [TEST] Fix exit snapshot restore exit condition (elastic#90696)
  [TEST] Change to atomic file contents save (elastic#90695)
  Update forbiddenapis to 3.4 (elastic#90624)
  [Tests] Don't use concurrent search in scripted field type tests (elastic#90712)
  [ML] Move scaling is possible check for starting trained model (elastic#90706)
  Add new base test case for chunked xcontent types  (elastic#90707)
  Fix testRedNoBlockedIndicesAndRedAllRoleNodes (elastic#90671)
  Fix nullpointer in docs test setup (elastic#90660)
  Don't produce build logs artifact when in a composite build
  Fixing a race condition in EnrichCoordinatorProxyAction that can leave an item stuck in its queue (elastic#90688)
  docs: update fleet/agent pipeline docs (elastic#90659)
  [HealthAPI] Use plural consistently in resource types (elastic#90682)
  [Testing] Enable bwc and fix sorting for 500_date_range (elastic#90681)
  Add profiling and documentation for dfs phase (elastic#90536)
  ...

# Conflicts:
#	x-pack/plugin/mapper-aggregate-metric/src/test/java/org/elasticsearch/xpack/aggregatemetric/mapper/AggregateDoubleMetricFieldMapperTests.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Ingest Node Execution or management of Ingest Pipelines Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] CoordinatorTests testAllSearchesExecuted failing

5 participants