Skip to content

Fix NPE caused by race condition in async search when minimise round trips is true#117504

Merged
pawankartik-elastic merged 24 commits intoelastic:mainfrom
pawankartik-elastic:pkar/async-min-round-trip-npe
Jan 27, 2025
Merged

Fix NPE caused by race condition in async search when minimise round trips is true#117504
pawankartik-elastic merged 24 commits intoelastic:mainfrom
pawankartik-elastic:pkar/async-min-round-trip-npe

Conversation

@pawankartik-elastic
Copy link
Copy Markdown
Contributor

@pawankartik-elastic pawankartik-elastic commented Nov 25, 2024

Previously, the notifyListShards() initialised and updated the required pre-requisites (searchResponse being amongst them) when a search op began. This function takes in arguments that contain shard-specific details amongst others. Because this information is not immediately available when the search begins, it is not immediately called. In some specific cases, there can be a race condition that can cause the pre-requisities (such as searchResponse) to be accessed before they're initialised, causing an NPE.

This fix addresses the race condition by ensuring that MutableSearchResponse is instantiated right from the beginning and instead populating the shards count and clusters via onListShards().

…trips is true

Previously, the `notifyListShards()` initialised and updated the
required pre-requisites (`searchResponse` being amongst them) when a
search op began. This function takes in arguments that contain
shard-specific details amongst others. Because this information is not
immediately available when the search begins, it is not immediately
called. In some specific cases, there can be a race condition that can
cause the pre-requisities (such as `searchResponse`) to be accessed
before they're initialised, causing an NPE.

This fix addresses the race condition by splitting the initialisation
and subsequent updation amongst 2 different methods. This way, the
pre-requisities are always initialised and do not lead to an NPE.
@pawankartik-elastic pawankartik-elastic force-pushed the pkar/async-min-round-trip-npe branch from 341ec23 to 3328dae Compare November 27, 2024 10:12
signify search start.

To prevent polluting the progress listener with unnecessary search
specific details, we now pass the `Clusters` object to `SearchTask` when
a search op begins. This lets `AsyncSearchTask` access it and use it to
initialise `MutableSearchResponse` appropriately.
@pawankartik-elastic
Copy link
Copy Markdown
Contributor Author

pawankartik-elastic commented Jan 16, 2025

Follow up:

  1. I've moved out the shard counts, i.e. the mutable entities out of MutableSearchResponse. This allows me to set searchResponse right from the beginning.
  2. toAsyncResponse() hasn't been moved out of MutableSearchResponse because it requires information that is private and specific to this class. If I were to move it out, I'd have to pass this info manually when and where necessary. Even if I did move it, there'd be no benefit since methods like onXYZ() that lead to this NPE are invoked prior to response building.

Edit: After some internal discussions, it was decided to revert e7b8a7c and instead proceed with the approach in the commit 2636c74.

@pawankartik-elastic pawankartik-elastic added >bug auto-backport Automatically create backport pull requests when merged Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v8.17.2 labels Jan 21, 2025
@pawankartik-elastic pawankartik-elastic marked this pull request as ready for review January 21, 2025 15:19
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @pawankartik-elastic, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Copy link
Copy Markdown
Contributor

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change LGTM, I wonder if we have enough test coverage. Didn't you work on reproducing the original issue and tests already? Perhaps we could expand unit tests in AsyncSearchTaskTests and recreate the scenario that required the fix?

@pawankartik-elastic pawankartik-elastic force-pushed the pkar/async-min-round-trip-npe branch from 776d9c3 to 066a805 Compare January 22, 2025 17:33
Copy link
Copy Markdown
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - left one minor comment improvement suggestion

@pawankartik-elastic pawankartik-elastic merged commit 01edab5 into elastic:main Jan 27, 2025
pawankartik-elastic added a commit to pawankartik-elastic/elasticsearch that referenced this pull request Jan 27, 2025
…trips is true (elastic#117504)

* Fix NPE caused by race condition in async search when minimise round trips is true

Previously, the `notifyListShards()` initialised and updated the
required pre-requisites (`searchResponse` being amongst them) when a
search op began. This function takes in arguments that contain
shard-specific details amongst others. Because this information is not
immediately available when the search begins, it is not immediately
called. In some specific cases, there can be a race condition that can
cause the pre-requisities (such as `searchResponse`) to be accessed
before they're initialised, causing an NPE.

This fix addresses the race condition by splitting the initialisation
and subsequent updation amongst 2 different methods. This way, the
pre-requisities are always initialised and do not lead to an NPE.

* Try: call `notifyListShards()` after `notifySearchStart()` when minimize round trips is true

* Add removed code comment

* Pass `Clusters` to `SearchTask` rather than using progress listener to
signify search start.

To prevent polluting the progress listener with unnecessary search
specific details, we now pass the `Clusters` object to `SearchTask` when
a search op begins. This lets `AsyncSearchTask` access it and use it to
initialise `MutableSearchResponse` appropriately.

* Use appropriate `clusters` object rather than re-building it

* Do not double set `mutableSearchResponse`

* Move mutable entities such as shard counts out of `MutableSearchResponse`

* Address PR review: revert moving out mutable entities from
`MutableSearchResponse`

* Update docs/changelog/117504.yaml

* Get rid of `SetOnce` for `searchResponse`

* Drop redundant check around shards count

* Add a test that calls `onListShards()` at last and clarify `updateShardsAndClusters()`'s comment

* Fix test: ref count

* Address review comment: rewrite comment and test
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
8.x
8.17

pawankartik-elastic added a commit to pawankartik-elastic/elasticsearch that referenced this pull request Jan 27, 2025
…trips is true (elastic#117504)

* Fix NPE caused by race condition in async search when minimise round trips is true

Previously, the `notifyListShards()` initialised and updated the
required pre-requisites (`searchResponse` being amongst them) when a
search op began. This function takes in arguments that contain
shard-specific details amongst others. Because this information is not
immediately available when the search begins, it is not immediately
called. In some specific cases, there can be a race condition that can
cause the pre-requisities (such as `searchResponse`) to be accessed
before they're initialised, causing an NPE.

This fix addresses the race condition by splitting the initialisation
and subsequent updation amongst 2 different methods. This way, the
pre-requisities are always initialised and do not lead to an NPE.

* Try: call `notifyListShards()` after `notifySearchStart()` when minimize round trips is true

* Add removed code comment

* Pass `Clusters` to `SearchTask` rather than using progress listener to
signify search start.

To prevent polluting the progress listener with unnecessary search
specific details, we now pass the `Clusters` object to `SearchTask` when
a search op begins. This lets `AsyncSearchTask` access it and use it to
initialise `MutableSearchResponse` appropriately.

* Use appropriate `clusters` object rather than re-building it

* Do not double set `mutableSearchResponse`

* Move mutable entities such as shard counts out of `MutableSearchResponse`

* Address PR review: revert moving out mutable entities from
`MutableSearchResponse`

* Update docs/changelog/117504.yaml

* Get rid of `SetOnce` for `searchResponse`

* Drop redundant check around shards count

* Add a test that calls `onListShards()` at last and clarify `updateShardsAndClusters()`'s comment

* Fix test: ref count

* Address review comment: rewrite comment and test
elasticsearchmachine pushed a commit that referenced this pull request Jan 27, 2025
…trips is true (#117504) (#120955)

* Fix NPE caused by race condition in async search when minimise round trips is true

Previously, the `notifyListShards()` initialised and updated the
required pre-requisites (`searchResponse` being amongst them) when a
search op began. This function takes in arguments that contain
shard-specific details amongst others. Because this information is not
immediately available when the search begins, it is not immediately
called. In some specific cases, there can be a race condition that can
cause the pre-requisities (such as `searchResponse`) to be accessed
before they're initialised, causing an NPE.

This fix addresses the race condition by splitting the initialisation
and subsequent updation amongst 2 different methods. This way, the
pre-requisities are always initialised and do not lead to an NPE.

* Try: call `notifyListShards()` after `notifySearchStart()` when minimize round trips is true

* Add removed code comment

* Pass `Clusters` to `SearchTask` rather than using progress listener to
signify search start.

To prevent polluting the progress listener with unnecessary search
specific details, we now pass the `Clusters` object to `SearchTask` when
a search op begins. This lets `AsyncSearchTask` access it and use it to
initialise `MutableSearchResponse` appropriately.

* Use appropriate `clusters` object rather than re-building it

* Do not double set `mutableSearchResponse`

* Move mutable entities such as shard counts out of `MutableSearchResponse`

* Address PR review: revert moving out mutable entities from
`MutableSearchResponse`

* Update docs/changelog/117504.yaml

* Get rid of `SetOnce` for `searchResponse`

* Drop redundant check around shards count

* Add a test that calls `onListShards()` at last and clarify `updateShardsAndClusters()`'s comment

* Fix test: ref count

* Address review comment: rewrite comment and test
elasticsearchmachine pushed a commit that referenced this pull request Jan 27, 2025
…trips is true (#117504) (#120954)

* Fix NPE caused by race condition in async search when minimise round trips is true

Previously, the `notifyListShards()` initialised and updated the
required pre-requisites (`searchResponse` being amongst them) when a
search op began. This function takes in arguments that contain
shard-specific details amongst others. Because this information is not
immediately available when the search begins, it is not immediately
called. In some specific cases, there can be a race condition that can
cause the pre-requisities (such as `searchResponse`) to be accessed
before they're initialised, causing an NPE.

This fix addresses the race condition by splitting the initialisation
and subsequent updation amongst 2 different methods. This way, the
pre-requisities are always initialised and do not lead to an NPE.

* Try: call `notifyListShards()` after `notifySearchStart()` when minimize round trips is true

* Add removed code comment

* Pass `Clusters` to `SearchTask` rather than using progress listener to
signify search start.

To prevent polluting the progress listener with unnecessary search
specific details, we now pass the `Clusters` object to `SearchTask` when
a search op begins. This lets `AsyncSearchTask` access it and use it to
initialise `MutableSearchResponse` appropriately.

* Use appropriate `clusters` object rather than re-building it

* Do not double set `mutableSearchResponse`

* Move mutable entities such as shard counts out of `MutableSearchResponse`

* Address PR review: revert moving out mutable entities from
`MutableSearchResponse`

* Update docs/changelog/117504.yaml

* Get rid of `SetOnce` for `searchResponse`

* Drop redundant check around shards count

* Add a test that calls `onListShards()` at last and clarify `updateShardsAndClusters()`'s comment

* Fix test: ref count

* Address review comment: rewrite comment and test
@pawankartik-elastic pawankartik-elastic deleted the pkar/async-min-round-trip-npe branch June 26, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.17.2 v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants