Skip to content

Fix SearchTimeoutIT#120390

Merged
javanna merged 5 commits intoelastic:mainfrom
javanna:test/search_timeout_it
Feb 10, 2025
Merged

Fix SearchTimeoutIT#120390
javanna merged 5 commits intoelastic:mainfrom
javanna:test/search_timeout_it

Conversation

@javanna
Copy link
Copy Markdown
Contributor

@javanna javanna commented Jan 17, 2025

Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:

  • use index random to index documents, that speeds it up
  • share indexing across test methods, so that it happens once at the suite level
  • replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents.

Closes #98369
Closes #98053

@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jan 17, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@javanna javanna added >test Issues or PRs that are addressing/adding tests auto-backport Automatically create backport pull requests when merged labels Jan 20, 2025
@javanna
Copy link
Copy Markdown
Contributor Author

javanna commented Jan 21, 2025

run elasticsearch-ci/part-1

Two of the timeout tests have been muted for several months. The reason is that we
tightened the assertions to cover for partial results being returned, but there were
edge cases in which partial results were not actually returned.

The edge case is a typical timing issue where the initial check for timeout in
CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- raise a single timeout, rather than one per visited document which causes an unnecessary slowdown
- make sure that one document is always visited before a timeout, so that partial results are
always guaranteed to be returned

Closes elastic#98369
Closes elastic#98053
@javanna javanna force-pushed the test/search_timeout_it branch from c404cfa to 4ee35b3 Compare February 10, 2025 09:11
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this Luca!

Nicely done - quite easy to follow and understand 👍

@javanna javanna merged commit ebba004 into elastic:main Feb 10, 2025
@javanna javanna deleted the test/search_timeout_it branch February 10, 2025 16:06
javanna added a commit to javanna/elasticsearch that referenced this pull request Feb 10, 2025
Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents. 

Closes elastic#98369
Closes elastic#98053
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💔 Backport failed

Status Branch Result
9.0
8.18 Commit could not be cherrypicked due to conflicts
8.x Commit could not be cherrypicked due to conflicts
8.16 Commit could not be cherrypicked due to conflicts
8.17 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 120390

javanna added a commit to javanna/elasticsearch that referenced this pull request Feb 10, 2025
Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents. 

Closes elastic#98369
Closes elastic#98053
javanna added a commit to javanna/elasticsearch that referenced this pull request Feb 10, 2025
Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents.

Closes elastic#98369
Closes elastic#98053
@javanna javanna removed the v8.16.5 label Feb 10, 2025
javanna added a commit to javanna/elasticsearch that referenced this pull request Feb 10, 2025
Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents.

Closes elastic#98369
Closes elastic#98053
elasticsearchmachine pushed a commit that referenced this pull request Feb 10, 2025
Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents. 

Closes #98369
Closes #98053
elasticsearchmachine pushed a commit that referenced this pull request Feb 10, 2025
Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents.

Closes #98369
Closes #98053
elasticsearchmachine pushed a commit that referenced this pull request Feb 10, 2025
Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents.

Closes #98369
Closes #98053
elasticsearchmachine pushed a commit that referenced this pull request Feb 11, 2025
Two of the timeout tests have been muted for several months. The reason is that we tightened the assertions to cover for partial results being returned, but there were edge cases in which partial results were not actually returned.

The timeout used in the test was time dependent, hence when the timeout precisely will be thrown is unpredictable, because we have timeout checks in different places in the codebase, when iterating through the leaves, before scoring any document, or while scoring documents. The edge case that caused failures is a typical timing issue where the initial check for timeout in CancellableBulkScorer already triggers the timeout, before any document has been collected.

I made several adjustments to the test to make it more robust:
- use index random to index documents, that speeds it up
- share indexing across test methods, so that it happens once at the suite level
- replace the custom query that triggers a timeout to not be a script query, but rather a lucene query that is not time dependent, and throws a time exceeded exception precisely where we expect it, so that we can test how the system reacts to that. That allows to test that partial results are always returned when a timeout happens while scoring documents, and that partial results are never returned when a timeout happens before we even started to score documents.

Closes #98369
Closes #98053

Co-authored-by: Joe Gallo <joe.gallo@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch >test Issues or PRs that are addressing/adding tests v8.17.3 v8.18.1 v8.19.0 v9.0.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SearchTimeoutIT testTopHitsTimeout failing [CI] SearchTimeoutIT testAggsTimeout failing

3 participants