Avoid using WindowsFS in ClusterRerouteIT by DaveCTurner · Pull Request #52488 · elastic/elasticsearch

DaveCTurner · 2020-02-18T18:14:56Z

Issue #52000 looks like a case of cluster state updates being slower than
expected, but it seems that these slowdowns are relatively rare: most
invocations of testDelayWithALargeAmountOfShards take well under a minute in
CI, but there are occasional failures that take 6+ minutes instead. When it
fails like this, cluster state persistence seems generally slow: most are
slower than expected, with some small updates even taking over 2 seconds to
complete.

The failures all have in common that they use WindowsFS to emulate Windows'
behaviour of refusing to delete files that are still open, by tracking all
files (really, inodes) and validating that deleted files are really closed
first. There is a suggestion that this is a little slow in the Lucene test
framework [1]. To see if we can attribute the slowdown to that common factor,
this commit suppresses the use of WindowsFS for this test suite.

[1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

Issue elastic#52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

elasticmachine · 2020-02-18T18:14:58Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

original-brownbear

LGTM

henningandersen

LGTM.

Nice find and a good experiment. If successful, I would prefer that we dig a bit into the WindowsFS before we disable it for this test permanently (or alternatively increase the timeout when running that FS). Looks like there is at least a bit of IO done under a lock that could be improved (though in Lucene).

Issue #52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

Issue elastic#52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

Issue #52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

Same as #52488 but for a different test suite Closes #58019

DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.7.0 labels Feb 18, 2020

DaveCTurner requested review from henningandersen and ywelsch February 18, 2020 18:14

original-brownbear approved these changes Feb 18, 2020

View reviewed changes

DaveCTurner mentioned this pull request Feb 18, 2020

[CI] ClusterRerouteIT.testDelayWithALargeAmountOfShards timed out waiting for green state #52000

Closed

henningandersen approved these changes Feb 18, 2020

View reviewed changes

DaveCTurner merged commit 0a06ef5 into elastic:master Feb 19, 2020

DaveCTurner deleted the 2020-02-18-suppress-windowsfs-on-ClusterRerouteIT branch February 19, 2020 07:51

DaveCTurner added the v7.6.2 label Mar 5, 2020

original-brownbear mentioned this pull request Mar 27, 2020

SharedClusterSnapshotRestoreIT testBasicWorkFlow timeout #53596

Closed

original-brownbear mentioned this pull request Jun 12, 2020

Exclude WindowsFS from SharedClusterSnapshotRestoreIT #58020

Merged

original-brownbear added a commit that referenced this pull request Jun 12, 2020

Exclude WindowsFS from SharedClusterSnapshotRestoreIT (#58020)

d4cc6b5

Same as #52488 but for a different test suite Closes #58019

This was referenced Jun 12, 2020

Exclude WindowsFS from SharedClusterSnapshotRestoreIT (#58020) #58023

Merged

Exclude WindowsFS from SharedClusterSnapshotRestoreIT (#58020) #58024

Merged

original-brownbear added a commit that referenced this pull request Jun 12, 2020

Exclude WindowsFS from SharedClusterSnapshotRestoreIT (#58020) (#58024)

3350028

Same as #52488 but for a different test suite Closes #58019

original-brownbear added a commit that referenced this pull request Jun 12, 2020

Exclude WindowsFS from SharedClusterSnapshotRestoreIT (#58020) (#58023)

db03e7c

Same as #52488 but for a different test suite Closes #58019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

ywangd mentioned this pull request Oct 31, 2025

[CI] ClusterRerouteIT testDelayWithALargeAmountOfShards failing #137384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid using WindowsFS in ClusterRerouteIT#52488

Avoid using WindowsFS in ClusterRerouteIT#52488
DaveCTurner merged 1 commit intoelastic:masterfrom
DaveCTurner:2020-02-18-suppress-windowsfs-on-ClusterRerouteIT

DaveCTurner commented Feb 18, 2020

Uh oh!

elasticmachine commented Feb 18, 2020

Uh oh!

original-brownbear left a comment

Uh oh!

henningandersen left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

DaveCTurner commented Feb 18, 2020

Uh oh!

elasticmachine commented Feb 18, 2020

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

henningandersen left a comment •

edited

Loading