Skip to content

Fix PrimaryAllocationIT Race Condition#37355

Merged
original-brownbear merged 4 commits intoelastic:masterfrom
original-brownbear:37345
Jan 11, 2019
Merged

Fix PrimaryAllocationIT Race Condition#37355
original-brownbear merged 4 commits intoelastic:masterfrom
original-brownbear:37345

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

  • Forcing a stale primary allocation on a green index was tripping an assertion
    • Added a test that this case still errors out correctly
  • Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behavior on the fixed test
    • Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty
  • Closes CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

* Forcing a stale primary allocation on a green index was tripping the assertion that was removed
   * Added a test that this case still errors out correctly
* Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behaviour on the fixed test
   * Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty
* Closes #37345
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.7.0 labels Jan 11, 2019
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok but I'm always cautious about changes to the InternalTestCluster API so I've asked for a second opinion.


private void wipePendingDataDirectories() {
assert Thread.holdsLock(this);
public synchronized void wipePendingDataDirectories() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any obvious problems that arise from exposing this, but the API is kinda complicated so I'd like a second opinion from @ywelsch.

ensureGreen();
final String nodeWithoutData = randomFrom(dataNodes);
final int shardId = 0;
Throwable iae = expectThrows(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalArgumentException iae = ...

createStaleReplicaScenario(master);
internalCluster().startDataOnlyNodes(2);
// Ensure the stopped primary's data is deleted so that it doesn't get picked up by the next datanode we start
internalCluster().wipePendingDataDirectories();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok as a temporary solution. We definitely need a cleaner way of managing data paths in InternalTestCluster.

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins run Gradle build tests 2

1 similar comment
@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins run Gradle build tests 2

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins run Gradle build tests 1

@original-brownbear original-brownbear merged commit 63fe3c6 into elastic:master Jan 11, 2019
@original-brownbear original-brownbear deleted the 37345 branch January 11, 2019 22:26
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 12, 2019
…rsistence

* elastic/master:
  Fix PrimaryAllocationIT Race Condition (elastic#37355)
  SQL: Make `FULL` non-reserved keyword in the grammar (elastic#37377)
  SQL: [Tests] Fix and enable internalClusterTests (elastic#37300)
  ML: Fix testMigrateConfigs  (elastic#37373)
  Fix RollupDocumentation test to wait for job to stop
original-brownbear added a commit that referenced this pull request Jan 29, 2019
* Fix PrimaryAllocationIT Race Condition

* Forcing a stale primary allocation on a green index was tripping the assertion that was removed
   * Added a test that this case still errors out correctly
* Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behaviour on the fixed test
   * Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty
* Closes #37345
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test Issues or PRs that are addressing/adding tests v6.7.0 v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode

5 participants