Skip to content

[qa] multinode tests fails when you run low on disk space (85%)#12853

Merged
dadoonet merged 1 commit intoelastic:masterfrom
dadoonet:qa/fix-multinode-low-disk
Aug 18, 2015
Merged

[qa] multinode tests fails when you run low on disk space (85%)#12853
dadoonet merged 1 commit intoelastic:masterfrom
dadoonet:qa/fix-multinode-low-disk

Conversation

@dadoonet
Copy link
Copy Markdown
Contributor

Indeed, we check within the test suite that we have not unassigned shards.

But when the test starts on my machine I get:

[elasticsearch] [2015-08-13 12:03:18,801][INFO ][org.elasticsearch.cluster.routing.allocation.decider] [Kehl of Tauran] low disk watermark [85%] exceeded on [eLujVjWAQ8OHdhscmaf0AQ][Jackhammer] free: 59.8gb[12.8%], replicas will not be assigned to this node
  2> REPRODUCE WITH: mvn verify -Pdev -Dskip.unit.tests -Dtests.seed=2AE3A3B7B13CE3D6 -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiIT -Dtests.method="test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index}" -Des.logger.level=ERROR -Dtests.assertion.disabled=false -Dtests.security.manager=true -Dtests.heap.size=512m -Dtests.locale=ar_YE -Dtests.timezone=Asia/Hong_Kong -Dtests.rest.suite=smoke_test_multinode
FAILURE 38.5s | SmokeTestMultiIT.test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index} <<<
   > Throwable #1: java.lang.AssertionError: expected [2xx] status code but api [cluster.health] returned [408 Request Timeout] [{"cluster_name":"prepare_release","status":"yellow","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":3,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":3,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":50.0}]

I propose here to define for all integration tests:

  • cluster.routing.allocation.disk.watermark.low:200mb
  • cluster.routing.allocation.disk.watermark.high:100mb

Closes #12852.

@dadoonet dadoonet added >test Issues or PRs that are addressing/adding tests review :Delivery/Build Build or test infrastructure v2.0.0 labels Aug 13, 2015
@dadoonet
Copy link
Copy Markdown
Contributor Author

@rmuir Could you tell me what you think about this?

@rmuir
Copy link
Copy Markdown
Contributor

rmuir commented Aug 13, 2015

Sounds like elasticsearch has bad defaults to me. This will just hide those bad defaults.

@dadoonet
Copy link
Copy Markdown
Contributor Author

Makes sense. @dakrone WDYT? Should we change Elasticsearch defaults to absolute values?

@rmuir
Copy link
Copy Markdown
Contributor

rmuir commented Aug 13, 2015

Keep in mind every -D here makes the tests less realistic: there are already far too many -D's in the integration tests IMO.

Unless users start elasticsearch with 57 -D's, then we should not either.

dadoonet added a commit to dadoonet/elasticsearch that referenced this pull request Aug 13, 2015
 As for now, we have the current defaults:

 * `cluster.routing.allocation.disk.watermark.low:85%`
 * `cluster.routing.allocation.disk.watermark.high:90%`

But, even if you have plenty of free space on you 1Tb disk, you could end up not allocating any replica if you have less than 15gb free disk space.

This change propose to set:

 * `cluster.routing.allocation.disk.watermark.low:1gb`
 * `cluster.routing.allocation.disk.watermark.high:500mb`

as the new defaults.

Related to elastic#12853 (comment)
Closes elastic#12852.
@dakrone
Copy link
Copy Markdown
Member

dakrone commented Aug 13, 2015

Sounds like elasticsearch has bad defaults to me.

@rmuir explain why you think these are bad? I could see changing to 90/95% perhaps, but there isn't a good way to support the wide array of disk sizes people run ES on other than relative values.

Should we change Elasticsearch defaults to absolute values?

No. I think relative values are the only way to support as many disk sizes as we can by default. And if not, that's why they are dynamically configurable.

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Aug 13, 2015

I see it as a test bug: the test makes incorrect assumptions about allocation rules.

@rmuir
Copy link
Copy Markdown
Contributor

rmuir commented Aug 13, 2015

integration tests should test our defaults. if you are annoyed that tests fail because they don't work as expected when you are "low" on disk space, then users will be equally annoyed when they are in the same situation.

@rmuir
Copy link
Copy Markdown
Contributor

rmuir commented Aug 13, 2015

please don't add the -D's. Fix the defaults.

@dakrone
Copy link
Copy Markdown
Member

dakrone commented Aug 13, 2015

please don't add the -D's. Fix the defaults.

"fixing" the defaults to run QA tests is not a good solution. This is the same as saying we should "fix" the JVM checker to allow running old JVMs because the tests fail if you try to run them using an old version.

users will be equally annoyed when they are in the same situation

I'll take "equally annoyed" versus "out of disk space and with corrupted indices or translogs" any day.

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Aug 13, 2015

I agree we shouldn't add a-D, however I think we should fix the test to not expect all shards to be allocated instead of fixing our defaults (which look reasonable to me).

@dadoonet
Copy link
Copy Markdown
Contributor Author

I think we should fix the test to not expect all shards to be allocated instead of fixing our defaults (which look reasonable to me).

Agreed. Will come with an update.

@rmuir
Copy link
Copy Markdown
Contributor

rmuir commented Aug 13, 2015

I'll take "equally annoyed" versus "out of disk space and with corrupted indices or translogs" any day.

This has nothing to do with that. If elasticsearch has problems on disk full like that, then its because elasticsearch is broken.

Lucene does not have such problems.

@dadoonet dadoonet force-pushed the qa/fix-multinode-low-disk branch from b804db3 to 8903551 Compare August 18, 2015 09:32
@dadoonet
Copy link
Copy Markdown
Contributor Author

@jpountz I added a new commit. It now does not check anymore if we have unassigned shards and wait for yellow instead of green.

We still check that we have 2 nodes running which is I think the first goal for this qa test.

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Aug 18, 2015

LGTM

Indeed, we check within the test suite that we have not unassigned shards.

But when the test starts on my machine I get:

```
[elasticsearch] [2015-08-13 12:03:18,801][INFO ][org.elasticsearch.cluster.routing.allocation.decider] [Kehl of Tauran] low disk watermark [85%] exceeded on [eLujVjWAQ8OHdhscmaf0AQ][Jackhammer] free: 59.8gb[12.8%], replicas will not be assigned to this node
```

```
  2> REPRODUCE WITH: mvn verify -Pdev -Dskip.unit.tests -Dtests.seed=2AE3A3B7B13CE3D6 -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiIT -Dtests.method="test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index}" -Des.logger.level=ERROR -Dtests.assertion.disabled=false -Dtests.security.manager=true -Dtests.heap.size=512m -Dtests.locale=ar_YE -Dtests.timezone=Asia/Hong_Kong -Dtests.rest.suite=smoke_test_multinode
FAILURE 38.5s | SmokeTestMultiIT.test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index} <<<
   > Throwable #1: java.lang.AssertionError: expected [2xx] status code but api [cluster.health] returned [408 Request Timeout] [{"cluster_name":"prepare_release","status":"yellow","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":3,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":3,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":50.0}]
```

We don't check anymore if we have unassigned shards and we wait for `yellow` status instead of `green`.

Closes elastic#12852.
@dadoonet dadoonet force-pushed the qa/fix-multinode-low-disk branch from 8903551 to da65493 Compare August 18, 2015 11:21
@dadoonet dadoonet added v2.1.0 and removed review labels Aug 18, 2015
@dadoonet dadoonet merged commit da65493 into elastic:master Aug 18, 2015
@dadoonet dadoonet deleted the qa/fix-multinode-low-disk branch August 18, 2015 11:24
dadoonet added a commit that referenced this pull request Aug 18, 2015
In #12853 we actually introduced a test regression. Now as we wait for yellow instead of green, we might have some pending tasks.
This commit simplify all that and only checks the number of nodes within the cluster.

(cherry picked from commit 4a3ea79)
dadoonet added a commit that referenced this pull request Aug 18, 2015
In #12853 we actually introduced a test regression. Now as we wait for yellow instead of green, we might have some pending tasks.
This commit simplify all that and only checks the number of nodes within the cluster.
@colings86
Copy link
Copy Markdown
Contributor

@dadoonet this doesn't seem to have been backported to the 2.0 branch. Should it be backported?

@colings86
Copy link
Copy Markdown
Contributor

@dadoonet sorry, my bad, it is in 2.0. Ignore the above

@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v2.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[qa] multinode tests fails when you run low on disk space (85%)

6 participants