[qa] multinode tests fails when you run low on disk space (85%) by dadoonet · Pull Request #12853 · elastic/elasticsearch

dadoonet · 2015-08-13T10:48:11Z

Indeed, we check within the test suite that we have not unassigned shards.

But when the test starts on my machine I get:

[elasticsearch] [2015-08-13 12:03:18,801][INFO ][org.elasticsearch.cluster.routing.allocation.decider] [Kehl of Tauran] low disk watermark [85%] exceeded on [eLujVjWAQ8OHdhscmaf0AQ][Jackhammer] free: 59.8gb[12.8%], replicas will not be assigned to this node

  2> REPRODUCE WITH: mvn verify -Pdev -Dskip.unit.tests -Dtests.seed=2AE3A3B7B13CE3D6 -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiIT -Dtests.method="test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index}" -Des.logger.level=ERROR -Dtests.assertion.disabled=false -Dtests.security.manager=true -Dtests.heap.size=512m -Dtests.locale=ar_YE -Dtests.timezone=Asia/Hong_Kong -Dtests.rest.suite=smoke_test_multinode
FAILURE 38.5s | SmokeTestMultiIT.test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index} <<<
   > Throwable #1: java.lang.AssertionError: expected [2xx] status code but api [cluster.health] returned [408 Request Timeout] [{"cluster_name":"prepare_release","status":"yellow","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":3,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":3,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":50.0}]

I propose here to define for all integration tests:

cluster.routing.allocation.disk.watermark.low:200mb
cluster.routing.allocation.disk.watermark.high:100mb

Closes #12852.

dadoonet · 2015-08-13T10:48:43Z

@rmuir Could you tell me what you think about this?

rmuir · 2015-08-13T10:54:12Z

Sounds like elasticsearch has bad defaults to me. This will just hide those bad defaults.

dadoonet · 2015-08-13T10:56:13Z

Makes sense. @dakrone WDYT? Should we change Elasticsearch defaults to absolute values?

rmuir · 2015-08-13T10:57:32Z

Keep in mind every -D here makes the tests less realistic: there are already far too many -D's in the integration tests IMO.

Unless users start elasticsearch with 57 -D's, then we should not either.

As for now, we have the current defaults: * `cluster.routing.allocation.disk.watermark.low:85%` * `cluster.routing.allocation.disk.watermark.high:90%` But, even if you have plenty of free space on you 1Tb disk, you could end up not allocating any replica if you have less than 15gb free disk space. This change propose to set: * `cluster.routing.allocation.disk.watermark.low:1gb` * `cluster.routing.allocation.disk.watermark.high:500mb` as the new defaults. Related to elastic#12853 (comment) Closes elastic#12852.

dakrone · 2015-08-13T13:50:55Z

Sounds like elasticsearch has bad defaults to me.

@rmuir explain why you think these are bad? I could see changing to 90/95% perhaps, but there isn't a good way to support the wide array of disk sizes people run ES on other than relative values.

Should we change Elasticsearch defaults to absolute values?

No. I think relative values are the only way to support as many disk sizes as we can by default. And if not, that's why they are dynamically configurable.

jpountz · 2015-08-13T16:09:54Z

I see it as a test bug: the test makes incorrect assumptions about allocation rules.

rmuir · 2015-08-13T16:12:55Z

integration tests should test our defaults. if you are annoyed that tests fail because they don't work as expected when you are "low" on disk space, then users will be equally annoyed when they are in the same situation.

rmuir · 2015-08-13T16:13:10Z

please don't add the -D's. Fix the defaults.

dakrone · 2015-08-13T16:16:02Z

please don't add the -D's. Fix the defaults.

"fixing" the defaults to run QA tests is not a good solution. This is the same as saying we should "fix" the JVM checker to allow running old JVMs because the tests fail if you try to run them using an old version.

users will be equally annoyed when they are in the same situation

I'll take "equally annoyed" versus "out of disk space and with corrupted indices or translogs" any day.

jpountz · 2015-08-13T16:23:25Z

I agree we shouldn't add a-D, however I think we should fix the test to not expect all shards to be allocated instead of fixing our defaults (which look reasonable to me).

dadoonet · 2015-08-13T16:24:27Z

I think we should fix the test to not expect all shards to be allocated instead of fixing our defaults (which look reasonable to me).

Agreed. Will come with an update.

rmuir · 2015-08-13T16:43:24Z

I'll take "equally annoyed" versus "out of disk space and with corrupted indices or translogs" any day.

This has nothing to do with that. If elasticsearch has problems on disk full like that, then its because elasticsearch is broken.

Lucene does not have such problems.

dadoonet · 2015-08-18T09:34:26Z

@jpountz I added a new commit. It now does not check anymore if we have unassigned shards and wait for yellow instead of green.

We still check that we have 2 nodes running which is I think the first goal for this qa test.

jpountz · 2015-08-18T09:46:38Z

LGTM

Indeed, we check within the test suite that we have not unassigned shards. But when the test starts on my machine I get: ``` [elasticsearch] [2015-08-13 12:03:18,801][INFO ][org.elasticsearch.cluster.routing.allocation.decider] [Kehl of Tauran] low disk watermark [85%] exceeded on [eLujVjWAQ8OHdhscmaf0AQ][Jackhammer] free: 59.8gb[12.8%], replicas will not be assigned to this node ``` ``` 2> REPRODUCE WITH: mvn verify -Pdev -Dskip.unit.tests -Dtests.seed=2AE3A3B7B13CE3D6 -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiIT -Dtests.method="test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index}" -Des.logger.level=ERROR -Dtests.assertion.disabled=false -Dtests.security.manager=true -Dtests.heap.size=512m -Dtests.locale=ar_YE -Dtests.timezone=Asia/Hong_Kong -Dtests.rest.suite=smoke_test_multinode FAILURE 38.5s | SmokeTestMultiIT.test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index} <<< > Throwable #1: java.lang.AssertionError: expected [2xx] status code but api [cluster.health] returned [408 Request Timeout] [{"cluster_name":"prepare_release","status":"yellow","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":3,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":3,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":50.0}] ``` We don't check anymore if we have unassigned shards and we wait for `yellow` status instead of `green`. Closes elastic#12852.

In #12853 we actually introduced a test regression. Now as we wait for yellow instead of green, we might have some pending tasks. This commit simplify all that and only checks the number of nodes within the cluster. (cherry picked from commit 4a3ea79)

In #12853 we actually introduced a test regression. Now as we wait for yellow instead of green, we might have some pending tasks. This commit simplify all that and only checks the number of nodes within the cluster.

colings86 · 2015-08-21T09:39:04Z

@dadoonet this doesn't seem to have been backported to the 2.0 branch. Should it be backported?

colings86 · 2015-08-21T09:42:47Z

@dadoonet sorry, my bad, it is in 2.0. Ignore the above

dadoonet added >test Issues or PRs that are addressing/adding tests review :Delivery/Build Build or test infrastructure v2.0.0 labels Aug 13, 2015

dadoonet mentioned this pull request Aug 13, 2015

[allocation] change watermark low and high defaults #12858

Closed

dadoonet force-pushed the qa/fix-multinode-low-disk branch from b804db3 to 8903551 Compare August 18, 2015 09:32

dadoonet force-pushed the qa/fix-multinode-low-disk branch from 8903551 to da65493 Compare August 18, 2015 11:21

dadoonet added v2.1.0 and removed review labels Aug 18, 2015

dadoonet merged commit da65493 into elastic:master Aug 18, 2015

dadoonet deleted the qa/fix-multinode-low-disk branch August 18, 2015 11:24

jpountz added the v2.0.0-beta1 label Aug 21, 2015

jpountz removed v2.0.0 v2.1.0 labels Aug 21, 2015

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

Conversation

dadoonet commented Aug 13, 2015

Uh oh!

dadoonet commented Aug 13, 2015

Uh oh!

rmuir commented Aug 13, 2015

Uh oh!

dadoonet commented Aug 13, 2015

Uh oh!

rmuir commented Aug 13, 2015

Uh oh!

dakrone commented Aug 13, 2015

Uh oh!

jpountz commented Aug 13, 2015

Uh oh!

rmuir commented Aug 13, 2015

Uh oh!

rmuir commented Aug 13, 2015

Uh oh!

dakrone commented Aug 13, 2015

Uh oh!

jpountz commented Aug 13, 2015

Uh oh!

dadoonet commented Aug 13, 2015

Uh oh!

rmuir commented Aug 13, 2015

Uh oh!

dadoonet commented Aug 18, 2015

Uh oh!

jpountz commented Aug 18, 2015

Uh oh!

colings86 commented Aug 21, 2015

Uh oh!

colings86 commented Aug 21, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants