[qa] multinode tests fails when you run low on disk space (85%)#12853
[qa] multinode tests fails when you run low on disk space (85%)#12853dadoonet merged 1 commit intoelastic:masterfrom
Conversation
|
@rmuir Could you tell me what you think about this? |
|
Sounds like elasticsearch has bad defaults to me. This will just hide those bad defaults. |
|
Makes sense. @dakrone WDYT? Should we change Elasticsearch defaults to absolute values? |
|
Keep in mind every -D here makes the tests less realistic: there are already far too many -D's in the integration tests IMO. Unless users start elasticsearch with 57 -D's, then we should not either. |
As for now, we have the current defaults: * `cluster.routing.allocation.disk.watermark.low:85%` * `cluster.routing.allocation.disk.watermark.high:90%` But, even if you have plenty of free space on you 1Tb disk, you could end up not allocating any replica if you have less than 15gb free disk space. This change propose to set: * `cluster.routing.allocation.disk.watermark.low:1gb` * `cluster.routing.allocation.disk.watermark.high:500mb` as the new defaults. Related to elastic#12853 (comment) Closes elastic#12852.
@rmuir explain why you think these are bad? I could see changing to 90/95% perhaps, but there isn't a good way to support the wide array of disk sizes people run ES on other than relative values.
No. I think relative values are the only way to support as many disk sizes as we can by default. And if not, that's why they are dynamically configurable. |
|
I see it as a test bug: the test makes incorrect assumptions about allocation rules. |
|
integration tests should test our defaults. if you are annoyed that tests fail because they don't work as expected when you are "low" on disk space, then users will be equally annoyed when they are in the same situation. |
|
please don't add the -D's. Fix the defaults. |
"fixing" the defaults to run QA tests is not a good solution. This is the same as saying we should "fix" the JVM checker to allow running old JVMs because the tests fail if you try to run them using an old version.
I'll take "equally annoyed" versus "out of disk space and with corrupted indices or translogs" any day. |
|
I agree we shouldn't add a |
Agreed. Will come with an update. |
This has nothing to do with that. If elasticsearch has problems on disk full like that, then its because elasticsearch is broken. Lucene does not have such problems. |
b804db3 to
8903551
Compare
|
@jpountz I added a new commit. It now does not check anymore if we have unassigned shards and wait for yellow instead of green. We still check that we have 2 nodes running which is I think the first goal for this qa test. |
|
LGTM |
Indeed, we check within the test suite that we have not unassigned shards.
But when the test starts on my machine I get:
```
[elasticsearch] [2015-08-13 12:03:18,801][INFO ][org.elasticsearch.cluster.routing.allocation.decider] [Kehl of Tauran] low disk watermark [85%] exceeded on [eLujVjWAQ8OHdhscmaf0AQ][Jackhammer] free: 59.8gb[12.8%], replicas will not be assigned to this node
```
```
2> REPRODUCE WITH: mvn verify -Pdev -Dskip.unit.tests -Dtests.seed=2AE3A3B7B13CE3D6 -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiIT -Dtests.method="test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index}" -Des.logger.level=ERROR -Dtests.assertion.disabled=false -Dtests.security.manager=true -Dtests.heap.size=512m -Dtests.locale=ar_YE -Dtests.timezone=Asia/Hong_Kong -Dtests.rest.suite=smoke_test_multinode
FAILURE 38.5s | SmokeTestMultiIT.test {yaml=smoke_test_multinode/10_basic/cluster health basic test, one index} <<<
> Throwable #1: java.lang.AssertionError: expected [2xx] status code but api [cluster.health] returned [408 Request Timeout] [{"cluster_name":"prepare_release","status":"yellow","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":3,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":3,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":50.0}]
```
We don't check anymore if we have unassigned shards and we wait for `yellow` status instead of `green`.
Closes elastic#12852.
8903551 to
da65493
Compare
In #12853 we actually introduced a test regression. Now as we wait for yellow instead of green, we might have some pending tasks. This commit simplify all that and only checks the number of nodes within the cluster. (cherry picked from commit 4a3ea79)
In #12853 we actually introduced a test regression. Now as we wait for yellow instead of green, we might have some pending tasks. This commit simplify all that and only checks the number of nodes within the cluster.
|
@dadoonet this doesn't seem to have been backported to the 2.0 branch. Should it be backported? |
|
@dadoonet sorry, my bad, it is in 2.0. Ignore the above |
Indeed, we check within the test suite that we have not unassigned shards.
But when the test starts on my machine I get:
I propose here to define for all integration tests:
cluster.routing.allocation.disk.watermark.low:200mbcluster.routing.allocation.disk.watermark.high:100mbCloses #12852.