KAFKA-2527; System Test for Quotas in Ducktape#275
Conversation
There was a problem hiding this comment.
use snake-case (underscores) rather than camel-case
There was a problem hiding this comment.
Thanks for taking time to review this patch. Will fix this one.
|
@granders @ewencp @guozhangwang I have addressed comments received so far and all 3 quota test succeed on my desktop. |
|
@lindong28 fyi I am seeing a failure in test_shared_quota when I run on aws: Summary report: |
|
@granders Thanks for your information. This also happens occasionally in my test. This test try to verify that, when 2 consumer shares the same quota, maximum value of sum of their throughput at any given time <= quota * (1+50%). However, it sometimes exceeds the quota by almost 100%. I need to investigate 1) the quota implementation in kafka and 2) metrics calculation in the kafka and client source code to understand why deviation can be as much as 100%. As far as quota system test is concerned, I have modified the threshold to 100% to suppress the problem. |
|
@lindong28 I would think it'd be better to have a test that fails + a followup JIRA to investigate whether the failure is real than a false pass. |
|
Agreed it defeats the purpose of the test. Didn't something like this come up at some point during the evaluation of the original patch, maybe in one of the unit tests or something? I think one proposed possibility was that the windows used to compute stats for the quota system and for metrics was different, which could account for the difference. |
There was a problem hiding this comment.
fyi - this is essentially a parametrized test, so you could express it that way:
change "run_clients" -> "test_quotas", and then:
from ducktape.marks import parameterize
...
@parameterize(producer_id="default_id", producer_num=1, consumer_id="default_id", consumer_num=1)
@parameterize(producer_id="overridden_id", producer_num=1, consumer_id="overridden_id", consumer_num=1)
@parameterize(producer_id="overridden_id", producer_num=1, consumer_id="overridden_id", consumer_num=2)
def test_quotas(self, producer_id, producer_num, consumer_id, consumer_num):
stuff...
There was a problem hiding this comment.
It'd also be nice to use default values on the method parameters so the @parametrize annotations only have to specify the parameters that are overridden.
There was a problem hiding this comment.
@granders I haven't seen any use of this feature in kafkatests. Does this allow developer to specify test names, e.g. test_overridden_quota, so that we can get an idea of what the test is in the report?
There was a problem hiding this comment.
@lindong28 In the console output the tests get a name with the parameters included, e.g. kafkatest.tests.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-one.num_producers=1.acks=1.message_size=100. In the HTML report, the parameters are listed with the test name in the "Test" column. See, e.g., the report http://testing.confluent.io/kafka/latest/ and search for the test_producer_throughput tests.
If you're looking for examples, we use this in a few places in the tests already -- the copycat and benchmark tests are both using it.
There was a problem hiding this comment.
@ewencp @granders Very nice feature. Thanks for sharing. I thought it is not used in kafkatest because I searched for "parameterize" instead of "parametrize". For consistency I will use "parametrize" as well.
The only inconvenience with that approach is that I can not selectively run one of the three test (no?). But this is more of a problem during development and debug.
I find it clearer to specify every parameter in the python code than specifying difference set of parameters in the "@parametrize". Since this is also the practice adopted in benchmark_test and copycat_test, I will specify every parameter as well.
|
@granders @ewencp I think the purpose of system tests is to notify developers of bugs so we can fix it. But it is not clear that there is any bug in the quota implementation even though the deviation can be high, for the following reasons: -- Immediate goal of quote is to ensure that the maximum throughput per clientId on broker averaged over a given window <= quota. This is verified in the original evaluation when quota >= 10MB (posted in KAFKA-2528). But it doesn't guarantee that the maximum throughput measured on the client side <= quota, because 1) there is variable delay between broker and consumer, and 2) consumer uses different configuration and implementation for its rate metrics. -- This high deviation happens in the shared quota test, which may additionally be explained by the variable delay between broker and consumers. For example, say broker allocates 100% quota to consumer 1 at time t, and allocates 100% quota to consumer 2 at time t+1. If the batches of messages arrives at both consumers at the same time, then the sum of measured byte-in-rate would double the quota. Overall the quota is working, because without quota you would observe much higher byte-in rate or byte-out rate in the test. I would say there is problem if the sum of average throughput over entire experiment >= quota. I plan to do further investigation to explain the gap here. But we probably should not mark test as FAIL if this doesn't indicate problem in kafka implementation. |
|
Thinking more about this, maybe I should also collect byte-in rate and byte-out rate directly from broker in this system test. |
|
@lindong28 Yeah, I think that explanation of how you can get a window where the client sees the quota exceeded was the explanation I had seen, probably in that discussion in KAFKA-2528. I think the key thing in this test is to figure out the exact ways in which the measured value can exceed the quota and document that clearly in the test. I think the 2x number makes sense -- as long as we never exceed that then we can consider it valid. Measuring byte-in and byte-out rates directly would also be good for validation. Those shouldn't skew as much, but could also deviate since they are measured at a different point than the quota is enforced, right? What would be the acceptable deviation from the quota there? |
|
Ah.. Closed by mistake. |
There was a problem hiding this comment.
Rename "dict" here since it is a reserved keyword
There was a problem hiding this comment.
Thanks for noting this. Will fix it.
|
@lindong28 It does look like the broker rates are under the quotas, but it'd be good to reason from first principles what the right limit is so we'll see a regression if it is violated. If the broker rates are per-minute and the quotas are computed per second, wouldn't the maximum limit be |
|
@ewencp BrokerTopicMetrics uses EWMA, configured to be ticked every 5 seconds, to calculate 1-minute rate. Quota uses the sliding window algorithm, configured with 11 samples of 1 second each, to calculate 11-second rate. Since 60-seconds can be covered by 11-seconds_6, a possibly loose bound is that However, note that there is no bound on how much In short, 10% seems a reasonable bound on the deviation between broker's rate and quota. Though there is no guarantee that this will be satisfied all the time, from users's perspective it is good enough if in most tests the |
|
@ewencp @granders @guozhangwang I have updated patch to address the comments received so far. Thanks for your help. |
There was a problem hiding this comment.
There is now one sanity check which fails both remotely and on my mac:
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/94/console
This is fixed by changing console-consumer to console_consumer (dash to underscore)
|
@lindong28 In hindsight, I think jmxtool possibly has too much internal state to be a good mixin, and the multiple inheritance here is evidently making it a bit hard to use - e.g. there seem to be a few awkward and dangerous things happening with a mix of old and new-style calls to superclass methods which I think were sortof necessitated by the multiple inheritance and the method signatures. That said, if you adjust the A couple of things should happen in a followup
|
|
@granders Thanks much as well for your review and help to update Ducktape! The I have two questions: -- Could you elaborate a bit more on why there is too much internal state in JmxMixin? It seems to me that this JmxMixin will have no performance overhead if user doesn't want to query any jmx attribute. And it requires only the necessary overhead, e.g. start jmx process at beginning and read and analyze jmx log at the end, to provide user with average and maximum value for each jmx attribute. In other words, is there any better way to allow multiple service query jmx attribute via mixin? -- I have used I agree with you that the way clients (e.g. producer performance and console consumer) calls super class methods is awkward due to multiple inheritance and method signatures. It will be nice if we can find a way to solve this problem. |
|
@guozhangwang @granders @ewencp Can you help review this patch so that it can hopefully get committed by Oct 16 as requested by Jun for 0.9.0 release? |
|
|
@granders Sure, I added a feature ticket in confluentinc/ducktape#109. Can I submit a patch for this feature? |
|
@lindong28 Yes, patches welcome! |
|
@lindong28 The producer performance service is timing out in several benchmark tests now. |
There was a problem hiding this comment.
@lindong28
I tried this a bit locally, and realized the command now has two pipes to tee (see line 58 as well)
When I drop the pipe to tee here on line 51 and keep the one below, the producer runs as expected.
|
@gwenshap |
|
LGTM. |
…able other TLS protocol versions (KIP-553) (apache#7998)" (apache#275) This reverts commit 172409c TLSv1.0 and TLSv1.1 will be disabled in CP 6.0 across the whole platform.
Revert apache#275 Removed These tests - As per discussion in apache#602 these can be left out.
Single bulk apply of the Language Engine's IntelliJ-style format profile across the kafka source tree. Pairs with the IntelliJ-real control branch `intellij-formatting` for side-by-side comparison. Engine state at apply time includes the following format fixes landed against the language-engine repo: Pre-PR #11: - PR apache#275 multi-line // comment indent group - PR apache#276 BLANK_LINES_AROUND_CLASS + sibling no-op repair - PR apache#277 string-concat chain anchor preservation - PR apache#279 forward style through apply routes - PR apache#281 Result-tab line-number alignment - PR apache#282 spaces after return / throw / yield / instanceof - PR apache#283 chain-dot postfix preservation - PR apache#284 BlankLines sibling minimum.* conversion - PR apache#285 chain-dot single-anchor (N=1) preservation - PR apache#286 string-concat chain partial-cascade fix - PR apache#287 method-decl param re-align for misaligned source - PR apache#288 annotation-in-array-init indent Post-original-PR #11 (new in this re-apply): - PR apache#289 partial-cascade post-rewrite-anchor (4 shape fixes: chain-dot postfix follow-on, MI/NewClass close-paren cascade, ternary continuation, lambda body continuation) Stats: 3101 files written, 0 failures, 2678 already-idempotent files skipped (no_change). Corpus byte delta vs trunk: -510 bytes (106275 insertions / 106785 deletions).
@granders Can you take a look at this quota system test?