KAFKA-2527; System Test for Quotas in Ducktape by lindong28 · Pull Request #275 · apache/kafka

lindong28 · 2015-10-05T07:09:35Z

@granders Can you take a look at this quota system test?

granders · 2015-10-05T21:38:11Z

use snake-case (underscores) rather than camel-case

Thanks for taking time to review this patch. Will fix this one.

lindong28 · 2015-10-06T04:51:44Z

@granders @ewencp @guozhangwang I have addressed comments received so far and all 3 quota test succeed on my desktop.

granders · 2015-10-06T17:00:06Z

@lindong28 fyi I am seeing a failure in test_shared_quota when I run on aws:
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/90/console

Summary report:
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/90/artifact/results/KAFKA-2527/2015-10-06--001/report.html

Artifacts:
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/90/artifact/results/KAFKA-2527/2015-10-06--001/

lindong28 · 2015-10-06T17:14:49Z

@granders Thanks for your information. This also happens occasionally in my test. This test try to verify that, when 2 consumer shares the same quota, maximum value of sum of their throughput at any given time <= quota * (1+50%). However, it sometimes exceeds the quota by almost 100%. I need to investigate 1) the quota implementation in kafka and 2) metrics calculation in the kafka and client source code to understand why deviation can be as much as 100%.

As far as quota system test is concerned, I have modified the threshold to 100% to suppress the problem.

granders · 2015-10-06T17:32:52Z

@lindong28
Doesn't upping the threshold to suppress the failure defeat the purpose of this particular test?

I would think it'd be better to have a test that fails + a followup JIRA to investigate whether the failure is real than a false pass.

ewencp · 2015-10-06T17:36:41Z

Agreed it defeats the purpose of the test. Didn't something like this come up at some point during the evaluation of the original patch, maybe in one of the unit tests or something? I think one proposed possibility was that the windows used to compute stats for the quota system and for metrics was different, which could account for the difference.

granders · 2015-10-06T17:44:21Z

fyi - this is essentially a parametrized test, so you could express it that way:

change "run_clients" -> "test_quotas", and then:

from ducktape.marks import parameterize ... @parameterize(producer_id="default_id", producer_num=1, consumer_id="default_id", consumer_num=1) @parameterize(producer_id="overridden_id", producer_num=1, consumer_id="overridden_id", consumer_num=1) @parameterize(producer_id="overridden_id", producer_num=1, consumer_id="overridden_id", consumer_num=2) def test_quotas(self, producer_id, producer_num, consumer_id, consumer_num): stuff...

It'd also be nice to use default values on the method parameters so the @parametrize annotations only have to specify the parameters that are overridden.

@granders I haven't seen any use of this feature in kafkatests. Does this allow developer to specify test names, e.g. test_overridden_quota, so that we can get an idea of what the test is in the report?

@lindong28 In the console output the tests get a name with the parameters included, e.g. kafkatest.tests.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-one.num_producers=1.acks=1.message_size=100. In the HTML report, the parameters are listed with the test name in the "Test" column. See, e.g., the report http://testing.confluent.io/kafka/latest/ and search for the test_producer_throughput tests.

If you're looking for examples, we use this in a few places in the tests already -- the copycat and benchmark tests are both using it.

@ewencp @granders Very nice feature. Thanks for sharing. I thought it is not used in kafkatest because I searched for "parameterize" instead of "parametrize". For consistency I will use "parametrize" as well.

The only inconvenience with that approach is that I can not selectively run one of the three test (no?). But this is more of a problem during development and debug.

I find it clearer to specify every parameter in the python code than specifying difference set of parameters in the "@parametrize". Since this is also the practice adopted in benchmark_test and copycat_test, I will specify every parameter as well.

lindong28 · 2015-10-06T18:09:51Z

@granders @ewencp I think the purpose of system tests is to notify developers of bugs so we can fix it. But it is not clear that there is any bug in the quota implementation even though the deviation can be high, for the following reasons:

-- Immediate goal of quote is to ensure that the maximum throughput per clientId on broker averaged over a given window <= quota. This is verified in the original evaluation when quota >= 10MB (posted in KAFKA-2528). But it doesn't guarantee that the maximum throughput measured on the client side <= quota, because 1) there is variable delay between broker and consumer, and 2) consumer uses different configuration and implementation for its rate metrics.

-- This high deviation happens in the shared quota test, which may additionally be explained by the variable delay between broker and consumers. For example, say broker allocates 100% quota to consumer 1 at time t, and allocates 100% quota to consumer 2 at time t+1. If the batches of messages arrives at both consumers at the same time, then the sum of measured byte-in-rate would double the quota.

Overall the quota is working, because without quota you would observe much higher byte-in rate or byte-out rate in the test. I would say there is problem if the sum of average throughput over entire experiment >= quota. I plan to do further investigation to explain the gap here. But we probably should not mark test as FAIL if this doesn't indicate problem in kafka implementation.

lindong28 · 2015-10-06T18:15:11Z

Thinking more about this, maybe I should also collect byte-in rate and byte-out rate directly from broker in this system test.

ewencp · 2015-10-06T18:32:17Z

@lindong28 Yeah, I think that explanation of how you can get a window where the client sees the quota exceeded was the explanation I had seen, probably in that discussion in KAFKA-2528. I think the key thing in this test is to figure out the exact ways in which the measured value can exceed the quota and document that clearly in the test. I think the 2x number makes sense -- as long as we never exceed that then we can consider it valid. Measuring byte-in and byte-out rates directly would also be good for validation. Those shouldn't skew as much, but could also deviate since they are measured at a different point than the quota is enforced, right? What would be the acceptable deviation from the quota there?

lindong28 · 2015-10-06T20:20:28Z

Ah.. Closed by mistake.

granders · 2015-10-06T22:44:55Z

Rename "dict" here since it is a reserved keyword

Thanks for noting this. Will fix it.

ewencp · 2015-10-07T05:56:08Z

@lindong28 It does look like the broker rates are under the quotas, but it'd be good to reason from first principles what the right limit is so we'll see a regression if it is violated. If the broker rates are per-minute and the quotas are computed per second, wouldn't the maximum limit be (61 * quota_rate / 60)? We always just need to match up the measured period with the period of the rate limiting, with some slack to account for mismatches in time windows.

lindong28 · 2015-10-07T18:26:12Z

@ewencp BrokerTopicMetrics uses EWMA, configured to be ticked every 5 seconds, to calculate 1-minute rate. Quota uses the sliding window algorithm, configured with 11 samples of 1 second each, to calculate 11-second rate. Since 60-seconds can be covered by 11-seconds_6, a possibly loose bound is that max(broker_metric_rate) <= max(quota_rate)_66/60 = (1+10%)*max(quota_metric_rate)

However, note that there is no bound on how much quota_metric_rate can exceed quota since the quota metric is updated before we decide whether to throttled request. The deviation depends on the number of inflight requests between broker and clients. Adding to the complexity, there is also difference in when the broker's byte-in rate and quota's byte-in rate are updated -- broker's byte-in rate is updated immediate after broker appends data to log, while quota's byte-in rate is updated after necessary acks are received from follower.

In short, 10% seems a reasonable bound on the deviation between broker's rate and quota. Though there is no guarantee that this will be satisfied all the time, from users's perspective it is good enough if in most tests the maximum broker rate <= (1+10%) quota .

lindong28 · 2015-10-07T19:32:09Z

@ewencp @granders @guozhangwang I have updated patch to address the comments received so far. Thanks for your help.

granders · 2015-10-08T07:52:02Z

There is now one sanity check which fails both remotely and on my mac:
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/94/console

This is fixed by changing console-consumer to console_consumer (dash to underscore)

granders · 2015-10-08T08:26:07Z

@lindong28
Thanks a lot for your adjustments!

In hindsight, I think jmxtool possibly has too much internal state to be a good mixin, and the multiple inheritance here is evidently making it a bit hard to use - e.g. there seem to be a few awkward and dangerous things happening with a mix of old and new-style calls to superclass methods which I think were sortof necessitated by the multiple inheritance and the method signatures.

That said, if you adjust the stop_node call so test_console_consumer passes, this seems basically good.

A couple of things should happen in a followup

incorporate the ducktape update which allows the async ssh_capture call and remove the ssh_capture(..).next() trick which can cause problems if there are no lines of output (next() raises StopIteration error if there is no output, as in the test_console_consumer sanity check)
- I can tweak the jmx tool and the old-style super calls when I merge your changes into my upcoming patch

lindong28 · 2015-10-08T17:08:55Z

@granders Thanks much as well for your review and help to update Ducktape! The stop_node is fixed now.

I have two questions:

-- Could you elaborate a bit more on why there is too much internal state in JmxMixin? It seems to me that this JmxMixin will have no performance overhead if user doesn't want to query any jmx attribute. And it requires only the necessary overhead, e.g. start jmx process at beginning and read and analyze jmx log at the end, to provide user with average and maximum value for each jmx attribute. In other words, is there any better way to allow multiple service query jmx attribute via mixin?

-- I have used ssh_capture(..).next() in order to wait for producer/consumer to start before I start jmx tool. I think the best solution is for ducktape to support ssh_capture(..).hasNext(). Do you think this is a useful API for Ducktape to provide? If developer were to wait for process to start without this support, they have to do ps ax | grep -i console_consumer (in ConsoleConsumer) or tail kafka.log (in KafkaService), both require extra ssh_command.

I agree with you that the way clients (e.g. producer performance and console consumer) calls super class methods is awkward due to multiple inheritance and method signatures. It will be nice if we can find a way to solve this problem.

lindong28 · 2015-10-12T21:57:26Z

@guozhangwang @granders @ewencp Can you help review this patch so that it can hopefully get committed by Oct 16 as requested by Jun for 0.9.0 release?

granders · 2015-10-13T00:43:18Z

@lindong28

Let's address any jmx cleanup in a followup patch.
has_next() or some other mechanism to signal that output is available would be useful, maybe you can open an issue on ducktape for a feature request? https://github.com/confluentinc/ducktape
The initial SSL test patch was just merged, so there are some conflicts to be resolved, but @gwenshap can take a final look once that is ready

lindong28 · 2015-10-13T02:30:40Z

@granders @gwenshap Sure, I just merged my patch with trunk. Quota tests have passed on my machine. Please have a look. Thanks!

lindong28 · 2015-10-13T02:36:20Z

@granders Sure, I added a feature ticket in confluentinc/ducktape#109. Can I submit a patch for this feature?

granders · 2015-10-13T06:12:39Z

@lindong28 Yes, patches welcome!

granders · 2015-10-13T16:59:10Z

@lindong28
It looks like this merge may have caused existing benchmark tests to break...
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/104/artifact/results/KAFKA-2527/2015-10-13--001/report.html

The producer performance service is timing out in several benchmark tests now.

granders · 2015-10-13T18:19:50Z

@lindong28
I tried this a bit locally, and realized the command now has two pipes to tee (see line 58 as well)

When I drop the pipe to tee here on line 51 and keep the one below, the producer runs as expected.

granders · 2015-10-13T20:28:01Z

@gwenshap
Confirming that failed benchmark tests pass once again locally. Running the full suite again on aws: http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/105/console

gwenshap · 2015-10-13T20:53:31Z

LGTM.

lindong28 · 2015-10-13T21:01:32Z

@granders @ewencp @gwenshap Thanks for your review!

…able other TLS protocol versions (KIP-553) (apache#7998)" (apache#275) This reverts commit 172409c TLSv1.0 and TLSv1.1 will be disabled in CP 6.0 across the whole platform.

Revert apache#275 Removed These tests - As per discussion in apache#602 these can be left out.

Single bulk apply of the Language Engine's IntelliJ-style format profile across the kafka source tree. Pairs with the IntelliJ-real control branch `intellij-formatting` for side-by-side comparison. Engine state at apply time includes the following format fixes landed against the language-engine repo: Pre-PR #11: - PR apache#275 multi-line // comment indent group - PR apache#276 BLANK_LINES_AROUND_CLASS + sibling no-op repair - PR apache#277 string-concat chain anchor preservation - PR apache#279 forward style through apply routes - PR apache#281 Result-tab line-number alignment - PR apache#282 spaces after return / throw / yield / instanceof - PR apache#283 chain-dot postfix preservation - PR apache#284 BlankLines sibling minimum.* conversion - PR apache#285 chain-dot single-anchor (N=1) preservation - PR apache#286 string-concat chain partial-cascade fix - PR apache#287 method-decl param re-align for misaligned source - PR apache#288 annotation-in-array-init indent Post-original-PR #11 (new in this re-apply): - PR apache#289 partial-cascade post-rewrite-anchor (4 shape fixes: chain-dot postfix follow-on, MI/NewClass close-paren cascade, ternary continuation, lambda body continuation) Stats: 3101 files written, 0 failures, 2678 already-idempotent files skipped (no_change). Corpus byte delta vs trunk: -510 bytes (106275 insertions / 106785 deletions).

lindong28 added 2 commits October 5, 2015 07:07

KAFKA-2527; System Test for Quotas in Ducktape

cf1a116

adjust quota configuration

345b997

lindong28 force-pushed the KAFKA-2527 branch from 2d5c124 to 345b997 Compare October 5, 2015 21:36

granders reviewed Oct 5, 2015
View reviewed changes

lindong28 added 2 commits October 6, 2015 02:00

JmxMixin will subclass object

28bd200

support jmx query with arbitrary object name and attributes

0d07fc7

lindong28 force-pushed the KAFKA-2527 branch from 78a88ee to b421122 Compare October 6, 2015 15:57

jmx_object_name is required to use jmx tool

c0fd768

lindong28 force-pushed the KAFKA-2527 branch from b421122 to c0fd768 Compare October 6, 2015 16:17

adjust quota test configuration

f7cfad5

granders reviewed Oct 6, 2015
View reviewed changes

lindong28 closed this Oct 6, 2015

lindong28 reopened this Oct 6, 2015

This was referenced Oct 6, 2015

ssh_capture confluentinc/ducktape#106

Closed

Updated ssh_capture confluentinc/ducktape#107

Merged

granders reviewed Oct 6, 2015
View reviewed changes

address reviewer comments

35f3dd8

granders reviewed Oct 8, 2015
View reviewed changes

fix stop_node in console consumer

42e5804

resolve conflicts

3b68a11

granders reviewed Oct 13, 2015
View reviewed changes

fix benchmark test

ceb4534

asfgit closed this in 123d27e Oct 13, 2015

lindong28 deleted the KAFKA-2527 branch October 13, 2015 21:01

jeffkbkim pushed a commit to jeffkbkim/kafka that referenced this pull request Oct 22, 2021

MINOR: Remove tests testTLSDefaults and checkAuthentiationFailed

07f02da

Revert apache#275 Removed These tests - As per discussion in apache#602 these can be left out.

davide-armand pushed a commit to aiven/kafka that referenced this pull request Dec 1, 2025

chore(inkless:ci): increase port range for InfiniSpan (apache#275)

e6cc71e

jeqo pushed a commit to aiven/kafka that referenced this pull request Jan 16, 2026

chore(inkless:ci): increase port range for InfiniSpan (apache#275)

e3ab28b

traceyyoshima mentioned this pull request May 21, 2026

Apply Language Engine formatter under style=intellij traceyyoshima/kafka#11

Closed

Conversation

lindong28 commented Oct 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lindong28 commented Oct 6, 2015

Uh oh!

granders commented Oct 6, 2015

Uh oh!

lindong28 commented Oct 6, 2015

Uh oh!

granders commented Oct 6, 2015

Uh oh!

ewencp commented Oct 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lindong28 commented Oct 6, 2015

Uh oh!

lindong28 commented Oct 6, 2015

Uh oh!

ewencp commented Oct 6, 2015

Uh oh!

lindong28 commented Oct 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ewencp commented Oct 7, 2015

Uh oh!

lindong28 commented Oct 7, 2015

Uh oh!

lindong28 commented Oct 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

granders commented Oct 8, 2015

Uh oh!

lindong28 commented Oct 8, 2015

Uh oh!

lindong28 commented Oct 12, 2015

Uh oh!

granders commented Oct 13, 2015

Uh oh!

lindong28 commented Oct 13, 2015

Uh oh!

lindong28 commented Oct 13, 2015

Uh oh!

granders commented Oct 13, 2015

Uh oh!

granders commented Oct 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

granders commented Oct 13, 2015

Uh oh!

gwenshap commented Oct 13, 2015

Uh oh!

lindong28 commented Oct 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants