Skip to content

KAFKA-2527; System Test for Quotas in Ducktape#275

Closed
lindong28 wants to merge 10 commits into
apache:trunkfrom
lindong28:KAFKA-2527
Closed

KAFKA-2527; System Test for Quotas in Ducktape#275
lindong28 wants to merge 10 commits into
apache:trunkfrom
lindong28:KAFKA-2527

Conversation

@lindong28

Copy link
Copy Markdown
Member

@granders Can you take a look at this quota system test?

Comment thread tests/kafkatest/tests/quota_test.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use snake-case (underscores) rather than camel-case

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking time to review this patch. Will fix this one.

@lindong28

Copy link
Copy Markdown
Member Author

@granders @ewencp @guozhangwang I have addressed comments received so far and all 3 quota test succeed on my desktop.

@lindong28

Copy link
Copy Markdown
Member Author

@granders Thanks for your information. This also happens occasionally in my test. This test try to verify that, when 2 consumer shares the same quota, maximum value of sum of their throughput at any given time <= quota * (1+50%). However, it sometimes exceeds the quota by almost 100%. I need to investigate 1) the quota implementation in kafka and 2) metrics calculation in the kafka and client source code to understand why deviation can be as much as 100%.

As far as quota system test is concerned, I have modified the threshold to 100% to suppress the problem.

@granders

granders commented Oct 6, 2015

Copy link
Copy Markdown
Contributor

@lindong28
Doesn't upping the threshold to suppress the failure defeat the purpose of this particular test?

I would think it'd be better to have a test that fails + a followup JIRA to investigate whether the failure is real than a false pass.

@ewencp

ewencp commented Oct 6, 2015

Copy link
Copy Markdown
Contributor

Agreed it defeats the purpose of the test. Didn't something like this come up at some point during the evaluation of the original patch, maybe in one of the unit tests or something? I think one proposed possibility was that the windows used to compute stats for the quota system and for metrics was different, which could account for the difference.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi - this is essentially a parametrized test, so you could express it that way:

change "run_clients" -> "test_quotas", and then:

from ducktape.marks import parameterize

...

@parameterize(producer_id="default_id", producer_num=1, consumer_id="default_id", consumer_num=1)
@parameterize(producer_id="overridden_id", producer_num=1, consumer_id="overridden_id", consumer_num=1)
@parameterize(producer_id="overridden_id", producer_num=1, consumer_id="overridden_id", consumer_num=2)
def test_quotas(self, producer_id, producer_num, consumer_id, consumer_num):
    stuff...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd also be nice to use default values on the method parameters so the @parametrize annotations only have to specify the parameters that are overridden.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@granders I haven't seen any use of this feature in kafkatests. Does this allow developer to specify test names, e.g. test_overridden_quota, so that we can get an idea of what the test is in the report?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lindong28 In the console output the tests get a name with the parameters included, e.g. kafkatest.tests.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-one.num_producers=1.acks=1.message_size=100. In the HTML report, the parameters are listed with the test name in the "Test" column. See, e.g., the report http://testing.confluent.io/kafka/latest/ and search for the test_producer_throughput tests.

If you're looking for examples, we use this in a few places in the tests already -- the copycat and benchmark tests are both using it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ewencp @granders Very nice feature. Thanks for sharing. I thought it is not used in kafkatest because I searched for "parameterize" instead of "parametrize". For consistency I will use "parametrize" as well.

The only inconvenience with that approach is that I can not selectively run one of the three test (no?). But this is more of a problem during development and debug.

I find it clearer to specify every parameter in the python code than specifying difference set of parameters in the "@parametrize". Since this is also the practice adopted in benchmark_test and copycat_test, I will specify every parameter as well.

@lindong28

Copy link
Copy Markdown
Member Author

@granders @ewencp I think the purpose of system tests is to notify developers of bugs so we can fix it. But it is not clear that there is any bug in the quota implementation even though the deviation can be high, for the following reasons:

-- Immediate goal of quote is to ensure that the maximum throughput per clientId on broker averaged over a given window <= quota. This is verified in the original evaluation when quota >= 10MB (posted in KAFKA-2528). But it doesn't guarantee that the maximum throughput measured on the client side <= quota, because 1) there is variable delay between broker and consumer, and 2) consumer uses different configuration and implementation for its rate metrics.

-- This high deviation happens in the shared quota test, which may additionally be explained by the variable delay between broker and consumers. For example, say broker allocates 100% quota to consumer 1 at time t, and allocates 100% quota to consumer 2 at time t+1. If the batches of messages arrives at both consumers at the same time, then the sum of measured byte-in-rate would double the quota.

Overall the quota is working, because without quota you would observe much higher byte-in rate or byte-out rate in the test. I would say there is problem if the sum of average throughput over entire experiment >= quota. I plan to do further investigation to explain the gap here. But we probably should not mark test as FAIL if this doesn't indicate problem in kafka implementation.

@lindong28

Copy link
Copy Markdown
Member Author

Thinking more about this, maybe I should also collect byte-in rate and byte-out rate directly from broker in this system test.

@ewencp

ewencp commented Oct 6, 2015

Copy link
Copy Markdown
Contributor

@lindong28 Yeah, I think that explanation of how you can get a window where the client sees the quota exceeded was the explanation I had seen, probably in that discussion in KAFKA-2528. I think the key thing in this test is to figure out the exact ways in which the measured value can exceed the quota and document that clearly in the test. I think the 2x number makes sense -- as long as we never exceed that then we can consider it valid. Measuring byte-in and byte-out rates directly would also be good for validation. Those shouldn't skew as much, but could also deviate since they are measured at a different point than the quota is enforced, right? What would be the acceptable deviation from the quota there?

@lindong28 lindong28 closed this Oct 6, 2015
@lindong28

Copy link
Copy Markdown
Member Author

Ah.. Closed by mistake.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename "dict" here since it is a reserved keyword

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noting this. Will fix it.

@ewencp

ewencp commented Oct 7, 2015

Copy link
Copy Markdown
Contributor

@lindong28 It does look like the broker rates are under the quotas, but it'd be good to reason from first principles what the right limit is so we'll see a regression if it is violated. If the broker rates are per-minute and the quotas are computed per second, wouldn't the maximum limit be (61 * quota_rate / 60)? We always just need to match up the measured period with the period of the rate limiting, with some slack to account for mismatches in time windows.

@lindong28

Copy link
Copy Markdown
Member Author

@ewencp BrokerTopicMetrics uses EWMA, configured to be ticked every 5 seconds, to calculate 1-minute rate. Quota uses the sliding window algorithm, configured with 11 samples of 1 second each, to calculate 11-second rate. Since 60-seconds can be covered by 11-seconds_6, a possibly loose bound is that max(broker_metric_rate) <= max(quota_rate)_66/60 = (1+10%)*max(quota_metric_rate)

However, note that there is no bound on how much quota_metric_rate can exceed quota since the quota metric is updated before we decide whether to throttled request. The deviation depends on the number of inflight requests between broker and clients. Adding to the complexity, there is also difference in when the broker's byte-in rate and quota's byte-in rate are updated -- broker's byte-in rate is updated immediate after broker appends data to log, while quota's byte-in rate is updated after necessary acks are received from follower.

In short, 10% seems a reasonable bound on the deviation between broker's rate and quota. Though there is no guarantee that this will be satisfied all the time, from users's perspective it is good enough if in most tests the maximum broker rate <= (1+10%) quota .

@lindong28

Copy link
Copy Markdown
Member Author

@ewencp @granders @guozhangwang I have updated patch to address the comments received so far. Thanks for your help.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is now one sanity check which fails both remotely and on my mac:
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/94/console

This is fixed by changing console-consumer to console_consumer (dash to underscore)

@granders

granders commented Oct 8, 2015

Copy link
Copy Markdown
Contributor

@lindong28
Thanks a lot for your adjustments!

In hindsight, I think jmxtool possibly has too much internal state to be a good mixin, and the multiple inheritance here is evidently making it a bit hard to use - e.g. there seem to be a few awkward and dangerous things happening with a mix of old and new-style calls to superclass methods which I think were sortof necessitated by the multiple inheritance and the method signatures.

That said, if you adjust the stop_node call so test_console_consumer passes, this seems basically good.

A couple of things should happen in a followup

  • incorporate the ducktape update which allows the async ssh_capture call and remove the ssh_capture(..).next() trick which can cause problems if there are no lines of output (next() raises StopIteration error if there is no output, as in the test_console_consumer sanity check)
    • I can tweak the jmx tool and the old-style super calls when I merge your changes into my upcoming patch

@lindong28

Copy link
Copy Markdown
Member Author

@granders Thanks much as well for your review and help to update Ducktape! The stop_node is fixed now.

I have two questions:

-- Could you elaborate a bit more on why there is too much internal state in JmxMixin? It seems to me that this JmxMixin will have no performance overhead if user doesn't want to query any jmx attribute. And it requires only the necessary overhead, e.g. start jmx process at beginning and read and analyze jmx log at the end, to provide user with average and maximum value for each jmx attribute. In other words, is there any better way to allow multiple service query jmx attribute via mixin?

-- I have used ssh_capture(..).next() in order to wait for producer/consumer to start before I start jmx tool. I think the best solution is for ducktape to support ssh_capture(..).hasNext(). Do you think this is a useful API for Ducktape to provide? If developer were to wait for process to start without this support, they have to do ps ax | grep -i console_consumer (in ConsoleConsumer) or tail kafka.log (in KafkaService), both require extra ssh_command.

I agree with you that the way clients (e.g. producer performance and console consumer) calls super class methods is awkward due to multiple inheritance and method signatures. It will be nice if we can find a way to solve this problem.

@lindong28

Copy link
Copy Markdown
Member Author

@guozhangwang @granders @ewencp Can you help review this patch so that it can hopefully get committed by Oct 16 as requested by Jun for 0.9.0 release?

@granders

Copy link
Copy Markdown
Contributor

@lindong28

  • Let's address any jmx cleanup in a followup patch.
  • has_next() or some other mechanism to signal that output is available would be useful, maybe you can open an issue on ducktape for a feature request? https://github.com/confluentinc/ducktape
  • The initial SSL test patch was just merged, so there are some conflicts to be resolved, but @gwenshap can take a final look once that is ready

@lindong28

Copy link
Copy Markdown
Member Author

@granders @gwenshap Sure, I just merged my patch with trunk. Quota tests have passed on my machine. Please have a look. Thanks!

@lindong28

Copy link
Copy Markdown
Member Author

@granders Sure, I added a feature ticket in confluentinc/ducktape#109. Can I submit a patch for this feature?

@granders

Copy link
Copy Markdown
Contributor

@lindong28 Yes, patches welcome!

@granders

Copy link
Copy Markdown
Contributor

@lindong28
It looks like this merge may have caused existing benchmark tests to break...
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/104/artifact/results/KAFKA-2527/2015-10-13--001/report.html

The producer performance service is timing out in several benchmark tests now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lindong28
I tried this a bit locally, and realized the command now has two pipes to tee (see line 58 as well)

When I drop the pipe to tee here on line 51 and keep the one below, the producer runs as expected.

@granders

Copy link
Copy Markdown
Contributor

@gwenshap
Confirming that failed benchmark tests pass once again locally. Running the full suite again on aws: http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/105/console

@gwenshap

Copy link
Copy Markdown
Contributor

LGTM.

@asfgit asfgit closed this in 123d27e Oct 13, 2015
@lindong28

Copy link
Copy Markdown
Member Author

@granders @ewencp @gwenshap Thanks for your review!

@lindong28 lindong28 deleted the KAFKA-2527 branch October 13, 2015 21:01
jsancio pushed a commit to jsancio/kafka that referenced this pull request May 6, 2020
…able other TLS protocol versions (KIP-553) (apache#7998)" (apache#275)

This reverts commit 172409c

TLSv1.0 and TLSv1.1 will be disabled in CP 6.0 across the whole platform.
jeffkbkim pushed a commit to jeffkbkim/kafka that referenced this pull request Oct 22, 2021
Revert apache#275 Removed These tests - As per discussion in apache#602 these can
be left out.
davide-armand pushed a commit to aiven/kafka that referenced this pull request Dec 1, 2025
jeqo pushed a commit to aiven/kafka that referenced this pull request Jan 16, 2026
traceyyoshima added a commit to traceyyoshima/kafka that referenced this pull request May 22, 2026
Single bulk apply of the Language Engine's IntelliJ-style format
profile across the kafka source tree. Pairs with the IntelliJ-real
control branch `intellij-formatting` for side-by-side comparison.

Engine state at apply time includes the following format fixes
landed against the language-engine repo:

  Pre-PR #11:
  - PR apache#275 multi-line // comment indent group
  - PR apache#276 BLANK_LINES_AROUND_CLASS + sibling no-op repair
  - PR apache#277 string-concat chain anchor preservation
  - PR apache#279 forward style through apply routes
  - PR apache#281 Result-tab line-number alignment
  - PR apache#282 spaces after return / throw / yield / instanceof
  - PR apache#283 chain-dot postfix preservation
  - PR apache#284 BlankLines sibling minimum.* conversion
  - PR apache#285 chain-dot single-anchor (N=1) preservation
  - PR apache#286 string-concat chain partial-cascade fix
  - PR apache#287 method-decl param re-align for misaligned source
  - PR apache#288 annotation-in-array-init indent

  Post-original-PR #11 (new in this re-apply):
  - PR apache#289 partial-cascade post-rewrite-anchor (4 shape fixes:
    chain-dot postfix follow-on, MI/NewClass close-paren cascade,
    ternary continuation, lambda body continuation)

Stats: 3101 files written, 0 failures, 2678 already-idempotent
files skipped (no_change). Corpus byte delta vs trunk: -510 bytes
(106275 insertions / 106785 deletions).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants