[STORM-8019] Fixing kafka topic level metrics computation#8047
Conversation
…ka/spout/metrics2/KafkaOffsetPartitionMetrics.java
[STORM-8019] Fixing kafka topic level metrics computation
|
@jakehschwartz Would it be possible for you to test a SNAPSHOT of the fix for one of your (test) workloads? If so, just let me know and we can put a SNAPSHOT build on nightlies.apache.org for this purpose. |
|
hey, I could try. Ive never set up a cluster before, our devops team usually does that, but I could certainly give it a shot. Our code is on the 2.7.1 version because we are still using Java 11, so if the SNAPSHOT could be on that line that would help |
I have cherry picked the changes on top of the last 2.7.1 release: https://github.com/apache/storm/tree/2.7.2/external/storm-kafka-client/ and deployed a 2.7.2-SNAPSHOT on the Apache Snapshot Repository. You can fetch the client jar from https://repository.apache.org/content/groups/snapshots/org/apache/storm/storm-kafka-client/2.7.2-SNAPSHOT/ Should be enough to use 2.7.2-SNAPSHOT for the kafka dependency and use a 2.7.1 storm distribution for the test. |
|
@rzo1 Thanks for setting that up, much easier for me to test I dont believe the fix is working, here is a graph of the sum of the partitions (left) vs the total lag metric (right) It's hard to debug further as it also seems like the |
|
@richard Zowalla ***@***.***> thanks for setting this up.
Much appreciated.
Jake, thanks for testing it. That behaviour is unexpected.
The Kafka UI lag is fed through a different OS process, that directly
connects to the Kafka brokers to check the lag of the consumer group, so
it's unlikely that is connected to the changes.
As to the Kafka total lag reported through the V2 metrics keeping climbing,
might be due to a coding mistake on my part, although we do have unit tests
showing the it's reset everytime getMetrics is called. I'll try to test the
changes with a real cluster instance.
…On Mon, Apr 28, 2025, 18:57 Jake Schwartz ***@***.***> wrote:
*jakehschwartz* left a comment (apache/storm#8047)
<#8047 (comment)>
@rzo1 <https://github.com/rzo1> Thanks for setting that up, much easier
for me to test
I dont believe the fix is working, here is a graph of the sum of the
partitions (left) vs the total lag metric (right)
Screenshot.2025-04-28.at.13.53.39.png (view on web)
<https://github.com/user-attachments/assets/481a333b-d24c-4ddc-864c-8162d6d4e413>
It's hard to debug further as it also seems like the Spout Committed
Offset that shows up in the UI (ui.disable.spout.lag.monitoring: false)
is now not being updated.
—
Reply to this email directly, view it on GitHub
<#8047 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG5GIXQUFYPIG3ZUA3MR7L23ZTXDAVCNFSM6AAAAAB33E6ZNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMZWGA2DKNBYG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Maybe I need to test on a different topology, doesnt look like they are committing when I revert even though other topologies are. I'll try again tomorrow |
|
The files changed on the PR belong to the storm-kafka-client JAR which is
not provided by the Storm cluster, but rather should be bundled up with
your topology JAR and deployed to the cluster. Can you repeat the test
deploying your topology built using the nem storm-kafka-client JAR?
…On Mon, Apr 28, 2025, 21:54 Jake Schwartz ***@***.***> wrote:
*jakehschwartz* left a comment (apache/storm#8047)
<#8047 (comment)>
The Kafka UI lag is fed through a different OS process, that directly
connects to the Kafka brokers to check the lag of the consumer group, so
it's unlikely that is connected to the changes.
Maybe I need to test on a different topology, doesnt look like they are
committing when I revert even though other topologies are. I'll try again
tomorrow
—
Reply to this email directly, view it on GitHub
<#8047 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG5GITI36TJRNPU4NKYUKD232IO3AVCNFSM6AAAAAB33E6ZNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMZWGU3TKMRYGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
The UI lag piece is not actually broken, I forgot the topology I was using required auth, so it doesnt show up in the UI and was looking at a different spout that was deactivated. Sorry for the false alarm :) Trying again with a different topology, will report back in a few hours |
|
Sorry, got pulled onto some other work the past couple days. I'm still seeing the even incrementing value unfortunately. I looked the code for this PR and I don't see anything glaring that would explain it. I'll regenerate my jar again with the library and double check to make sure its not accidentally pulling in the 2.7.1 version Just to confirm, I could be using storm-kafka-client-2.7.2-20250425.195041-27.jar? thanks again |
|
Thanks for taking the time to do this. @richard Zowalla ***@***.***> are
you able to confirm the ID of the correct JAR? Thanks in advance!
…On Wed, Apr 30, 2025, 23:18 Jake Schwartz ***@***.***> wrote:
*jakehschwartz* left a comment (apache/storm#8047)
<#8047 (comment)>
Sorry, got pulled onto some other work the past couple days.
I'm still seeing the even incrementing value unfortunately. I looked the
code for this PR and I don't see anything glaring that would explain it.
I'll regenerate my jar again with the library and double check to make sure
its not accidentally pulling in the 2.7.1 version
Just to confirm, I could be using
storm-kafka-client-2.7.2-20250425.195041-27.jar
<https://repository.apache.org/content/groups/snapshots/org/apache/storm/storm-kafka-client/2.7.2-SNAPSHOT/storm-kafka-client-2.7.2-20250425.195041-27.jar>
?
thanks again
—
Reply to this email directly, view it on GitHub
<#8047 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG5GISCNYVTMUKIGVE6P7D24E435AVCNFSM6AAAAAB33E6ZNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNBTGMYDQMRUGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
storm-kafka-client-2.7.2-20250425.195041-27.jar is correct 🫠 |
|
@jakehschwartz any luck with the tests? |
|
Hi @rzo1 @jakehschwartz |
|
I think I found the problem. I'll try to submit a fix to the PR over the
next few days.
…On Thu, 1 May 2025 at 18:45, Richard Zowalla ***@***.***> wrote:
*rzo1* left a comment (apache/storm#8047)
<#8047 (comment)>
storm-kafka-client-2.7.2-20250425.195041-27.jar is correct 🫠
—
Reply to this email directly, view it on GitHub
<#8047 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG5GIUWZFUMORG5Y74VS3T24JMUZAVCNFSM6AAAAAB33E6ZNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNBVGM2DOMBTG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
* Fixing the way Kafka topic level metrics are computed
|
Hi folks, Apologies for the delay. Further changes to the code have been pushed. I reckon this solves the issue. |
|
Did a test with a real instance of a topology. Problem seems to be fixed @jakehschwartz @rzo1 |
|
I am not a Kafka user, but changes lgtm. |

What is the purpose of the change
PR opened to address: #8019
Fixing a regression introduced by https://issues.apache.org/jira/browse/STORM-3782
This PR keeps the same logic introduced in #8019, but with a important change:
How was the change tested
Created the following test classes
Performed manual validations by adding a Console Reporter to an existing topology running a version of Storm running the changes proposed in this PR.