Skip to content

[STORM-8019] Fixing kafka topic level metrics computation#8047

Merged
reiabreu merged 36 commits into
apache:masterfrom
reiabreu:master
Jun 16, 2025
Merged

[STORM-8019] Fixing kafka topic level metrics computation#8047
reiabreu merged 36 commits into
apache:masterfrom
reiabreu:master

Conversation

@reiabreu

@reiabreu reiabreu commented Apr 25, 2025

Copy link
Copy Markdown
Contributor

What is the purpose of the change

PR opened to address: #8019

Fixing a regression introduced by https://issues.apache.org/jira/browse/STORM-3782

This PR keeps the same logic introduced in #8019, but with a important change:

  • the topic level metrics do not accumulate between calls of the Gauge's getValues method. Total values are reset on every invocation

How was the change tested

Created the following test classes

  • external/storm-kafka-client/src/test/java/org/apache/storm/kafka/spout/metric2/KafkaOffsetPartitionMetricsTest.java
  • external/storm-kafka-client/src/test/java/org/apache/storm/kafka/spout/metric2/KafkaOffsetTopicMetricsTest.java

Performed manual validations by adding a Console Reporter to an existing topology running a version of Storm running the changes proposed in this PR.

[STORM-8019] Fixing kafka topic level metrics computation
@rzo1

rzo1 commented Apr 25, 2025

Copy link
Copy Markdown
Contributor

@jakehschwartz Would it be possible for you to test a SNAPSHOT of the fix for one of your (test) workloads? If so, just let me know and we can put a SNAPSHOT build on nightlies.apache.org for this purpose.

@jakehschwartz

Copy link
Copy Markdown

hey, I could try. Ive never set up a cluster before, our devops team usually does that, but I could certainly give it a shot.

Our code is on the 2.7.1 version because we are still using Java 11, so if the SNAPSHOT could be on that line that would help

@rzo1

rzo1 commented Apr 25, 2025

Copy link
Copy Markdown
Contributor

hey, I could try. Ive never set up a cluster before, our devops team usually does that, but I could certainly give it a shot.

Our code is on the 2.7.1 version because we are still using Java 11, so if the SNAPSHOT could be on that line that would help

I have cherry picked the changes on top of the last 2.7.1 release: https://github.com/apache/storm/tree/2.7.2/external/storm-kafka-client/ and deployed a 2.7.2-SNAPSHOT on the Apache Snapshot Repository. You can fetch the client jar from https://repository.apache.org/content/groups/snapshots/org/apache/storm/storm-kafka-client/2.7.2-SNAPSHOT/

Should be enough to use 2.7.2-SNAPSHOT for the kafka dependency and use a 2.7.1 storm distribution for the test.

@jakehschwartz

Copy link
Copy Markdown

@rzo1 Thanks for setting that up, much easier for me to test

I dont believe the fix is working, here is a graph of the sum of the partitions (left) vs the total lag metric (right)
Screenshot 2025-04-28 at 13 53 39

It's hard to debug further as it also seems like the Spout Committed Offset that shows up in the UI (ui.disable.spout.lag.monitoring: false) is now not being updated.

@reiabreu

reiabreu commented Apr 28, 2025 via email

Copy link
Copy Markdown
Contributor Author

@jakehschwartz

Copy link
Copy Markdown

The Kafka UI lag is fed through a different OS process, that directly
connects to the Kafka brokers to check the lag of the consumer group, so
it's unlikely that is connected to the changes.

Maybe I need to test on a different topology, doesnt look like they are committing when I revert even though other topologies are. I'll try again tomorrow

@reiabreu

reiabreu commented Apr 28, 2025 via email

Copy link
Copy Markdown
Contributor Author

@jakehschwartz

Copy link
Copy Markdown

The UI lag piece is not actually broken, I forgot the topology I was using required auth, so it doesnt show up in the UI and was looking at a different spout that was deactivated. Sorry for the false alarm :)

Trying again with a different topology, will report back in a few hours

@jakehschwartz

Copy link
Copy Markdown

Sorry, got pulled onto some other work the past couple days.

I'm still seeing the even incrementing value unfortunately. I looked the code for this PR and I don't see anything glaring that would explain it. I'll regenerate my jar again with the library and double check to make sure its not accidentally pulling in the 2.7.1 version

Just to confirm, I could be using storm-kafka-client-2.7.2-20250425.195041-27.jar?

thanks again

@reiabreu

reiabreu commented May 1, 2025 via email

Copy link
Copy Markdown
Contributor Author

@rzo1

rzo1 commented May 1, 2025

Copy link
Copy Markdown
Contributor

storm-kafka-client-2.7.2-20250425.195041-27.jar is correct 🫠

@reiabreu

reiabreu commented May 8, 2025

Copy link
Copy Markdown
Contributor Author

@jakehschwartz any luck with the tests?
Thanks in advance

@reiabreu

Copy link
Copy Markdown
Contributor Author

Hi @rzo1 @jakehschwartz
I finally got around to test the change with a real topology, using Richard's JAR. I can confirm the reported metrics are accumulating, so something is still off. I'll do some digging on this

@reiabreu

reiabreu commented May 22, 2025 via email

Copy link
Copy Markdown
Contributor Author

@reiabreu

reiabreu commented Jun 9, 2025

Copy link
Copy Markdown
Contributor Author

Hi folks,

Apologies for the delay. Further changes to the code have been pushed. I reckon this solves the issue.
I'll try the changes with a real topology over the next few days. Feel free to review it.

@reiabreu

Copy link
Copy Markdown
Contributor Author

Did a test with a real instance of a topology. Problem seems to be fixed @jakehschwartz @rzo1
Will merge it once I get an approval.
Thanks

@rzo1

rzo1 commented Jun 16, 2025

Copy link
Copy Markdown
Contributor

I am not a Kafka user, but changes lgtm.

@reiabreu reiabreu merged commit 1a6ff41 into apache:master Jun 16, 2025
12 checks passed
@reiabreu reiabreu added this to the 2.8.2 milestone Jun 16, 2025
@reiabreu reiabreu removed this from the 2.8.2 milestone Jul 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants