Skip to content

fix(topology_lag): Kafka Topology Lag breaking when no offsets are commited for topic/partition#8589

Merged
reiabreu merged 3 commits into
apache:masterfrom
DiogoP98:fix-kafka-lag-issue
May 8, 2026
Merged

fix(topology_lag): Kafka Topology Lag breaking when no offsets are commited for topic/partition#8589
reiabreu merged 3 commits into
apache:masterfrom
DiogoP98:fix-kafka-lag-issue

Conversation

@DiogoP98

@DiogoP98 DiogoP98 commented May 6, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

Fixes the silent failure of the topology spout lag query, where the Storm UI's "Topology Lag" panel stops working with this warning being logged repeatedly for every Kafka spout:

  TopologySpoutLag [WARN] Exception thrown while getting lag for spout id: <spoutId>
  TopologySpoutLag [WARN] Exception message:null
  java.lang.ClassCastException

The regression was introduced by the Kafka client bump from 3.9.0 → 4.x (#8243). Two issues, on the same code path:

  1. KafkaOffsetLagUtil — NPE on partitions with no committed offset.
  2. TopologySpoutLag — ClassCastException swallows the real error. When the kafka-monitor subprocess fails (the NPE above, or any other failure: broker unreachable, auth failure, bad config, etc.), it prints a plain-text error to stdout.

Together, (1) restores correct lag reporting for partitions with no committed offset, and (2) ensures any future monitor failure is reported usefully instead of as a generic ClassCastException with a null message.

How was the change tested

Integration test

  • KafkaOffsetLagUtilTest#getOffsetLagsReportsCommittedOffsetForCommittedPartitionsAndMinusOneForUncommittedPartitions creates a 2-partition topic, commits an offset on partition 0 only, and calls the production KafkaOffsetLagUtil.getOffsetLags(...) end-to-end. Asserts the committed offset is reported for partition 0 and -1 for partition 1 (the regression case — pre-fix this NPE'd inside the monitor and surfaced as ClassCastException upstream in TopologySpoutLag).

Manual verification on a Storm 2.8.7 cluster

  • Built storm-core and storm-kafka-monitor JARs and replaced them on the storm-ui node.
  • Confirmed the previously-failing spouts (managementKafka, secondaryKafka, Instructions, …) now return lag results in the UI with no TopologySpoutLag warnings in ui.log.
  • For a spout with a fresh consumer group (no committed offsets yet), confirmed lag reports correctly instead of throwing.

Issue: #8588

@rzo1 rzo1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Both bugs are real, and the bulk-committed() + null-check is exactly the right shape for the fix.

One heads-up before going further: master currently targets Storm 3 (Java 21, no Clojure). The PR description mentions manual verification on a 2.8.7 cluster, so if you also need this fix on the 2.x line, happy to get a separate PR against the 2.x branch — they're maintained independently now.

A couple of things on the change itself:

  1. Testcontainers without a Docker fallback — see inline at KafkaOffsetLagUtilTest.java:45. This will hard-fail on every CI runner (and dev box) without Docker. storm-metrics-prometheus already established the right pattern; please mirror it.
  2. Formatting churn in KafkaOffsetLagUtil.java — the bug fix is ~6 useful lines, but the diff also reformats half the file (continuation indent 4→8, plus a Javadoc tweak). Bug fixes that get cherry-picked to release branches are much easier to backport when they aren't entangled with style-only changes. Consider splitting the formatting into a separate commit (or dropping it from this PR).
  3. Minor observability regression in TopologySpoutLag.java — see inline.

Nothing blocking on (2) or (3) — author's call.

Comment thread storm-core/src/jvm/org/apache/storm/utils/TopologySpoutLag.java
@DiogoP98 DiogoP98 force-pushed the fix-kafka-lag-issue branch from da2b705 to 49835fe Compare May 7, 2026 21:14
@DiogoP98 DiogoP98 requested a review from rzo1 May 7, 2026 21:14
@reiabreu

reiabreu commented May 7, 2026

Copy link
Copy Markdown
Contributor

Hey @DiogoP98.
Thanks for this. I'll try to review it ASAP

@reiabreu reiabreu added the bug label May 7, 2026
@rzo1 rzo1 added this to the 3.0.0 milestone May 8, 2026
@reiabreu

reiabreu commented May 8, 2026

Copy link
Copy Markdown
Contributor

@rzo1 happy to merge this into both master and 2.x
Keen on having the 2.8.8 patch our over the next days, if possible. Happy to do the release procedure

@reiabreu reiabreu merged commit b312633 into apache:master May 8, 2026
16 checks passed
@reiabreu

reiabreu commented May 8, 2026

Copy link
Copy Markdown
Contributor

#8591

@DiogoP98 DiogoP98 deleted the fix-kafka-lag-issue branch May 9, 2026 12:28
reiabreu added a commit that referenced this pull request May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants