Skip to content

Conversation

@pdolif
Copy link
Contributor

@pdolif pdolif commented Aug 15, 2025

Fixes #24627

Motivation

As shared in #24627, ZkSessionExpireTest is flaky. In the cleanup method, the main thread tries to close a PulsarService but then remains in WAITING state, and the test runs into the one hour timeout.

The logs shared in the issue contain a ConcurrentModificationException:

java.util.ConcurrentModificationException
	at java.base/java.util.ArrayList$Itr.checkForComodification(ArrayList.java:1095)
	at java.base/java.util.ArrayList$Itr.next(ArrayList.java:1049)
	at org.apache.pulsar.broker.service.Ipv4Proxy.disconnectFrontChannels(Ipv4Proxy.java:87)
	at org.apache.pulsar.broker.service.ZkSessionExpireTest.testTopicUnloadAfterSessionRebuild(ZkSessionExpireTest.java:149)

The ZkSessionExpireTest calls Ipv4Proxy.disconnectFrontChannels() to inject a ZK session expire error:

metadataZKProxy.rejectAllConnections();
metadataZKProxy.disconnectFrontChannels();

The method disconnectFrontChannels() looks as follows:

for (Channel channel : frontChannels) {
    channel.close();
}

When channel.close() is called, it triggers FrontendHandler.channelInactive(), which removes the channel from the frontChannels list:

@Override
public void channelInactive(ChannelHandlerContext ctx) {
    frontChannels.remove(ctx.channel());
    ...
}

This can trigger the ConcurrentModificationException.

If the ConcurrentModificationException is thrown, the test execution stops. We don't reach the point of the test where the rejectAllConnections is reverted. Therefore, the cleanup method cannot close the PulsarService properly.

I was able to replicate the ConcurrentModificationException by inserting a Thread.sleep() into disconnectFrontChannels():

public void disconnectFrontChannels() throws InterruptedException {
    for (Channel channel : frontChannels) {
        Thread.sleep(5);
        channel.close();
    }
}

Modifications

Prevent ConcurrentModificationException by iterating over a copy of frontChannels.

Verifying this change

  • Make sure that the change passes the CI checks.

When adding the Thread.sleep(), the ConcurrentModificationException was always thrown. After changing the for-loop to iterate over a copy of frontChannels, even with the Thread.sleep(), no ConcurrentModificationException occurs.

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: pdolif#13

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 15, 2025
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great description and thanks for addressing this!

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.24%. Comparing base (ba4f715) to head (ffd6384).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24632      +/-   ##
============================================
+ Coverage     73.56%   74.24%   +0.68%     
- Complexity    32894    33153     +259     
============================================
  Files          1882     1882              
  Lines        146854   146854              
  Branches      16866    16866              
============================================
+ Hits         108030   109035    +1005     
+ Misses        30153    29157     -996     
+ Partials       8671     8662       -9     
Flag Coverage Δ
inttests 26.64% <ø> (?)
systests 23.31% <ø> (+0.01%) ⬆️
unittests 73.74% <ø> (+0.62%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 148 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lhotari lhotari merged commit 19b7c27 into apache:master Aug 16, 2025
53 checks passed
lhotari pushed a commit that referenced this pull request Aug 18, 2025
lhotari pushed a commit that referenced this pull request Aug 18, 2025
lhotari pushed a commit that referenced this pull request Aug 18, 2025
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 20, 2025
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 20, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 21, 2025
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 26, 2025
nodece pushed a commit to ascentstream/pulsar that referenced this pull request Aug 26, 2025
Technoboy- pushed a commit to Technoboy-/pulsar that referenced this pull request Sep 10, 2025
@lhotari lhotari added this to the 4.1.0 milestone Sep 17, 2025
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky-test: ZkSessionExpireTest.cleanup

3 participants