Skip to content

Replicated Subscription fail with IllegalStateException #10097

@lhotari

Description

@lhotari

Problem

This exception is repeated on the log when using replicated subscriptions:

07:17:59.770 [bookkeeper-ml-workers-OrderedExecutor-4-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentReplicator - [persistent://georep/default/t1][cluster-a -> cluster-b] Unexpected exception: Field 'replicated_from' is not set
java.lang.IllegalStateException: Field 'replicated_from' is not set
at org.apache.pulsar.common.api.proto.MessageMetadata.getReplicatedFrom(MessageMetadata.java:151) ~[org.apache.pulsar-pulsar-common-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
at org.apache.pulsar.broker.service.persistent.PersistentReplicator.checkReplicatedSubscriptionMarker(PersistentReplicator.java:763) ~[org.apache.pulsar-pulsar-broker-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
at org.apache.pulsar.broker.service.persistent.PersistentReplicator.readEntriesComplete(PersistentReplicator.java:366) ~[org.apache.pulsar-pulsar-broker-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$checkReadCompletion$2(OpReadEntry.java:156) ~[org.apache.pulsar-managed-ledger-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.13.0.jar:4.13.0]
at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.13.0.jar:4.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]

To reproduce

  1. Create 2-cluster k8s deployment (sample script/helm to do so)
  2. Create consumer with replicated subscription for topic georep/default/t1 in cluster-a and close it
  3. Create consumer with replicated subscription for topic georep/default/t1 in cluster-b and close it
  4. Create producer for topic georep/default/t1 in cluster-a and publish messages to the topic
  5. Create consumer with replicated subscription for topic georep/default/t1 in cluster-a, consume 1 message and close it
  6. Create consumer with replicated subscription for topic georep/default/t1 in cluster-b, consume 1 message and close it

It might be possible to reproduce the issue with fewer steps.

Observations

It seems that the code location broker after the switch to LightProto (#9046).
The fix is easy for the exception above. The concern is the lack of test coverage for replicated subscriptions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThe PR fixed a bug or issue reported a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions