Skip to content

[Bug] Topic unloading during Compaction leads to messages loss and messages redelivery loop #21074

@michalcukierman

Description

@michalcukierman

Search before asking

  • I searched in the issues and found nothing similar.

Version

3.1.0 using official Helm chart
Latest build from master using official Helm chart

Minimal reproduce step

Use 3 bookies, 3 brokers, 3 proxies

  1. Create compacted, partitioned topic with the compaction threshold set to 1GB
  2. Produce 100k of messages of 100 kb size
  3. Create exclusive consumer and start reading
  4. Produce 100k of messages with the same keys
    -- wait for the messages to be produced and the compaction to finish
  5. Create new exclusive consumer and read all the messages

What did you expect to see?

Compaction finish successfully.
First consumer receives between 100k - 200k messages
Second consumer receives 100k messages

What did you see instead?

  • I see in the error logs that the compaction fails.
Screenshot 2023-08-27 at 18 02 33
  • The first consumer often starts to receive messages, that were already delivered (millions of messages, It never ends)
  • The second consumer cannot receive all the messages. Sometimes it is able to finish, sometimes is not (falls into the loop)
  • The backlog of the subscription does not change while the consumers are reading

Reads come from one broker, the backlog is not changing (the screenshot is from my environment/system):
Screenshot 2023-08-27 at 18 29 22
Screenshot 2023-08-27 at 18 45 05

Anything else?

The reproduction steps depends on the used setup, speed of the persistent storage. It's possible that the re balancing or adding new broker affects the steps. I cannot provide reproducible way, as it happens randomly. Nevertheless it happens to us almost every day.

Now, I produced:
30 GB of 30kb files x 2
5 GB of 5MB files x 2
4,6 GB of 130kb files x 2

I reconnected the client and started reading. Client received 5 048 330 messages, but only 661 897 unique.
The consumer summary after 93 minutes:
#93 Processing 1 message(s). Totally received: 5140219, already processed: 5140218 (ACK: 5140218.0, NACK: 0.0, Invalid: 0.0, Failed to ACK/NACK: 0.0

My internal stats:
internal.stats.txt

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Labels

Staletype/bugThe PR fixed a bug or issue reported a bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions