Skip to content

[improve] [broker] filter system topics while shedding#18936

Closed
thetumbled wants to merge 1 commit into
apache:branch-2.9from
thetumbled:improve_filter_systemTopicsBundle
Closed

[improve] [broker] filter system topics while shedding#18936
thetumbled wants to merge 1 commit into
apache:branch-2.9from
thetumbled:improve_filter_systemTopicsBundle

Conversation

@thetumbled

@thetumbled thetumbled commented Dec 15, 2022

Copy link
Copy Markdown
Member

Fixes #18935

Motivation

Topics/Bundles will be unload while doing shedding, and there are some special topics that should not be unloaded for some reason. For example, if transaction_coordinator_assign is unloaded, the corresponding TC need to be recovered, which is time consuming.
So, we have better avoid unload these topics. And i found that such features have been implemented in the latest branch except branch-2.9.

Modifications

fitler system topics while shedding in branch-2.9.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is already covered by existing tests, such as (please describe tests).

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: thetumbled#9

@congbobo184 congbobo184 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A system topic can be unloaded, allowing it to be load balanced. If you filter the system topic, it may lead to uneven resource allocation.

@thetumbled

Copy link
Copy Markdown
Member Author

A system topic can be unloaded, allowing it to be load balanced. If you filter the system topic, it may lead to uneven resource allocation.

In the master branch, topics/bundles in pulsar/system will be filter.
org.apache.pulsar.broker.loadbalance.LoadData#getBundleDataForLoadShedding
image

image

image

and actually we can achieve distributing bundles containing transaction_coordinator_assign evenly and avoiding unloading these bundles while shedding at the same time.

In our production clusters, we use AvgShedder described in #18186.

  • When the cluster is initializing or the broker is restarted, bundles will be distributed randomly or based on hashing algorithm, which is similar to be a uniform distribution. So we can ensure that bundles containing transaction_coordinator_assign will be distributed evenly across brokers.
  • When we need to do shedding, we will filter bundles containing transaction_coordinator_assign to avoid TC recovery.

@congbobo184

Copy link
Copy Markdown
Contributor

In the master branch, topics/bundles in pulsar/system will be filter.

I think this logic was introduced by mistake by pr #15252

and actually we can achieve distributing bundles containing transaction_coordinator_assign evenly and avoiding unloading these bundles while shedding at the same time.

In our production clusters, we use AvgShedder described in #18186.

I will see the PIP later.

and I think the transaction_coordinator_assign can be shedding in any time, we could use a smoother strategy, but it doesn't prevent being shed

@thetumbled

Copy link
Copy Markdown
Member Author

In the master branch, topics/bundles in pulsar/system will be filter.

I think this logic was introduced by mistake by pr #15252

Should i raise a PR to revert mistake in the master branch?

and I think the transaction_coordinator_assign can be shedding in any time, we could use a smoother strategy, but it doesn't prevent being shed

I think that load balancing strategy such as ThresholdShedder do not work well with transaction_coordinator_assign, which will do many meaningless bundles unloading. and the cost of TC recovery is pretty high that there are more than 20 minutes of unavailable time in our test.

@congbobo184

Copy link
Copy Markdown
Contributor

Should i raise a PR to revert mistake in the master branch?

yes, I think we need a pr to revert this change

I think that load balancing strategy such as ThresholdShedder do not work well with transaction_coordinator_assign, > which will do many meaningless bundles unloading. and the cost of TC recovery is pretty high that there are more than 20 minutes of unavailable time in our test.

I think we need to find out why the Tc recover so slowly, Is it a logic error, or need to expand the TC?

if it is a logic error, we need to fix.

later I will think about how to optimize the recovery time of TC

@github-actions

Copy link
Copy Markdown

The pr had no activity for 30 days, mark with Stale label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants