Skip to content

[Enhancement] replace brpc bthread with pthead #16634

@chenlinzhong

Description

@chenlinzhong

Search before asking

  • I had searched in the issues and found no similar issues.

Description

The bthread of brpc cannot be scheduled under high load, resulting in stuck interface callback. A common problem is:
timeout when waiting for send fragments RPC
This interface takes a short time. It should return in hundreds of microseconds, but it will be stuck for a long time under high load. We are waiting for the lock through the perf top tool under high load

image

The reason is that bthreads need to find free ptreads to schedule successfully. If all pthreads are blocking, bthreads will wait all the time. What is the reason for the long blocking of pthreads?

*The main reason is that the pthread blocking function is called, such as waiting for a lock (std:: mutex)

The doris be interface uses a lot of std:: mutex. It is easy to exhaust pthread threads under high load. A simple query can generate tens of thousands or tens of thousands of qps between be, affecting the processing of all requests of brpc

How to solve

There are usually three ways

  • 1、brpc_ num_ Thread: increase the number of threads
  • 2、bthread:: mutex replaces std:: mutex
  • 3、brpc+thread pool mode: brpc will no longer process requests, but only send and receive network requests, which will be sent to the thread pool after being received

Method 1: addresses the symptoms rather than the root cause, and pthread depletion will still affect all brpc interfaces
Method 2:is difficult to implement. It is not easy to operate when it is developed by many people. No way, everyone knows which functions need bthread mutex and which need pthread mutex
Method 3: can completely solve the impact of pthread depletion

be在高负载下brpc的bthread存在无法调度的问题,导致卡住接口的回调,一个常见的问题就是:
timeout when waiting for send fragments RPC
这个接口的耗时很短,通常数百微秒就应该返回,但是在高负载下会卡住很久,我们在高负载下通过perf top工具都在等待锁
image

原因是:bthread需要找到空闲的ptread才能成功调度,如果所有的pthread都在阻塞那么bthread会一直等待,那么导致pthread阻塞长时间阻塞的原因是啥?

  • 主要是调用了pthread阻塞函数,比如在等待一把锁(std::mutex)

doris be接口中大量使用了std::mutex,在高负载下很容易把pthread的线程耗尽,一个简单的查询在be间产生的qps就能达到上万或者数万,影响brpc的所有请求的处理

如何解决
通常有3种方式

  • 1.brpc_num_thread: 调大线程的数量
  • 2.bthread::mutex替换std::mutex
  • 3.brpc+线程池方式:brpc不再处理请求,只做网络请求收发,请求接到到丢到线程池中

方式1 治标不治本,pthread耗尽仍会影响所有brpc接口
方案2 落地较难,这个多人开发的时候不太好操作了。没办法每个人都知道哪些函数需要bthread的mutex,哪些是需要pthread mutex
方案3 可以彻底解决pthread耗尽影响

Solution

add thread pool to handle the be service logic, do not use brpc any more

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions