-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](core)Provide add_blocks interface for VDataStreamRecvr #57888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix](core)Provide add_blocks interface for VDataStreamRecvr #57888
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 34615 ms |
TPC-DS: Total hot run time: 188302 ms |
ClickBench: Total hot run time: 27.68 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 34569 ms |
TPC-DS: Total hot run time: 187685 ms |
ClickBench: Total hot run time: 27.5 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run p0 |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run performance |
TPC-H: Total hot run time: 34169 ms |
TPC-DS: Total hot run time: 187487 ms |
ClickBench: Total hot run time: 27.24 s |
| } | ||
| iter->second = packet_seq; | ||
| } else { | ||
| _packet_seq_map.emplace(be_number, packet_seq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the packet_seq seem different with origin logic
|
run buildall |
TPC-H: Total hot run time: 34297 ms |
TPC-DS: Total hot run time: 187437 ms |
ClickBench: Total hot run time: 27.58 s |
HappenLee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 34721 ms |
TPC-DS: Total hot run time: 189659 ms |
ClickBench: Total hot run time: 27.59 s |
|
run buildall |
TPC-H: Total hot run time: 34452 ms |
TPC-DS: Total hot run time: 188831 ms |
ClickBench: Total hot run time: 27.6 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
HappenLee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
| _done_cb = nullptr; | ||
| _wait_timer.stop(); | ||
| int64_t elapse_time = _wait_timer.elapsed_time(); | ||
| if (recvr->_max_wait_to_process_time->value() < elapse_time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a CAS here
…57888) We previously had a crash. The cause is that we should not access the request after calling add_block(...) because add_block may enqueue a closure that runs on another thread and frees the request ``` ==730145==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7be1efd803a0 at pc 0x556b38d9d625 bp 0x7b16bf0193f0 sp 0x7b16bf0193e8 READ of size 4 at 0x7be1efd803a0 thread T1559 #0 0x556b38d9d624 in google::protobuf::internal::RepeatedPtrFieldBase::size() const /home/zcp/repo_center/doris_master/doris/thirdparty/installed/include/google/protobuf/repeated_ptr_field.h:185:29 apache#1 0x556b408ab062 in google::protobuf::RepeatedPtrField<doris::PBlock>::size() const /home/zcp/repo_center/doris_master/doris/thirdparty/installed/include/google/protobuf/repeated_ptr_field.h:1248:32 apache#2 0x556b408aaff4 in doris::PTransmitDataParams::_internal_blocks_size() const /home/zcp/repo_center/doris_master/doris/be/../gensrc/build/gen_cpp/internal_service.pb.h:32149:25 apache#3 0x556b4089731c in doris::PTransmitDataParams::blocks_size() const /home/zcp/repo_center/doris_master/doris/be/../gensrc/build/gen_cpp/internal_service.pb.h:32152:10 apache#4 0x556b60a83c17 in doris::vectorized::VDataStreamMgr::transmit_block(doris::PTransmitDataParams const*, google::protobuf::Closure**, long) /home/zcp/repo_center/doris_master/doris/be/src/vec/runtime/vdata_stream_mgr.cpp:150:38 apache#5 0x556b407f7408 in doris::PInternalService::_transmit_block(google::protobuf::RpcController*, doris::PTransmitDataParams const*, doris::PTransmitDataResult*, google::protobuf::Closure*, doris::Status const&, long) /home/zcp/repo_center/doris_master/doris/be/src/service/internal_service.cpp:1673:40 apache#6 0x556b407f52bb in doris::PInternalService::transmit_block(google::protobuf::RpcController*, doris::PTransmitDataParams const*, doris::PTransmitDataResult*, google::protobuf::Closure*) /home/zcp/repo_center/doris_master/doris/be/src/service/internal_service.cpp:1610:9 apache#7 0x556b43fceba2 in doris::PBackendService::CallMethod(google::protobuf::MethodDescriptor const*, google::protobuf::RpcController*, google::protobuf::Message const*, google::protobuf::Message*, google::protobuf::Closure*) /home/zcp/repo_center/doris_master/doris/gensrc/build/gen_cpp/internal_service.pb.cc:49452:7 apache#8 0x556b6736273e in brpc::policy::ProcessRpcRequest(brpc::InputMessageBase*) (/mnt/hdd01/selectdb-cloud-chaos/cluster0/be/lib/doris_be+0x770bd73e) apache#9 0x556b67357426 in brpc::ProcessInputMessage(void*) (/mnt/hdd01/selectdb-cloud-chaos/cluster0/be/lib/doris_be+0x770b2426) apache#10 0x556b67357f20 in brpc::InputMessenger::InputMessageClosure::~InputMessageClosure() (/mnt/hdd01/selectdb-cloud-chaos/cluster0/be/lib/doris_be+0x770b2f20) apache#11 0x556b673588dd in brpc::InputMessenger::OnNewMessages(brpc::Socket*) (/mnt/hdd01/selectdb-cloud-chaos/cluster0/be/lib/doris_be+0x770b38dd) apache#12 0x556b674a0adc in brpc::Socket::ProcessEvent(void*) (/mnt/hdd01/selectdb-cloud-chaos/cluster0/be/lib/doris_be+0x771fbadc) apache#13 0x556b672e0f76 in bthread::TaskGroup::task_runner(long) (/mnt/hdd01/selectdb-cloud-chaos/cluster0/be/lib/doris_be+0x7703bf76) apache#14 0x556b672cbbe0 in bthread_make_fcontext (/mnt/hdd01/selectdb-cloud-chaos/cluster0/be/lib/doris_be+0x77026be0) ``` ``` Note: The done pointer will be saved in add_block and may be called in another thread via done->Run(). For example, when blocks_size == 1, the process is as follows: transmit_block (i=0) └─> recvr->add_block(..., done, ...) // Pass done └─> SenderQueue::add_block └─> _pending_closures.push(done) // done is saved get_batch() [another thread] └─> closure_pair.first->Run() //⚠️ done->Run() is called └─> brpc releases request and response transmit_block (i=1) [original thread continues] └─> request->blocks_size() //⚠️ request has already been released! At this point, a use-after-free issue occurs. TODO: We should consider refactoring this part because add_block may release the request. We should not access the request after calling add_block. ``` apache#50113
What problem does this PR solve?
We previously had a crash. The cause is that we should not access the request after calling add_block(...) because add_block may enqueue a closure that runs on another thread and frees the request
#50113
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)