Skip to content

Commit 9a7662b

Browse files
stenlarssonpitrou
andauthored
GH-39582: [C++][Acero] Increase size of Acero TempStack (#40007)
We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: #39582 Lead-authored-by: Sten Larsson <sten@burtcorp.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
1 parent a7ac7e0 commit 9a7662b

2 files changed

Lines changed: 9 additions & 8 deletions

File tree

cpp/src/arrow/acero/query_context.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ size_t QueryContext::max_concurrency() const { return thread_indexer_.Capacity()
5353
Result<util::TempVectorStack*> QueryContext::GetTempStack(size_t thread_index) {
5454
if (!tld_[thread_index].is_init) {
5555
RETURN_NOT_OK(tld_[thread_index].stack.Init(
56-
memory_pool(), 8 * util::MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
56+
memory_pool(), 32 * util::MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
5757
tld_[thread_index].is_init = true;
5858
}
5959
return &tld_[thread_index].stack;

cpp/src/arrow/compute/util.cc

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,18 @@ using internal::CpuInfo;
3232
namespace util {
3333

3434
void TempVectorStack::alloc(uint32_t num_bytes, uint8_t** data, int* id) {
35-
int64_t old_top = top_;
36-
top_ += PaddedAllocationSize(num_bytes) + 2 * sizeof(uint64_t);
37-
// Stack overflow check
38-
ARROW_DCHECK(top_ <= buffer_size_);
39-
*data = buffer_->mutable_data() + old_top + sizeof(uint64_t);
35+
int64_t new_top = top_ + PaddedAllocationSize(num_bytes) + 2 * sizeof(uint64_t);
36+
// Stack overflow check (see GH-39582).
37+
// XXX cannot return a regular Status because most consumers do not either.
38+
ARROW_CHECK_LE(new_top, buffer_size_) << "TempVectorStack::alloc overflow";
39+
*data = buffer_->mutable_data() + top_ + sizeof(uint64_t);
4040
// We set 8 bytes before the beginning of the allocated range and
4141
// 8 bytes after the end to check for stack overflow (which would
4242
// result in those known bytes being corrupted).
43-
reinterpret_cast<uint64_t*>(buffer_->mutable_data() + old_top)[0] = kGuard1;
44-
reinterpret_cast<uint64_t*>(buffer_->mutable_data() + top_)[-1] = kGuard2;
43+
reinterpret_cast<uint64_t*>(buffer_->mutable_data() + top_)[0] = kGuard1;
44+
reinterpret_cast<uint64_t*>(buffer_->mutable_data() + new_top)[-1] = kGuard2;
4545
*id = num_vectors_++;
46+
top_ = new_top;
4647
}
4748

4849
void TempVectorStack::release(int id, uint32_t num_bytes) {

0 commit comments

Comments
 (0)