-
Notifications
You must be signed in to change notification settings - Fork 588
[VL] Global memory OOM #7249
Copy link
Copy link
Open
Labels
Description
Backend
VL (Velox)
Bug description
It's the new issue triggered by #6988
The root cause is Velox's sort needs to allocate a large memory buffer from global memory when spill is triggered. There should be some design issue there.
W20240914 06:04:39.696241 48552 MallocAllocator.cpp:267] [MEM] Failed to allocateBytes 256.00MB: Exceeded memory allocator limit of 3.00GB
E20240914 06:04:39.696458 48552 Exceptions.h:67] Line: /home/binweiyang/gluten/ep/build-velox/build/velox_ep/velox/common/memory/MemoryPool.cpp:1314, Function:handleAllocationFailure, Expression: allocate failed with 256.00MB from Memory Pool[__sys_spilling__ LEAF root[__sys_root__] parent[__sys_root__] MALLOC no-usage-track thread-safe]<unlimited max capacity unlimited capacity used 0B available 0B reservation [used 0B, reserved 0B, min 0B] counters [allocs 109, frees 103, reserves 0, releases 0, collisions 0])> Failed to allocateBytes 256.00MB: Exceeded memory allocator limit of 3.00GB, Source: RUNTIME, ErrorCode: MEM_ALLOC_ERROR
24/09/14 06:04:39 ERROR [Executor task launch worker for task 1188.0 in stage 2.0 (TID 116257)] listener.ManagedReservationListener: Error reserving memory from target
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: MEM_ALLOC_ERROR
Reason: allocate failed with 256.00MB from Memory Pool[__sys_spilling__ LEAF root[__sys_root__] parent[__sys_root__] MALLOC no-usage-track thread-safe]<unlimited max capacity unlimited capacity used 0B available 0B reservation [used 0B, reserved 0B, min 0B] counters [allocs 109, frees 103, reserves 0, releases 0, collisions 0])> Failed to allocateBytes 256.00MB: Exceeded memory allocator limit of 3.00GB
Retriable: True
Context: Operator: OrderBy[1] 1
Function: handleAllocationFailure
File: /home/binweiyang/gluten/ep/build-velox/build/velox_ep/velox/common/memory/MemoryPool.cpp
Line: 1314
Stack trace:
# 0 _ZN8facebook5velox7process10StackTraceC1Ei
# 1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3 _ZN8facebook5velox6memory14MemoryPoolImpl23handleAllocationFailureERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
# 4 _ZN8facebook5velox6memory14MemoryPoolImpl8allocateEl
# 5 _ZN8facebook5velox4exec7Spiller13fillSpillRunsEPNS1_20RowContainerIteratorE
# 6 _ZN8facebook5velox4exec7Spiller5spillEPKNS1_20RowContainerIteratorE
# 7 _ZN8facebook5velox4exec10SortBuffer10spillInputEv
# 8 _ZN8facebook5velox4exec7OrderBy7reclaimEmRNS0_6memory15MemoryReclaimer5StatsE
# 9 _ZNSt17_Function_handlerIFlvEZN8facebook5velox4exec8Operator15MemoryReclaimer7reclaimEPNS2_6memory10MemoryPoolEmmRNS6_15MemoryReclaimer5StatsEEUlvE_E9_M_invokeERKSt9_Any_data
# 10 _ZN8facebook5velox6memory15MemoryReclaimer3runERKSt8functionIFlvEERNS2_5StatsE
# 11 _ZN8facebook5velox4exec8Operator15MemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS4_15MemoryReclaimer5StatsE
# 12 _ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE
# 13 _ZN8facebook5velox4exec23ParallelMemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS3_15MemoryReclaimer5StatsE
# 14 _ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE
# 15 _ZN8facebook5velox4exec4Task15MemoryReclaimer11reclaimTaskERKSt10shared_ptrIS2_EmmRNS0_6memory15MemoryReclaimer5StatsE
# 16 _ZN8facebook5velox4exec4Task15MemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS4_15MemoryReclaimer5StatsE
# 17 _ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE
# 18 _ZN6gluten20ListenableArbitrator14shrinkCapacityEmbb
# 19 _ZN6gluten24WholeStageResultIterator14spillFixedSizeEl
# 20 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeSpill
# 21 0x00007ff1f89bf427
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response
Reactions are currently unavailable