Skip to content

[VL] Global memory OOM #7249

@FelixYBW

Description

@FelixYBW

Backend

VL (Velox)

Bug description

It's the new issue triggered by #6988

The root cause is Velox's sort needs to allocate a large memory buffer from global memory when spill is triggered. There should be some design issue there.

W20240914 06:04:39.696241 48552 MallocAllocator.cpp:267] [MEM] Failed to allocateBytes 256.00MB: Exceeded memory allocator limit of 3.00GB
E20240914 06:04:39.696458 48552 Exceptions.h:67] Line: /home/binweiyang/gluten/ep/build-velox/build/velox_ep/velox/common/memory/MemoryPool.cpp:1314, Function:handleAllocationFailure, Expression:  allocate failed with 256.00MB from Memory Pool[__sys_spilling__ LEAF root[__sys_root__] parent[__sys_root__] MALLOC no-usage-track thread-safe]<unlimited max capacity unlimited capacity used 0B available 0B reservation [used 0B, reserved 0B, min 0B] counters [allocs 109, frees 103, reserves 0, releases 0, collisions 0])> Failed to allocateBytes 256.00MB: Exceeded memory allocator limit of 3.00GB, Source: RUNTIME, ErrorCode: MEM_ALLOC_ERROR
24/09/14 06:04:39 ERROR [Executor task launch worker for task 1188.0 in stage 2.0 (TID 116257)] listener.ManagedReservationListener: Error reserving memory from target
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: MEM_ALLOC_ERROR
Reason: allocate failed with 256.00MB from Memory Pool[__sys_spilling__ LEAF root[__sys_root__] parent[__sys_root__] MALLOC no-usage-track thread-safe]<unlimited max capacity unlimited capacity used 0B available 0B reservation [used 0B, reserved 0B, min 0B] counters [allocs 109, frees 103, reserves 0, releases 0, collisions 0])> Failed to allocateBytes 256.00MB: Exceeded memory allocator limit of 3.00GB
Retriable: True
Context: Operator: OrderBy[1] 1
Function: handleAllocationFailure
File: /home/binweiyang/gluten/ep/build-velox/build/velox_ep/velox/common/memory/MemoryPool.cpp
Line: 1314
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox6memory14MemoryPoolImpl23handleAllocationFailureERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
# 4  _ZN8facebook5velox6memory14MemoryPoolImpl8allocateEl
# 5  _ZN8facebook5velox4exec7Spiller13fillSpillRunsEPNS1_20RowContainerIteratorE
# 6  _ZN8facebook5velox4exec7Spiller5spillEPKNS1_20RowContainerIteratorE
# 7  _ZN8facebook5velox4exec10SortBuffer10spillInputEv
# 8  _ZN8facebook5velox4exec7OrderBy7reclaimEmRNS0_6memory15MemoryReclaimer5StatsE
# 9  _ZNSt17_Function_handlerIFlvEZN8facebook5velox4exec8Operator15MemoryReclaimer7reclaimEPNS2_6memory10MemoryPoolEmmRNS6_15MemoryReclaimer5StatsEEUlvE_E9_M_invokeERKSt9_Any_data
# 10 _ZN8facebook5velox6memory15MemoryReclaimer3runERKSt8functionIFlvEERNS2_5StatsE
# 11 _ZN8facebook5velox4exec8Operator15MemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS4_15MemoryReclaimer5StatsE
# 12 _ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE
# 13 _ZN8facebook5velox4exec23ParallelMemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS3_15MemoryReclaimer5StatsE
# 14 _ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE
# 15 _ZN8facebook5velox4exec4Task15MemoryReclaimer11reclaimTaskERKSt10shared_ptrIS2_EmmRNS0_6memory15MemoryReclaimer5StatsE
# 16 _ZN8facebook5velox4exec4Task15MemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS4_15MemoryReclaimer5StatsE
# 17 _ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE
# 18 _ZN6gluten20ListenableArbitrator14shrinkCapacityEmbb
# 19 _ZN6gluten24WholeStageResultIterator14spillFixedSizeEl
# 20 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeSpill
# 21 0x00007ff1f89bf427

@zhztheplayer

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions