-
Notifications
You must be signed in to change notification settings - Fork 588
[VL] Distinct aggregation OOM when getOutput #8025
Copy link
Copy link
Open
Labels
Description
Backend
VL (Velox)
Bug description
Distinct aggregation will merge all sorted spill file in getOutput() (SpillPartition::createOrderedReader). If there are too many spill files, reading the first batch of each file into memory will consume a significant amount of memory. In one of our internal cases, one task generated 300 spill files, which requires close to 3G of memory.
Possible workarounds:
- Increase
kMaxSpillRunRows,1Mwill generate too many spill files for hundreds million rows of input. [GLUTEN-7249][VL] Lower default overhead memory ratio and spill run size #7531 - Reduce
kSpillWriteBufferSizeto1Mor lower. Why it is set to 4M by default? Is there any experience in performance tuning?
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response
Reactions are currently unavailable
