-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
On executing SortPreservingMerge on multiple streams of record batches and using a high-cardinality dictionary field as the sort key, the RowConverter instance used to merge the multiple RowCursorStreams keeps growing in memory (as it keeps accumulating the dict mappings internally in the OrderPreservingInterner structure). This unbounded memory growth eventually causes data fusion to get killed by the OOM killer.
To Reproduce
Detailed steps to reproduce this issue is given here.
Expected behavior
SortPreservingMerge on streams of record batches with high-cardinality dictionary-encoded sort keys should be memory aware and keep memory usage within a user-defined limit.
Additional context
Possible solution:
- Keep track of the memory usage for the
RowConverterusing thesize()method which in the case ofDictionaryfields returns the size of theOrderPreservingInterner. - If the size of the
RowConvertergrows more than a user-defined memory limit, take note of theRowCursorStreamthat are still getting converted, delete the converter, create a new one, and re-do the aborted conversions.
alamb and caseykneale
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working