Skip to content

Sorting by column works in opposite way when external sort exceeds allowed memory #1116

@doogi

Description

@doogi

It looks that when whole dataset to sort is small enough to fit into allowed memory for external sort implementation then sorting works fine (sorry for overcomplicated example, I was testing multiple theories):

// putenv('FLOW_EXTERNAL_SORT_MAX_MEMORY=10M');
$arr = [];
for ($j = 100; $j > 0; $j--) {
    for ($i = 200; $i > 0; $i--) {
        $arr[] = ['id' => str_pad((string) $j, 5, '0', STR_PAD_LEFT).'-'.str_pad((string) $i, 3, '0', STR_PAD_LEFT)];
    }
}

shuffle($arr);

data_frame()
    ->read(from_array($arr))
    ->sortBy(ref('id'))
    ->limit(10)
    ->write(to_output(truncate: true))
    ->run();

Output:

+-----------+
|        id |
+-----------+
| 00001-001 |
| 00001-002 |
| 00001-003 |
| 00001-004 |
| 00001-005 |
| 00001-006 |
| 00001-007 |
| 00001-008 |
| 00001-009 |
| 00001-010 |
+-----------+

But when we set memory limit to anything below amount required to run sorting (example above reaches ~60MB for my tests), by e.g. uncommenting putenv('FLOW_EXTERNAL_SORT_MAX_MEMORY=10M'); the output for the same code as above is:

+-----------+
|        id |
+-----------+
| 00100-200 |
| 00100-199 |
| 00100-198 |
| 00100-197 |
| 00100-196 |
| 00100-195 |
| 00100-194 |
| 00100-193 |
| 00100-192 |
| 00100-191 |
+-----------+

I believe it may be a bug and should work the same, no matter if we use (filesystem) cache to support sorting or not, right? 🙂

Thanks in advance for taking a look into this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions