Skip to content

External Sort - Cache Implementation - Memory Consumption #1142

@norberttech

Description

@norberttech

@domis86 noticed during his precise and detailed research, that implementation of External Sort mechanism that is based on Cache is currently not as efficient as it supposed to be.

It's because of those lines:

        foreach ($this->generators as $cacheId => $generator) {
            if ($generator->valid()) {
                /** @var Rows $rows */
                $rows = $generator->current();

                foreach ($rows->chunks(1) as $chunk) {
                    $heap->insert(CachedRow::fromRows($chunk, $cacheId));
                }
                $generator->next();
            }
        }

Each Generator represents all Rows from cached chunks.
This means that at once it might load 1k rows when in reality we should just load the first row from the generator and move to the next generator.

However to achieve that I think we need to update the following logic:

      $minHeap = $cachedParts->createHeap(...$refs);

        $bufferCache = new BufferCache($this->cache, $maxRowsSize);

        while ($cachedParts->notEmpty() || !$minHeap->isEmpty()) {
            $cachedParts->takeNext($minHeap, $this->id, $bufferCache);
        }

        $bufferCache->close();

Currently we are first creating the heap from all rows and then just reading from it, but that's wrong.
Let say we have 3 chunks each with 1k rows.

Each iteration should put to the MinHeap only 3 rows at once (1 from each chunk), so we can later add them to the $bufferCache.

Accordingly to: https://web.archive.org/web/20150202022830/http://faculty.simpson.edu/lydia.sinapova/www/cmsc250/LN250_Weiss/L17-ExternalSortEX2.htm

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions