Skip to content

A perfect implementation of a hashtable with a very large number of keys #9517

@GuoZhaoran

Description

@GuoZhaoran

I would be honored if you would take five minutes to read through my solution for expanding a hashtable with a large number of keys. @oranagra

I saw your thoughts under this pr and it resonated me a lot.

Currently a reshash come with a big price of memory, which sometimes causes eviction and data loss.
this is because we allocate another hash table that's double in size, and keep the old one in parallel to the new one for a while.
my idea is that instead of allocating a new hash table, we'll re-alloc the existing one, and then start moving entries from the lower half to the upper half. or alternatively when shrinking, we can maybe concentrate the entries in the lower part, and then eventually realloc the hash table to be smaller.
obviously this poses some challenges in dictScan and other areas of dict.c, but i think they're all solvable.

When the hashTable does not contain many keys, rehash does not take much memory (at least it is acceptable), but as the number of keys increases, the memory required for expansion becomes larger and larger (and must be contiguous).

ht[0].size memory required for ht[1] capacity expansion
4 64bytes
8 128bytes
16 256bytes
...... ......
65536 1024K
...... ......
8388608 128M
16777216 256M
...... ......

I think the biggest problem with hashtable scaling is that you can use infinitely large tables (they are all nth power of 2 in size), and such large tables not only consume memory, and much memory is not immediately needed. My idea is to limit the size of the largest table, for example 65535 (15 powers of 2, occupying 1M size of memory, this specification size is adjustable according to the configuration). When we need a larger capacity table, we can use multiple such maximum tables to form this larger capacity table to use, for example, a capacity of 131072 hashtable.

hashtable1

By adding an dictEntry ** array , a table of any size can be assembled, a capacity of 131072 hashtable used 2, a capacity of 8388608 hashtable used 128, etc.

This pattern of combining tables does not cause inconvenience to the lookup, insertion, expansion and traversal of hashTable, and is basically compatible with the previous way, except that it requires an additional pointer addressing, which I think is nothing compared to the benefits it can bring, such as the following expansion:

hashtable2

  • when the hashtable is expanded just expand the dictEntry ** array
  • allocate a table of size 65536 only when it is used, which is not often needed, and it has a memory limit of 1M
  • overall it saves half of the memory than the original because it reuses the original table of 65536 size, and can also release a table's memory at the right time when scaling down, which is very flexible

this pattern of combining large tables is also very convenient when use scan, and like the previous way, it works better, it does not return duplicate cursor.

Of course, it is slightly more complex than the original implementation, but it can work better when there are a particularly large number of keys in a table, perhaps redis hashtable can set a threshold (for example, set to 65535) and use the combined bucket pattern when it exceeds this size.

Perhaps there are many details that I have not considered perfectly, even so, I hope that you or your colleagues can tell me, it will enable me to learn more, thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions