Conversation
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
There was a problem hiding this comment.
Code Review
This pull request implements a lazy eviction mechanism for the BlendTokenRangeMatcher by adding a remove_chunks method and updating the lookup logic to identify and clear stale entries. Review feedback highlighted a critical scalability issue where the lack of compact_id reuse will eventually cause an IndexError and a memory leak as the internal list grows beyond the fixed table size. Additionally, the implementation currently fails to handle duplicate token_hash registrations, which could lead to orphaned entries in the fingerprint table and potential data corruption during matching.
| slot = int(chunk_hashes[i]) & int(self._mask) | ||
| self._chunk_token_hash.append(th) | ||
| self._token_hash_to_start[th] = i * self.chunk_size | ||
| self._compact_id_to_slot[cid] = slot |
There was a problem hiding this comment.
There is a critical issue here that will lead to an IndexError and a crash.
The compact_ids are generated based on the length of self._chunk_token_hash. The remove_chunks method evicts entries by setting them to None in this list, but it never reclaims the space or allows compact_ids to be reused.
As a result, len(self._chunk_token_hash) grows without bound. Since self._compact_id_to_slot has a fixed size of _TABLE_SIZE, this will inevitably cause an IndexError here when a cid is generated that is >= _TABLE_SIZE. This also constitutes a memory leak, as the list will fill with None values that are never removed.
To fix this, you should implement a mechanism to reuse compact_ids. A common pattern is to maintain a free list of evicted IDs. When registering new chunks, you can pull from this free list before generating a new ID by increasing the size of _chunk_token_hash.
| for i in range(n): | ||
| th = token_hashes[i] | ||
| cid = int(compact_ids[i]) | ||
| slot = int(chunk_hashes[i]) & int(self._mask) | ||
| self._chunk_token_hash.append(th) | ||
| self._token_hash_to_start[th] = i * self.chunk_size | ||
| self._compact_id_to_slot[cid] = slot | ||
| self._token_hash_to_compact_id[th] = cid |
There was a problem hiding this comment.
The current implementation does not handle the case where the same token_hash is registered for multiple different chunks. This can lead to incorrect behavior and orphaned entries in the matcher.
If a token_hash is reused, the reverse-lookup maps (_token_hash_to_start and _token_hash_to_compact_id) are overwritten with the details of the latest registration. This can cause several problems:
remove_chunkswill only be able to evict the most recently registered chunk for a given hash, leaving earlier chunks with the same hash as orphans in the table.match_sub_sequencecould return incorrectold_stpositions for these orphaned chunks if they are matched, leading to data corruption downstream.
To ensure correctness, you should enforce that all registered token_hash values are unique. A straightforward way to enforce this is to check for existing hashes before registration and raise an error. This makes the requirement on the caller explicit.
# Check for duplicates before modifying state
for th in token_hashes:
if th in self._token_hash_to_compact_id:
raise ValueError(f"Attempted to register duplicate token hash: {th.hex()}")
# Persist compact_id → token_hash, token_hash → start, and reverse maps
for i in range(n):
th = token_hashes[i]
cid = int(compact_ids[i])
slot = int(chunk_hashes[i]) & int(self._mask)
self._chunk_token_hash.append(th)
self._token_hash_to_start[th] = i * self.chunk_size
self._compact_id_to_slot[cid] = slot
self._token_hash_to_compact_id[th] = cidadd eviction for cb Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
add eviction for cb Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
What this PR does / why we need it:
Special notes for your reviewers:
If applicable: