Skip to content
This repository was archived by the owner on Jan 2, 2025. It is now read-only.

Conversation

@ggordonhall
Copy link
Contributor

Semantic search results can be impeded by low-quality code chunks. These come in many forms, but one common case is where they consist of a sequence of floats (e.g. from an embedded SVG).

This PR introduces a first attempt to filter these out, by skipping chunks where numeric and punctuation chars make up more than 50% of the total number of non-whitespace chars.

Closes BLO-1822

@ggordonhall ggordonhall merged commit 6a31b80 into main Dec 8, 2023
@ggordonhall ggordonhall deleted the gabriel/filter-noisy-chunks branch December 8, 2023 16:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants