Skip to content

Feature: Fuzzy Matching — Permissive Thresholds + Unicode Normalization (inspired by Kilocode) #517

@teknium1

Description

@teknium1

Overview

Hermes and Kilocode share nearly identical 9-strategy fuzzy matching chains for patch/edit (same Cline heritage). Two specific differences improve edit success rates:

  1. More permissive BlockAnchor thresholds — Kilocode: 0.0/0.3 vs Hermes: 0.70
  2. Unicode normalization strategy — Handles smart quotes, em-dashes, other LLM Unicode artifacts

Research Findings

Threshold Comparison

Kilocode BlockAnchor: If first+last lines match exactly, accept middle with threshold 0.0 (single candidate) or 0.3 (multiple candidates). Uses Levenshtein distance.

Hermes BlockAnchor (fuzzy_match.py line 285): Requires 0.70 similarity for middle content. Uses SequenceMatcher.

Kilocode's reasoning: if block boundaries match exactly, you almost certainly found the right block. Being strict about the middle causes unnecessary edit failures.

Unicode Normalization

Kilocode's patch applicator (patch/index.ts) has a 4-pass strategy including Unicode normalization — smart quotes to ASCII quotes, em-dashes to hyphens. LLMs occasionally produce these Unicode characters, causing exact match failures.

Characters normalized: \u201c \u201d", \u2018 \u2019', \u2014--, \u2013-, \u2026..., non-breaking space → space.


Implementation

1. Adjust BlockAnchor threshold (fuzzy_match.py line 285)

# Before: if similarity >= 0.70
# After: context-dependent threshold
threshold = 0.10 if candidate_count == 1 else 0.30
if similarity >= threshold:

2. Add Unicode normalization strategy (~20 LOC)

Insert as a new strategy before block_anchor:

UNICODE_MAP = {
    "\u201c": "\"", "\u201d": "\"",  # smart double quotes
    "\u2018": "'", "\u2019": "'",    # smart single quotes
    "\u2014": "--", "\u2013": "-",   # em/en-dash
    "\u2026": "...", "\u00a0": " ",  # ellipsis, nbsp
}

def _unicode_normalize(text: str) -> str:
    for char, repl in UNICODE_MAP.items():
        text = text.replace(char, repl)
    return text

Effort: ~30 minutes. ~20 LOC.


References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions