Skip to content

fix(hindsight): preserve non-ASCII text in retained conversation turns#13090

Merged
teknium1 merged 1 commit into
NousResearch:mainfrom
harryplusplus:fix/hindsight-unicode-retain
Apr 24, 2026
Merged

fix(hindsight): preserve non-ASCII text in retained conversation turns#13090
teknium1 merged 1 commit into
NousResearch:mainfrom
harryplusplus:fix/hindsight-unicode-retain

Conversation

@harryplusplus

Copy link
Copy Markdown
Contributor

fix(hindsight): preserve non-ASCII text in retained conversation turns

Problem

sync_turn serializes conversation turns with json.dumps(messages), which defaults to ensure_ascii=True. This escapes all non-ASCII characters (Korean, Japanese, Chinese, emoji) into \uXXXX sequences before sending to Hindsight via aretain_batch.

The escaped content is stored as-is in Hindsight's documents.original_text column and chunks.chunk_text column, confirmed via direct DB queries:

-- documents.original_text
[[{"role": "user", "content": "[\ub098] <@1487373250630651975> ...

-- chunks.chunk_text
[[{"role": "user", "content": "[\ub098] ...

This affects the Hindsight pipeline in two ways:

1. LLM token waste during fact extraction — The escaped original_text is passed to the LLM for fact extraction. Tokenizers break \uXXXX escapes into far more tokens than the original characters:

Text ensure_ascii=True ensure_ascii=False Token increase
안녕 こんにちは 你好 31 tokens 8 tokens +287%
👨‍👩‍👧‍👦 family 43 tokens 14 tokens +207%
나 Hermes Agent 로그 보는 방법 29 tokens 8 tokens +262%

(Token counts via tiktoken with gpt-4o encoding.)

2. Chunk readability — When include_chunks=True is used in recall, the returned chunk_text contains escaped Unicode, degrading readability for both LLMs and humans.

What is NOT affected

DB investigation shows that Hindsight's retrieval pipeline works correctly regardless of ensure_ascii:

  • memory_units.text — LLM extracts facts in the original language even from escaped input, so extracted facts contain proper Korean/CJK characters.
  • memory_units.search_vector (BM25) — Built from memory_units.text, not from original_text or chunk_text. BM25 keyword search works correctly.
  • memory_units.embedding — Also built from memory_units.text. Semantic search works correctly.

Note on embeddings: bge-m3 treats "나" and "\\ub098" as very different strings (cosine = 0.47). If embeddings were generated from the escaped original_text, semantic search would break. However, Hindsight generates embeddings from memory_units.text (the LLM-extracted facts, which are correct Korean), so recall is unaffected.

Fix

Add ensure_ascii=False to the json.dumps call in sync_turn so non-ASCII text is preserved as-is in the serialized JSON sent to Hindsight.

Before:

[{"role": "user", "content": "\uc548\ub155 \u3053\u3093\u306b\u3061\u306f \u4f60\u597d"}]

After:

[{"role": "user", "content": "안녕 こんにちは 你好"}]

Test

Added test_sync_turn_preserves_unicode that verifies the serialized JSON content (not just the round-tripped Python object) contains the original CJK characters and ZWJ composite emoji. This test fails without the fix and passes with it.

Note: A test that only checks json.loads round-trip would pass regardless of ensure_ascii, since json.loads silently decodes \uXXXX back to the original characters. The test must inspect the raw serialized string to catch this bug.

How to test

pytest tests/plugins/memory/test_hindsight_provider.py -v

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/plugins Plugin system and bundled plugins tool/memory Memory tool and memory providers labels Apr 22, 2026
@harryplusplus harryplusplus force-pushed the fix/hindsight-unicode-retain branch from 0152c94 to 8b7e814 Compare April 22, 2026 17:21
@harryplusplus harryplusplus force-pushed the fix/hindsight-unicode-retain branch from 8b7e814 to 93d4ad6 Compare April 22, 2026 17:24

@nicoloboschi nicoloboschi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

teknium1 added a commit that referenced this pull request Apr 24, 2026
Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus
ahead of merging PRs #14109, #13153, #13090.
@teknium1 teknium1 merged commit d6b65bb into NousResearch:main Apr 24, 2026
nekorytaylor666 pushed a commit to nekorytaylor666/hermes-agent that referenced this pull request Apr 24, 2026
justrhoto pushed a commit to justrhoto/hermes-agent that referenced this pull request Apr 24, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…arch#15070)

Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus
ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…arch#15070)

Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus
ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…arch#15070)

Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus
ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…arch#15070)

Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus
ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…arch#15070)

Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus
ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…arch#15070)

Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus
ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P2 Medium — degraded but workaround exists tool/memory Memory tool and memory providers type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants