fix(hindsight): preserve non-ASCII text in retained conversation turns by harryplusplus · Pull Request #13090 · NousResearch/hermes-agent

harryplusplus · 2026-04-20T15:31:29Z

fix(hindsight): preserve non-ASCII text in retained conversation turns

Problem

sync_turn serializes conversation turns with json.dumps(messages), which defaults to ensure_ascii=True. This escapes all non-ASCII characters (Korean, Japanese, Chinese, emoji) into \uXXXX sequences before sending to Hindsight via aretain_batch.

The escaped content is stored as-is in Hindsight's documents.original_text column and chunks.chunk_text column, confirmed via direct DB queries:

-- documents.original_text
[[{"role": "user", "content": "[\ub098] <@1487373250630651975> ...

-- chunks.chunk_text
[[{"role": "user", "content": "[\ub098] ...

This affects the Hindsight pipeline in two ways:

1. LLM token waste during fact extraction — The escaped original_text is passed to the LLM for fact extraction. Tokenizers break \uXXXX escapes into far more tokens than the original characters:

Text	`ensure_ascii=True`	`ensure_ascii=False`	Token increase
`안녕 こんにちは你好`	31 tokens	8 tokens	+287%
`👨‍👩‍👧‍👦 family`	43 tokens	14 tokens	+207%
`나 Hermes Agent 로그 보는 방법`	29 tokens	8 tokens	+262%

(Token counts via tiktoken with gpt-4o encoding.)

2. Chunk readability — When include_chunks=True is used in recall, the returned chunk_text contains escaped Unicode, degrading readability for both LLMs and humans.

What is NOT affected

DB investigation shows that Hindsight's retrieval pipeline works correctly regardless of ensure_ascii:

memory_units.text — LLM extracts facts in the original language even from escaped input, so extracted facts contain proper Korean/CJK characters.
memory_units.search_vector (BM25) — Built from memory_units.text, not from original_text or chunk_text. BM25 keyword search works correctly.
memory_units.embedding — Also built from memory_units.text. Semantic search works correctly.

Note on embeddings: bge-m3 treats "나" and "\\ub098" as very different strings (cosine = 0.47). If embeddings were generated from the escaped original_text, semantic search would break. However, Hindsight generates embeddings from memory_units.text (the LLM-extracted facts, which are correct Korean), so recall is unaffected.

Fix

Add ensure_ascii=False to the json.dumps call in sync_turn so non-ASCII text is preserved as-is in the serialized JSON sent to Hindsight.

Before:

[{"role": "user", "content": "\uc548\ub155 \u3053\u3093\u306b\u3061\u306f \u4f60\u597d"}]

After:

[{"role": "user", "content": "안녕 こんにちは 你好"}]

Test

Added test_sync_turn_preserves_unicode that verifies the serialized JSON content (not just the round-tripped Python object) contains the original CJK characters and ZWJ composite emoji. This test fails without the fix and passes with it.

Note: A test that only checks json.loads round-trip would pass regardless of ensure_ascii, since json.loads silently decodes \uXXXX back to the original characters. The test must inspect the raw serialized string to catch this bug.

How to test

pytest tests/plugins/memory/test_hindsight_provider.py -v

nicoloboschi

+1

Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs #14109, #13153, #13090.

…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/plugins Plugin system and bundled plugins tool/memory Memory tool and memory providers labels Apr 22, 2026

harryplusplus force-pushed the fix/hindsight-unicode-retain branch from 0152c94 to 8b7e814 Compare April 22, 2026 17:21

fix(hindsight): preserve non-ASCII text in retained conversation turns

93d4ad6

harryplusplus force-pushed the fix/hindsight-unicode-retain branch from 8b7e814 to 93d4ad6 Compare April 22, 2026 17:24

nicoloboschi approved these changes Apr 23, 2026

View reviewed changes

teknium1 mentioned this pull request Apr 24, 2026

chore(release): map hindsight PR contributors in AUTHOR_MAP #15070

Merged

teknium1 added a commit that referenced this pull request Apr 24, 2026

chore(release): map hindsight PR contributors in AUTHOR_MAP (#15070)

3c0a728

Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs #14109, #13153, #13090.

teknium1 merged commit d6b65bb into NousResearch:main Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hindsight): preserve non-ASCII text in retained conversation turns#13090

fix(hindsight): preserve non-ASCII text in retained conversation turns#13090
teknium1 merged 1 commit into
NousResearch:mainfrom
harryplusplus:fix/hindsight-unicode-retain

harryplusplus commented Apr 20, 2026

Uh oh!

nicoloboschi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

harryplusplus commented Apr 20, 2026