fix(hindsight): preserve non-ASCII text in retained conversation turns#13090
Merged
teknium1 merged 1 commit intoApr 24, 2026
Merged
Conversation
0152c94 to
8b7e814
Compare
8b7e814 to
93d4ad6
Compare
teknium1
added a commit
that referenced
this pull request
Apr 24, 2026
nekorytaylor666
pushed a commit
to nekorytaylor666/hermes-agent
that referenced
this pull request
Apr 24, 2026
…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
justrhoto
pushed a commit
to justrhoto/hermes-agent
that referenced
this pull request
Apr 24, 2026
…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
donald131
pushed a commit
to donald131/hermes-agent
that referenced
this pull request
May 2, 2026
…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…arch#15070) Adds AUTHOR_MAP entries for perlowja, tangyuanjc, harryplusplus ahead of merging PRs NousResearch#14109, NousResearch#13153, NousResearch#13090.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(hindsight): preserve non-ASCII text in retained conversation turns
Problem
sync_turnserializes conversation turns withjson.dumps(messages), which defaults toensure_ascii=True. This escapes all non-ASCII characters (Korean, Japanese, Chinese, emoji) into\uXXXXsequences before sending to Hindsight viaaretain_batch.The escaped content is stored as-is in Hindsight's
documents.original_textcolumn andchunks.chunk_textcolumn, confirmed via direct DB queries:This affects the Hindsight pipeline in two ways:
1. LLM token waste during fact extraction — The escaped
original_textis passed to the LLM for fact extraction. Tokenizers break\uXXXXescapes into far more tokens than the original characters:ensure_ascii=Trueensure_ascii=False안녕 こんにちは 你好👨👩👧👦 family나 Hermes Agent 로그 보는 방법(Token counts via
tiktokenwithgpt-4oencoding.)2. Chunk readability — When
include_chunks=Trueis used in recall, the returnedchunk_textcontains escaped Unicode, degrading readability for both LLMs and humans.What is NOT affected
DB investigation shows that Hindsight's retrieval pipeline works correctly regardless of
ensure_ascii:memory_units.text— LLM extracts facts in the original language even from escaped input, so extracted facts contain proper Korean/CJK characters.memory_units.search_vector(BM25) — Built frommemory_units.text, not fromoriginal_textorchunk_text. BM25 keyword search works correctly.memory_units.embedding— Also built frommemory_units.text. Semantic search works correctly.Note on embeddings:
bge-m3treats"나"and"\\ub098"as very different strings (cosine = 0.47). If embeddings were generated from the escapedoriginal_text, semantic search would break. However, Hindsight generates embeddings frommemory_units.text(the LLM-extracted facts, which are correct Korean), so recall is unaffected.Fix
Add
ensure_ascii=Falseto thejson.dumpscall insync_turnso non-ASCII text is preserved as-is in the serialized JSON sent to Hindsight.Before:
[{"role": "user", "content": "\uc548\ub155 \u3053\u3093\u306b\u3061\u306f \u4f60\u597d"}]After:
[{"role": "user", "content": "안녕 こんにちは 你好"}]Test
Added
test_sync_turn_preserves_unicodethat verifies the serialized JSON content (not just the round-tripped Python object) contains the original CJK characters and ZWJ composite emoji. This test fails without the fix and passes with it.Note: A test that only checks
json.loadsround-trip would pass regardless ofensure_ascii, sincejson.loadssilently decodes\uXXXXback to the original characters. The test must inspect the raw serialized string to catch this bug.How to test