feat: add Korean language support for memory search query expansion#18899
feat: add Korean language support for memory search query expansion#18899vincentkoc merged 7 commits intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 170524c567
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Friendly reminder — this PR is ready for review. All CI checks are passing. Would love to get a maintainer's eyes on this when you get a chance! 🙏 |
fba41f4 to
ba569f7
Compare
|
Hi @gumadeiras @vincentkoc 👋 Just rebased onto latest main — all CI should be green. Would really appreciate a review when you get a chance. Happy to address any feedback. Thanks! 🙏 |
|
Addressed the open review feedback and pushed follow-up fixes in @greptileai review |
|
Changelog credit added for this PR in @greptileai review |
…18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
…penclaw#18899) * feat: add Korean stop words and tokenization for memory search * fix: address review comments on Korean query expansion * fix: lint errors - curly brace and toSorted * fix(memory): improve Korean stop words and deduplicate * Memory: tighten Korean query expansion filtering * Docs/Changelog: credit Korean memory query expansion --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Summary
Korean is the 3rd most common CJK language but had no stop words support in the query expansion module. This directly impacts memory search quality for Korean-speaking users.
Changes
STOP_WORDS_KO): ~70 common Korean particles (조사), pronouns (대명사), auxiliary verbs, conjunctions, adverbs, vague time references, and question words\uAC00-\uD7AF) and jamo (\u3131-\u3163), splits on spaces, and strips common trailing particles (e.g. 서버에서 → 서버) to improve keyword extractionextractKeywords()alongside existing EN and ZH checksAll existing tests continue to pass.
Greptile Summary
This PR adds Korean language support to the FTS (full-text search) query expansion module, completing CJK coverage alongside existing English and Chinese support. The implementation includes ~70 Korean stop words, Hangul-aware tokenization that detects syllables (
\uAC00-\uD7AF) and jamo (\u3131-\u3163), and particle-stripping logic that removes common Korean grammatical particles (e.g.,서버에서→서버) to improve keyword extraction quality..toSorted()for robust longest-match-first ordering, with anisUsefulKoreanStemguard that prevents bogus single-syllable stems (e.g.,논의is not incorrectly reduced to논)Confidence Score: 5/5
.toSorted()usage is safe on the Node 22+ baseline. All edge cases (single-char stems, inflected stop words, mixed-language tokens) are handled with appropriate guards and tested.Last reviewed commit: b11191e