Skip to content

feat: add Korean language support for memory search query expansion#18899

Merged
vincentkoc merged 7 commits intoopenclaw:mainfrom
ruypang:feature/korean-query-expansion
Feb 22, 2026
Merged

feat: add Korean language support for memory search query expansion#18899
vincentkoc merged 7 commits intoopenclaw:mainfrom
ruypang:feature/korean-query-expansion

Conversation

@ruypang
Copy link
Contributor

@ruypang ruypang commented Feb 17, 2026

Summary

Korean is the 3rd most common CJK language but had no stop words support in the query expansion module. This directly impacts memory search quality for Korean-speaking users.

Changes

  • Korean stop words (STOP_WORDS_KO): ~70 common Korean particles (조사), pronouns (대명사), auxiliary verbs, conjunctions, adverbs, vague time references, and question words
  • Hangul-aware tokenization: Detects Hangul syllables (\uAC00-\uD7AF) and jamo (\u3131-\u3163), splits on spaces, and strips common trailing particles (e.g. 서버에서 → 서버) to improve keyword extraction
  • Korean stop word filtering in extractKeywords() alongside existing EN and ZH checks
  • Tests for Korean queries: keyword extraction, particle stripping, stop word filtering, and mixed Korean/English queries

All existing tests continue to pass.

Greptile Summary

This PR adds Korean language support to the FTS (full-text search) query expansion module, completing CJK coverage alongside existing English and Chinese support. The implementation includes ~70 Korean stop words, Hangul-aware tokenization that detects syllables (\uAC00-\uD7AF) and jamo (\u3131-\u3163), and particle-stripping logic that removes common Korean grammatical particles (e.g., 서버에서서버) to improve keyword extraction quality.

  • Korean stop words cover particles, pronouns, auxiliary verbs, conjunctions, adverbs, vague time references, and question words — matching the structure of the existing English and Chinese sets
  • Trailing particle stripping uses .toSorted() for robust longest-match-first ordering, with an isUsefulKoreanStem guard that prevents bogus single-syllable stems (e.g., 논의 is not incorrectly reduced to )
  • Both the original token and the stripped stem are emitted for FTS, maximizing match potential
  • 8 new test cases cover keyword extraction, particle stripping, stop word filtering (including inflected forms), and mixed Korean/English queries
  • No issues found — the implementation is clean, follows existing patterns, and stays within repository coding guidelines

Confidence Score: 5/5

  • This PR is safe to merge — it adds additive functionality with no changes to existing behavior and comprehensive test coverage.
  • The changes are purely additive: new stop word sets, new tokenizer branch, and new tests. No existing code paths are modified in a way that could break current English or Chinese functionality. The Korean tokenizer branch is gated behind a Hangul regex check, so it only activates for Korean text. The .toSorted() usage is safe on the Node 22+ baseline. All edge cases (single-char stems, inflected stop words, mixed-language tokens) are handled with appropriate guards and tested.
  • No files require special attention

Last reviewed commit: b11191e

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 170524c567

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@ruypang
Copy link
Contributor Author

ruypang commented Feb 19, 2026

Friendly reminder — this PR is ready for review. All CI checks are passing. Would love to get a maintainer's eyes on this when you get a chance! 🙏

@ruypang ruypang force-pushed the feature/korean-query-expansion branch from fba41f4 to ba569f7 Compare February 20, 2026 02:11
@ruypang
Copy link
Contributor Author

ruypang commented Feb 20, 2026

Hi @gumadeiras @vincentkoc 👋 Just rebased onto latest main — all CI should be green. Would really appreciate a review when you get a chance. Happy to address any feedback. Thanks! 🙏

@vincentkoc
Copy link
Member

Addressed the open review feedback and pushed follow-up fixes in 135fda872bb5dc5b14ede15aecce27d6653f03b2.

@greptileai review

@vincentkoc
Copy link
Member

Changelog credit added for this PR in b11191e5df41fed6402b5f23888d161c7bd0e10a.

@greptileai review

@vincentkoc vincentkoc merged commit 853ae62 into openclaw:main Feb 22, 2026
24 of 25 checks passed
BunsDev pushed a commit that referenced this pull request Feb 22, 2026
…18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
obviyus pushed a commit to guirguispierre/openclaw that referenced this pull request Feb 22, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
00xglitch pushed a commit to 00xglitch/openclaw that referenced this pull request Feb 22, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
00xglitch pushed a commit to 00xglitch/openclaw that referenced this pull request Feb 23, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
7Sageer pushed a commit to 7Sageer/openclaw that referenced this pull request Feb 23, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
gabrielkoo pushed a commit to gabrielkoo/openclaw that referenced this pull request Feb 23, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
mreedr pushed a commit to mreedr/openclaw-custom that referenced this pull request Feb 24, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
00xglitch pushed a commit to 00xglitch/openclaw that referenced this pull request Feb 24, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
00xglitch pushed a commit to 00xglitch/openclaw that referenced this pull request Feb 24, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
…penclaw#18899)

* feat: add Korean stop words and tokenization for memory search

* fix: address review comments on Korean query expansion

* fix: lint errors - curly brace and toSorted

* fix(memory): improve Korean stop words and deduplicate

* Memory: tighten Korean query expansion filtering

* Docs/Changelog: credit Korean memory query expansion

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants