Korean search & retrieval ML engineer — analyzer correctness, ranking losses, LLM serving · 7y · Python · PyTorch
Korean search, fixed where it breaks — upstream: Hangul NFD normalization into Lucene, approved by Robert Muir (#16242); the meaning-inverting nori XPN default (비급여 non-covered → 급여 covered) now warned in the official Elasticsearch docs (#151157); a ListMLE listwise-loss fix in sentence-transformers (#3827; maintainer-measured NanoBEIR nDCG@10 0.39 → 0.53). Previously 5.5y on the search team at 42Maru; now at MetLife on production ML.
Data that is valid on one side of a representation boundary silently breaks the other — NFD Hangul vs. the analyzer, stop strings vs. byte-fragment tokens, bf16 logits vs. a float32 loss. Korean hits these boundaries constantly; English-only test suites never do.
search_system — a Korean insurance-clause (약관) retrieval lab over 36,983 clause passages with 700 hand-graded queries: nori BM25 + BGE-M3 hybrid retrieval, analyzer probes, real-query failures. For each Korean failure I took upstream, the lab holds a before/after fixture tied to the fix and a regression test:
- XPN polarity (비급여 → 급여) — nori's default analyzer drops the meaning-bearing prefix, so 비급여 (non-covered) indexes as 급여 (covered) and opposite-meaning clauses become indistinguishable. Reproduced and pinned; documented upstream (Elasticsearch #151157).
- NFD Hangul — NFD-decomposed Hangul is unanalyzable as Korean. Fixed via the new
HangulCompositionCharFilter(Lucene #16242, Muir-approved).
The lab is also where I compare offline variants — analyzer choices (형태소 분석기), fusion weights, reranker on/off — on the qrels benchmark, decided by nDCG / Recall. The scorecard harness (nori-BM25 → BGE-M3 → RRF → cross-encoder, human-graded qrels, paired bootstrap) is implemented; numbers TBU (measuring within 1–2 weeks).
Built or prototyped in search_system / production:
- Ranking — LambdaMART / two-tower, late-interaction (ColBERT / MaxSim), hybrid fusion vs. fixed RRF.
- Serving — quantized + distilled reranker, p99 cascade budget, Docker / Kubernetes; FP8 dequant (transformers #46763); Transformers continuous-batching internals; vLLM Hermes tool-parser.
- LLM — RAG (MLX / vLLM) with citation / abstention eval, post-training (SFT / DPO / LoRA), LLM for search-quality (query rewriting, relevance judging).
- Data — Spark / Databricks embedding, near-real-time index refresh, Elasticsearch / OpenSearch + FAISS (C++) tuning.
- Recommendation — cross-sell with online A/B tests (MetLife).
Korean search & ranking — primary
- sentence-transformers #3827 — ListMLE/PListMLE listwise reranker losses mixed padding positions into the Plackett-Luce normalizer; excluded the padding. The maintainer measured NanoBEIR nDCG@10 0.39 → 0.53 (ListMLE). (merged)
- apache/lucene #16242 — new
HangulCompositionCharFilterfor analysis-nori: NFD-form Hangul was silently unanalyzable as Korean (#16241); approved by Robert Muir / Lucene PMC. (open) - elastic/elasticsearch #151157 — found that nori's default analyzer silently strips Korean negation prefixes (비급여 non-covered → 급여 covered, 부담보 → 담보), so opposite-meaning clauses index identically; traced to the default
XPNstop tag and now warned in the official Elasticsearch nori docs. (merged)
Embedding losses & model internals
- sentence-transformers #3817 — multi-GPU
gather_across_devices: gathered positives inGISTEmbedLoss/CachedGISTEmbedLosswere masked as false negatives, so the cross-entropy target collapsed to-infand the training signal silently vanished on rank > 0. Surfaced with a Korean polarity probe. (merged) - sentence-transformers #3800 — bf16/fp16 training crash across six learning-to-rank losses. (merged)
- huggingface/transformers #46530 —
StopStringCriteriamisses CJK stop strings on byte-level tokenizers (#46519). (merged) - huggingface/transformers #46670 — continuous batching returned live aliases of the growing token/logprob buffers; made it a snapshot. (merged)
- huggingface/transformers #46624 / #46763 — model/serving numeric internals: dynamic RoPE never reset
inv_freqon thelayer_type=Nonepath; round the ue8m0 FP8 scale before quantizing so dequant matches the stored inverse. (merged) - run-llama/llama_index #21900 —
RecursionErrorin text splitters when a single CJK/emoji token exceedschunk_size. (merged)
Open / active
- vllm-project/vllm #45168 — Hermes tool parser drops tool calls when a literal
</tool_call>appears inside a JSON string argument (#45167). (open)
Also (Korean & search infra) — Korean tokenizer offsets (spaCy #13974), Elasticsearch wildcard-normalizer escaping (#151008), FAISS musllinux wheels restored (#5272). Reported issue: NAVER hcx-vllm-plugin #5 (<|im_end|> parser boundary). Full PR list →
MetLife (current) — churn, fraud, agent activation, cross-sell on Azure ML / Databricks. Deploy, retrain, monitor; online A/B tests for model rollouts.
42Maru — search team, 5.5y. BM25 IR, contrastive retrieval, RAG QA, MRC, SFT / DPO / LoRA, large-scale indexing and crawlers.
Closed-source enterprise systems I worked on at 42Maru, with the research and engineering teams: Korean search quality, semantic QA, retrieval behavior, and OCR/NLP pipelines for real customer workflows.
- AI ship-sales design-support system — Daewoo Shipbuilding (DSME): semantic QA over ~100K historical records for shipowners' pre-contract technical inquiries. press
- AML / trade-based transaction detection — Hana Bank: OCR-NLP over cross-border remittance invoices. press
Government-published Korean NLP artifacts from 42Maru projects I worked on: five AI Hub releases across news MRC, national-archives LLM instruction data, finance/legal MRC, numeric reasoning MRC, and table QA. ~2.3M labeled QA pairs plus a ~300M-token corpus.
news MRC · national-archives LLM corpus · finance/legal MRC · numeric-reasoning MRC · table QA
- search_system — Korean clause retrieval lab: nori BM25 + BGE-M3 hybrid retrieval, analyzer probes, real-query failures, and traces that feed the upstream work above.
- Selected upstream workspaces — sentence-transformers, transformers, lucene, elasticsearch, vllm: short-lived branches for submitted fixes and repros.
- Domain probes — insurance-bias-probe: focused artifacts around insurance-domain behavior and model/system bias.



