Mingi Jeong Incheonkirin

Mingi Jeong (정민기)

Korean search & retrieval ML engineer — analyzer correctness, ranking losses, LLM serving · 7y · Python · PyTorch

Korean search, fixed where it breaks — upstream: Hangul NFD normalization into Lucene, approved by Robert Muir (#16242); the meaning-inverting nori XPN default (비급여 non-covered → 급여 covered) now warned in the official Elasticsearch docs (#151157); a ListMLE listwise-loss fix in sentence-transformers (#3827; maintainer-measured NanoBEIR nDCG@10 0.39 → 0.53). Previously 5.5y on the search team at 42Maru; now at MetLife on production ML.

Data that is valid on one side of a representation boundary silently breaks the other — NFD Hangul vs. the analyzer, stop strings vs. byte-fragment tokens, bf16 logits vs. a float32 loss. Korean hits these boundaries constantly; English-only test suites never do.

Search depth

search_system — a Korean insurance-clause (약관) retrieval lab over 36,983 clause passages with 700 hand-graded queries: nori BM25 + BGE-M3 hybrid retrieval, analyzer probes, real-query failures. For each Korean failure I took upstream, the lab holds a before/after fixture tied to the fix and a regression test:

XPN polarity (비급여 → 급여) — nori's default analyzer drops the meaning-bearing prefix, so 비급여 (non-covered) indexes as 급여 (covered) and opposite-meaning clauses become indistinguishable. Reproduced and pinned; documented upstream (Elasticsearch #151157).
NFD Hangul — NFD-decomposed Hangul is unanalyzable as Korean. Fixed via the new HangulCompositionCharFilter (Lucene #16242, Muir-approved).

The lab is also where I compare offline variants — analyzer choices (형태소 분석기), fusion weights, reranker on/off — on the qrels benchmark, decided by nDCG / Recall. The scorecard harness (nori-BM25 → BGE-M3 → RRF → cross-encoder, human-graded qrels, paired bootstrap) is implemented; numbers TBU (measuring within 1–2 weeks).

Across the stack

Built or prototyped in search_system / production:

Ranking — LambdaMART / two-tower, late-interaction (ColBERT / MaxSim), hybrid fusion vs. fixed RRF.
Serving — quantized + distilled reranker, p99 cascade budget, Docker / Kubernetes; FP8 dequant (transformers #46763); Transformers continuous-batching internals; vLLM Hermes tool-parser.
LLM — RAG (MLX / vLLM) with citation / abstention eval, post-training (SFT / DPO / LoRA), LLM for search-quality (query rewriting, relevance judging).
Data — Spark / Databricks embedding, near-real-time index refresh, Elasticsearch / OpenSearch + FAISS (C++) tuning.
Recommendation — cross-sell with online A/B tests (MetLife).

Upstream contributions

Korean search & ranking — primary

sentence-transformers #3827 — ListMLE/PListMLE listwise reranker losses mixed padding positions into the Plackett-Luce normalizer; excluded the padding. The maintainer measured NanoBEIR nDCG@10 0.39 → 0.53 (ListMLE). (merged)
apache/lucene #16242 — new HangulCompositionCharFilter for analysis-nori: NFD-form Hangul was silently unanalyzable as Korean (#16241); approved by Robert Muir / Lucene PMC. (open)
elastic/elasticsearch #151157 — found that nori's default analyzer silently strips Korean negation prefixes (비급여 non-covered → 급여 covered, 부담보 → 담보), so opposite-meaning clauses index identically; traced to the default XPN stop tag and now warned in the official Elasticsearch nori docs. (merged)

Embedding losses & model internals

sentence-transformers #3817 — multi-GPU gather_across_devices: gathered positives in GISTEmbedLoss/CachedGISTEmbedLoss were masked as false negatives, so the cross-entropy target collapsed to -inf and the training signal silently vanished on rank > 0. Surfaced with a Korean polarity probe. (merged)
sentence-transformers #3800 — bf16/fp16 training crash across six learning-to-rank losses. (merged)
huggingface/transformers #46530 — StopStringCriteria misses CJK stop strings on byte-level tokenizers (#46519). (merged)
huggingface/transformers #46670 — continuous batching returned live aliases of the growing token/logprob buffers; made it a snapshot. (merged)
huggingface/transformers #46624 / #46763 — model/serving numeric internals: dynamic RoPE never reset inv_freq on the layer_type=None path; round the ue8m0 FP8 scale before quantizing so dequant matches the stored inverse. (merged)
run-llama/llama_index #21900 — RecursionError in text splitters when a single CJK/emoji token exceeds chunk_size. (merged)

Open / active

vllm-project/vllm #45168 — Hermes tool parser drops tool calls when a literal </tool_call> appears inside a JSON string argument (#45167). (open)

Also (Korean & search infra) — Korean tokenizer offsets (spaCy #13974), Elasticsearch wildcard-normalizer escaping (#151008), FAISS musllinux wheels restored (#5272). Reported issue: NAVER hcx-vllm-plugin #5 (<|im_end|> parser boundary). Full PR list →

Production & earlier

MetLife (current) — churn, fraud, agent activation, cross-sell on Azure ML / Databricks. Deploy, retrain, monitor; online A/B tests for model rollouts.

42Maru — search team, 5.5y. BM25 IR, contrastive retrieval, RAG QA, MRC, SFT / DPO / LoRA, large-scale indexing and crawlers.

Enterprise NLP/QA at 42Maru (press)

Closed-source enterprise systems I worked on at 42Maru, with the research and engineering teams: Korean search quality, semantic QA, retrieval behavior, and OCR/NLP pipelines for real customer workflows.

AI ship-sales design-support system — Daewoo Shipbuilding (DSME): semantic QA over ~100K historical records for shipowners' pre-contract technical inquiries. press
AML / trade-based transaction detection — Hana Bank: OCR-NLP over cross-border remittance invoices. press

Public artifacts from 42Maru — NIA AI Hub

Government-published Korean NLP artifacts from 42Maru projects I worked on: five AI Hub releases across news MRC, national-archives LLM instruction data, finance/legal MRC, numeric reasoning MRC, and table QA. ~2.3M labeled QA pairs plus a ~300M-token corpus.

news MRC · national-archives LLM corpus · finance/legal MRC · numeric-reasoning MRC · table QA

Repo map

search_system — Korean clause retrieval lab: nori BM25 + BGE-M3 hybrid retrieval, analyzer probes, real-query failures, and traces that feed the upstream work above.
Selected upstream workspaces — sentence-transformers, transformers, lucene, elasticsearch, vllm: short-lived branches for submitted fixes and repros.
Domain probes — insurance-bias-probe: focused artifacts around insurance-domain behavior and model/system bias.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mingi Jeong Incheonkirin

Achievements

Achievements

Block or report Incheonkirin