Skip to content
View Incheonkirin's full-sized avatar
🥽
🥽

Block or report Incheonkirin

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Incheonkirin/README.md

Mingi Jeong

Mingi Jeong (정민기)

Korean search & retrieval ML engineer — analyzer correctness, ranking losses, LLM serving · 7y · Python · PyTorch

Korean search, fixed where it breaks — upstream: Hangul NFD normalization into Lucene, approved by Robert Muir (#16242); the meaning-inverting nori XPN default (비급여 non-covered → 급여 covered) now warned in the official Elasticsearch docs (#151157); a ListMLE listwise-loss fix in sentence-transformers (#3827; maintainer-measured NanoBEIR nDCG@10 0.39 → 0.53). Previously 5.5y on the search team at 42Maru; now at MetLife on production ML.

LinkedIn Email


Data that is valid on one side of a representation boundary silently breaks the other — NFD Hangul vs. the analyzer, stop strings vs. byte-fragment tokens, bf16 logits vs. a float32 loss. Korean hits these boundaries constantly; English-only test suites never do.

Search depth

search_system — a Korean insurance-clause (약관) retrieval lab over 36,983 clause passages with 700 hand-graded queries: nori BM25 + BGE-M3 hybrid retrieval, analyzer probes, real-query failures. For each Korean failure I took upstream, the lab holds a before/after fixture tied to the fix and a regression test:

  • XPN polarity (비급여 → 급여) — nori's default analyzer drops the meaning-bearing prefix, so 비급여 (non-covered) indexes as 급여 (covered) and opposite-meaning clauses become indistinguishable. Reproduced and pinned; documented upstream (Elasticsearch #151157).
  • NFD Hangul — NFD-decomposed Hangul is unanalyzable as Korean. Fixed via the new HangulCompositionCharFilter (Lucene #16242, Muir-approved).

The lab is also where I compare offline variants — analyzer choices (형태소 분석기), fusion weights, reranker on/off — on the qrels benchmark, decided by nDCG / Recall. The scorecard harness (nori-BM25 → BGE-M3 → RRF → cross-encoder, human-graded qrels, paired bootstrap) is implemented; numbers TBU (measuring within 1–2 weeks).

Across the stack

Built or prototyped in search_system / production:

  • Ranking — LambdaMART / two-tower, late-interaction (ColBERT / MaxSim), hybrid fusion vs. fixed RRF.
  • Serving — quantized + distilled reranker, p99 cascade budget, Docker / Kubernetes; FP8 dequant (transformers #46763); Transformers continuous-batching internals; vLLM Hermes tool-parser.
  • LLM — RAG (MLX / vLLM) with citation / abstention eval, post-training (SFT / DPO / LoRA), LLM for search-quality (query rewriting, relevance judging).
  • Data — Spark / Databricks embedding, near-real-time index refresh, Elasticsearch / OpenSearch + FAISS (C++) tuning.
  • Recommendation — cross-sell with online A/B tests (MetLife).

Upstream contributions

Korean search & ranking — primary

  • sentence-transformers #3827 — ListMLE/PListMLE listwise reranker losses mixed padding positions into the Plackett-Luce normalizer; excluded the padding. The maintainer measured NanoBEIR nDCG@10 0.39 → 0.53 (ListMLE). (merged)
  • apache/lucene #16242 — new HangulCompositionCharFilter for analysis-nori: NFD-form Hangul was silently unanalyzable as Korean (#16241); approved by Robert Muir / Lucene PMC. (open)
  • elastic/elasticsearch #151157 — found that nori's default analyzer silently strips Korean negation prefixes (비급여 non-covered → 급여 covered, 부담보 → 담보), so opposite-meaning clauses index identically; traced to the default XPN stop tag and now warned in the official Elasticsearch nori docs. (merged)

Embedding losses & model internals

  • sentence-transformers #3817 — multi-GPU gather_across_devices: gathered positives in GISTEmbedLoss/CachedGISTEmbedLoss were masked as false negatives, so the cross-entropy target collapsed to -inf and the training signal silently vanished on rank > 0. Surfaced with a Korean polarity probe. (merged)
  • sentence-transformers #3800 — bf16/fp16 training crash across six learning-to-rank losses. (merged)
  • huggingface/transformers #46530StopStringCriteria misses CJK stop strings on byte-level tokenizers (#46519). (merged)
  • huggingface/transformers #46670 — continuous batching returned live aliases of the growing token/logprob buffers; made it a snapshot. (merged)
  • huggingface/transformers #46624 / #46763 — model/serving numeric internals: dynamic RoPE never reset inv_freq on the layer_type=None path; round the ue8m0 FP8 scale before quantizing so dequant matches the stored inverse. (merged)
  • run-llama/llama_index #21900RecursionError in text splitters when a single CJK/emoji token exceeds chunk_size. (merged)

Open / active

Also (Korean & search infra) — Korean tokenizer offsets (spaCy #13974), Elasticsearch wildcard-normalizer escaping (#151008), FAISS musllinux wheels restored (#5272). Reported issue: NAVER hcx-vllm-plugin #5 (<|im_end|> parser boundary). Full PR list →


Production & earlier

MetLife (current) — churn, fraud, agent activation, cross-sell on Azure ML / Databricks. Deploy, retrain, monitor; online A/B tests for model rollouts.

42Maru — search team, 5.5y. BM25 IR, contrastive retrieval, RAG QA, MRC, SFT / DPO / LoRA, large-scale indexing and crawlers.

Enterprise NLP/QA at 42Maru (press)

Closed-source enterprise systems I worked on at 42Maru, with the research and engineering teams: Korean search quality, semantic QA, retrieval behavior, and OCR/NLP pipelines for real customer workflows.

  • AI ship-sales design-support system — Daewoo Shipbuilding (DSME): semantic QA over ~100K historical records for shipowners' pre-contract technical inquiries. press
  • AML / trade-based transaction detection — Hana Bank: OCR-NLP over cross-border remittance invoices. press

Public artifacts from 42Maru — NIA AI Hub

Government-published Korean NLP artifacts from 42Maru projects I worked on: five AI Hub releases across news MRC, national-archives LLM instruction data, finance/legal MRC, numeric reasoning MRC, and table QA. ~2.3M labeled QA pairs plus a ~300M-token corpus.

news MRC · national-archives LLM corpus · finance/legal MRC · numeric-reasoning MRC · table QA


Repo map


Stack

Python PyTorch Transformers sentence-transformers vLLM MLflow Elasticsearch / Lucene Hybrid Retrieval / RAG

Pinned Loading

  1. Incheonkirin.github.io Incheonkirin.github.io Public

    Personal site — portfolio and notes.

    TypeScript