Skip to content

thedavidweng/lyric-romanizer

Repository files navigation

lyric-romanizer

npm version license

English | 日本語 | 中文(简体) | 中文(粵語) | 한국어 | Русский | हिन्दी | தமிழ் | ไทย

Philosophy: Don't reinvent the wheel. This project deliberately avoids building romanization logic from scratch. Instead, it composes best-in-class, community-maintained libraries — one for each script — and focuses on the orchestration layer: script detection, engine routing, dialect handling, and a unified API. Every romanization engine in the dependency list is a dedicated, battle-tested library maintained by domain experts. That's the point.

Script detection and local romanization engine for lyrics. Supports 12 scripts across Japanese, Chinese (Mandarin and Cantonese), Korean, Cyrillic, Indic, Tamil, and Thai — all running locally with zero API calls.

Extracted from Spotify Karaoke. Used by OpenKara.

Features

  • Zero API calls — all romanization runs locally
  • Auto script detection — pass in text, get back the detected script
  • 12+ scripts — Japanese, Chinese, Korean, Cyrillic, 7 Indic scripts, Tamil, Thai
  • Cantonese support — Jyutping alongside default Mandarin Pinyin
  • Lightweight detector subpath — import only script detection without pulling in romanization engines
  • Ukrainian-aware Cyrillic — auto-detects Ukrainian-specific characters and applies the correct transliteration preset

Installation

npm install lyric-romanizer
yarn add lyric-romanizer
pnpm add lyric-romanizer

Quick Start

import { createRomanizer, detectScript } from 'lyric-romanizer';

const romanizer = createRomanizer();

// Auto-detect script and romanize
const result = await romanizer.romanizeLines(['你好世界', 'こんにちは']);
// { script: 'chinese', lines: ['nǐ hǎo shì jiè', 'こんにちは'] }

// Romanize a single line
const line = await romanizer.romanizeLine('안녕하세요');
// 'annyeonghaseyo'

API

Imports

// Main entry — full romanization engine
import {
  createRomanizer,
  detectScript,
  isLatinScript,
  requiresExternalRomanization,
  UnsupportedRomanizationError,
} from 'lyric-romanizer';

// Detector-only subpath — lightweight, no romanization dependencies
import { detectScript, isLatinScript, NON_LATIN_SCRIPT_RE } from 'lyric-romanizer/detector';

Types

type ScriptType =
  | 'japanese' | 'chinese' | 'korean' | 'cyrillic'
  | 'devanagari' | 'gujarati' | 'gurmukhi' | 'telugu'
  | 'kannada' | 'odia' | 'tamil' | 'malayalam'
  | 'bengali' | 'arabic' | 'hebrew' | 'thai'
  | 'other';

interface Romanizer {
  romanizeLine(line: string, options?: RomanizeOptions): Promise<string>;
  romanizeLines(lines: readonly string[], options?: RomanizeOptions): Promise<RomanizeResult>;
}

type RomanizeOptions = { script?: ScriptType; dialect?: 'mandarin' | 'cantonese' };
type RomanizeResult = { script: ScriptType; lines: string[] };
type RomanizerOptions = { japaneseDictPath?: string };

createRomanizer(options?)

Factory that returns a Romanizer instance. The Kuroshiro engine (Japanese) is lazily initialized on first use and cached.

const romanizer = createRomanizer();

// Override the Kuromoji dictionary CDN path (e.g. for self-hosting)
const romanizer = createRomanizer({
  japaneseDictPath: 'https://my-cdn.com/kuromoji/dict',
});

detectScript(lines)

Detects the dominant script in the given text lines. Checks for Japanese kana first (definitive), then scores all other scripts by character count.

detectScript(['こんにちは']);          // 'japanese'
detectScript(['你好世界']);            // 'chinese'
detectScript(['Привет']);             // 'cyrillic'
detectScript(['Hello world']);        // 'latin'
detectScript(['123 ???']);            // 'other'

isLatinScript(lines)

Fast check — returns true if the text contains only Latin letters (no CJK, Cyrillic, Indic, etc.). Useful for skipping romanization entirely.

isLatinScript(['Hello world']);  // true
isLatinScript(['안녕하세요']);    // false
isLatinScript(['♪♪♪']);         // false (no letters)

requiresExternalRomanization(script)

Returns true for scripts that cannot be romanized locally and require an external API.

requiresExternalRomanization('chinese');   // false
requiresExternalRomanization('arabic');    // true
requiresExternalRomanization('malayalam'); // true

romanizer.romanizeLine(line, options?)

Romanizes a single line. If script is omitted, it is auto-detected via detectScript. Returns the original line unchanged for Latin text or non-letter content.

For Chinese text, the dialect option controls the romanization system: 'mandarin' (default) uses Pinyin, 'cantonese' uses Jyutping.

Throws UnsupportedRomanizationError for external scripts.

await romanizer.romanizeLine('你好世界');
// 'nǐ hǎo shì jiè' (default: Mandarin/Pinyin)

await romanizer.romanizeLine('你好', { dialect: 'cantonese' });
// 'nei5 hou2' (Jyutping)

await romanizer.romanizeLine('Привет мир');
// 'Privet mir'

await romanizer.romanizeLine('Hello world');
// 'Hello world' (no-op)

await romanizer.romanizeLine('مرحبا');
// throws UnsupportedRomanizationError { script: 'arabic' }

romanizer.romanizeLines(lines, options?)

Romanizes multiple lines in parallel. Returns the detected script and romanized lines.

const { script, lines } = await romanizer.romanizeLines([
  'สวัสดี',
  'ชาวโลก',
]);
// { script: 'thai', lines: ['sawatdi', 'chaolok'] }

UnsupportedRomanizationError

Thrown when attempting to romanize a script that requires an external API. Has a script property for programmatic handling.

try {
  await romanizer.romanizeLine('مرحبا');
} catch (err) {
  if (err instanceof UnsupportedRomanizationError) {
    console.log(err.script); // 'arabic'
    // fall back to external API
  }
}

Supported Scripts

Local (fully offline)

Script Engine Example
Universal (fallback) transliteration ПриветPrivet
Japanese kuroshiro + kuromoji こんにちはkonnichiha
Mandarin pinyin-pro 你好nǐ hǎo
Cantonese to-jyutping 佢冇keoi5 mou5
Korean @romanize/korean 안녕annyeong
Cyrillic cyrillic-to-translit-js ПриветPrivet
Devanagari sanscript नमस्तेnamaste
Gujarati sanscript નમસ્તેnamaste
Gurmukhi sanscript ਨਮਸਤੇnamaste
Telugu sanscript నమస్తేnamaste
Kannada sanscript ನಮಸ್ತೆnamaste
Odia sanscript ନମସ୍ତେnamaste
Tamil tamil-romanizer வணக்கம்vanakkam
Thai @dehoist/romanize-thai สวัสดีsawatdi

External (requires API)

Script Method
Malayalam Google Translate dt=rm
Bengali Google Translate dt=rm
Arabic Google Translate dt=rm
Hebrew Google Translate dt=rm
Other Google Translate dt=rm

Use requiresExternalRomanization() to detect these and branch to your preferred API.

Script-Specific Notes

Cyrillic Detection

Cyrillic auto-detects Ukrainian-specific characters (і, ї, є, ґ) and applies the Ukrainian transliteration preset. All other Cyrillic text defaults to Russian.

Cantonese Support

Chinese text defaults to Mandarin (Pinyin). Pass dialect: 'cantonese' in RomanizeOptions to romanize Chinese text to Jyutping instead.

const { lines } = await romanizer.romanizeLines(['你好世界', '食飯'], {
  script: 'chinese',
  dialect: 'cantonese',
});
// ['nei5 hou2 sai3 gaai3', 'sik6 faan6']

License

MIT

About

Script detection and local romanization engine for lyrics — supports Japanese, Chinese, Korean, Cyrillic, Indic, Tamil, Thai, and Latin

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors