| SPDX-FileCopyrightText | 2024-2026 PyThaiNLP Project |
|---|---|
| SPDX-License-Identifier | Apache-2.0 |
A Thai natural language processing library in Rust, with Node.js/TypeScript and Python bindings.
Formerly known as oxidized-thainlp, it was originally developed by
Thanathip Suntorntip.
- Three tokenizer choices with a consistent API:
NewmmTokenizer(dictionary, fastest)NewmmFstTokenizer(dictionary, lower memory)DeepcutTokenizer(neural model, requires featuredeepcut)
- Thread-safe tokenizer instances for concurrent use.
- Improved handling for ambiguous and large input.
Rust:
cargo add nlpo3Node.js/TypeScript:
npm install nlpo3Python:
pip install nlpo3CLI:
cargo install nlpo3-cli- Rust Edition 2024
- rustc 1.88.0 or newer
Dictionary tokenizer:
use nlpo3::tokenizer::newmm::NewmmTokenizer;
fn main() {
let tok = NewmmTokenizer::new("words_th.txt").expect("dictionary load failed");
let tokens = tok.segment("สวัสดีครับ").expect("segmentation failed");
println!("{:?}", tokens);
}Neural tokenizer:
Enable feature first:
nlpo3 = { version = "2.0", features = ["deepcut"] }use nlpo3::tokenizer::deepcut::DeepcutTokenizer;
fn main() {
let tok = DeepcutTokenizer::new().expect("model load failed");
let tokens = tok.segment("สวัสดีครับ").expect("segmentation failed");
println!("{:?}", tokens);
}- Rust:
segment(...)returnsAnyResult<Vec<String>>(anyhow::Result<Vec<String>>). - Python: tokenizer methods raise
RuntimeErroron tokenization/inference failures. - Node.js/TypeScript: tokenizer methods throw
Erroron tokenization/inference failures.
Use the host language's normal error handling style (?, try/except,
try/catch) to decide whether to propagate, recover, or fail fast.
| Option | Type | Default | Description |
|---|---|---|---|
safe |
bool |
false |
For NewmmTokenizer and NewmmFstTokenizer: avoid long run times on highly ambiguous input |
parallel_chunk_size |
Option<usize> |
None |
Enable chunked parallel processing for larger text; None, 0, or too-small values disable parallel mode |
Auto-parallel helpers are available via segment_parallel(...) on tokenizers.
Note on parallel mode accuracy: when
parallel_chunk_sizeis set, text is split into chunks before tokenization. Token sequences near chunk boundaries may differ from full-text results. This is acceptable for tasks such as text classification and word embedding, but may not be suitable for tasks that require precise linguistic unit identification.
For technical implementation and design notes, see docs/implementation.md.
- Node.js/TypeScript: nlpo3-nodejs
- Python: nlpo3-python
- CLI: nlpo3-cli
nlpO3 does not bundle a dictionary for dictionary-based tokenizers.
Recommended sources:
- words_th.txt from PyThaiNLP
- word break dictionary from libthai
nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.