Skip to content

PyThaiNLP/nlpo3

Repository files navigation

SPDX-FileCopyrightText 2024-2026 PyThaiNLP Project
SPDX-License-Identifier Apache-2.0

nlpO3

crates.io Apache-2.0 DOI

A Thai natural language processing library in Rust, with Node.js/TypeScript and Python bindings.

Formerly known as oxidized-thainlp, it was originally developed by Thanathip Suntorntip.

Overview

  • Three tokenizer choices with a consistent API:
    • NewmmTokenizer (dictionary, fastest)
    • NewmmFstTokenizer (dictionary, lower memory)
    • DeepcutTokenizer (neural model, requires feature deepcut)
  • Thread-safe tokenizer instances for concurrent use.
  • Improved handling for ambiguous and large input.

Install

Rust:

cargo add nlpo3

Node.js/TypeScript:

npm install nlpo3

Python:

pip install nlpo3

CLI:

cargo install nlpo3-cli

Requirements

  • Rust Edition 2024
  • rustc 1.88.0 or newer

Quick start

Dictionary tokenizer:

use nlpo3::tokenizer::newmm::NewmmTokenizer;

fn main() {
  let tok = NewmmTokenizer::new("words_th.txt").expect("dictionary load failed");
  let tokens = tok.segment("สวัสดีครับ").expect("segmentation failed");
  println!("{:?}", tokens);
}

Neural tokenizer:

Enable feature first:

nlpo3 = { version = "2.0", features = ["deepcut"] }
use nlpo3::tokenizer::deepcut::DeepcutTokenizer;

fn main() {
  let tok = DeepcutTokenizer::new().expect("model load failed");
  let tokens = tok.segment("สวัสดีครับ").expect("segmentation failed");
  println!("{:?}", tokens);
}

Error handling

  • Rust: segment(...) returns AnyResult<Vec<String>> (anyhow::Result<Vec<String>>).
  • Python: tokenizer methods raise RuntimeError on tokenization/inference failures.
  • Node.js/TypeScript: tokenizer methods throw Error on tokenization/inference failures.

Use the host language's normal error handling style (?, try/except, try/catch) to decide whether to propagate, recover, or fail fast.

Segment options

Option Type Default Description
safe bool false For NewmmTokenizer and NewmmFstTokenizer: avoid long run times on highly ambiguous input
parallel_chunk_size Option<usize> None Enable chunked parallel processing for larger text; None, 0, or too-small values disable parallel mode

Auto-parallel helpers are available via segment_parallel(...) on tokenizers.

Note on parallel mode accuracy: when parallel_chunk_size is set, text is split into chunks before tokenization. Token sequences near chunk boundaries may differ from full-text results. This is acceptable for tasks such as text classification and word embedding, but may not be suitable for tasks that require precise linguistic unit identification.

For technical implementation and design notes, see docs/implementation.md.

Bindings

Dictionary

nlpO3 does not bundle a dictionary for dictionary-based tokenizers.

Recommended sources:

Support

License

nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.