nlpO3

SPDX-FileCopyrightText	2024-2026 PyThaiNLP Project
SPDX-License-Identifier	Apache-2.0

nlpO3

A Thai natural language processing library in Rust, with Node.js/TypeScript and Python bindings.

Formerly known as oxidized-thainlp, it was originally developed by Thanathip Suntorntip.

Overview

Three tokenizer choices with a consistent API:
- NewmmTokenizer (dictionary, fastest)
- NewmmFstTokenizer (dictionary, lower memory)
- DeepcutTokenizer (neural model, requires feature deepcut)
Thread-safe tokenizer instances for concurrent use.
Improved handling for ambiguous and large input.

Install

Rust:

cargo add nlpo3

Node.js/TypeScript:

npm install nlpo3

Python:

pip install nlpo3

CLI:

cargo install nlpo3-cli

Requirements

Rust Edition 2024
rustc 1.88.0 or newer

Quick start

Dictionary tokenizer:

use nlpo3::tokenizer::newmm::NewmmTokenizer;

fn main() {
  let tok = NewmmTokenizer::new("words_th.txt").expect("dictionary load failed");
  let tokens = tok.segment("สวัสดีครับ").expect("segmentation failed");
  println!("{:?}", tokens);
}

Neural tokenizer:

Enable feature first:

nlpo3 = { version = "2.0", features = ["deepcut"] }

use nlpo3::tokenizer::deepcut::DeepcutTokenizer;

fn main() {
  let tok = DeepcutTokenizer::new().expect("model load failed");
  let tokens = tok.segment("สวัสดีครับ").expect("segmentation failed");
  println!("{:?}", tokens);
}

Error handling

Rust: segment(...) returns AnyResult<Vec<String>> (anyhow::Result<Vec<String>>).
Python: tokenizer methods raise RuntimeError on tokenization/inference failures.
Node.js/TypeScript: tokenizer methods throw Error on tokenization/inference failures.

Use the host language's normal error handling style (?, try/except, try/catch) to decide whether to propagate, recover, or fail fast.

Segment options

Option	Type	Default	Description
`safe`	`bool`	`false`	For `NewmmTokenizer` and `NewmmFstTokenizer`: avoid long run times on highly ambiguous input
`parallel_chunk_size`	`Option<usize>`	`None`	Enable chunked parallel processing for larger text; `None`, `0`, or too-small values disable parallel mode

Auto-parallel helpers are available via segment_parallel(...) on tokenizers.

Note on parallel mode accuracy: when parallel_chunk_size is set, text is split into chunks before tokenization. Token sequences near chunk boundaries may differ from full-text results. This is acceptable for tasks such as text classification and word embedding, but may not be suitable for tasks that require precise linguistic unit identification.

For technical implementation and design notes, see docs/implementation.md.

Bindings

Node.js/TypeScript: nlpo3-nodejs
Python: nlpo3-python
CLI: nlpo3-cli

Dictionary

nlpO3 does not bundle a dictionary for dictionary-based tokenizers.

Recommended sources:

words_th.txt from PyThaiNLP
word break dictionary from libthai

Support

Issues: https://github.com/PyThaiNLP/nlpo3/issues

License

nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 662 Commits
.cargo		.cargo
.github		.github
benches		benches
build_tools/github		build_tools/github
docs		docs
model		model
nlpo3-cli		nlpo3-cli
nlpo3-nodejs		nlpo3-nodejs
nlpo3-python		nlpo3-python
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlpO3

Overview

Install

Requirements

Quick start

Error handling

Segment options

Bindings

Dictionary

Support

License

About

Uh oh!

Releases 15

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nlpO3

Overview

Install

Requirements

Quick start

Error handling

Segment options

Bindings

Dictionary

Support

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages