Skip to content

twardoch/uicu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

uicu

A Pythonic wrapper around PyICU with supplementary Unicode functionality from fontTools.unicodedata.

PyPI - Version PyPI - Python Version

Overview

uicu provides natural, Pythonic interfaces to ICU's powerful internationalization and Unicode capabilities. It transforms PyICU's C++-style API into idiomatic Python, making advanced text processing accessible to Python developers.

Key Features

  • Unicode Character Properties: Rich character information with up-to-date Unicode data
  • Locale-Aware Operations: Sorting, formatting, and text processing that respects locale rules
  • Text Segmentation: Break text into graphemes, words, and sentences according to Unicode rules
  • Script Conversion: Transliterate between writing systems (Greek→Latin, Cyrillic→Latin, etc.)
  • Collation: Locale-sensitive string comparison and sorting with customizable strength levels
  • High Performance: Built on ICU's optimized C++ implementation

Installation

pip install uicu

Dependencies

  • Python 3.10+
  • PyICU 2.11+
  • fontTools[unicode] 4.38.0+ (for enhanced Unicode data)

Quick Start

Character Properties

import uicu

# Get character information
char = uicu.Char('€')
print(char.name)         # 'EURO SIGN'
print(char.category)     # 'Sc' (Currency Symbol)
print(char.script)       # 'Zyyy' (Common)
print(char.block)        # 'Currency Symbols'

# Direct function access
print(uicu.name('你'))    # 'CJK UNIFIED IDEOGRAPH-4F60'
print(uicu.script('A'))   # 'Latn'

# Note: Multi-codepoint strings (like flag emojis) need special handling
# char = uicu.Char('🎉')  # ✅ Works: Party popper (single codepoint)
# char = uicu.Char('🇺🇸')  # ❌ Fails: US flag (two codepoints)

Locale-Aware Collation

import uicu

# Create a locale-specific collator
collator = uicu.Collator('de-DE')  # German collation rules

# Sort strings according to locale
words = ['Müller', 'Mueller', 'Mahler']
sorted_words = collator.sort(words)
print(sorted_words)  # German-specific ordering

# Numeric sorting
numeric_collator = uicu.Collator('en-US', numeric=True)
items = ['item10', 'item2', 'item1']
print(numeric_collator.sort(items))  # ['item1', 'item2', 'item10']

# Direct comparison
print(uicu.compare('café', 'cafe', 'en-US'))  # 1 (café > cafe)

Text Segmentation

import uicu

# Break text into user-perceived characters (grapheme clusters)
text = "👨‍👩‍👧‍👦"  # Family emoji
print(list(uicu.graphemes(text)))  # ['👨‍👩‍👧‍👦'] - single grapheme!

# Word segmentation
text = "Hello, world! How are you?"
words = list(uicu.words(text))
print(words)  # ['Hello', 'world', 'How', 'are', 'you']

# Sentence segmentation
text = "Dr. Smith went to N.Y.C. yesterday. He's busy!"
sentences = list(uicu.sentences(text))
print(sentences)  # Handles abbreviations correctly

# Language-specific segmentation
thai_text = "สวัสดีครับ"
thai_words = list(uicu.words(thai_text, locale='th-TH'))

Script Conversion and Transliteration

import uicu

# Convert between scripts
trans = uicu.Transliterator('Greek-Latin')
print(trans.transliterate('Ελληνικά'))  # 'Ellēniká'

# Remove accents
trans = uicu.Transliterator('Latin-ASCII')
print(trans.transliterate('café résumé'))  # 'cafe resume'

# Chain transformations
trans = uicu.Transliterator('Any-Latin; Latin-ASCII; Lower')
print(trans.transliterate('北京'))  # 'bei jing'

# Case transformations
upper = uicu.Transliterator('Upper')
print(upper.transliterate('hello'))  # 'HELLO'

Working with Locales

import uicu

# Create and inspect locales
locale = uicu.Locale('zh-Hant-TW')
print(locale.language)     # 'zh'
print(locale.script)       # 'Hant'
print(locale.region)       # 'TW'
print(locale.display_name) # 'Chinese (Traditional, Taiwan)'

# Get system default locale
default = uicu.get_default_locale()
print(default.language_tag)  # e.g., 'en-US'

# List available locales
locales = uicu.get_available_locales()
print(f"Available locales: {len(locales)}")  # 700+ locales

# Create locale-specific services
formatter = locale.get_datetime_formatter(date_style='long', time_style='short')
# Date/time formatting with locale-specific patterns

Advanced Usage

Custom Collation Strength

# Primary strength - ignores case and accents
collator = uicu.Collator('en-US', strength='primary')
print(collator.compare('café', 'CAFE'))  # 0 (equal)

# Secondary strength - considers accents but not case
collator = uicu.Collator('en-US', strength='secondary')
print(collator.compare('café', 'CAFÉ'))  # 0 (equal)
print(collator.compare('café', 'cafe'))  # 1 (café > cafe)

# Tertiary strength (default) - considers case
collator = uicu.Collator('en-US', strength='tertiary')
print(collator.compare('café', 'Café'))  # 1 (café > Café)

Reusable Segmenters

# Create reusable segmenters for better performance
word_segmenter = uicu.WordSegmenter('en-US')
sentences = [
    "This is a test.",
    "Another sentence here.",
    "And one more!"
]

for sentence in sentences:
    words = list(word_segmenter.segment(sentence))
    print(f"{len(words)} words: {words}")

Script Detection

# Detect the primary script in text
print(uicu.detect_script('Hello'))      # 'Latn'
print(uicu.detect_script('你好'))        # 'Hani'
print(uicu.detect_script('مرحبا'))      # 'Arab'
print(uicu.detect_script('Привет'))     # 'Cyrl'

API Design Philosophy

uicu follows these principles:

  1. Pythonic: Natural Python idioms, not C++ style
  2. Unicode-first: Seamless handling of all Unicode text
  3. Locale-aware: Respect cultural and linguistic differences
  4. Performance: Efficient ICU algorithms under the hood
  5. Compatibility: Works with Python's built-in string types
  6. Fallbacks: Graceful degradation when optional features unavailable

Development Status

Version 1.0.0-alpha (2025-01-25)

uicu v1.0 focuses on delivering fast, reliable, essential Unicode operations. All shipped features are 100% functional and thoroughly tested.

Core Features (Production Ready):

  • Unicode Character Properties - Complete character analysis with fontTools integration
  • Locale Management - BCP 47 compliant locale handling with factory patterns
  • Collation & Sorting - Culture-aware string comparison with customizable strength
  • Text Segmentation - Grapheme, word, sentence, and line break detection
  • Transliteration - Script conversion and text transformation
  • Date/Time Formatting - Locale-aware formatting with patterns and styles
  • Script Detection - Automatic detection of writing systems

Performance Metrics:

  • 🚀 Import time: 16.9ms (target: <100ms)
  • 📦 Package size: 96KB source (target: <100KB)
  • 📊 Code size: 2,418 lines (close to 2000 target)
  • ⚡ Minimal PyICU overhead for maximum performance

Code Quality:

  • 🧹 Reduced from 26 to 1 linting error (only module name warning)
  • 🔍 Simplified exception handling - ICU errors provide better context
  • 🏗️ Consolidated validation code with shared utilities
  • 🗑️ Removed 200+ lines of dead code and unnecessary fallbacks
  • 📝 Streamlined docstrings by 40% while keeping clarity

Coming in v2.0:

  • Number formatting (decimal, currency, percent, scientific)
  • List formatting with locale-appropriate conjunctions
  • Message formatting with plural/gender support
  • Date/time parsing functionality
  • Advanced timezone handling
  • Relative time formatting
  • Unicode regex support
  • Sphinx documentation site

Examples

Run the comprehensive demo to see all features in action:

python examples/uicu_demo.py

This demo includes:

  1. Unicode character exploration with properties
  2. Culture-aware multilingual name sorting
  3. Text segmentation (graphemes, words, sentences)
  4. Script conversion and transliteration
  5. Locale-aware date/time formatting
  6. Smart numeric vs lexical sorting
  7. Unicode text transformations
  8. Automatic script detection
  9. Thai word segmentation
  10. Emoji and complex grapheme handling
  11. Case-sensitive sorting control
  12. Bidirectional text analysis

Development

Environment Setup

# Install and use uv for package management
pip install uv

# Use hatch for development workflow
uv pip install hatch

Common Development Tasks

# Activate development environment
hatch shell

# Run tests
hatch run test

# Run tests with coverage
hatch run test-cov

# Run linting
hatch run lint

# Format code
hatch run format

# Run type checking
hatch run type-check

Contributing

Contributions are welcome! Please see our Contributing Guide for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built on top of PyICU, which provides Python bindings for ICU
  • Enhanced with fontTools.unicodedata for up-to-date Unicode data
  • Inspired by the need for more Pythonic Unicode handling in Python applications

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors