A Pythonic wrapper around PyICU with supplementary Unicode functionality from fontTools.unicodedata.
uicu provides natural, Pythonic interfaces to ICU's powerful internationalization and Unicode capabilities. It transforms PyICU's C++-style API into idiomatic Python, making advanced text processing accessible to Python developers.
- Unicode Character Properties: Rich character information with up-to-date Unicode data
- Locale-Aware Operations: Sorting, formatting, and text processing that respects locale rules
- Text Segmentation: Break text into graphemes, words, and sentences according to Unicode rules
- Script Conversion: Transliterate between writing systems (Greek→Latin, Cyrillic→Latin, etc.)
- Collation: Locale-sensitive string comparison and sorting with customizable strength levels
- High Performance: Built on ICU's optimized C++ implementation
pip install uicu- Python 3.10+
- PyICU 2.11+
- fontTools[unicode] 4.38.0+ (for enhanced Unicode data)
import uicu
# Get character information
char = uicu.Char('€')
print(char.name) # 'EURO SIGN'
print(char.category) # 'Sc' (Currency Symbol)
print(char.script) # 'Zyyy' (Common)
print(char.block) # 'Currency Symbols'
# Direct function access
print(uicu.name('你')) # 'CJK UNIFIED IDEOGRAPH-4F60'
print(uicu.script('A')) # 'Latn'
# Note: Multi-codepoint strings (like flag emojis) need special handling
# char = uicu.Char('🎉') # ✅ Works: Party popper (single codepoint)
# char = uicu.Char('🇺🇸') # ❌ Fails: US flag (two codepoints)import uicu
# Create a locale-specific collator
collator = uicu.Collator('de-DE') # German collation rules
# Sort strings according to locale
words = ['Müller', 'Mueller', 'Mahler']
sorted_words = collator.sort(words)
print(sorted_words) # German-specific ordering
# Numeric sorting
numeric_collator = uicu.Collator('en-US', numeric=True)
items = ['item10', 'item2', 'item1']
print(numeric_collator.sort(items)) # ['item1', 'item2', 'item10']
# Direct comparison
print(uicu.compare('café', 'cafe', 'en-US')) # 1 (café > cafe)import uicu
# Break text into user-perceived characters (grapheme clusters)
text = "👨👩👧👦" # Family emoji
print(list(uicu.graphemes(text))) # ['👨👩👧👦'] - single grapheme!
# Word segmentation
text = "Hello, world! How are you?"
words = list(uicu.words(text))
print(words) # ['Hello', 'world', 'How', 'are', 'you']
# Sentence segmentation
text = "Dr. Smith went to N.Y.C. yesterday. He's busy!"
sentences = list(uicu.sentences(text))
print(sentences) # Handles abbreviations correctly
# Language-specific segmentation
thai_text = "สวัสดีครับ"
thai_words = list(uicu.words(thai_text, locale='th-TH'))import uicu
# Convert between scripts
trans = uicu.Transliterator('Greek-Latin')
print(trans.transliterate('Ελληνικά')) # 'Ellēniká'
# Remove accents
trans = uicu.Transliterator('Latin-ASCII')
print(trans.transliterate('café résumé')) # 'cafe resume'
# Chain transformations
trans = uicu.Transliterator('Any-Latin; Latin-ASCII; Lower')
print(trans.transliterate('北京')) # 'bei jing'
# Case transformations
upper = uicu.Transliterator('Upper')
print(upper.transliterate('hello')) # 'HELLO'import uicu
# Create and inspect locales
locale = uicu.Locale('zh-Hant-TW')
print(locale.language) # 'zh'
print(locale.script) # 'Hant'
print(locale.region) # 'TW'
print(locale.display_name) # 'Chinese (Traditional, Taiwan)'
# Get system default locale
default = uicu.get_default_locale()
print(default.language_tag) # e.g., 'en-US'
# List available locales
locales = uicu.get_available_locales()
print(f"Available locales: {len(locales)}") # 700+ locales
# Create locale-specific services
formatter = locale.get_datetime_formatter(date_style='long', time_style='short')
# Date/time formatting with locale-specific patterns# Primary strength - ignores case and accents
collator = uicu.Collator('en-US', strength='primary')
print(collator.compare('café', 'CAFE')) # 0 (equal)
# Secondary strength - considers accents but not case
collator = uicu.Collator('en-US', strength='secondary')
print(collator.compare('café', 'CAFÉ')) # 0 (equal)
print(collator.compare('café', 'cafe')) # 1 (café > cafe)
# Tertiary strength (default) - considers case
collator = uicu.Collator('en-US', strength='tertiary')
print(collator.compare('café', 'Café')) # 1 (café > Café)# Create reusable segmenters for better performance
word_segmenter = uicu.WordSegmenter('en-US')
sentences = [
"This is a test.",
"Another sentence here.",
"And one more!"
]
for sentence in sentences:
words = list(word_segmenter.segment(sentence))
print(f"{len(words)} words: {words}")# Detect the primary script in text
print(uicu.detect_script('Hello')) # 'Latn'
print(uicu.detect_script('你好')) # 'Hani'
print(uicu.detect_script('مرحبا')) # 'Arab'
print(uicu.detect_script('Привет')) # 'Cyrl'uicu follows these principles:
- Pythonic: Natural Python idioms, not C++ style
- Unicode-first: Seamless handling of all Unicode text
- Locale-aware: Respect cultural and linguistic differences
- Performance: Efficient ICU algorithms under the hood
- Compatibility: Works with Python's built-in string types
- Fallbacks: Graceful degradation when optional features unavailable
uicu v1.0 focuses on delivering fast, reliable, essential Unicode operations. All shipped features are 100% functional and thoroughly tested.
Core Features (Production Ready):
- ✅ Unicode Character Properties - Complete character analysis with fontTools integration
- ✅ Locale Management - BCP 47 compliant locale handling with factory patterns
- ✅ Collation & Sorting - Culture-aware string comparison with customizable strength
- ✅ Text Segmentation - Grapheme, word, sentence, and line break detection
- ✅ Transliteration - Script conversion and text transformation
- ✅ Date/Time Formatting - Locale-aware formatting with patterns and styles
- ✅ Script Detection - Automatic detection of writing systems
Performance Metrics:
- 🚀 Import time: 16.9ms (target: <100ms)
- 📦 Package size: 96KB source (target: <100KB)
- 📊 Code size: 2,418 lines (close to 2000 target)
- ⚡ Minimal PyICU overhead for maximum performance
Code Quality:
- 🧹 Reduced from 26 to 1 linting error (only module name warning)
- 🔍 Simplified exception handling - ICU errors provide better context
- 🏗️ Consolidated validation code with shared utilities
- 🗑️ Removed 200+ lines of dead code and unnecessary fallbacks
- 📝 Streamlined docstrings by 40% while keeping clarity
Coming in v2.0:
- Number formatting (decimal, currency, percent, scientific)
- List formatting with locale-appropriate conjunctions
- Message formatting with plural/gender support
- Date/time parsing functionality
- Advanced timezone handling
- Relative time formatting
- Unicode regex support
- Sphinx documentation site
Run the comprehensive demo to see all features in action:
python examples/uicu_demo.pyThis demo includes:
- Unicode character exploration with properties
- Culture-aware multilingual name sorting
- Text segmentation (graphemes, words, sentences)
- Script conversion and transliteration
- Locale-aware date/time formatting
- Smart numeric vs lexical sorting
- Unicode text transformations
- Automatic script detection
- Thai word segmentation
- Emoji and complex grapheme handling
- Case-sensitive sorting control
- Bidirectional text analysis
# Install and use uv for package management
pip install uv
# Use hatch for development workflow
uv pip install hatch# Activate development environment
hatch shell
# Run tests
hatch run test
# Run tests with coverage
hatch run test-cov
# Run linting
hatch run lint
# Format code
hatch run format
# Run type checking
hatch run type-checkContributions are welcome! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on top of PyICU, which provides Python bindings for ICU
- Enhanced with fontTools.unicodedata for up-to-date Unicode data
- Inspired by the need for more Pythonic Unicode handling in Python applications