Summary
Add TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to aprender::text module for text feature extraction in machine learning pipelines.
Motivation
The organizational-intelligence-plugin project (https://github.com/paiml/organizational-intelligence-plugin) is implementing NLP-based defect classification and requires TF-IDF feature extraction for Phase 2 ML classification (Tier 2: TF-IDF + ML classifier, target <100ms inference).
Current Status:
- ✅ Phase 1 Complete: Rule-based classifier with aprender text processing (tokenization, stemming, stopwords)
- ✅ 18-category defect taxonomy (10 general + 8 transpiler-specific)
- ✅ Multi-label classification support
- ❌ Phase 2 Blocked: Need TF-IDF vectorization for ML feature extraction
Target Pipeline:
// Desired API
use aprender::text::tfidf::TfidfVectorizer;
use aprender::text::tokenize::WordTokenizer;
let vectorizer = TfidfVectorizer::new()
.max_features(1500)
.ngram_range(1, 3) // unigrams, bigrams, trigrams
.min_df(5) // min document frequency
.max_df(0.5) // max document frequency (ignore common terms)
.sublinear_tf(true); // use log(1 + tf) damping
// Fit on training corpus
vectorizer.fit(&commit_messages)?;
// Transform to sparse feature matrix
let X_train = vectorizer.transform(&commit_messages)?;
// Get feature names for interpretability
let feature_names = vectorizer.get_feature_names();
Requirements
Core TF-IDF Formula
TF (Term Frequency):
IDF (Inverse Document Frequency):
idf(t, D) = log((1 + |D|) / (1 + df(t))) + 1
Where:
df(t) = number of documents containing term t
|D| = total number of documents
TF-IDF:
tfidf(t, d, D) = tf(t, d) × idf(t, D)
Sublinear TF Scaling (optional):
tf_sublinear(t, d) = 1 + log(tf(t, d)) if tf(t, d) > 0
= 0 otherwise
API Design
pub struct TfidfVectorizer {
// Configuration
max_features: Option<usize>,
ngram_range: (usize, usize),
min_df: usize,
max_df: f32,
sublinear_tf: bool,
// Fitted state
vocabulary: Option<HashMap<String, usize>>,
idf_scores: Option<Vec<f32>>,
document_count: usize,
}
impl TfidfVectorizer {
pub fn new() -> Self;
// Builder pattern for configuration
pub fn max_features(self, n: usize) -> Self;
pub fn ngram_range(self, min: usize, max: usize) -> Self;
pub fn min_df(self, threshold: usize) -> Self;
pub fn max_df(self, ratio: f32) -> Self;
pub fn sublinear_tf(self, enable: bool) -> Self;
// Fit on training corpus
pub fn fit(&mut self, documents: &[String]) -> Result<(), AprenderError>;
// Transform documents to TF-IDF matrix
pub fn transform(&self, documents: &[String]) -> Result<Matrix, AprenderError>;
// Fit and transform in one step
pub fn fit_transform(&mut self, documents: &[String]) -> Result<Matrix, AprenderError>;
// Get feature names (vocabulary terms)
pub fn get_feature_names(&self) -> Result<Vec<String>, AprenderError>;
// Get IDF scores for vocabulary
pub fn get_idf_scores(&self) -> Result<Vec<f32>, AprenderError>;
}
Output Format
Return sparse matrix as aprender::primitives::Matrix:
- Rows: documents (n_samples)
- Columns: features (n_features)
- Values: TF-IDF scores (typically 0.0 to ~1.0)
For organizational-intelligence-plugin use case:
- Input: 5,000 commit messages
- Output: 5000 × 1500 sparse matrix (most values = 0.0)
Software Engineering Adaptations
Custom Stop Words Handling:
// Keep technical terms that are typically stop words
// Keep: "fix", "bug", "error", "null", "if", "for", "while"
// Remove: "the", "a", "an", "and", "or", "but", "this", "that"
N-gram Examples:
- Unigrams:
["null", "pointer", "fix"]
- Bigrams:
["null_pointer", "race_condition", "memory_leak"]
- Trigrams:
["use_after_free", "operator_precedence_bug"]
Implementation Guidance
Phase 1: Core TF-IDF
- Build vocabulary with n-gram support
- Calculate document frequencies (df)
- Compute IDF scores
- Transform documents to TF-IDF vectors
- Apply min_df and max_df filtering
Phase 2: Optimizations
- Sparse matrix representation (most values are 0)
- Sublinear TF scaling option
- L2 normalization (unit vectors)
- Top-k feature selection (max_features)
Phase 3: Integration
- Integration with
aprender::text::tokenize
- Integration with
aprender::text::stopwords
- Serialization support via
aprender::serialization
Quality Standards
Following aprender's Toyota Way principles:
- ✅ Zero
unwrap() calls (Cloudflare-class safety)
- ✅ Result-based error handling with
AprenderError
- ✅ Comprehensive test coverage (≥95%)
- ✅ Property-based testing with proptest
- ✅ Benchmarks for performance validation
- ✅ Doc tests for all public APIs
Test Coverage Requirements:
- Unit tests: vocabulary building, IDF calculation, transform
- Edge cases: empty documents, single document, all stop words
- Integration tests: full fit_transform pipeline
- Property tests: tf-idf scores in valid range [0, inf)
- Benchmarks: transform time for 5K documents × 1500 features
References
NLP Specification:
- organizational-intelligence-plugin:
docs/specifications/nlp-models-techniques-spec.md
- Section 2.1.2: TF-IDF Implementation Details
- Section 3: ML Classification with TF-IDF Features
- Section 5.1: Three-Tier Architecture (Tier 2 needs TF-IDF)
Industry Best Practices:
Research Evidence:
- Ensemble study: N-grams (1-3) + TF-IDF achieve 85-92% accuracy in text classification
- Bug report management: TF-IDF features improve defect categorization vs. keyword matching
Success Criteria
Priority
High Priority - Blocks organizational-intelligence-plugin Phase 2 NLP implementation
Current project status: 86.65% test coverage, 422 tests passing, ready for ML integration once TF-IDF is available.
Related Work
This will enable:
- Tier 2 ML classifier (TF-IDF + Random Forest/XGBoost)
- Feature interpretability (top TF-IDF terms per category)
- Training data pipeline for 5K+ commit messages
- Target: ≥80% actionable defect categorization (current: 30.8%)
cc @paiml/aprender-maintainers
Summary
Add TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to
aprender::textmodule for text feature extraction in machine learning pipelines.Motivation
The organizational-intelligence-plugin project (https://github.com/paiml/organizational-intelligence-plugin) is implementing NLP-based defect classification and requires TF-IDF feature extraction for Phase 2 ML classification (Tier 2: TF-IDF + ML classifier, target <100ms inference).
Current Status:
Target Pipeline:
Requirements
Core TF-IDF Formula
TF (Term Frequency):
IDF (Inverse Document Frequency):
Where:
df(t)= number of documents containing termt|D|= total number of documentsTF-IDF:
Sublinear TF Scaling (optional):
API Design
Output Format
Return sparse matrix as
aprender::primitives::Matrix:For organizational-intelligence-plugin use case:
Software Engineering Adaptations
Custom Stop Words Handling:
N-gram Examples:
["null", "pointer", "fix"]["null_pointer", "race_condition", "memory_leak"]["use_after_free", "operator_precedence_bug"]Implementation Guidance
Phase 1: Core TF-IDF
Phase 2: Optimizations
Phase 3: Integration
aprender::text::tokenizeaprender::text::stopwordsaprender::serializationQuality Standards
Following aprender's Toyota Way principles:
unwrap()calls (Cloudflare-class safety)AprenderErrorTest Coverage Requirements:
References
NLP Specification:
docs/specifications/nlp-models-techniques-spec.mdIndustry Best Practices:
Research Evidence:
Success Criteria
TfidfVectorizerstruct with builder pattern APIfit(),transform(),fit_transform()methodsaprender::text::tokenizeaprender::primitives::Matrix(sparse-friendly)Priority
High Priority - Blocks organizational-intelligence-plugin Phase 2 NLP implementation
Current project status: 86.65% test coverage, 422 tests passing, ready for ML integration once TF-IDF is available.
Related Work
This will enable:
cc @paiml/aprender-maintainers