Skip to content

feat(text): Add TF-IDF vectorizer for text feature extraction #70

@noahgift

Description

@noahgift

Summary

Add TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to aprender::text module for text feature extraction in machine learning pipelines.

Motivation

The organizational-intelligence-plugin project (https://github.com/paiml/organizational-intelligence-plugin) is implementing NLP-based defect classification and requires TF-IDF feature extraction for Phase 2 ML classification (Tier 2: TF-IDF + ML classifier, target <100ms inference).

Current Status:

  • ✅ Phase 1 Complete: Rule-based classifier with aprender text processing (tokenization, stemming, stopwords)
  • ✅ 18-category defect taxonomy (10 general + 8 transpiler-specific)
  • ✅ Multi-label classification support
  • ❌ Phase 2 Blocked: Need TF-IDF vectorization for ML feature extraction

Target Pipeline:

// Desired API
use aprender::text::tfidf::TfidfVectorizer;
use aprender::text::tokenize::WordTokenizer;

let vectorizer = TfidfVectorizer::new()
    .max_features(1500)
    .ngram_range(1, 3)  // unigrams, bigrams, trigrams
    .min_df(5)          // min document frequency
    .max_df(0.5)        // max document frequency (ignore common terms)
    .sublinear_tf(true); // use log(1 + tf) damping

// Fit on training corpus
vectorizer.fit(&commit_messages)?;

// Transform to sparse feature matrix
let X_train = vectorizer.transform(&commit_messages)?;

// Get feature names for interpretability
let feature_names = vectorizer.get_feature_names();

Requirements

Core TF-IDF Formula

TF (Term Frequency):

tf(t, d) = count(t in d)

IDF (Inverse Document Frequency):

idf(t, D) = log((1 + |D|) / (1 + df(t))) + 1

Where:

  • df(t) = number of documents containing term t
  • |D| = total number of documents

TF-IDF:

tfidf(t, d, D) = tf(t, d) × idf(t, D)

Sublinear TF Scaling (optional):

tf_sublinear(t, d) = 1 + log(tf(t, d))  if tf(t, d) > 0
                   = 0                   otherwise

API Design

pub struct TfidfVectorizer {
    // Configuration
    max_features: Option<usize>,
    ngram_range: (usize, usize),
    min_df: usize,
    max_df: f32,
    sublinear_tf: bool,
    
    // Fitted state
    vocabulary: Option<HashMap<String, usize>>,
    idf_scores: Option<Vec<f32>>,
    document_count: usize,
}

impl TfidfVectorizer {
    pub fn new() -> Self;
    
    // Builder pattern for configuration
    pub fn max_features(self, n: usize) -> Self;
    pub fn ngram_range(self, min: usize, max: usize) -> Self;
    pub fn min_df(self, threshold: usize) -> Self;
    pub fn max_df(self, ratio: f32) -> Self;
    pub fn sublinear_tf(self, enable: bool) -> Self;
    
    // Fit on training corpus
    pub fn fit(&mut self, documents: &[String]) -> Result<(), AprenderError>;
    
    // Transform documents to TF-IDF matrix
    pub fn transform(&self, documents: &[String]) -> Result<Matrix, AprenderError>;
    
    // Fit and transform in one step
    pub fn fit_transform(&mut self, documents: &[String]) -> Result<Matrix, AprenderError>;
    
    // Get feature names (vocabulary terms)
    pub fn get_feature_names(&self) -> Result<Vec<String>, AprenderError>;
    
    // Get IDF scores for vocabulary
    pub fn get_idf_scores(&self) -> Result<Vec<f32>, AprenderError>;
}

Output Format

Return sparse matrix as aprender::primitives::Matrix:

  • Rows: documents (n_samples)
  • Columns: features (n_features)
  • Values: TF-IDF scores (typically 0.0 to ~1.0)

For organizational-intelligence-plugin use case:

  • Input: 5,000 commit messages
  • Output: 5000 × 1500 sparse matrix (most values = 0.0)

Software Engineering Adaptations

Custom Stop Words Handling:

// Keep technical terms that are typically stop words
// Keep: "fix", "bug", "error", "null", "if", "for", "while"
// Remove: "the", "a", "an", "and", "or", "but", "this", "that"

N-gram Examples:

  • Unigrams: ["null", "pointer", "fix"]
  • Bigrams: ["null_pointer", "race_condition", "memory_leak"]
  • Trigrams: ["use_after_free", "operator_precedence_bug"]

Implementation Guidance

Phase 1: Core TF-IDF

  1. Build vocabulary with n-gram support
  2. Calculate document frequencies (df)
  3. Compute IDF scores
  4. Transform documents to TF-IDF vectors
  5. Apply min_df and max_df filtering

Phase 2: Optimizations

  1. Sparse matrix representation (most values are 0)
  2. Sublinear TF scaling option
  3. L2 normalization (unit vectors)
  4. Top-k feature selection (max_features)

Phase 3: Integration

  1. Integration with aprender::text::tokenize
  2. Integration with aprender::text::stopwords
  3. Serialization support via aprender::serialization

Quality Standards

Following aprender's Toyota Way principles:

  • ✅ Zero unwrap() calls (Cloudflare-class safety)
  • ✅ Result-based error handling with AprenderError
  • ✅ Comprehensive test coverage (≥95%)
  • ✅ Property-based testing with proptest
  • ✅ Benchmarks for performance validation
  • ✅ Doc tests for all public APIs

Test Coverage Requirements:

  • Unit tests: vocabulary building, IDF calculation, transform
  • Edge cases: empty documents, single document, all stop words
  • Integration tests: full fit_transform pipeline
  • Property tests: tf-idf scores in valid range [0, inf)
  • Benchmarks: transform time for 5K documents × 1500 features

References

NLP Specification:

  • organizational-intelligence-plugin: docs/specifications/nlp-models-techniques-spec.md
  • Section 2.1.2: TF-IDF Implementation Details
  • Section 3: ML Classification with TF-IDF Features
  • Section 5.1: Three-Tier Architecture (Tier 2 needs TF-IDF)

Industry Best Practices:

Research Evidence:

  • Ensemble study: N-grams (1-3) + TF-IDF achieve 85-92% accuracy in text classification
  • Bug report management: TF-IDF features improve defect categorization vs. keyword matching

Success Criteria

  • TfidfVectorizer struct with builder pattern API
  • fit(), transform(), fit_transform() methods
  • N-gram support (1-3 grams)
  • Document frequency filtering (min_df, max_df)
  • Sublinear TF scaling option
  • Integration with existing aprender::text::tokenize
  • Return aprender::primitives::Matrix (sparse-friendly)
  • Comprehensive tests (≥95% coverage)
  • Benchmarks showing <100ms for 5K docs × 1500 features
  • Documentation with examples

Priority

High Priority - Blocks organizational-intelligence-plugin Phase 2 NLP implementation

Current project status: 86.65% test coverage, 422 tests passing, ready for ML integration once TF-IDF is available.

Related Work

This will enable:

  1. Tier 2 ML classifier (TF-IDF + Random Forest/XGBoost)
  2. Feature interpretability (top TF-IDF terms per category)
  3. Training data pipeline for 5K+ commit messages
  4. Target: ≥80% actionable defect categorization (current: 30.8%)

cc @paiml/aprender-maintainers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions