feat(text): Add TF-IDF vectorizer for text feature extraction

## Summary

Add TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to `aprender::text` module for text feature extraction in machine learning pipelines.

## Motivation

The organizational-intelligence-plugin project (https://github.com/paiml/organizational-intelligence-plugin) is implementing NLP-based defect classification and requires TF-IDF feature extraction for Phase 2 ML classification (Tier 2: TF-IDF + ML classifier, target <100ms inference).

**Current Status:**
- ✅ Phase 1 Complete: Rule-based classifier with aprender text processing (tokenization, stemming, stopwords)
- ✅ 18-category defect taxonomy (10 general + 8 transpiler-specific)
- ✅ Multi-label classification support
- ❌ Phase 2 Blocked: Need TF-IDF vectorization for ML feature extraction

**Target Pipeline:**
```rust
// Desired API
use aprender::text::tfidf::TfidfVectorizer;
use aprender::text::tokenize::WordTokenizer;

let vectorizer = TfidfVectorizer::new()
    .max_features(1500)
    .ngram_range(1, 3)  // unigrams, bigrams, trigrams
    .min_df(5)          // min document frequency
    .max_df(0.5)        // max document frequency (ignore common terms)
    .sublinear_tf(true); // use log(1 + tf) damping

// Fit on training corpus
vectorizer.fit(&commit_messages)?;

// Transform to sparse feature matrix
let X_train = vectorizer.transform(&commit_messages)?;

// Get feature names for interpretability
let feature_names = vectorizer.get_feature_names();
```

## Requirements

### Core TF-IDF Formula

**TF (Term Frequency):**
```
tf(t, d) = count(t in d)
```

**IDF (Inverse Document Frequency):**
```
idf(t, D) = log((1 + |D|) / (1 + df(t))) + 1
```
Where:
- `df(t)` = number of documents containing term `t`
- `|D|` = total number of documents

**TF-IDF:**
```
tfidf(t, d, D) = tf(t, d) × idf(t, D)
```

**Sublinear TF Scaling (optional):**
```
tf_sublinear(t, d) = 1 + log(tf(t, d))  if tf(t, d) > 0
                   = 0                   otherwise
```

### API Design

```rust
pub struct TfidfVectorizer {
    // Configuration
    max_features: Option<usize>,
    ngram_range: (usize, usize),
    min_df: usize,
    max_df: f32,
    sublinear_tf: bool,
    
    // Fitted state
    vocabulary: Option<HashMap<String, usize>>,
    idf_scores: Option<Vec<f32>>,
    document_count: usize,
}

impl TfidfVectorizer {
    pub fn new() -> Self;
    
    // Builder pattern for configuration
    pub fn max_features(self, n: usize) -> Self;
    pub fn ngram_range(self, min: usize, max: usize) -> Self;
    pub fn min_df(self, threshold: usize) -> Self;
    pub fn max_df(self, ratio: f32) -> Self;
    pub fn sublinear_tf(self, enable: bool) -> Self;
    
    // Fit on training corpus
    pub fn fit(&mut self, documents: &[String]) -> Result<(), AprenderError>;
    
    // Transform documents to TF-IDF matrix
    pub fn transform(&self, documents: &[String]) -> Result<Matrix, AprenderError>;
    
    // Fit and transform in one step
    pub fn fit_transform(&mut self, documents: &[String]) -> Result<Matrix, AprenderError>;
    
    // Get feature names (vocabulary terms)
    pub fn get_feature_names(&self) -> Result<Vec<String>, AprenderError>;
    
    // Get IDF scores for vocabulary
    pub fn get_idf_scores(&self) -> Result<Vec<f32>, AprenderError>;
}
```

### Output Format

Return sparse matrix as `aprender::primitives::Matrix`:
- Rows: documents (n_samples)
- Columns: features (n_features)
- Values: TF-IDF scores (typically 0.0 to ~1.0)

For organizational-intelligence-plugin use case:
- Input: 5,000 commit messages
- Output: 5000 × 1500 sparse matrix (most values = 0.0)

### Software Engineering Adaptations

**Custom Stop Words Handling:**
```rust
// Keep technical terms that are typically stop words
// Keep: "fix", "bug", "error", "null", "if", "for", "while"
// Remove: "the", "a", "an", "and", "or", "but", "this", "that"
```

**N-gram Examples:**
- Unigrams: `["null", "pointer", "fix"]`
- Bigrams: `["null_pointer", "race_condition", "memory_leak"]`
- Trigrams: `["use_after_free", "operator_precedence_bug"]`

## Implementation Guidance

### Phase 1: Core TF-IDF
1. Build vocabulary with n-gram support
2. Calculate document frequencies (df)
3. Compute IDF scores
4. Transform documents to TF-IDF vectors
5. Apply min_df and max_df filtering

### Phase 2: Optimizations
1. Sparse matrix representation (most values are 0)
2. Sublinear TF scaling option
3. L2 normalization (unit vectors)
4. Top-k feature selection (max_features)

### Phase 3: Integration
1. Integration with `aprender::text::tokenize`
2. Integration with `aprender::text::stopwords`
3. Serialization support via `aprender::serialization`

## Quality Standards

Following aprender's Toyota Way principles:
- ✅ Zero `unwrap()` calls (Cloudflare-class safety)
- ✅ Result-based error handling with `AprenderError`
- ✅ Comprehensive test coverage (≥95%)
- ✅ Property-based testing with proptest
- ✅ Benchmarks for performance validation
- ✅ Doc tests for all public APIs

**Test Coverage Requirements:**
- Unit tests: vocabulary building, IDF calculation, transform
- Edge cases: empty documents, single document, all stop words
- Integration tests: full fit_transform pipeline
- Property tests: tf-idf scores in valid range [0, inf)
- Benchmarks: transform time for 5K documents × 1500 features

## References

**NLP Specification:**
- organizational-intelligence-plugin: `docs/specifications/nlp-models-techniques-spec.md`
- Section 2.1.2: TF-IDF Implementation Details
- Section 3: ML Classification with TF-IDF Features
- Section 5.1: Three-Tier Architecture (Tier 2 needs TF-IDF)

**Industry Best Practices:**
- scikit-learn TfidfVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- Practitioner consensus: "TF-IDF + n-grams + linear model is an extremely strong baseline"

**Research Evidence:**
- Ensemble study: N-grams (1-3) + TF-IDF achieve 85-92% accuracy in text classification
- Bug report management: TF-IDF features improve defect categorization vs. keyword matching

## Success Criteria

- [ ] `TfidfVectorizer` struct with builder pattern API
- [ ] `fit()`, `transform()`, `fit_transform()` methods
- [ ] N-gram support (1-3 grams)
- [ ] Document frequency filtering (min_df, max_df)
- [ ] Sublinear TF scaling option
- [ ] Integration with existing `aprender::text::tokenize`
- [ ] Return `aprender::primitives::Matrix` (sparse-friendly)
- [ ] Comprehensive tests (≥95% coverage)
- [ ] Benchmarks showing <100ms for 5K docs × 1500 features
- [ ] Documentation with examples

## Priority

**High Priority** - Blocks organizational-intelligence-plugin Phase 2 NLP implementation

Current project status: 86.65% test coverage, 422 tests passing, ready for ML integration once TF-IDF is available.

## Related Work

This will enable:
1. Tier 2 ML classifier (TF-IDF + Random Forest/XGBoost)
2. Feature interpretability (top TF-IDF terms per category)
3. Training data pipeline for 5K+ commit messages
4. Target: ≥80% actionable defect categorization (current: 30.8%)

---

cc @paiml/aprender-maintainers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(text): Add TF-IDF vectorizer for text feature extraction #70

Summary

Motivation

Requirements

Core TF-IDF Formula

API Design

Output Format

Software Engineering Adaptations

Implementation Guidance

Phase 1: Core TF-IDF

Phase 2: Optimizations

Phase 3: Integration

Quality Standards

References

Success Criteria

Priority

Related Work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(text): Add TF-IDF vectorizer for text feature extraction #70

Description

Summary

Motivation

Requirements

Core TF-IDF Formula

API Design

Output Format

Software Engineering Adaptations

Implementation Guidance

Phase 1: Core TF-IDF

Phase 2: Optimizations

Phase 3: Integration

Quality Standards

References

Success Criteria

Priority

Related Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions