feat: Mixture of Experts (MoE) support for specialized ensemble learning

## Summary

Add Mixture of Experts (MoE) architecture for specialized ensemble learning. This enables multiple expert models with a learnable gating network that routes inputs to the most appropriate expert(s).

## Motivation

Use case: **depyler-oracle** for transpiler error classification.

Current single RandomForest handles all error types equally. MoE would allow:
- **Scope Expert**: E0425, E0412 (variable/import resolution)
- **Type Expert**: E0308, E0277 (casts, trait bounds)  
- **Method Expert**: E0599 (API mapping)

Each expert specializes, improving accuracy on edge cases within categories.

## Proposed API

```rust
use aprender::ensemble::{MixtureOfExperts, GatingNetwork, SoftmaxGating};

// Define experts
let experts = vec![
    RandomForest::new(100, 10),  // scope expert
    RandomForest::new(100, 10),  // type expert
    RandomForest::new(100, 10),  // method expert
];

// Gating network (routes inputs to experts)
let gating = SoftmaxGating::new(n_features, n_experts);

// MoE ensemble
let moe = MixtureOfExperts::builder()
    .experts(experts)
    .gating(gating)
    .top_k(2)  // sparse: only top 2 experts per input
    .build();

// Train end-to-end
moe.fit(&X_train, &y_train)?;

// Predict (weighted combination of expert outputs)
let predictions = moe.predict(&X_test)?;
```

## Core Components

### 1. GatingNetwork Trait (~50 LOC)
```rust
pub trait GatingNetwork: Send + Sync {
    /// Compute expert weights for input
    fn forward(&self, x: &[f32]) -> Vec<f32>;
    
    /// Train gating network
    fn fit(&mut self, X: &[Vec<f32>], expert_losses: &[Vec<f32>]) -> Result<()>;
}
```

### 2. SoftmaxGating (~100 LOC)
```rust
pub struct SoftmaxGating {
    weights: Matrix<f32>,  // [n_features, n_experts]
    temperature: f32,
}
```

### 3. MixtureOfExperts (~150 LOC)
```rust
pub struct MixtureOfExperts<E: Estimator, G: GatingNetwork> {
    experts: Vec<E>,
    gating: G,
    top_k: usize,
    load_balance_weight: f32,  // optional: encourage even expert usage
}
```

## Training Strategy

1. **Option A - Joint training**: Train gating + experts together (complex)
2. **Option B - Two-stage** (recommended):
   - Stage 1: Pre-train experts on labeled subsets
   - Stage 2: Train gating to route to best expert

## Nice-to-Have Features

- [ ] Load balancing loss (prevent expert collapse)
- [ ] Sparse top-k routing (efficiency)
- [ ] Expert capacity limits
- [ ] Auxiliary loss for gating

## Estimated Effort

- Core implementation: ~200 LOC
- Tests: ~100 LOC
- Total: 1-2 days

## References

- [Outrageously Large Neural Networks (Shazeer et al., 2017)](https://arxiv.org/abs/1701.06538)
- [Switch Transformers (Fedus et al., 2021)](https://arxiv.org/abs/2101.03961)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Mixture of Experts (MoE) support for specialized ensemble learning #101

Summary

Motivation

Proposed API

Core Components

1. GatingNetwork Trait (~50 LOC)

2. SoftmaxGating (~100 LOC)

3. MixtureOfExperts (~150 LOC)

Training Strategy

Nice-to-Have Features

Estimated Effort

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: Mixture of Experts (MoE) support for specialized ensemble learning #101

Description

Summary

Motivation

Proposed API

Core Components

1. GatingNetwork Trait (~50 LOC)

2. SoftmaxGating (~100 LOC)

3. MixtureOfExperts (~150 LOC)

Training Strategy

Nice-to-Have Features

Estimated Effort

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions