Skip to content

feat: Mixture of Experts (MoE) support for specialized ensemble learning #101

@noahgift

Description

@noahgift

Summary

Add Mixture of Experts (MoE) architecture for specialized ensemble learning. This enables multiple expert models with a learnable gating network that routes inputs to the most appropriate expert(s).

Motivation

Use case: depyler-oracle for transpiler error classification.

Current single RandomForest handles all error types equally. MoE would allow:

  • Scope Expert: E0425, E0412 (variable/import resolution)
  • Type Expert: E0308, E0277 (casts, trait bounds)
  • Method Expert: E0599 (API mapping)

Each expert specializes, improving accuracy on edge cases within categories.

Proposed API

use aprender::ensemble::{MixtureOfExperts, GatingNetwork, SoftmaxGating};

// Define experts
let experts = vec![
    RandomForest::new(100, 10),  // scope expert
    RandomForest::new(100, 10),  // type expert
    RandomForest::new(100, 10),  // method expert
];

// Gating network (routes inputs to experts)
let gating = SoftmaxGating::new(n_features, n_experts);

// MoE ensemble
let moe = MixtureOfExperts::builder()
    .experts(experts)
    .gating(gating)
    .top_k(2)  // sparse: only top 2 experts per input
    .build();

// Train end-to-end
moe.fit(&X_train, &y_train)?;

// Predict (weighted combination of expert outputs)
let predictions = moe.predict(&X_test)?;

Core Components

1. GatingNetwork Trait (~50 LOC)

pub trait GatingNetwork: Send + Sync {
    /// Compute expert weights for input
    fn forward(&self, x: &[f32]) -> Vec<f32>;
    
    /// Train gating network
    fn fit(&mut self, X: &[Vec<f32>], expert_losses: &[Vec<f32>]) -> Result<()>;
}

2. SoftmaxGating (~100 LOC)

pub struct SoftmaxGating {
    weights: Matrix<f32>,  // [n_features, n_experts]
    temperature: f32,
}

3. MixtureOfExperts (~150 LOC)

pub struct MixtureOfExperts<E: Estimator, G: GatingNetwork> {
    experts: Vec<E>,
    gating: G,
    top_k: usize,
    load_balance_weight: f32,  // optional: encourage even expert usage
}

Training Strategy

  1. Option A - Joint training: Train gating + experts together (complex)
  2. Option B - Two-stage (recommended):
    • Stage 1: Pre-train experts on labeled subsets
    • Stage 2: Train gating to route to best expert

Nice-to-Have Features

  • Load balancing loss (prevent expert collapse)
  • Sparse top-k routing (efficiency)
  • Expert capacity limits
  • Auxiliary loss for gating

Estimated Effort

  • Core implementation: ~200 LOC
  • Tests: ~100 LOC
  • Total: 1-2 days

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions