Pinned
As MoEs grow larger and sparser, they become memory-bottlenecked.
What if experts were actually composable - so you only keep the subset relevant to your task?
We show that this doesn't emerge in standard MoEs (their training makes this hard), but you can pre-train MoEs to
MoEs are everywhere in frontier models, and they are deployed as a monolith system.
But many applications only need a narrow slice of capabilities, e.g., math, code, biomedical, etc.
So what if "modularity" is actually the missing opportunity for MoEs?
Today, we're releasing












