VallayZ Interspecies Message Passing Networks: A Unified Framework for Multi-Modal Latent Space Linking
A new neural network architecture + message passing algorithm to connect multimodal data via stochastic micro-variants and selective gating.
Abstract
Interspecies Message Passing Networks (IMPNs) and the VallayZ framework is a neural network architecture for robust communication between heterogeneous latent spaces. Similar to the nutrient-exchange mechanisms of mycorrhizal fungal networks, IMPNs/VallayZ employ stochastic micro-variant sampling and selective receptor gating to enable efficient cross-modal translation while remaining resilient to noise and missing modalities. Each modality is projected into a unified latent “soil” space via learned adapters, where messages are exchanged as sets of perturbed variants. Receivers apply learned gating to select and aggregate the most relevant variants, refining their internal state before decoding back to their native space.
The Broad Idea
Roots into Shared Soil
Each modality (vision, text, audio, etc.) first grows roots into a common soil space via an adapter.
These adapters normalize everyone into the same latent dimensionality — a neutral ground where any species can interact.Multiple Nutrient Packets
Instead of sending one perfect representation, each input sprouts K noisy variants.
Tiny perturbations explore nearby territory in the latent space — like sending several nutrient blends to see which ones the neighbor prefers.Fungal Selection
The receiver’s receptors score each incoming variant and pick the top-k.
This is learned selectivity, so the receiver ignores junk and focuses on the variants that matter to it.Weighted Aggregation
The chosen variants get averaged together with learned weights.
This smooths out random noise and creates a cleaner, more robust signal — a biological denoising filter.Residual Update
The aggregated signal updates the receiver’s internal state (via residual or GRU-style updates).
This lets it accumulate and refine information over time, not just react in the moment.Decoding
Once updated, the receiver can map the shared latent signal back into its own native language
(e.g., vision → text, text → audio, etc.).
Loss Metrics
The VallayZ / IMPN training objective combines multiple complementary loss terms to ensure robust multi-modal alignment:
Contrastive Loss (InfoNCE):
Encourages aligned modality pairs to have higher similarity than mismatched pairs, improving cross-modal retrieval accuracy.Cycle Consistency Loss:
Ensures that a mapping from Modality A → B → A (or vice versa) reconstructs the original latent representation, preserving reversibility.kNN Preservation Loss:
Maintains neighborhood structure after projection, ensuring that semantically similar inputs remain close in the shared latent space.Orthogonality Regularization:
Encourages projection matrices to remain close to orthonormal, supporting invertible and stable transformations.Denoising Regularization:
Promotes robustness to perturbations in the shared space by penalizing deviation from the clean aggregated signal.
The total loss is a weighted sum:
$$
\mathcal{L} = \lambda_1 L_{\text{contrast}} + \lambda_2 L_{\text{cycle}} + \lambda_3 L_{\text{knn}} + \lambda_4 L_{\text{orth}} + \lambda_5 L_{\text{denoise}}
$$
These together balance alignment, structure preservation, and robustness across heterogeneous modalities.
Ancient Tibetian Translation
Ancient Tibetan is a hard translation problem because its meaning is distributed across context. It becomes a question of How do you map between two worlds when one is data-rich, the other is data-poor, and their “units of meaning” don’t even match? As the meaning of a single sentence often depends on the chapter-level context. this is compounded as a problem as the character set is oversampled compared to English, and each character can have multiple contextual meanings.
Using VallayZ / IMPN, we built a tri-modal translation bridge:
- Modality A: Ancient Tibetan text embeddings
- Modality B: Generated audio transliterations
- Modality C: Modern English translations
By aligning these modalities in a shared latent soil space and exchanging stochastic micro-variants with selective gating, the model learns to focus on the most relevant semantic variants for each translation context.
Results (4,000 epochs, ~30 min training):
- Top-1 accuracy: 77%
- Top-5 accuracy: 88%
Log in or sign up for Devpost to join the conversation.