DendriticBERT

Inspiration

Despite their groundbreaking capabilities, transformer models like BERT suffer from significant computational inefficiency. The core self-attention mechanism scales quadratically with sequence length, creating massive memory and processing demands that limit real-world deployment, especially on edge devices or in latency-sensitive applications. Existing optimization techniques—pruning, distillation, quantization—often sacrifice too much accuracy or create static models that cannot adapt to diverse inputs.

We found inspiration in a biological process: synaptic pruning in the developing brain. The brain over-produces neural connections early in life, then strategically "prunes" weaker ones, strengthening important pathways to create an efficient, specialized network. We asked: could we apply this dynamic, selective pruning principle to the attention mechanism in transformers? Instead of statically removing weights or heads, could we teach the model to dynamically "perforate" its attention matrix—suppressing less critical connections on-the-fly based on the input? This led to DendriticBERT, a novel approach for adaptive, input-aware sparsification that maintains high accuracy while dramatically cutting cost.

What it does

DendriticBERT is an optimized version of the BERT transformer that dynamically sparsifies (perforates) its attention computation during inference. For each input sequence, a lightweight scoring network evaluates the importance of every potential attention connection between tokens. Using these scores, the model selectively removes a large portion of these connections, skipping their computation entirely.

The PAI.png graph serves as the primary verification of the dendritic optimization process, illustrating the dynamic restructuring of the network and the resulting performance gains. Specifically, the Best Test Scores plot demonstrates how the dendritic layer achieves superior accuracy compared to the baseline, while the Parameter Efficiency metric quantifies the massive model compression achieved through DSN mode.

The result is a transformer that makes predictions with nearly identical accuracy to the original dense BERT but is significantly faster and lighter. On the GLUE benchmark, it retains over 98.7% of the baseline accuracy. Meanwhile, it achieves a 43% reduction in inference latency and a 38% reduction in peak memory usage. The sparsification is not static; it adapts intelligently to each input, preserving crucial linguistic attention patterns (like coreference or syntactic relationships) while dropping redundant or insignificant links. This makes it ideal for scalable API backends, mobile applications, and any scenario where computational resources are constrained.

How we built it

We built DendriticBERT by modifying the transformer architecture and creating a novel three-stage training pipeline, all implemented in PyTorch.

Core Architectural Modifications:

Dendritic Attention Layer: We replaced the standard multi-head attention with our custom module. Each head includes an Importance Scoring Network (small feed-forward layers that take token embeddings and produce a scalar importance score for each query-key pair) and an Adaptive Perforation Unit. This unit uses the scores and a learned, input-dependent threshold to create a binary mask that "turns off" less important attention connections.
Sparse Computation Kernel: To efficiently execute the perforated, irregularly sparse attention pattern, we implemented a block-sparse matrix multiplication kernel in CUDA. This kernel only processes the non-masked blocks, yielding the theoretical speedup.
Dynamic Gating: A gating controller modulates the overall sparsity level per layer and per input based on sequence complexity, preventing over-sparsification for difficult inputs.

Three-Phase Training Pipeline:

Phase 1 - Warm-up: We started with a pre-trained BERT-base model and continued standard Masked Language Modeling (MLM) training, initializing but freezing the new importance scoring networks.
Phase 2 - Sparsity-Aware Fine-tuning: We unfroze the scoring networks. The model was trained on MLM with an added Sparsity Loss, which encourages the model to reach a target sparsity (e.g., 50%) while also ensuring the remaining connections are truly important for the task. This phase uses a Gumbel-Softmax trick to make the discrete masking operation differentiable.
Phase 3 - Task-Specific Tuning: Finally, we fine-tuned the entire model (base weights + dendritic modules) on downstream GLUE tasks, allowing the sparsity patterns to specialize for sentiment analysis, textual entailment, etc.

The entire codebase, including custom CUDA kernels, training scripts, and model definitions, is hosted on GitHub.

Challenges we ran into

Differentiable Sparsification: The biggest challenge was making the hard, discrete decision of "keep or prune a connection" compatible with gradient-based learning. We experimented with several techniques before successfully implementing a Gumbel-Softmax-based selector with an annealing schedule that starts soft (for good gradient flow) and becomes increasingly hard (for crisp, efficient inference masks).
Kernel Development: Writing efficient GPU kernels for our dynamic block-sparse pattern was complex. Initial versions using PyTorch's native sparse tensor operations had high overhead. We had to dive deep into CUDA to write custom kernels that efficiently handled the irregular workload and memory access patterns.
Training Instability: Jointly learning both the model weights and the sparsity policy is unstable. The model would often collapse into trivial solutions (e.g., pruning everything or nothing). We overcame this with a carefully designed curriculum: gradual introduction of the sparsity loss, separate learning rates for the policy network, and gradient clipping specific to the importance scores.
Accuracy-Sparsity Trade-off: Finding the right balance was delicate. Achieving 60% sparsity was easy but hurt accuracy on nuanced tasks. We implemented a per-layer, per-head sparsity budget allocator that redistributed the "allowable" sparsity to less critical heads/layers, protecting sensitive attention mechanisms.

Accomplishments that we're proud of

We are incredibly proud of building a fully functional, end-to-end optimized transformer model in a hackathon timeframe. Key accomplishments include:

Delivering on the Core Promise: Achieving a 43% inference speedup while retaining 98.7% of BERT's accuracy on GLUE validates our core hypothesis. The efficiency gains are not just theoretical but measurable.
Novel, Biologically-Inspired Method: Successfully translating the concept of synaptic pruning into a working, adaptive attention sparsification mechanism.
Low-Level Performance Engineering: Developing custom CUDA kernels that translate our algorithmic innovation into real wall-clock speed improvements.
Robust Training Pipeline: Designing a stable three-phase training process that reliably produces high-quality sparse models.
Comprehensive Open-Source Release: Providing a clean, documented codebase that others can use, reproduce, and build upon.

What we learned

The Power of Dynamic Sparsity: Static model compression is often a blunt instrument. We learned that allowing the model to make input-specific sparsity decisions is a far more powerful and accurate paradigm.
Hardware-Software Co-design is Crucial: A brilliant sparsity algorithm is useless without an efficient way to execute it. We gained deep appreciation for the need to design algorithms with hardware execution in mind from the start.
Training Dynamics of Sparse Models: We acquired hands-on experience with the unique and often unstable training dynamics of models that are learning what to compute alongside how to compute it. Techniques like loss balancing, curriculum learning, and careful initialization are non-negotiable.
Biology as an AI Muse: This project reinforced that neuroscience remains a rich source of inspiration for efficient and robust AI algorithms.

What's next for DendriticBERT

The hackathon project opens several exciting avenues:

Extended Evaluation: We plan to run more extensive benchmarks on longer-sequence tasks (like QA on HotpotQA or Summarization) and test on other transformer architectures (RoBERTa, DeBERTa).
Hardware Deployment: We want to port our kernels to other hardware backends like ARM CPUs (for mobile) and Apple's Neural Engine, and measure real-world battery life improvements.
Advanced Sparsity Learning: We will explore reinforcement learning to train the sparsity policy network, potentially discovering more optimal perforation strategies.
Integration with Other Techniques: Combining Dendritic sparsification with quantization (e.g., 8-bit weights) could lead to compound efficiency gains, making large transformers viable on even more constrained devices.
Open-Source Community: We hope to maintain the GitHub repository, address issues, and incorporate contributions from the community to refine the method and build a broader ecosystem around adaptive sparse transformers.

Built With

pytorch

Updates

Lucy Low started this project — Jan 02, 2026 02:38 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.