<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://shuaichenchang.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://shuaichenchang.github.io/" rel="alternate" type="text/html" /><updated>2026-03-06T02:19:39+00:00</updated><id>https://shuaichenchang.github.io/feed.xml</id><title type="html">Shuaichen Chang</title><subtitle>Applied Scientist at AWS AI Lab</subtitle><author><name>Shuaichen Chang</name><email>shuaichenchang@gmail.com</email></author><entry><title type="html">Continual Learning and Memory (2): Memory Architecture in Language Models</title><link href="https://shuaichenchang.github.io/posts/2026/continual-learning-2/" rel="alternate" type="text/html" title="Continual Learning and Memory (2): Memory Architecture in Language Models" /><published>2026-02-07T00:00:00+00:00</published><updated>2026-02-07T00:00:00+00:00</updated><id>https://shuaichenchang.github.io/posts/2026/continual-learning-2</id><content type="html" xml:base="https://shuaichenchang.github.io/posts/2026/continual-learning-2/"><![CDATA[<p>In <a href="https://shuaichenchang.github.io/posts/2026/continual-learning-1/">our first post</a>, we recapped the formulations of two continual learning papers. Both approaches conduct test-time learning to compress raw input into model parameters, effectively using those parameters as a dynamic working memory. In this post, we step back to examine the evolution of language models through the lens of memory.</p>

<p><strong>TLDR: Everything is memory.</strong> Ultimately, the evolution of language models boils down to two design choices: (1) How will you represent the memory (vectors, matrices, weights, or external pools)? and (2) How will you update it (appending, closed-form optimization, or gradient descent)?</p>

<h2 id="recurrent-hidden-vector-as-memory">Recurrent Hidden Vector as Memory</h2>
<p>The concept of “memory” in neural networks is far from new. Recurrent Neural Networks (RNNs) have long utilized a hidden state vector to carry information over to new timesteps (or tokens, in modern terminology). The Long Short-Term Memory (<a href="https://deeplearning.cs.cmu.edu/S23/document/readings/LSTM.pdf">LSTM</a>) architecture explicitly formalized this by distinguishing between a “hidden state” (output) and a “cell state” (memory).</p>

<p align="center">
<img src="/images/blogs/2025-12-31-continual-learning-memory/lstm.svg" style="width: 400px; max-width: 100%;" />
</p>

<p>At a high level, we can view the memory update as $c_t=f(x_t, c_{t-1})$, where $c_t$ and $c_{t-1}$ represent the memory at the current and previous timesteps, and $x_t$ is the current input. LSTMs introduce a forget gate and an input gate to regulate how much of the past memory $c_{t-1}$ is retained and how much new information from $x_t$ is added.</p>

<p>In theory, this recurrent memory allows information to be carried over indefinitely—from previous tasks to new ones. Aha! It turns out we had long-context language models for continual learning over 30 years ago.</p>

<p>However, it has two practical issues:</p>

<p>(1) Capacity: A fixed-length hidden vector is easily overloaded; it simply cannot losslessly store the vast amount of information contained in a long sequence.</p>

<p>(2) Permanence of Loss: Once information is discarded via the forget mechanism, it is gone forever. The model cannot “look back” to retrieve it later.</p>

<h2 id="attention-based-memory">Attention-based Memory</h2>

<h3 id="quadratic-attention-as-memory">Quadratic Attention as Memory</h3>
<p>The first modern <a href="https://arxiv.org/pdf/1409.0473">attention paper</a>, proposed a solution: keep the RNN hidden vectors for all encoded tokens and use an attention mechanism to search over them during decoding. (Note: This was originally an encoder-decoder framework for machine translation, distinct from today’s decoder-only LLMs). This RNN + Attention architecture effectively utilizes two types of memory: (1) RNN hidden state, which is a fixed-size vector representing compressed context, (2) token activations, which form a growing buffer of states with a size linear to the input length.</p>

<p align="center">
<img src="/images/blogs/2025-12-31-continual-learning-memory/rnn_attention.png" style="width: 250px; max-width: 100%;" />
</p>

<p>This mechanism laid the groundwork for the standard attention found in <a href="https://arxiv.org/pdf/1706.03762">Transformers</a>, which relies exclusively on this retrieved history as its working memory.</p>

\[\begin{aligned}
q&amp;=xW_q, \quad k=xW_k, \quad v=xW_v, \\
o_t &amp;=
\sum_{j=1}^{t}
\frac{
\exp\!\left( q_t^\top k_j / \sqrt{d} \right)
}{
\sum_{l=1}^{t}
\exp\!\left( q_t^\top k_l / \sqrt{d} \right)
}
v_j,
\end{aligned}\]

<p>where $q,k \in \mathbb{R}^{d_k}$, $v \in \mathbb{R}^{d_v}$, and $W_q, W_k, W_v$ are projection matrices.</p>

<p>Because Transformer attention provides a direct, lossless view of all past tokens, the recurrent hidden state became obsolete as a working memory. However, this capability comes at a steep price: quadratic time complexity with respect to sequence length.</p>

<h3 id="linear-attention-recurrent-hidden-matrix-as-memory">Linear Attention (Recurrent Hidden Matrix) as Memory</h3>

<p>The <a href="https://arxiv.org/pdf/2006.16236">Linear Attention paper</a> addresses this bottleneck by removing the softmax normalization from the attention mechanism. It observes that for any non-negative similarity function $\text{sim}(q_t, k_j)$, including softmax, there exists a feature map $\phi$ (potentially in infinite dimensions) such that $\text{sim}(q_t, k_j)=\phi(q_t)^\top\phi(k_j)$. Under this formulation, attention can be rewritten as:</p>

\[\begin{aligned}
o_t &amp;=
\sum_{j=1}^{t}
\frac{
\text{sim}\!\left(q_t,k_j\right)
}{
\sum_{l=1}^{t}
\text{sim}\!\left(q_t, k_l\right)
}
v_j \\
&amp;=
\sum_{j=1}^{t}
\frac{
\phi(q_t)^\top \phi(k_j)
}{
\sum_{l=1}^{t}
\phi(q_t)^\top \phi(k_l)
}
v_j \\
&amp;=
\frac{
\phi(q_t)^\top \sum_{j=1}^{t} \phi(k_j) v_j^\top
}{
\phi(q_t)^\top \sum_{l=1}^{t} \phi(k_l)
}
\end{aligned}\]

<p>For simplicity, if we use the identity function as the feature map $\phi$ and omit the denominator normalization, the equation simplifies to:</p>

\[\begin{aligned}
o_t &amp;=
q_t^\top \sum_{j=1}^{t} k_j v_j^\top
\end{aligned}\]

<p>Here, the term $\sum_{j=1}^{t} k_j v_j^\top$ can be denoted as a matrix $S_t \in \mathbb{R}^{d_k \times d_v}$. Crucially, this matrix can be computed recurrently:</p>

\[S_t = S_{t−1} + k_tv_t^\top\]

<p>This brings us full circle: $S_t$ acts as a recurrent memory, much like the RNN hidden vector. However, we are now using a matrix rather than a vector. While significantly more powerful than a simple vector, this compression is still lossy compared to standard Transformers.</p>

<p>Consider retrieving a specific value $v_i$ using its key $k_i$:</p>

\[k_i^\top S_t = (k_i^\top k_i) v_i^T +  \sum_{j\neq i} (k_i^\top k_j) v_j^\top\]

<p>If all keys are normalized to unit length, this becomes:</p>

\[k_i^\top S_t = v_i^T +  \underbrace{\sum_{j\neq i} (k_i^\top k_j) v_j^\top}_{\text{Noise}}\]

<p>To minimize retrieval error (i.e., reduce the noise term to zero), all keys must be orthogonal. This implies that a matrix of dimension $d_k$ can only losslessly store up to $d_k$ distinct items.</p>

<h3 id="gated-deltanet">Gated DeltaNet</h3>

<p>To mitigate the lossy nature of Linear Attention, researchers revisited the LSTM’s forget gate, leading to <a href="https://arxiv.org/pdf/2312.06635">Gated Linear Attention</a>. By introducing an input-dependent gate $G_t$, the model can selectively “clear up space” in $S_t$ for new information.</p>

\[S_t = G_t \odot  S_{t−1} + k_tv_t^\top\]

<p>Furthermore, <a href="https://arxiv.org/pdf/2406.06484">DeltaNet</a> argues that the update rule should be mindful of what is already stored. Instead of blindly adding $k_t v_t^\top$, we should only add the difference (or delta) between the new information and the existing memory. Conceptually, we first “erase” the old value associated with $k_t$ and then write the new value<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>:</p>

\[S_t = S_{t−1} - \beta k_t v_{old}^\top + \beta k_tv_t^\top,\]

<p>where $v_{old}^\top=k_t^\top S_{t-1}$. Expanding this term yields:</p>

\[\begin{aligned}
S_t
&amp;= S_{t−1} - \beta k_t k_t^\top S_{t-1} + \beta k_tv_t^\top \\
&amp;= S_{t−1} + \beta k_t (v_t^\top - k_t^\top S_{t-1})
\end{aligned}\]

<p>Interestingly, this update rule is equivalent to one step of gradient descent on an L2 loss function that measures the reconstruction error of the key-value pair:</p>

\[\begin{aligned}
\ell(S_{t-1}; k_t) &amp;= ||v_t-k_t^\top S_{t-1}||_2^2 \\
\nabla\ell_S(S_{t−1}; k_t) &amp;=  - k_t (v_t^\top - k_t^\top S_{t-1})
\end{aligned}\]

<p>From this perspective, Linear Attention is effectively a closed-form solution for test-time training of the memory matrix:</p>

\[\begin{aligned}
S_t &amp;= S_{t-1} - \beta \nabla\ell_S(S_{t−1}; k_t)
\end{aligned}\]

<p>Moreover, the Gated Linear Attention can be viewed as a weight decay and combined into this optimization. The combined <a href="https://arxiv.org/pdf/2412.06464">Gated DeltaNets</a> are then used in modern LLMs such as Kimi, Qwen3-next.</p>

\[S_t = G_t \odot S_{t−1} - G_t \odot \beta k_t k_t^\top S_{t-1} + \beta k_tv_t^\top\]

<h2 id="feed-forward-networks-as-memory">Feed-Forward Networks as Memory</h2>

<p>The rise of Linear Attention and DeltaNet shifts our perspective on how attention mechanisms operate. In standard Transformers, attention is commonly viewed as working memory retrieval over a dynamically generated list of activations, updated by growing activation buffers.</p>

<p>But now we can reframe this: attention (the retrieval) acts as a function, while the memory pool update can be treated as an optimization problem. This reframing connects beautifully to an existing stream of <a href="https://arxiv.org/pdf/2012.14913">research</a> that interprets the Feed-Forward Network (FFN) layers in Transformers as a massive memory retrieval mechanism.</p>

<p>Mathematically, we can express the standard FFN operation as:</p>

\[FFN(x) =  V f(K^\top x),\]

<p>where $K, V \in \mathbb{R}^{d \times d_{ff}}$ are the projection matrices and $f$ is a non-linear activation function.</p>

<p>This formulation views the FFN as a memory layer containing $d_{ff}$ distinct key-value pairs. The FFN first maps the input $x$ against $d_{ff}$ memory keys, uses the non-linearity $f$ to select relevant keys, and then outputs a weighted average of the corresponding memory values. If we were to use a softmax function for the non-linearity $f$, the FFN would become mathematically identical to an attention-like memory lookup:</p>

\[FFN(x) = \sum_{i=1}^{d_{ff}} \frac{
\exp\!\left( x^\top k_i \right)
}{
\sum_{j=1}^{d_{ff}}
\exp\!\left( x^\top k_j \right)
} v_i\]

<p>While original Transformers relied on ReLU activations for $f$, modern LLMs predominantly use gated activations, such as SwiGLU:</p>

<p align="center">
<img src="/images/blogs/2025-12-31-continual-learning-memory/ffn.png" style="width: 200px; max-width: 100%;" />
</p>

\[FFN(x) = W^d (SiLU(W^g x) \odot W^u x)\]

<p>In this architecture, $W^u$ and $W^d$ conceptually correspond to the key ($K$) and value ($V$) projections from the earlier equation. In standard nomenclature, $W^u$, $W^d$, and $W^g$ represent the up-projection, down-projection, and gating weight matrices, respectively. This structure enables the model to selectively activate relevant memory keys using dedicated gating parameters ($W^g$) conditioned directly on the input $x$.</p>

<p>Typically, we view these FFN weights as static repositories for pre-training knowledge that remain fixed during inference. However, recent breakthroughs like Titans and TTT-E2E propose that FFNs can also serve as dynamic, mutable working memory, provided we introduce an efficient mechanism to optimize and update these weights at test time.</p>

<h2 id="sparse-memory-layers">Sparse Memory Layers</h2>

<p>Recently, the <a href="https://arxiv.org/pdf/2412.09764">Memory Layers paper</a> argued that memory access is inherently sparse: only a few relevant pieces of information need to be retrieved at any given time, while the vast majority of stored knowledge remains irrelevant to the current context. Consequently, the dense matrix multiplication used in Feed-Forward Network (FFN) layers is an inefficient architecture for storage and retrieval.</p>

<p>They propose replacing the dense FFN layers in Transformer blocks with a Memory Lookup Layer. This sparse architecture allows for storing millions of memory slots, orders of magnitude more than a standard FFN, while maintaining efficient retrieval.</p>

<p align="center">
<img src="/images/blogs/2025-12-31-continual-learning-memory/memory_layer.png" style="width: 400px; max-width: 100%;" />
</p>

<p>The memory layer contains a set of trainable parameters: keys $K \in \mathbb{R}^{d \times N}$ and values $V \in \mathbb{R}^{d \times N}$. Unlike the dynamic activations in attention, these parameters store static memory from pre-training data. At test time, a query $q \in \mathbb{R}^d$ is used to retrieve only the top-$k$ relevant keys ($k \ll N$), followed by a standard attention operation over just those $k$ slots:</p>

\[\begin{aligned}
I &amp;= TopkIndices(Kq), \\
s &amp;= Softmax(K_Iq), \\
y &amp;= sV_I
\end{aligned}\]

<h2 id="external-memory-pool">External Memory Pool</h2>

<p>While innovations in Linear Attention allow for more efficient reading and writing of the memory state $S$, these methods still consolidate all knowledge into a single, superimposed representation rather than maintained in individual, discrete memory records. An alternative approach is to utilize an external memory pool and interact with it in a similar fashion as attention. This line of research dates back over a decade; let’s briefly recap its evolution.</p>

<p>Following the rise of attention mechanisms in machine translation, researchers began to conceptualize neural memory as analogous to RAM in a computer, which is an external component that the CPU (the neural network) can read from and write to. Guided by this philosophy, pioneering works like <a href="https://arxiv.org/pdf/1410.3916">Memory Networks</a> and <a href="https://arxiv.org/pdf/1410.5401">Neural Turing Machines</a> implemented a memory pool $M \in \mathbb{R}^{N \times d}$ containing $N$ slots, each storing a $d$-dimensional vector. The model is then trained to learn specific read/write operations to manipulate this external pool for task-specific goals.</p>

<p>More recently, architectures like <a href="https://arxiv.org/pdf/2402.04624">MemoryLLM</a> have augmented the standard Transformer attention mechanism by maintaining a set of external memory vectors.</p>

<p align="center">
<img src="/images/blogs/2025-12-31-continual-learning-memory/memoryllm.png" style="width: 400px; max-width: 100%;" />
</p>

<p>During generation, MemoryLLM attends to both the local context and the global memory pool, extending the standard formulation $Attention(Q, K, V)$ to:
\(Attention(Q_X, [K_M;K_X], [V_M;V_X]),\)
where $X$, $M$ represent the local context and global memory respectively.</p>

<p>Unlike the standard Attention, which updates the activations at every token, the MemoryLLM pool is updated only after processing a complete text segment (e.g., a paragraph). During this update phase, the model concatenates the hidden states of the new text chunk with the last $k$ records ($k \ll N$) from the existing memory pool. These are processed together, and the final $k$ hidden states are written back into the pool, replacing randomly selected vectors.</p>

<p>Conceptually, this acts as a “k-gram” conditional memory write: the model uses the previous $k$ memory records to condition the compression of the current text chunk into the latent memory space, generating $k$ updated vectors.</p>

<h2 id="wrap-up">Wrap-up</h2>

<p>We’ve traced memory in language models from fixed-size hidden vectors, through attention-based token buffers, to matrices updated via gradient descent, and finally to sparse and external memory pools. Despite their apparent diversity, these architectures all face the same fundamental trade-offs: capacity vs. efficiency, and lossless retrieval vs. compressive storage.</p>

<p>A recurring theme is that memory updates increasingly resemble optimization — DeltaNet’s update rule is literally a gradient step, and test-time training methods like Titans extend this idea to FFN weights.</p>

<h2 id="references">References</h2>

<ol>
  <li>
    <p><strong>Long Short-Term Memory</strong> <a href="https://deeplearning.cs.cmu.edu/S23/document/readings/LSTM.pdf">[PDF]</a>
Hochreiter, Sepp and Schmidhuber, Jürgen, 1997.</p>
  </li>
  <li>
    <p><strong>Neural Machine Translation by Jointly Learning to Align and Translate</strong> <a href="https://arxiv.org/pdf/1409.0473">[PDF]</a>
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua, 2014.</p>
  </li>
  <li>
    <p><strong>Attention Is All You Need</strong> <a href="https://arxiv.org/pdf/1706.03762">[PDF]</a>
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia, 2017.</p>
  </li>
  <li>
    <p><strong>Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention</strong> <a href="https://arxiv.org/pdf/2006.16236">[PDF]</a>
Katharopoulos, Angelos, Vyas, Apoorv, Pappas, Nikolaos, and Fleuret, François, 2020.</p>
  </li>
  <li>
    <p><strong>Gated Linear Attention Transformers with Hardware-Efficient Training</strong> <a href="https://arxiv.org/pdf/2312.06635">[PDF]</a>
Yang, Songlin, Wang, Bailin, Shen, Yikang, Panda, Rameswar, and Kim, Yoon, 2024.</p>
  </li>
  <li>
    <p><strong>Parallelizing Linear Transformers with the Delta Rule over Sequence Length</strong> <a href="https://arxiv.org/pdf/2406.06484">[PDF]</a>
Yang, Songlin, Wang, Bailin, Zhang, Yu, Shen, Yikang, and Kim, Yoon, 2024.</p>
  </li>
  <li>
    <p><strong>Gated Delta Networks: Improving Mamba2 with Delta Rule</strong> <a href="https://arxiv.org/pdf/2412.06464">[PDF]</a>
Yang, Songlin, Kautz, Jan, and Hatamizadeh, Ali, 2024.</p>
  </li>
  <li>
    <p><strong>Transformer feed-forward layers are key-value memories.</strong>
<a href="https://arxiv.org/pdf/2012.14913">[PDF]</a>
Geva, Mor, Roei Schuster, Jonathan Berant, and Omer Levy, 2021.</p>
  </li>
  <li>
    <p><strong>Titans: Learning to Memorize at Test Time</strong> <a href="https://arxiv.org/pdf/2501.00663">[PDF]</a>
Behrouz, Ali, Zhong, Peilin, and Mirrokni, Vahab, 2024.</p>
  </li>
  <li>
    <p><strong>End-to-End Test-Time Training for Long Context</strong> <a href="https://arxiv.org/pdf/2512.23675">[PDF]</a>
   Tandon, Arnuv, Dalal, Karan, Li, Xinhao, Koceja, Daniel, Rød, Marcel, Buchanan, Sam, Wang, Xiaolong, et al., 2025.</p>
  </li>
  <li>
    <p><strong>Memory Layers at Scale</strong> <a href="https://arxiv.org/pdf/2412.09764">[PDF]</a>
Mu, Zhihang, Qiu, Shun, Lin, Xien, Huang, Po-Yao, Yan, Yingwei, and Lei, Tao, 2024.</p>
  </li>
  <li>
    <p><strong>Memory Networks</strong> <a href="https://arxiv.org/pdf/1410.3916">[PDF]</a>
Weston, Jason, Chopra, Sumit, and Bordes, Antoine, 2014.</p>
  </li>
  <li>
    <p><strong>Neural Turing Machines</strong> <a href="https://arxiv.org/pdf/1410.5401">[PDF]</a>
Graves, Alex, Wayne, Greg, and Danihelka, Ivo, 2014.</p>
  </li>
  <li>
    <p><strong>MemoryLLM: Towards Self-Updatable Large Language Models</strong> <a href="https://arxiv.org/pdf/2402.04624">[PDF]</a>
Wang, Yu, Dong, Yifan, Zeng, Zhuoyi, Ko, Shangchao, Li, Zhe, and Xiong, Wenhan, 2024.</p>
  </li>
</ol>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>The DeltaNet explanation is inspired by Songlin Yang’s blog post: <a href="https://sustcsonglin.github.io/blog/2024/deltanet-1/">Understanding DeltaNet</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Shuaichen Chang</name></author><category term="continual-learning" /><category term="test-time training" /><category term="memory" /><summary type="html"><![CDATA[In our first post, we recapped the formulations of two continual learning papers. Both approaches conduct test-time learning to compress raw input into model parameters, effectively using those parameters as a dynamic working memory. In this post, we step back to examine the evolution of language models through the lens of memory.]]></summary></entry><entry><title type="html">Continual Learning and Memory (1): Titans and End-to-End Test-Time Training</title><link href="https://shuaichenchang.github.io/posts/2026/continual-learning-1/" rel="alternate" type="text/html" title="Continual Learning and Memory (1): Titans and End-to-End Test-Time Training" /><published>2026-01-04T00:00:00+00:00</published><updated>2026-01-04T00:00:00+00:00</updated><id>https://shuaichenchang.github.io/posts/2026/continual-learning-1</id><content type="html" xml:base="https://shuaichenchang.github.io/posts/2026/continual-learning-1/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>We all share the dream that powerful AI models could learn by themselves, continuously updating and improving from their daily tasks (that is, at test time). Humans do this naturally. In fact, we have no choice but to update our memory every second, and we have not found a way to reset ourselves to a moment in the past (building a time machine is not discussed in this post). Compared with humans, many believe that the future of AI lies in the ability to continually learn at test time, rather than waiting to be retrained periodically. That said, continual learning and memory are essential capabilities for future AI systems.</p>

<p>This post is the first in a series of blog posts where I share my notes and learnings on recent progress in continual learning research. In this post, we recap the core ideas from Google’s paper <a href="https://arxiv.org/pdf/2501.00663">Titans: Learning to Memorize at Test Time</a> (January 2025) which went viral on social media after NeurIPS 2025 as well as a more recent paper <a href="https://arxiv.org/pdf/2512.23675">End-to-End Test-Time Training for Long Context</a> (December 2025). Both papers formulate continual learning as a long-context problem where a language model continues working on new tasks while keeping previous tasks in its context and learns from them.
While traditional attention has enabled in-context learning, it is limited by its quadratic growth in time complexity. These two papers address it with test-time training.</p>

<h2 id="background-full-attention-and-linear-attention">Background: Full Attention and Linear Attention</h2>

<p><a href="https://arxiv.org/pdf/1706.03762">Transformer-based</a> language models capture the dependency between the current token and previous tokens using the attention mechanism, which is based on a softmax over the previous tokens. Formally, the attention output $o_t$ can be written as:</p>

\[\begin{aligned}
q&amp;=xW_q, \quad k=xW_k, \quad v=xW_v, \\
o_t &amp;=
\sum_{j=1}^{t}
\frac{
\exp\!\left( q_t^\top k_j / \sqrt{d_{\text{in}}} \right)
}{
\sum_{l=1}^{t}
\exp\!\left( q_t^\top k_l / \sqrt{d_{\text{in}}} \right)
}
v_j,
\end{aligned}\]

<p>where $q, k, v \in \mathbb{R}^{d_{\text{in}}}$, and $W_q, W_k, W_v \in \mathbb{R}^{d_{\text{in}}\times d_{\text{in}}}$.</p>

<p>The <a href="https://arxiv.org/pdf/2006.16236">Linear Attention paper</a> points out that for any non-negative similarity function $\text{sim}(q_t, k_j)$, including softmax, there exists a feature map $\phi$ (potentially in infinite dimensions) such that $\text{sim}(q_t, k_j)=\phi(q_t)^\top\phi(k_j)$. Under this formulation, attention can be rewritten as:</p>

\[\begin{aligned}
o_t &amp;=
\sum_{j=1}^{t}
\frac{
\text{sim}\!\left(q_t,k_j\right)
}{
\sum_{l=1}^{t}
\text{sim}\!\left(q_t, k_l\right)
}
v_j \\
&amp;=
\sum_{j=1}^{t}
\frac{
\phi(q_t)^\top \phi(k_j)
}{
\sum_{l=1}^{t}
\phi(q_t)^\top \phi(k_l)
}
v_j \\
&amp;=
\frac{
\phi(q_t)^\top \sum_{j=1}^{t} \phi(k_j) v_j^\top
}{
\phi(q_t)^\top \sum_{l=1}^{t} \phi(k_l)
}
\end{aligned}\]

<p>The key insight is that $\sum_{j=1}^{t} \phi(k_j) v_j^\top$ can be computed recurrently, which can be written in a recurrent format.</p>

\[M_t = M_{t−1} + \phi(k_t) v_t^\top\]

<p>From this perspective, the goal of linear attention is to compress the keys and values into $M$, which serves as a form of fast, associative memory.</p>

<h2 id="titans">Titans</h2>
<p>Since memory can be represented in a recurrent form, Titans uses a more general formulation:</p>

\[\begin{aligned}
M_t &amp;= f(M_{t-1}, x_t),\\
\tilde{y}_t &amp;= g(M_t,x_t),
\end{aligned}\]

<p>where the functions $f$ and $g$ can be viewed as memory write and memory read operations, respectively. Here, $M_t$ itself can be a neural network (e.g., an MLP), rather than a simple vector or matrix.</p>

<p>The next question is how to learn and update the memory unit $M_t$. The Titans paper motivates this from a perspective inspired by human memory: events that violate expectations (i.e., are surprising) are more memorable for humans. Accordingly, Titans uses a surprise signal, measured via prediction error, to update the memory. Concretely, the memory is updated using gradient descent with momentum and weight decay<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>:</p>

\[M_t = M_{t−1} − \eta_t \nabla\ell(M_{t−1}; x_t),\]

<p>where $\eta_t$ is the learning rate at timestep $t$ and the surprise score $\ell$ is defined as:</p>

\[\ell(M_{t-1}; x_t) = ||M_{t-1}(k_t)-v_t||_2^2\]

<p>We can also understand this mechanism from a more traditional machine learning perspective. After the memory write, $M_t$ should contain the key $k_t$ and its corresponding value $v_t$, such that the value can be retrieved given the key. In other words, we want:</p>

\[M_t(k_t) = v_t\]

<p>If we treat $M$ as a set of model parameters, then updating $M$ so that the prediction $M(k_t)$ matches the target $v_t$ is simply a standard supervised learning problem. The update from $M_{t-1}$ to $M_t$ naturally follows from gradient descent on the loss:</p>

\[\ell(M_{t-1}; x_t) = ||M_{t-1}(k_t)-v_t||_2^2,\]

<p>which yields:</p>

\[M_t = M_{t−1} − \eta_t \nabla\ell(M_{t−1}; x_t)\]

<p>This means that we can update the memory $M$ at each timestep via gradient descent at test time. But how can we train such a memory mechanism when the overall model is trained with the standard language modeling objective (e.g., next-token prediction)?</p>

<p>Recall that the memory update itself can be viewed as a function $f$ in the recurrence $M_t = f(M_{t-1}, x_t)$. The training follows a meta-learning formulation: in the <strong>inner loop</strong>, the model learns how to update the memory parameters $M$ using the gradient-based rule above; in the <strong>outer loop</strong>, it trains the remaining parameters (e.g., the parameters in $g(M_t,x_t)$ and the projections for $k, v$) to optimize the language modeling objective, given the updated memory from the inner loop.  At test time, only the inner loop (memory updates) runs, while the outer loop parameters remain fixed.</p>

<h2 id="ttt-e2e">TTT-E2E</h2>
<p>While Titans learns the memory $M_t$ by associating $K_{&lt;t}$ and $V_{&lt;t}$, TTT-E2E argues that since the goal of memorizing past knowledge is to improve future predictions, a more straightforward objective can be used:</p>

\[\ell(M_{t-1}; x_t) = \text{CE}(g(M_{t-1},x_{t-1}), x_t),\]

<p>which is the cross-entropy loss for next-token prediction, where $g(M_{t-1}, x_{t-1})$ produces the logits for predicting token $x_t$. As a result, the same loss function is used to train the memory during both training and test time. Since the memory unit does not rely on a separate, memory-specific loss function, it is straightforward to train multiple memory units (e.g., layers) together. In particular, TTT-E2E reuses some of existing MLP layers in Transformers as the memory $M$, enabling continual learning without modifying the Transformer architecture.</p>

<h2 id="parallel-training">Parallel Training</h2>
<p>Both Titans and TTT-E2E rely on a recurrently updated memory unit, which requires $O(T)$ FLOPs during both training and test time. While it is efficient at test time, the recurrent nature of the memory does not naturally support token-level parallelization during training, making it less efficient than standard Transformers in practice. To address this issue, both papers adopt chunking/batching, partitioning the input token sequence to enable parallel computation within chunks.</p>

<p>For simplicity, assume the total number of tokens $T$ is divisible by a chunk size $b$. In Titans, the memory $M_t$ is updated with respect to $M_{t’}$ instead of $M_{t-1}$, where $t’ = t- (t \mod b)$ denotes the timestep at the end of the previous chunk. Under this approach, all $M_t$ values within the same chunk can be computed in parallel. The update can be written as:</p>

\[M_t = M_{t'} - \eta_{t'+1} \frac{1}{t-t'} \sum_{i=t'+1}^t \nabla\ell(M_{t'},x_i)\]

<p>This is parallelizable because all gradients $\nabla\ell(M_{t’},x_i)$ depend on the same $M_{t’}$ and can be computed simultaneously. The partial sums can then be accumulated efficiently using <a href="https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf">parallel prefix sum algorithms</a>.</p>

<p>TTT-E2E uses an even simpler formulation. Within each chunk, the memory unit $M$ remains constant and is updated only after the entire chunk is processed. As a result, all tokens within the chunk can be computed in parallel. For $i = 1, …, T/b$, we have:</p>

\[M_i = M_{i−1} − \eta_i \frac{1}{b} \sum_{t=(i-1)b+1}^{ib} \nabla\ell(M_{i−1}; 𝑥_t)\]

<p>Both methods enable parallelization within each chunk. However, this introduces another issue: $M_t$ becomes stale or imprecise within the chunk, as it does not reflect the most recent tokens and memory. This can affect the output prediction, which originally takes the form $\tilde{y}_t = g(M_t, x_t)$. To mitigate this issue, both Titans and TTT-E2E retain full self-attention within each chunk, modifying the output prediction to:</p>

\[\tilde{y}_t = g(M_t,x_t, K[t'+1:t], V[t'+1:t])\]

<h2 id="references">References</h2>

<p><strong>Titans: Learning to Memorize at Test Time</strong> <a href="https://arxiv.org/pdf/2501.00663">[PDF]</a>
Behrouz, Ali, Zhong, Peilin, and Mirrokni, Vahab, 2024.</p>

<p><strong>End-to-End Test-Time Training for Long Context</strong> <a href="https://arxiv.org/pdf/2512.23675">[PDF]</a>
Tandon, Arnuv, Dalal, Karan, Li, Xinhao, Koceja, Daniel, Rød, Marcel, Buchanan, Sam, Wang, Xiaolong, et al., 2025.</p>

<p><strong>Attention Is All You Need</strong> <a href="https://arxiv.org/pdf/1706.03762">[PDF]</a>
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia, 2017.</p>

<p><strong>Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention</strong> <a href="https://arxiv.org/pdf/2006.16236">[PDF]</a>
Katharopoulos, Angelos, Vyas, Apoorv, Pappas, Nikolaos, and Fleuret, François, 2020.</p>

<p><strong>Prefix Sums and Their Applications</strong> <a href="https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf">[PDF]</a>
Blelloch, Guy E., 1990.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>The Titans paper incorporates gradient descent with momentum and weight decay, motivated by analogies to memory forgetting and surprise memory momentum. We omit these details here for simplicity. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Shuaichen Chang</name></author><category term="continual-learning" /><category term="test-time training" /><category term="memory" /><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">A Summary of Our Paper Dr.Spider</title><link href="https://shuaichenchang.github.io/posts/2023/04/dr-spider-summary/" rel="alternate" type="text/html" title="A Summary of Our Paper Dr.Spider" /><published>2023-04-01T00:00:00+00:00</published><updated>2023-04-01T00:00:00+00:00</updated><id>https://shuaichenchang.github.io/posts/2023/04/dr-spider-summary</id><content type="html" xml:base="https://shuaichenchang.github.io/posts/2023/04/dr-spider-summary/"><![CDATA[<p>This blog post provides a summary of our paper “Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness,”.</p>

<h2 id="paper-overview">Paper Overview</h2>

<p>Dr.Spider is a comprehensive diagnostic evaluation benchmark designed to assess the robustness of Text-to-SQL models. The benchmark includes various perturbation types that test different aspects of model robustness, helping researchers understand where their models succeed and where they fail.</p>

<h2 id="read-the-full-post">Read the Full Post</h2>

<p>For a detailed explanation of our work, including methodology, results, and insights, please read the full blog post on Medium:</p>

<p><strong><a href="https://medium.com/@shuaichenchang/dr-spider-a-diagnostic-evaluation-benchmark-towards-text-to-sql-robustness-77d69e388fe">Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness</a></strong></p>

<h2 id="resources">Resources</h2>

<ul>
  <li><strong>Paper</strong>: <a href="https://arxiv.org/pdf/2301.08881.pdf">arXiv</a></li>
  <li><strong>Data</strong>: <a href="https://github.com/awslabs/diagnostic-robustness-text-to-sql">GitHub Repository</a></li>
  <li><strong>Slides</strong>: <a href="/files/Dr_spider_slides.pdf">Presentation Slides</a></li>
  <li><strong>Video</strong>: <a href="https://recorder-v3.slideslive.com/#/share?share=79980&amp;s=49e6c9eb-bb21-4fbd-a191-8f1a1d9cd872">ICLR 2023 Presentation</a></li>
</ul>

<h2 id="citation">Citation</h2>

<p>If you find our work useful, please consider citing:</p>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@inproceedings</span><span class="p">{</span><span class="nl">chang2023drspider</span><span class="p">,</span>
  <span class="na">title</span><span class="p">=</span><span class="s">{Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness}</span><span class="p">,</span>
  <span class="na">author</span><span class="p">=</span><span class="s">{Chang, Shuaichen and Wang, Jun and Dong, Mingwen and Pan, Lin and Zhu, Henghui and Li, Alexander Hanbo and Lan, Wuwei and Zhang, Sheng and Jiang, Jiarong and Lilien, Joseph and others}</span><span class="p">,</span>
  <span class="na">booktitle</span><span class="p">=</span><span class="s">{International Conference on Learning Representations}</span><span class="p">,</span>
  <span class="na">year</span><span class="p">=</span><span class="s">{2023}</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Shuaichen Chang</name><email>shuaichenchang@gmail.com</email></author><category term="text-to-sql" /><category term="robustness" /><category term="evaluation" /><category term="research" /><summary type="html"><![CDATA[This blog post provides a summary of our paper “Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness,”.]]></summary></entry></feed>