blank

Link Drift

2026-03-07T14:30:00+00:00

I have a Links page on this site that I keep adding to whenever I run into something I want to remember. Over time it became useful, but also a little stiff. It started to feel less like a living collection of things I enjoyed and more like a shelf.

So I made Link Drift, a small playground where those links float around the screen. You can click one, drag one, fling one, or hit “I’m feeling lucky” and let the page pick for you.

What I wanted was a little more serendipity. When I look back at the things that have shaped how I think, a lot of them did not arrive in some tidy, optimized order. They came from wandering around, opening the wrong tab, following a footnote too far, or revisiting something I had forgotten. Link Drift is my attempt to keep a bit of that feeling.

I still like having the plain list. Sometimes you just want the clean version. But I also like that there is now a more playful door into the same archive. The links are the same; the experience is not.

DeepSeek OCR, and why I think vision eats language

2025-10-24T12:00:00+00:00

Recently I’ve been trying to keep up with school, coursework, and settling into a new environment. So I wasn’t putting much effort into reading papers, which I used to enjoy. But this week was different. I found a paper that pulled me back in: DeepSeek OCR. It’s a peculiar paper in a good way. I like DeepSeek’s papers: they’re thorough and open, and they don’t hide things from researchers.

I think its findings are very aligned with the Bitter Lesson.

Key takeaways

The core takeaway from the paper is that by using a vision encoder to process documents as images, they achieved ten times more efficiency than using text tokens. In other words, it’s vastly more efficient to screenshot a document and feed it to an LLM than to paste the raw text, with little to no loss in performance.

But this isn’t entirely new. About a year ago, when Gemini 2.5 Pro came out, I was playing with it in Google AI Studio. I pasted a document and compared it to drag-and-drop upload. The full text token count was much larger than the image-based upload. They were converting files to images and counting vision tokens. So they already kind of knew this.

Also, I don’t know about ChatGPT, but for a long time, Anthropic’s Claude seemed to feed uploaded documents as images, not parsed text, from their webpage. They must have tested it and green-lit it because it was more feasible and efficient.

Where does vision stop and language begin?

This also reminded me of a 2020 Lex Fridman conversation with Ilya Sutskever:

Ilya Sutskever: Where does vision stop and language begin? … You have a vision system, you say it’s the best human-level vision system, I open a book and show you letters: will it understand how these letters form into words and sentences and meaning? Is this part of the vision problem?

Lex Fridman: That’s a really interesting question… One possibility is that it’s impossible to achieve really deep understanding in either images or language without basically using the same kind of system, so you’re going to get the other for free… Ilya Sutskever: A lot of it depends on your definitions of perfect vision: because really, you know, reading is vision. But should it count?

I emphasize again: this was recorded in 2020. Five years is a long time in this field. But still, I keep thinking about this conversation.

My opinion: vision precedes language here. Or said differently, for documents, language sits inside vision. If you can see the page, you can get the language. Starting from text alone can’t recover layout, typography, figures, or spatial structure. All those rich signals are lost when you flatten the document into text. It’s not just about efficiency; it’s about the nature of the task.

Bitter Lesson vibes

This feels like the Bitter Lesson again. Methods that scale win. Text is one-dimensional and throws away structure; you end up with fragile pipelines. Vision is two-dimensional and more general. If we want to be more Bitter-Lesson-aligned and pursue methods that scale, language tasks will increasingly be subsumed by vision tasks for documents. Looking at human modality: we have five senses. There is no “text modality.” What we do for “natural language processing” in our brains is auditory and visual. For text, it’s purely vision.

Twenty years ago, to take notes people had to type and store strings. Today, if I want to copy a whiteboard or an announcement in a classroom, everybody takes pictures. Nobody writes it down. It’s more general. I think the same thing will happen with language models. Twenty years from now, who knows how we’ll interact with AI models, but we’ll probably do the equivalent of taking pictures of text for much more capable models, and look back at “paste-the-whole-document-as-text-language-models” as quaint.

Final thoughts

I think we’re getting the answer Ilya hinted at back in 2020. Where does vision stop and language begin? For documents, language lives inside vision. DeepSeek OCR is interesting not because it invents a new modality, but because it treats the obvious seriously: for documents, seeing beats parsing. Once you accept that, a lot of design choices get simpler.

The fact that labs like Anthropic have long defaulted to image-based document uploads suggests they already tested this and know it’s more feasible and efficient. It makes you wonder how much frontier labs already know, and how far ahead they are.

My Take on GPT-5

2025-08-18T04:51:00+00:00

OpenAI recently released GPT-5, with claims of a new state-of-the-art model that tops benchmarks. After spending some time with it, my initial impression is that it is a decent model, but it doesn’t feel groundbreaking to me. However, I’ve come to realize that this release probably wasn’t intended for power users like me. For most people, this model is a much bigger deal.

Power users and everyday users

Before GPT-5, I used OpenAI’s o3 model almost exclusively since March. As I’ve discussed in a previous post, I have high regard for o3, mainly because of its “agentic” nature. It could actively search the web to gather context and provide more reliable answers. This ability to use tools and retrieve context on its own, in my opinion, separates a useful AI from a toy.

This is why it sometimes frustrates me to see friends and colleagues, even those with a ChatGPT Plus subscription, stick to the basic GPT-4o model. They often complain that it hallucinates or makes things up, and when I ask which model they’re using, most of the time it’s the 4o model. A model without a dedicated reasoning process and tool usage is going to be less reliable for complex tasks. I’ve made it a personal rule to never trust a non-reasoning model for anything beyond simple tasks like drafting an email or editing my writing.

The value of a “thinking” model comes from test-time compute scaling. When you allow a model to think harder about a problem, the result is usually much better than what a non-reasoning model can produce. With GPT-5, this capability is now dynamically available to everyone.

The router

The most significant change with GPT-5 may not be the base model itself, but the introduction of the router. This system dynamically decides whether a query requires the deeper “GPT-5 Thinking” model or can be handled by a simpler one.

A recent article from SemiAnalysis by Dylan Patel and his team opened my eyes to the business implications of this. They argue that the router could help OpenAI monetize its massive base of free users. The router can distinguish between a trivial query like, “What is the capital of France?” and a commercially valuable one like, “What are the best running shoes I can buy?”

The first query doesn’t require deep reasoning and is cheap to answer. The second has high commercial intent. The router can allocate more resources to it, use web search, and provide a detailed recommendation. This creates an opportunity for OpenAI to take a transaction fee or affiliate revenue, turning the chatbot into something closer to a monetizable super-app. It’s a way to monetize without resorting to intrusive ads, which Sam Altman has expressed a distaste for.

While I agree that the router enables this, I’d push back slightly and argue that a sufficiently advanced model could theoretically make these decisions on its own. Still, implementing it as a dedicated router is a clear product choice.

Final Thoughts

My experience with GPT-5 has solidified a key belief: always use a thinking model. Since its release, I’ve used “GPT-5 Thinking” exclusively, and I don’t care about the automatic routing for my own use.

If you’re reading this, the main takeaway I want to leave you with is this: whenever you have the choice, use the model that thinks. The difference in quality and reliability is huge. For the average user, GPT-5’s real benefit is making that choice for them and bringing reasoning models to many more people.

The Murmuring Woman

2025-05-10T11:00:00+00:00

I had a strange experience today that I wanted to write down. I was at a cafe with my girlfriend, planning our vacation. Nearby, there was a woman, maybe in her mid-40s, wearing a face mask. She was constantly murmuring to herself. Not loudly, and I couldn’t make out the words, but it was non-stop. She looked agitated, getting up, sitting down slightly differently, pacing out of the room, then returning to the same seat. This went on and on. As we wrapped up our trip planning, I looked up, and she was just gone. That quick.

Our vacation plans came together well, but the image of that woman lingered in my head. When I saw her, it reminded me of LLMs. I had to admit that my brain is so stuffed with AI these days, for better or worse.

Hear me out.

From self-talk to chain-of-thought

The woman’s constant self-talk, that murmuring, felt like what chain-of-thought reasoning models are currently doing. It’s a simple analogy, maybe too easy, but it stuck with me. I don’t know why this happens, but anthropomorphizing LLMs sometimes helps me see what capabilities they might need, or what data we should give them to make them more capable. These analogies make it easier for me to see things.

There’s a kind of progression here:

Traditional LLMs: These models don’t really “think” in a step-by-step way. They just generate, often verbatim, without much pause – a kind of knee-jerk reaction. This is like System 1 thinking.
Reasoning Models (Chain-of-Thought): When these came along, they blew the older models out of the water. This introduced a new scaling paradigm: test-time compute. Introspection, or thinking step-by-step, is much better than a knee-jerk response for many tasks. This is System 2, and it’s really good for improving capabilities. Noam Brown’s work really pioneered this area.

The limits of introspection

Current models are moving toward becoming agents. And here’s where the analogy with the woman becomes more interesting. To be clear, I don’t know her or what she was going through. She looked like she was having a tough time, and this is only an observation for the sake of analogy.

Constant introspection, just talking to oneself, only gets you so far. And that’s exactly the limit I see with first-generation reasoning models, like some of the DeepSeek models or OpenAI’s o1. They can think, they can “talk to themselves” on and on, but they can’t verify their own thoughts quite reliably.

Compare this to how people generally operate. When people think, they can self-verify using external tools or interactions. They might talk something through with someone else, or rely on external aids like their iPhone, a book, or a quick search. That’s what models like Anthropic’s Claude 3.7 Sonnet and OpenAI’s o3 are doing now. They interact with the real world through an external pipeline, a bridge we call “tools.”

The fine line of anthropomorphism

When you anthropomorphize an LLM this way, the need for tools and external interaction becomes obvious. But there’s a caveat: the internal modeling of an LLM is very different from human cognition. I’m anthropomorphizing only to see what LLMs might benefit from, what might make them more capable. It’s a fine line to walk.

Aren’t we all just next-word predictors?

As this kind of anthropomorphizing continues, and it’s easy to do because language models can seem persuasive and lifelike, it reminded me of something Scott Aaronson said a year ago. When LLMs first emerged and people argued “it’s just next-word prediction, just statistical modeling,” he’d retort, paraphrasing: “But what about you? Aren’t you just a next-word predictor? What about your mom?”

It cracked me up at the time. If I’d said that to some of my close friends who looked down on LLMs, they would have fumed. They’d be outraged! But when ChatGPT came out, I intuitively agreed with Aaronson’s point. My mind hasn’t changed on that.

I think as models get more capable, Aaronson’s quip, “aren’t we just the next-word predictor?”, will become true in a functional sense. Recently, LLMs passed the Turing test, but society moved on like nothing happened. Sooner or later, for every verifiable task, model capabilities will likely exceed human capabilities. And still, when that happens, they will be, at their core, next-word predictors. Superhuman next-word predictors, better than us at any given task.

Then what would we become?

Dropout - Review

2025-04-28T10:00:00+00:00

Dropout is one of those deep learning techniques that feels ubiquitous now. Revisiting the original 2014 paper, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” by Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, reminded me how much thought sits behind a simple idea.

The evolutionary analogy

One of the most striking parts of the paper is the motivation drawn from evolutionary biology, specifically the role of sexual reproduction.

“One possible explanation for the superiority of sexual reproduction is that, over the long term, the criterion for natural selection may not be individual fitness but rather mix-ability of genes.”

The paper contrasts this with asexual reproduction, where a well-adapted set of genes might be optimized for a specific environment but brittle if conditions change. Sexual reproduction constantly shuffles genes, forcing individual genes to work with a random set of other genes. This “mix-ability” creates robustness.

This analogy maps well onto neural networks. A standard network might develop complex “co-adaptations” between hidden units, fitting the training data perfectly but failing on unseen examples. Dropout, by randomly removing units during training, acts like gene shuffling. It forces each unit to be useful on its own or together with many different random subsets of other units. This prevents the network from relying on fragile partnerships that only exist in the training data. As the paper humorously adds, ten small conspiracies might be more robust than one large one requiring everyone to play their part perfectly.

The real goal

This ties into a crucial point, echoing sentiments sometimes expressed by researchers like Ilya Sutskever: the objective isn’t just fitting the training data, but generalizing to the test set. The paper highlights this early on:

“With limited training data, however, many of these complicated relationships will be the result of sampling noise, so they will exist in the training set but not in real test data even if it is drawn from the same distribution. This leads to overfitting…”

Dropout directly attacks this problem. Overfitting often involves learning spurious correlations, patterns that exist purely by chance in the training sample. Standard networks, especially high-capacity ones, have the “luxury” of using their parameters to memorize this noise and minimize training loss.

Learning robust features, not noise

Dropout changes the incentive structure during training. By constantly disrupting pathways, it makes it harder for the network to rely on specific complex interactions between neurons that may only capture spurious correlations. The “reward”, or gradient signal, for learning these fragile patterns becomes inconsistent.

In contrast, strong features that reflect the real data structure are likely detectable through multiple pathways or redundant representations. These features “survive” the dropout process more reliably and receive more consistent reinforcement. Dropout therefore pushes the network to spend capacity on features that are resilient to random disruption, which are exactly the features more likely to generalize.

Approximating an exponential ensemble

The core mechanism is simple:

During Training: For each training case (or minibatch), randomly “thin” the network by dropping units (setting their output to zero) with a certain probability 1-p. This means training an exponentially large ensemble of networks (potentially 2^N for N units) that all share weights.
At Test Time: Explicitly averaging the predictions of all possible thinned networks is intractable. Instead, use the single, full network but scale down the outgoing weights of units by the retention probability p. This simple scaling provides a good approximation of the average prediction of the ensemble.

This allows the model to train like a huge ensemble but perform inference efficiently with a single network.

Max-norm regularization

The paper notes that dropout often works best with high learning rates and momentum. However, this can risk weights growing uncontrollably. They found one technique particularly helpful: Max-Norm Regularization.

“…constraining the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant c. In other words, if w represents the vector of weights incident on any hidden unit, the neural network was optimized under the constraint IIwII₂ ≤ c.”

This acts as a stabilizer. By capping the L2 norm of incoming weights to each neuron, it prevents weights from exploding. That allows the use of aggressive learning rates needed to overcome the noise introduced by dropout, without losing stability.

Sparsity as a side effect

Interestingly, the paper shows in Figures 7 and 8 that dropout often leads to sparser activations in hidden units, even without explicit sparsity penalties. Neurons learn to be more selective, potentially making the learned representations more interpretable or efficient.

Final thoughts

Dropout shows the power of a simple, well-motivated idea. It provides a practical way to prevent overfitting by discouraging the memorization of spurious, training-set-specific correlations. It is not a silver bullet, especially with factors like training time and interaction with Batch Normalization, but it is easy to see why it became so widely used.

Rethinking Sequence-to-Sequence - Review

2025-04-26T09:00:00+00:00

Reading older papers often gives a clearer view of how current ideas developed. Recently, I went through the 2015 ICLR paper Neural Machine Translation by Jointly Learning to Align and Translate by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. It tackles a core problem in early sequence-to-sequence models for machine translation.

The main issue they identified was the “bottleneck” in the standard RNN encoder-decoder framework popular at the time. These models tried to compress the entire meaning of a source sentence, regardless of length, into a single fixed-length vector. As the paper noted, this made long sentences difficult; performance tended to drop significantly as sentences got longer.

Their proposed solution was to allow the decoder to look back at the source sentence and selectively focus on relevant parts when generating each target word. This avoids forcing all information through one fixed vector.

Key concepts

Here are the core ideas:

The problem: fixed-length vector bottleneck: Standard encoder-decoders map an input sequence x = (x_1, ..., x_{T_x}) to a fixed context vector c. The decoder then generates the output y = (y_1, ..., y_{T_y}) based solely on c and previously generated words. This compression limits the model’s capacity, especially for long inputs.
The solution: alignment mechanism: Instead of one c, the proposed model computes a distinct context vector c_i for each target word y_i. This c_i is a weighted sum of annotations (h_1, ..., h_{T_x}) from the encoder. Each h_j corresponds to a source word x_j, or more precisely, the hidden state around it.
How it works: alignment model and context vector:
- The weight a_{ij} for each annotation h_j when generating y_i depends on how well the input around position j aligns with the output at position i.
- These weights are calculated using an “alignment model” a, which takes the previous decoder hidden state s_{i-1} and the encoder annotation h_j as input to produce a score e_{ij}.
- e_{ij} = a(s_{i-1}, h_j)
- The weights a_{ij} are obtained by normalizing these scores with a softmax: a_{ij} = exp(e_{ij}) / Σ_k exp(e_{ik}).
- The context vector c_i is then the weighted sum: c_i = Σ_j a_{ij} h_j.
- Crucially, the alignment model a (parameterized as a small feedforward network) is trained jointly with the rest of the system.
Soft vs. hard alignment: The paper uses the term “soft alignment.” This contrasts with “hard alignment,” which would involve making a deterministic choice of which single source word aligns with the target word. Soft alignment uses a weighted average over all source annotations. This makes the mechanism differentiable and allows the model to learn alignments implicitly through backpropagation. It also handles cases where a target word depends on multiple source words, or vice versa.
The encoder: bidirectional RNN (BiRNN): To ensure the annotation h_j captures context from both before and after the source word x_j, they used a BiRNN. This consists of a forward RNN processing the sequence from x_1 to x_{T_x} and a backward RNN processing it from x_{T_x} to x_1. The annotation h_j is the concatenation of the forward hidden state \vec{h}_j and the backward hidden state \cev{h}_j. While BiRNNs weren’t new, their use here makes sense for creating richer annotations.

What I learned

Reflecting on the paper, several points stand out:

Performance improvement on long sentences: The results clearly show the benefit. The standard RNNencdec model’s performance drops sharply with sentence length, while the proposed RNNsearch model remains much more robust. The BLEU scores confirm a significant improvement, bringing NMT closer to traditional phrase-based systems of the time.
Interpretability via alignment: The alignment weights a_{ij} can be visualized. This gives some insight into what parts of the source sentence the model focuses on when generating a specific target word. The visualizations showed mostly monotonic alignments, as expected between English and French, but also the ability to handle local reordering, like adjective-noun flips. This interpretability is a nice side effect compared with trying to understand a monolithic RNN.
Handling reordering and length differences: The soft alignment naturally deals with source and target phrases having different lengths or requiring non-trivial mappings, without needing explicit mechanisms like NULL tokens used in traditional SMT.
Link to Transformers: Reading this after knowing about Transformers makes the connection clear. The core mechanism, scoring source annotations based on the current decoder state, using softmax for weights, and computing a weighted sum, is basically attention. The Transformer later built on this by removing recurrence and adding multi-head attention, positional encodings, and so on.

Summary

This paper addressed a clear limitation in early NMT: the fixed-length vector bottleneck. The solution was straightforward but powerful: allow the decoder to learn where to focus in the source sequence. The “soft alignment” mechanism is, in essence, the attention mechanism that later became central to architectures like the Transformer.

Looking back now, the idea feels intuitive, but implementing it effectively and showing its benefits in 2014/2015 mattered. It’s a clear paper: problem, solution, evidence. Reading it helps connect older sequence-to-sequence models to the models we use today.

Knowledge Distillation - Review

2025-04-22T14:00:00+00:00

I’ve known about knowledge distillation for a while. The core idea is simple: soft labels, the full probability distribution from a model, contain richer information about class relationships than hard labels alone. I first encountered it in a lecture by Geoffrey Hinton, like this one discussing paths to intelligence, and decided to read the original 2015 paper, “Distilling the Knowledge in a Neural Network,” co-authored with Oriol Vinyals and Jeff Dean. It’s short, but the idea is clear.

The insect analogy

What struck me immediately was the opening analogy:

“Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of traveling and reproduction.”

I haven’t seen many ML papers start with a biological analogy like this. I hadn’t thought about insect life stages this way before. The larva is about consumption and growth: slow-moving, maybe not complex, but efficient at extracting resources, like a large training model absorbing information from data. The adult form is optimized for different tasks: lightweight, fast, mobile, and focused on specific functions like reproduction, like an efficient deployment model needing low latency and low computational cost.

The analogy fits perfectly with the challenge in machine learning:

Training: We often use huge, “cumbersome” models (or ensembles) that take lots of computation and time but are great at extracting every bit of signal from large datasets.
Deployment: We need models that are fast, efficient, and have low latency for real-world use.

Distillation, then, is like metamorphosis: transforming the knowledge captured by the cumbersome training model into the efficient deployment model.

Knowledge beyond weights

The paper points out a potential “conceptual block”:

“…we tend to identify the knowledge in a trained model with the learned parameter values.”

This makes it hard to think about transferring knowledge without just copying weights. Prior work like Rich Caruana’s model compression focused on matching the outputs before the final softmax, the logits. Hinton et al.’s approach refines this by using the probabilities from the softmax, arguing that this captures the learned distribution more meaningfully.

The value of “wrong” answers

A key insight is how the large, cumbersome model generalizes. It’s not just about getting the right answer.

“…a side-effect of the learning is that the trained model assigns probabilities to all of the incorrect answers… The relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize.”

The example they give is clear: an image of a BMW might have a tiny probability of being mistaken for a garbage truck, but that probability, however small, is likely higher than the probability of it being mistaken for a carrot. This network of similarities and differences between classes is knowledge learned by the teacher model. Hard labels, just “BMW”, throw this information away. Soft labels, the full probability distribution, preserve it.

This aligns with the objective: we don’t just want models to perform well on training data, we want them to generalize well to new data. Soft targets directly transfer the generalization behavior of the teacher model to the student.

Temperature scaling

So how do we use these soft labels? If the teacher model is very confident, assigning probability ~1.0 to the correct class, the probabilities for incorrect classes are tiny. Even if their ratios contain information, they have almost no impact on the cross-entropy loss during student training.

The solution is to “raise the temperature” T of the softmax function:

q_i = exp(z_i / T) / Σ_j exp(z_j / T)

where z_i are the logits. Normally T=1. Using a higher T > 1 “softens” the probability distribution, increasing the probabilities of incorrect classes and allowing them to contribute more to the loss function. The student model is trained to match this softened distribution using the same high temperature T. After training, the student uses T=1 for inference.

This temperature scaling is the core mechanism. The paper notes that in the high-temperature limit, this method becomes equivalent to matching the logits (Caruana’s approach), but at intermediate temperatures, it focuses more on matching the more probable incorrect classes, potentially ignoring noise from very negative logits.

Training the student

The best results often come from combining two objectives:

Matching the soft targets from the teacher (using cross-entropy with high temperature T).
Matching the true hard labels (using cross-entropy with T=1).

They found that a weighted average works well, often with a lower weight on the hard target loss. As they say: “Typically, the small model cannot exactly match the soft targets and erring in the direction of the correct answer turns out to be helpful.”

The MNIST experiment

A clear experiment shows the value of this approach. They trained a student model on MNIST, but omitted all examples of the digit ‘3’ from the transfer set. From the student’s perspective, ‘3’ was a “mythical digit” it had never directly seen.

Despite this, the distilled model performed well on classifying ‘3’s at test time, with a bias adjustment. It had learned about ‘3’ indirectly through the soft targets for other digits, for example by learning which ‘8’s looked a bit like a ‘3’ according to the teacher model. This is evidence that soft targets transfer generalization behavior, not just labels.

Final thoughts

This paper is a good example of a simple idea that turns out to matter a lot:

“…a lot of helpful information can be carried in soft targets that could not possibly be encoded with a single hard target.”

Neural Probabilistic Language Model - Review

2025-04-20T15:00:00+00:00

I recently dove into Yoshua Bengio et al.’s 2003 paper, “A Neural Probabilistic Language Model”. Reading a paper from over two decades ago is fascinating. What struck me most wasn’t the specific model, which is simple by today’s standards, but how clearly Bengio laid out the core problems of language modeling. I came away with more respect for his vision.

The problem: the curse of dimensionality

Bengio starts by framing the fundamental challenge: the curse of dimensionality. As he puts it,

“…a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.”

This is because the number of possible sentences is essentially infinite, like the Library of Babel. Any specific sentence has almost zero probability of occurring randomly.

The “curse” goes deeper than the sheer number of sequences. As the number of dimensions, such as sequence length or feature count, increases:

Space expands exponentially: The volume of the space grows very fast, making the available data extremely sparse.
Distance intuition breaks: In high dimensions, points tend to become equidistant from each other, and most of the volume is concentrated far from the center, near the “surface” of the high-dimensional space. Our low-dimensional intuitions about proximity and density fail.
Spurious correlations: With so many dimensions, it becomes easy to find apparent patterns in data that are just noise.

This is a core challenge for many real-world problems, especially with rich sensory data spanning many dimensions. How do you find the signal in such a vast, sparse space without getting lost?

The solution: fighting fire with fire

Bengio and his colleagues proposed a way to fight this curse:

“…learning a distributed representation for words…”

Essentially, they proposed learning dense, low-dimensional feature vectors, or embeddings, for each word in the vocabulary. This is like fighting fire with fire: while the vocabulary space is huge and discrete, the learned feature space is much smaller, for example 30-100 dimensions in their experiments vs. 17k+ words, but continuous. Because it’s a dense continuous space, even a relatively low-dimensional one can represent complex relationships. They are mapping the discrete vocabulary into a structured latent space.

How generalization happens

So how does this help? The paper explains:

“Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.”

This, for me, is the crux of it. The model learns which words play similar semantic or syntactic roles and places them close together in the embedding space. Because the probability function operates smoothly over this continuous space, seeing “The cat sat on the mat” helps the model assign a higher probability to the unseen sentence “A dog rested on the rug,” because the corresponding words have similar learned representations. This mapping from discrete symbols to a meaningful continuous space is what allows generalization beyond simply memorizing n-grams. This is still central to how current LLMs generalize, even if the systems are much larger now.

Learning end-to-end

A key part of their proposal was point 3:

“learn simultaneously the word feature vectors and the parameters of that probability function.”

They recognized that the embeddings and the prediction mechanism need to learn from each other. You can’t just fix one and train the other; they have to be optimized together, end-to-end, for the embeddings to become useful for prediction and vice versa.

A historical aside: parallel processing with CPUs

What also caught my eye was the extensive discussion of parallelizing the training process. This was 2003, when widespread GPU computing for ML wasn’t a thing yet. They describe parameter-parallel processing across multiple CPUs, up to 64 Athlon processors in their cluster. They discuss asynchronous updates and communication overhead with MPI. It feels like an early version of the massive parallelization, now mostly on GPUs/TPUs, that is essential for training today’s large models.

Lasting impact

While the specific MLP architecture in the paper is rudimentary now, the core ideas still matter: tackle the curse of dimensionality with learned distributed representations, generalize through similarity in embedding space, and train the representations end-to-end. Reading this paper felt like seeing an early version of the framework we’re still working within.

Revisiting the 2014 Sequence-to-Sequence Paper

2025-04-20T11:00:00+00:00

I recently went back to read the 2014 paper “Sequence to Sequence Learning with Neural Networks” by Sutskever, Vinyals, and Le. It’s practically ancient by today’s ML standards, so I thought it would be interesting to look back. The method itself, an LSTM encoder-decoder, is simple compared to modern architectures, but the authors’ thinking process was still interesting. Some things they mentioned almost casually felt non-trivial to me now.

Here are some of my main takeaways:

DNN power

The paper starts by framing Deep Neural Networks (DNNs) in a way I hadn’t explicitly considered before. They state:

“DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps.”

This sentence, while true and maybe obvious in retrospect, struck me. Of course, we know matrix multiplications are parallelizable and run well on GPUs, but thinking about it from the perspective of individual neurons performing computations in parallel felt like a useful angle on why neural networks fit this hardware.

They also highlight:

“…their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size…”

Again, a specific example of computational power packed into a relatively simple network that I hadn’t really internalized.

The core problem and the LSTM solution

The authors clearly state the limitation they were tackling:

“Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality.”

This was a major hurdle for tasks like machine translation or question answering, where sequence lengths vary. This is where the Long Short-Term Memory (LSTM) network comes in.

For me, the biggest contribution of this paper is that they took the LSTM and successfully trained an encoder-decoder architecture at scale on a difficult task: English-to-French translation. They showed that LSTMs weren’t just a theoretical curiosity; they were practical for large-scale NLP problems.

The input reversal trick

One specific technical detail that stood out was their trick of reversing the input sentence:

“…reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.”

My first reaction was: Okay, reversing the input brings the first source word closer to the first target word, which makes sense for translation since the beginning often sets the context. But what about the last words of the source sentence? Don’t they get pushed really far away from the end of the target sentence?

That’s a valid point, but I think it highlights a trade-off. Getting the beginning of the translation right is often very important; it lays the groundwork. By reversing the input, they made it easier for SGD to “establish communication” between the early parts of the source and target sequences. The performance gains they reported, perplexity dropping from 5.8 to 4.7 and BLEU jumping from 25.9 to 30.6, suggest this was an effective trade-off, even if it seems counter-intuitive for the tail end of the sequences.

Final thoughts

Reading this paper reminded me how far the field has come, but also how much clear thinking went into earlier work. I admire Ilya Sutskever’s way of thinking; when I listen to him on podcasts, he often speaks with clarity. Looking at this early work reinforces that impression. Maybe I should read through more of his papers.

AI as Personal Guardians

2025-04-19T11:00:00+00:00

I’ve been thinking a lot about how LLMs might reshape society, and one thought clicked: AI could become personal guardians for each of us.

The background is this: as I’ve discussed before, context is important for LLMs. Even now, the insights they provide in split-second decisions can be helpful. They aren’t perfect, but the intelligence they offer is useful. The main thing limiting their ability to help us more consistently is access. If we don’t actively query the model with the right context, it can’t respond to our specific needs. Our lives are complex, and the same query can mean different things depending on our individual circumstances.

So, the context an LLM needs to be truly helpful is immense and deeply personal.

Beyond universal assistants

The idea of a universal AI assistant isn’t new. Think Her or Jarvis. We all nod along, assuming something like that is coming. But I don’t think most people fully grasp what happens when this is put into the palm of our hands. It could touch the texture of everyday experience.

What I envision is this: imagine wearing a small device, maybe a pendant, that continuously and passively records context from your daily life: conversations you have, things you hear, places you go, maybe even subtle reactions. Right now, most of this contextual data is ephemeral, lost the moment it happens because we don’t record our everyday lives.

If this data were captured objectively, it could provide the grounding LLMs need to become more helpful.

Confronting our subjectivity

Here’s why I think people underestimate this: we don’t fully appreciate how subjective, limited, fragile, and unreliable our own perception and memory are. Psychological literature makes it clear that human memory isn’t a perfect recording device. We reshape memories and construct narratives to make sense of the world. Our accounts of the same event differ from person to person, filtered through our viewpoints and emotional states. We aren’t purely rational decision-makers.

An AI, fed with continuous, objective context, could hold up a mirror to this subjectivity. It could help us see patterns and realities that our own minds obscure.

The guardian’s role

Imagine the possibilities:

Objective Recall and Comparison: The AI could provide an objective summary of your day, week, month, or even year. It could compare your activities, moods, or interactions over time in ways impossible for our biased human memory. “How does my interaction pattern today compare to last month?” is a question we can barely guess at; the AI could answer with data.
Personalized Planning: Based on this deep, objective understanding of your past actions, goals, and context, it could suggest optimal plans for tomorrow, complete with relevant reminders grounded in your actual history.
Social Shield: For interactions, it could offer insights or warnings. Imagine someone easily manipulated, such as an elderly person. This AI could recognize patterns of deception or fraud that the person might miss, acting as a protective layer by providing information they didn’t previously have.

Thinking about this “social shield” aspect, especially its potential to steer individuals away from harmful decisions, is when the core concept clicked for me. Imagine the AI noticing subtle health patterns and suggesting a check-up, or recognizing manipulative language in a conversation. Preventing bad outcomes by providing timely information could be one of the most useful parts of this technology. That realization solidified the idea of “a personal guardian for everyone.”

With such guardians, individuals might become better decision-makers overall. Imagine consulting your guardian before making major life choices: which university course to take, which habit you didn’t notice is hurting your health, which job offer best aligns with your long-term patterns and goals.

A guardian in the cloud

This isn’t about the AI becoming a godlike entity dictating our lives. It’s about having a practical tool, an intelligent counterpart trying to help us make better decisions, understand ourselves more clearly, and navigate the world more effectively.

I believe this is possible with current technology, though it needs refinement and scale. When, not if, this kind of personalized, context-aware AI guardian becomes widespread, the impact on individual productivity, efficiency, and well-being could be large. Everyone could be better off with their guardian than without.

It leads directly to the future Yuval Noah Harari described, where algorithms might genuinely know you better than you know yourself in certain ways. Fascinating, and slightly unnerving.