SwayStar123

My Diffusion Experiments

2025-09-08T00:00:00-07:00

Here I note down some experiments I tried with diffusion models in the past or that I plan on conducting (if i get the compute for it)

Improving SRA

SRA is a paper which introduces a method of self alignment, essentially, aligns an earlier layer at a higher timestep with a later layer with a lower timestep. I really like this idea, as it allows the paper to compete with (but unforunately not overtake) REPA, without involving an externally trained model. Theoretically it could outperform REPA too if you trained a large enough model with this method, as they show that the benefits from SRA increase as the model size gets larger (but they only test till XL).

So, if this method could be improved further, enough to overtake REPA, it will probably gain alot more recognition.

E-SRA

What I called Entangled-SRA was inspired by REG, which is a paper which improves REPA, it trains the diffusion model with repa, but additionally also includes the dino cls token as an additional target token to be denoised from scratch, which allows it to learn alot faster, 23x faster than REPA infact.

So the idea was to do SRA, but then also pool the teacher layers activations to form a cls token, and then denoise that aswell. However unforunately this didnt work out that well.

I also implemented CFM which should give another 3x boost

Results:

SiT-XL/2: (CFM, E-SRA). CLS weight 0.03
    300k train steps:
        Inception Score: 81.48990631103516
        FID: 14.728037552378794
        sFID: 8.32263929140015
        Precision: 0.65464
        Recall: 0.6173

The results were actually nearly indentical to SRA at 300k steps, so this was a null result. It is unclear if the CFM was counteracting any negative effect from my method, or the other way around. I also tried increasing the CLS weight to 1

SiT-B/2: (CFM, E-SRA). CLS weight 1.0
    375k train steps:
        Inception Score: 36.352333068847656
        FID: 36.51545321657193
        sFID: 9.45342136689294
        Precision: 0.49322
        Recall: 0.6285

however this was actually significantly worse than the SRA’s FID.

In retrospect I should have tested just CFM + SRA first in isolation, and my method in isolation! With these unknown confounding factors it is hard to determine what went wrong here. I made the mistake of trying to do multiple things together as a yolo run because I was on a limited compute budget.

However I theorize that my method might not work that well as it is. The main benefit from SRA is that an earlier layer is forced to model a later layer, which aligns itself to work more efficiently, however in my case when I am diffusing the pooled embedding, that happens after all the layers, so the teacher layer is actually shallower than the prediction head.

If I have additional compute and the desire to continue in this direction I will probably have to add a velocity head at the student layer and get the denoising prediction there, rather than at the end.

Modifying teacher timestep distribution

In the SRA paper, they ablate over their teachers timestep distribution, but they only test distributions in which the teacher timestep is 0-0.3 ahead of the student. Which I find a bit too low (and the fact that it is a random distribution irritates me! Isnt this a inconsistent target? The representations at t and t-0.3 could be very different! And the student has no idea which of this it is supposed to model!)

This is their current teacher timestep calculation method:

time_input_teacher = time_input - (self.t_max * torch.rand_like(time_input))

where t_max is by default 0.2

I replace it with a simple

time_input_teacher = time_input / 10

which makes it so at t=1, teacher time = 0.1, and as student time approaches 0, teacher time also approaches 0.

Inception Score: 48.172080993652344
FID: 30.437015441628148
sFID: 6.288378112108376
Precision: 0.5589
Recall: 0.6347

SRA with the default settings on B/2 gets 29.10 FID at 400k steps, so this is slightly worse, but curiously, my results are still better than their constant time interval of 0.2 ablation (which gets 30.7). So if I do time_input / 5 or something instead it might be better?

TODOs

SRA + SARA might work, or a modified version of SARA. Or maybe just distilling the attention queries and keys at the teacher layer into the student layer is also enough.

Dispersive loss along with SRA could be good. SRA shows that the better the teacher model, the more useful SRA is, so the benefit from dispersive loss could compound.

Contrastive flow loss needs to be tested in isolation to see if it also stacks with SRA and delivers compounding gains.

REPA is the original which inspired SRA, but they dont test if combining the two work, further derivatives like REG/ReDI/REPA-E can also be tested.

Reproducing Contrastive Flow Matching

CFM is a paper which suggests a really simple way to get ~3x faster convergence (and faster sampling) with an additional auxillary loss. Unforunately they dont provide code, so I reproduced it here: https://github.com/SwayStar123/REPA

And successfully reproduced the numbers from the paper:

B/2 λ=0.05: 
  Inception Score: 69.62489318847656
  FID: 20.539321634715975
  sFID: 5.430992245223706
  regularized FD-DINOv2_eff: 1818.6091087211607

(Reported FID in paper is 20.5, 400k steps w REPA, B/2)

It is actually very simple to implement it! These are the few lines you need to add/change

Loss calculation:

        contrastive_flow_target = torch.roll(model_target, shifts=1, dims=0)
        contrastive_flow_loss = mean_flat((model_output - contrastive_flow_target) ** 2)

Loss weight:

loss = (loss_mean - contrastive_flow_loss_mean * args.contrastive_flow_coeff) + proj_loss_mean * args.proj_coeff

(Projection loss is the REPA loss)

Improving CFM

I tried improving CFM further by introducing a time weighting to the loss coefficient. My thinking was that at lower timesteps (low noise), the clear image is very obvious to the model, so adding a contrastive flow loss to that training example could actually unnecessarily perterb the flow and hinder training. So I modified the loss calculation slightly to

contrastive_flow_loss = mean_flat((model_output - contrastive_flow_target) ** 2) * self.contrastive_flow_schedule(time_input)

where the constrative flow schedule would be by default a linear schedule:

def linear_schedule(t):
    return t

I planned on testing other schedules too if this was successful, but unforunately:

B/2 linear schedule λ=0.05:
  Inception Score: 65.89761352539062
  FID: 23.426444638420435
  sFID: 6.770656642394329
  regularized FD-DINOv2_eff: 1792.0649000826197

The results were worse! I thought maybe because the average t is 0.5, the loss weighting is getting halved, so if we double the lambda to counteract it, then it could start becoming better?

B/2 linear schedule λ=0.1:
  Inception Score: 67.88655090332031
  FID: 21.47094195726106
  sFID: 5.468999414535688
  regularized FD-DINOv2_eff: 1813.040922549716

It got better, but unforunately still underperforms the original CFM implementation. Due to these disappointing results I did not test the other schedules or higher lambda values.

Made up words

2025-01-28T00:00:00-08:00

A few words I made up. Use them.

魔也(まや): माया (Māyā)

Maya, illusion, magic.

梵語のमायाに当て字を付けた。マーヤーは魔物に作られた迷い。いわば魔法、呪術。

構図は「魔」と「也」、「魔を也もの」「魔て在る存在」と言う意味を仄めかす。

今代既聞(こんだいきもん):

As opposed to 前代未聞. Something that once might have been considered revolutionary, but is now commonplace.

前代未聞の逆。前代で革命的なものが今ありふれたものに成った状態。

亜生(あい):

AI, artificial intelligence.

生きていない物、亜生。人工知能を示す。また当て字に発音を合わせた。

N5 to N1 in one year

2025-01-28T00:00:00-08:00

I gave the Japanese N1 exam on December 1st 2024, got the result today, passed with a score of 108/180.

My journey

I first started learning Japanese about 3 years ago, with Duolingo. I did one lesson a day for like 2 years, barely learnt anything apart from kana. Calling me N5 level would be really, really, really generous. But as it is a 2 year headstart, I titled this post “N5 to N1 in one year” to be conservative.

I seriously started learning Japanese on October 18th 2023, using jpdb.io. You can see all my statistics by going to https://jpdb-stats.andmore.coffee and inserting my stats json. Download json from here.

Ive spent a cumulative 433 hours on jpdb, learning a total of 18627 non redundant words. In the starting few months, I was doing around 100 new words on average per day. Ive spent 66 hours reading books (同じ夢を見ていた, 本好きの下剋上　第一部　第一巻, 一瞬の永遠を、きみへ, 本好きの下剋上　第一部　第二巻) for a total of 650k characters. Played about 50 hours of divinity 2 original sin in Japanese. Watched an indeterminate amount of anime.

Advice

Learn kana first, you can do this in just a few weeks. Go on jpdb.io, find a simple slice of life anime (do NOT pick anything with a difficult topic), and memorize as many of its most frequent words as you can. (I started with horimiya). Once you have a decent starting point, you can start watching the anime, hearing the words in context will help you remember them alot. I would highly recommend donating to jpdb patreon, as that gives you access to its mpv plugin, which is a really convinient way of mining words. A free alternative would be setting up memento with anki. Its 90% there, but if you can afford it, I would still recommend the jpdb version.

I personally did not bother learning any grammar through any grammar book, I simply immersed and my brain put together all the verb conjugations, particles etc for me (although I did learn a bit from Duolingo, so this may be biased. But still I would not say I learned very much from Duolingo).

After you have watched and mined ~100-200 episodes of anime, you will have build a good vocabulary base, and can start reading books. I would recommend going to the prebuilt decks in JPDB, and seeing what books have the highest coverage for you, and starting with finding one you like from there. I would personally recommend 一瞬の永遠を、きみへ. It was simple in vocabulary and grammar, but was a good read regardless.

Books will force you to actually learn grammar, as you wont have as many context clues as anime, but since you will have a vocabulary base, putting it together subconsciously will still be possible.

After you’ve read a few books, and watched a few hundred episodes of anime, pretty much any kind of immersion along with a popup dictionary will be a piece of cake for you, and you can start watching any genre of content you want. Once you are at this point, immersion and getting better is barely any effort, if you are a weeb, you can just watch what you would watch normally, but with Japanese subtitles. Just try to keep up with your reviews.

Simplest Diffusion

2024-08-07T00:00:00-07:00

I present to the world, the simplest diffusion model! Using only a fully connected feedforward network, and the dead simple sampling function of

def sample(model, bs: int = 9, steps: int = 100):
    x = torch.randn(bs, 28*28).to(device)

    pred_weight = 4/steps
    x_weight = 1 - pred_weight

    for _ in range(steps):
        pred = model.forward(x)
        x = x * x_weight + pred * pred_weight
    return x

we can generate images like this from pure noise.

The sampling function could probably be improved, as the images are not perfect (or maybe you need to train the model for longer, only trained for 10 epochs)

Full code can be seen here

The code is so simple i dont even think it needs any explanation, id just be repeating myself.

Transformers in Brief

2024-08-02T00:00:00-07:00

My attempt to intuitively explain transformers in brief.

Overview

Transformers are a type of neural network architecture designed originally to handle text information.

Tokens and the tokenizer.

Problem

You have text data. Unforunately, machines only speak the language of numbers. You need to convert the text into numbers so the ai can understand it.

Solution

You can make a dictionary for the ai, where you assign each word a number, which is its word id (also called token id).

Here is an example dictionary, containing a very small vocabulary, and a vocabulary size of 9.

Word	Token ID
black	0
grey	1
brown	2
cat	3
dog	4
sleeping	5
playing	6
is	7
the	8

Using this dictionary you can convert your text into a sequence of numbers. For example, with the sentence “The black cat is sleeping” you would get the sequence [8, 0, 3, 7, 5]. (Ignoring the spaces and capitalization).

Problem 2

In the given example, the vocabulary size is only 9, in the real world, you would want your ai to know as many words as possible. You could just scale this up to millions, however there would be a problem. The ai would be unable to deal with made up words, words not common enough to make it into the dictionary, names not common enough to make it into the dictionary, etc.

Solution

You use a tokenizer!

A tokenizer analyzes a huge corpus of text, and learns what strings are the most frequent. It greedily tries to use the least amount of tokens to represent the entire text dataset, given a constraint on the vocabulary size. This means that in the most advanced LLM tokenizers, there is most likely a token for every character, space, punctuation mark, etc, on top of the tokens for most of the words. Not only would the tokenizer learn the most common words, given a large enough vocabulary size, it would also pick up on the most common colocation of words, for example the phrase “Hello world!” is extremely common, and could potentially be represented by a single token. Whereas for an uncommon name thats very rarely used, the tokenizer would likely have to make it up using multiple tokens.

You can see how the openai tokenizers split up a piece of text here https://platform.openai.com/tokenizer Curiously, the openai tokenizer usually has tokens for “{space}+{word}” instead of just “{word}”, so it actually rarely uses the space token. This makes sense if you think about it though, as the tokenizer is far more likely to see words with spaces seperating them, so it just learns the tokens along with the space.

Embeddings

Problem

You converted all your text into a sequence of numbers, but these numbers are still meaningless to the ai. You need to convert these token/word ids into some numbers that the ai can actually understand!

Solution

You use embeddings!

You can think of embeddings as yet another dictionary, this time mapping the token ids to a vector of numbers. All of these vectors would be learnable parameters. What this means is that instead of them being chosen by humans or an algorithm, the AI will decide these vectors for itself during training. This bridges the gap between numbers we can understand, and the numbers the ai can understand.

Positional Encoding

Problem

Further ahead, we will want every token to interact with every other token. However, in that situation, the order of words will be lost. “he killed the lion” and “the lion killed him” are two different sentences, in the embedding form, they would retain their positions, but when figuring out their interactions, you would consider every possible pair of tokens, in which the order of the tokens would be lost.

Solution

You use positional encoding!

Positional encodings are just a bunch of sine waves, of different frequencies, so if you sample the positional embedding, you will get different values for each position.

The reason you dont just use a simple range of increasing numbers (ie, [0, 1, 2, 3, 4, …]) is because ais generally dont play well with big numbers, and the sequence length could go upto millions in the most advanced models. So to keep the numbers between -1 and 1, you use sine waves.

As the positional encoding is constant, it is just precomputed for the entire range of sequence lengths that the ai will see during training, and just looked up when needed.

The positional encodings are added to the token embeddings, and the result is fed into the attention layer.

Attention

Problem

You have a sequence of token embeddings, but the length of the sequence can vary. You want to be able to support sentences of all sizes, paragraphs, essays, books, etc. In order to support all these extremely varying sequence lengths, there is no obvious way to use a fully connected neural network.

Solution

The attention architecture solves this problem by using the same set of weight matrices for every token embedding in the sequence, within the same attention layer (different layers have different weight matrices, and multi headed attention layers can have multiples of the weight matrices, but the number of these matrices wont change depending on the sequence length).

In a single single headed attention layer you have 3 weight matrices, \(W_q\), \(W_k\), and \(W_v\), each representing a query, key, and value matrix respectively.

You obtain the query, key, and value vectors for each token embedding by multiplying the token embedding by the query, key, and value matrices respectively.

\[q_i = W_q \cdot x_i\] \[k_i = W_k \cdot x_i\] \[v_i = W_v \cdot x_i\]

where \(x_i\) is the token embedding for the \(i\)th token in the sequence.

Then, using the key and the query, you can compute how much “attention” a word pays to another word (or itself). The below can be read as “the amount of attention token \(x_n\) pays to token \(x_m\)”.

\[a[x_m, x_n] = softmax_m[\frac{q_m \cdot k_n^T}{\sqrt{D_k}}]\]

where \(D_k\) is the dimension of the key and query vectors.

The numerator is the dot product of the query and the transpose of the key. This ends up being a scalar value since both the query and the key have the same dimension. This scalar value represents how similar the two vectors are, if they are similar, the value will be higher, and if not, the value will be lower. However, the values can become very large, as the dot product operation is a sum of products of the elements of the vectors, and as discussed above, ai training doesnt play well with big numbers, so to counteract this, we have the denominator.

The denominator is the squart root of the dimension of the vectors, this is to ensure training stability (in the softmax function, and if the values vary largely, the gradients will be unstable).

The softmax function is then applied to the term, which normalizes the value to between 0 and 1, where all the values along the m column sum to 1. Essentially meaning, the values for all \([x_(1, .., N), x_n]\) sum to 1. So we have a list of which \(x_n\)s pay the most attention to the given \(x_m\).

Going back to the example sentence “The black cat is sleeping”, this could be a hypothetical attention matrix

\(x_n\)	\(x_1\) “The”	\(x_2\) “black”	\(x_3\) “cat”	\(x_4\) “is”	\(x_5\) “sleeping”
“The”	0.15	0.05	0.8	0.0	0.0
“black”	0.0	0.1	0.9	0.0	0.0
“cat”	0.1	0.5	0.2	0.05	0.15
“is”	0.00	0.00	0.1	0.2	0.7
“sleeping”	0.1	0.0	0.8	0.05	0.05

Heres my thought process behind the numbers i made up (its not necessary any model learns such a matrix, this is something hypothetical i made up to aide understanding)

“The”: The word “cat” pays the most attention to this, because in this sentence, it marks the subject of the sentence. “black”: Again, the word “cat” pays most attention to this, because it is the cat that is black. “cat”: The words “black” and “sleeping” pay the most attention, however, the word also pays some attention to itself. “is”: The word “sleeping” pays the most attention, as that the action being done. “sleeping”: The word “cat” pays the most attention, as it is the one sleeping.

Using these attention weights, and the previously calculated value vectors, we will now begin to modify the token embeddings.

\[sa_n[x_1, .., x_N] = \sum_{m=1}^{N} a[x_m, x_n] \cdot v_m\]

This is called self attention, as the keys, queries, and values, all come from the same sequence of token embeddings. (As opposed to cross attention, where not all the vectors are from the same sequence).

The attention scalars that we just calculated act as weights, that decide how much of each value vector affect the \(sa_n\) output vector.

Continuing using the example sentence “The black cat is sleeping”, and using the hypothetical attention matrix, here is an imagined explanation for what the values represent, and what they combine into for all the \(sa_n\) vectors.

\(sa_1\): “The” goes from an abstract embedding that contains the meaning of “subject marker”, to a more concrete meaning of “The cat”. The value matrix for \(m = 3\) which is “cat” is the highest, which encodes the meaning of “cat” into the \(sa_1\) vector. \(sa_2\): “black” goes from just the colour “black” to a more nuanced meaning, of “black (fur, breed of cat, etc)”, as “cat” pays the most attention to this token. \(sa_3\): “cat” goes from an embedding representing the concept of the average cat, to a “The (singular subject) black sleeping cat”, due the attention paid by all the other tokens. \(sa_4\): “is” goes from another abstract verb embedding, to a more nuanced embedding of “(animal) is (sleeping)” due to the attention paid by “cat” and “sleeping” . \(sa_5\): “sleeping” goes from a verb embedding more heavily weighted to humans (as thats the most common use of the word), to now representing cats sleeping, perhaps it has the added nuance of “purring” now too!

Transformer Architecture

WIP. I dont have a good enough intuitive explanation for why the transformer architecture is as it is.