<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
      <title>David Oniani</title>
      <link>https://oniani.org</link>
      <description>David Oniani&#x27;s Website</description>
      <generator>Zola</generator>
      <language>en</language>
      <atom:link href="https://oniani.org/rss.xml" rel="self" type="application/rss+xml"/>
      <lastBuildDate>Thu, 12 Mar 2026 00:00:00 +0000</lastBuildDate>
      <item>
          <title>Autoregressive Transformer</title>
          <pubDate>Thu, 12 Mar 2026 00:00:00 +0000</pubDate>
          <author>Unknown</author>
          <link>https://oniani.org/blog/transformer/</link>
          <guid>https://oniani.org/blog/transformer/</guid>
          <description xml:base="https://oniani.org/blog/transformer/">&lt;p&gt;I had been planning to make a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;@davidoniani&#x2F;videos&quot;&gt;YouTube video&lt;&#x2F;a&gt; about this for quite some time. However, just as I
was preparing to release it, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=kCc8FmEb1nY&quot;&gt;Andrej Karpathy released an excellent video&lt;&#x2F;a&gt; on the
topic, and it quickly went viral. After that, I decided to hold off on recording my version. I am
now turning the material into a written guide on the autoregressive transformer architecture. We
will implement a GPT-like &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Large_language_model&quot;&gt;Large Language Model (LLM)&lt;&#x2F;a&gt; from scratch in Python using
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;pytorch.org&#x2F;&quot;&gt;PyTorch&lt;&#x2F;a&gt; as the only dependency. I assume familiarity with PyTorch, and for those new to
it, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cs230.stanford.edu&#x2F;blog&#x2F;pytorch&#x2F;&quot;&gt;Introduction to PyTorch Code Examples from Stanford&lt;&#x2F;a&gt; provides a helpful starting
point. Without further ado, let us get started!&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The complete code for the GPT-like autoregressive transformer implemented in this article is
available here: &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;oniani&#x2F;c8a346c59b2330869febc4c9c36b45fb&quot;&gt;link&lt;&#x2F;a&gt;. In 222 lines, it automatically downloads the dataset, tokenizes the
text, pretrains the model, and generates sample text.&lt;&#x2F;p&gt;
&lt;p&gt;Note that the model implemented and trained in this article, also referred to as the base model,
is a sentence completer rather than a chatbot. Converting it into a chatbot typically requires
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), which are beyond the scope
of this article. Still, a pretrained generative LLM is a powerful tool for sampling knowledge.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Given input text, our objective is to generate output text conditioned on that sequence. In
traditional transformer models, excluding &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;aclanthology.org&#x2F;2025.acl-long.453&#x2F;&quot;&gt;byte-latent transformers&lt;&#x2F;a&gt;, the initial stage
is always &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Large_language_model#Tokenization&quot;&gt;tokenization&lt;&#x2F;a&gt;, regardless of whether the architecture is encoder-only
(e.g., &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1810.04805&quot;&gt;BERT&lt;&#x2F;a&gt;), encoder-decoder (e.g., &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;jmlr.org&#x2F;papers&#x2F;v21&#x2F;20-074.html&quot;&gt;T5&lt;&#x2F;a&gt;), or decoder-only (e.g., &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cdn.openai.com&#x2F;better-language-models&#x2F;language_models_are_unsupervised_multitask_learners.pdf&quot;&gt;GPT-2&lt;&#x2F;a&gt;). One
straightforward method is character-level tokenization, where each individual character is mapped to
a unique token ID. The following zero-dependency Python implementation is designed to handle both
encoding and decoding for arbitrary input text data:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; string&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;class&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; CharacterLevelTokenizer&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 4&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;A character-level tokenizer that treats individual characters as tokens.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 5&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 6&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; __init__&lt;&#x2F;span&gt;&lt;span&gt;(self, text:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; str&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; string.printable) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 7&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Builds the vocabulary from the given text.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 8&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 9&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.vocab: list[&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;str&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; sorted&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;set&lt;&#x2F;span&gt;&lt;span&gt;(text))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.vocab_size:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; len&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.vocab)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.char_to_token: dict[&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;str&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; {char: idx&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt; idx, char&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; enumerate&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.vocab)}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;12&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; encode&lt;&#x2F;span&gt;&lt;span&gt;(self, text:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; str&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; list[&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt;]:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;14&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Converts a string into a list of token IDs.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;15&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        return&lt;&#x2F;span&gt;&lt;span&gt; [&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.char_to_token[char]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt; char&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span&gt; text]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;17&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;18&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; decode&lt;&#x2F;span&gt;&lt;span&gt;(self, token_ids: list[&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; str&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Converts a list of token IDs back into a string.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;20&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;21&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        return&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;.join(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.vocab[token_id]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt; token_id&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span&gt; token_ids)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;22&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;23&lt;&#x2F;span&gt;&lt;span&gt;tokenizer&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; CharacterLevelTokenizer()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span&gt;enc&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; tokenizer.encode(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;robot&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;    # -&amp;gt; [87, 84, 71, 84, 89]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;25&lt;&#x2F;span&gt;&lt;span&gt;dec&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; tokenizer.decode(enc)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;        # -&amp;gt; &amp;#39;robot&amp;#39;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As shown above, &lt;code&gt;CharacterLevelTokenizer&lt;&#x2F;code&gt; treats each individual character as a separate token.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note that while character-level tokenization is conceptually straightforward, most production
systems utilize subword tokenizers to achieve a more optimal balance between vocabulary size and
representational capacity. By capturing frequent character sequences as single units, subword
algorithms, such as &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Byte-pair_encoding&quot;&gt;Byte-Pair Encoding (BPE)&lt;&#x2F;a&gt; or &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;google-research&#x2F;bert#tokenization&quot;&gt;WordPiece&lt;&#x2F;a&gt;, significantly
enhance computational efficiency compared to more granular methods. However, in this article, we
will be using &lt;code&gt;CharacterLevelTokenizer&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;At this stage, it may help to look at the transformer architecture as a whole to get a sense of its
overall structure. After that, we can break it down and examine each component step by step to see
how everything fits together. It may feel a bit overwhelming at first, which is natural, but do not
be afraid, since in the end it is simply linear algebra arranged in a certain manner.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;from&lt;&#x2F;span&gt;&lt;span&gt; dataclasses&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; import&lt;&#x2F;span&gt;&lt;span&gt; dataclass&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; torch&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; torch.nn&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; as&lt;&#x2F;span&gt;&lt;span&gt; nn&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 4&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; torch.nn.functional&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; as&lt;&#x2F;span&gt;&lt;span&gt; F&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 5&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 6&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;@dataclass&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;frozen&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;True&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # We like our dataclasses frozen!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 7&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;class&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Config&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 8&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;Transformer config.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 9&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;    vocab_size:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Tokenizer vocabulary size&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span&gt;    block_size:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Max sequence length (context window)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;12&lt;&#x2F;span&gt;&lt;span&gt;    n_layer:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;     # Number of transformer layers&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;    n_head:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;      # Attention heads per layer&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;14&lt;&#x2F;span&gt;&lt;span&gt;    n_embd:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;      # Embedding dimension (must be divisible by `n_head`)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;15&lt;&#x2F;span&gt;&lt;span&gt;    dropout:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; float&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;   # Dropout probability&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;    bias:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; bool&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;       # Whether `nn.Linear` and `nn.LayerNorm` use bias&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;17&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;18&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;    @&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;property&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; head_size&lt;&#x2F;span&gt;&lt;span&gt;(self) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;20&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Returns the per-head dimension (embedding is split evenly across attention heads).&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;21&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;22&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        return&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.n_embd&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; &#x2F;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.n_head&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;23&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;class&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Transformer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;nn&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;Module&lt;&#x2F;span&gt;&lt;span&gt;):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;25&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;Autoregressive transformer language model.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;26&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;27&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; __init__&lt;&#x2F;span&gt;&lt;span&gt;(self, cfg: Config) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;28&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Initializes the building blocks of the transformer.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;29&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;30&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        super&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;__init__&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;31&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.block_size&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; cfg.block_size&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;32&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;33&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.tok_emb_table&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Embedding(cfg.vocab_size, cfg.n_embd)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;34&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.pos_emb_table&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Embedding(cfg.block_size, cfg.n_embd)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;35&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;36&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.blocks&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Sequential(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt;[Block(cfg)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt; _&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; range&lt;&#x2F;span&gt;&lt;span&gt;(cfg.n_layer)])&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;37&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.ln&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.LayerNorm(cfg.n_embd,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; bias&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;cfg.bias)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;38&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.proj&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Linear(cfg.n_embd, cfg.vocab_size)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;39&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;40&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;        # Weight tying: reduces the total number of parameters without degrading accuracy&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;41&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;        # Reference: https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1608.05859&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;42&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.proj.weight&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.tok_emb_table.weight&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;43&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;44&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; forward&lt;&#x2F;span&gt;&lt;span&gt;(self, token_ids: torch.Tensor, targets: torch.Tensor&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; |&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; tuple&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;45&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Computes logits and optional loss.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;46&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;        B, T&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; token_ids.shape&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;48&lt;&#x2F;span&gt;&lt;span&gt;        tok_emb&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.tok_emb_table(token_ids)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;49&lt;&#x2F;span&gt;&lt;span&gt;        pos_emb&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.pos_emb_table(torch.arange(T,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; device&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;token_ids.device))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;50&lt;&#x2F;span&gt;&lt;span&gt;        emb&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; tok_emb&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span&gt; pos_emb&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;51&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;52&lt;&#x2F;span&gt;&lt;span&gt;        out&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.ln(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.blocks(emb))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;53&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        if&lt;&#x2F;span&gt;&lt;span&gt; targets&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; is&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;54&lt;&#x2F;span&gt;&lt;span&gt;            logits&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.proj(out[:, [&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], :])&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Last-token projection with time dimension&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;55&lt;&#x2F;span&gt;&lt;span&gt;            loss&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;56&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        else&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;57&lt;&#x2F;span&gt;&lt;span&gt;            logits&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.proj(out)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;58&lt;&#x2F;span&gt;&lt;span&gt;            loss&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; F.cross_entropy(logits.view(B&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; *&lt;&#x2F;span&gt;&lt;span&gt; T,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;), targets.view(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;59&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;60&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        return&lt;&#x2F;span&gt;&lt;span&gt; logits, loss&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;61&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;62&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;    @torch.no_grad&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;63&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; generate&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span&gt;        self,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;65&lt;&#x2F;span&gt;&lt;span&gt;        token_ids: torch.Tensor,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;66&lt;&#x2F;span&gt;&lt;span&gt;        max_new_token_ids:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;67&lt;&#x2F;span&gt;&lt;span&gt;        temperature:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; float&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 0.7&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;68&lt;&#x2F;span&gt;&lt;span&gt;        top_k:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; |&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;69&lt;&#x2F;span&gt;&lt;span&gt;    ) -&amp;gt; torch.Tensor:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;70&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Generates tokens IDs autoregressively.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;71&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;72&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.eval()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;73&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        for&lt;&#x2F;span&gt;&lt;span&gt; _&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; range&lt;&#x2F;span&gt;&lt;span&gt;(max_new_token_ids):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;74&lt;&#x2F;span&gt;&lt;span&gt;            logits, _&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;(token_ids[:,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.block_size :])&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;75&lt;&#x2F;span&gt;&lt;span&gt;            logits&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; logits[:,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, :]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; &#x2F;&lt;&#x2F;span&gt;&lt;span&gt; temperature&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;76&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;77&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;            if&lt;&#x2F;span&gt;&lt;span&gt; top_k&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; is not&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;78&lt;&#x2F;span&gt;&lt;span&gt;                k&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; min&lt;&#x2F;span&gt;&lt;span&gt;(top_k, logits.size(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;79&lt;&#x2F;span&gt;&lt;span&gt;                threshold&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.topk(logits, k).values[:, [&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;]]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;80&lt;&#x2F;span&gt;&lt;span&gt;                logits[logits&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; &amp;lt;&lt;&#x2F;span&gt;&lt;span&gt; threshold]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;float&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;inf&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;81&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;82&lt;&#x2F;span&gt;&lt;span&gt;            probs&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.softmax(logits,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; dim&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;83&lt;&#x2F;span&gt;&lt;span&gt;            next_token&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.multinomial(probs,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; num_samples&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;84&lt;&#x2F;span&gt;&lt;span&gt;            token_ids&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.cat((token_ids, next_token),&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; dim&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;85&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        return&lt;&#x2F;span&gt;&lt;span&gt; token_ids&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let us focus on the &lt;code&gt;forward&lt;&#x2F;code&gt; method, the core inference function that is also used during
generation. After tokenizing the input text, semantic information is captured in &lt;code&gt;tok_emb&lt;&#x2F;code&gt;, which
produces token embeddings that represent each token&#x27;s meaning as a numerical tensor. However, these
embeddings do not encode token order. To incorporate positional information, we use learned
positional embeddings computed in &lt;code&gt;pos_emb&lt;&#x2F;code&gt; rather than &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;kazemnejad.com&#x2F;blog&#x2F;transformer_architecture_positional_encoding&#x2F;&quot;&gt;fixed sinusoidal encodings&lt;&#x2F;a&gt;,
as learned embeddings are more expressive and can adapt to task-specific positional patterns. The
token and positional embeddings are then combined through simple addition to form a unified
representation &lt;code&gt;emb&lt;&#x2F;code&gt; that encodes both meaning and position. This additive approach is sufficient,
as it breaks permutation symmetry and allows the attention mechanism to infer and model positional
structure without requiring more complex operations.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Learned positional embeddings are simple to implement and work well in practice, but they
typically tie a model to the maximum sequence length used during training. Many modern
architectures instead adopt &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2104.09864&quot;&gt;Rotary Position Embedding (RoPE)&lt;&#x2F;a&gt;, which encodes position by
rotating query and key vectors with position-dependent angles. This design allows attention to
represent relative distances between tokens and often extrapolates more gracefully to longer
contexts. For simplicity, however, this article uses learned positional embeddings.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Although the line &lt;code&gt;out = self.ln(self.blocks(emb))&lt;&#x2F;code&gt; appears compact, it encapsulates a substantial
portion of the transformer&#x27;s computational core. Here, &lt;code&gt;self.blocks&lt;&#x2F;code&gt; represents a stack of
transformer blocks, each composed of Multi-Head Attention (MHA) mechanisms and Multilayer
Perceptrons (MLPs) that progressively refine the token embeddings by modeling complex semantic and
contextual relationships across the sequence. Following these deep transformations, &lt;code&gt;self.ln&lt;&#x2F;code&gt;
applies &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;docs.pytorch.org&#x2F;docs&#x2F;stable&#x2F;generated&#x2F;torch.nn.LayerNorm.html&quot;&gt;layer normalization&lt;&#x2F;a&gt; to stabilize the network and ensure well-behaved gradients.
From there, the forward pass branches depending on the objective: during inference (when &lt;code&gt;targets&lt;&#x2F;code&gt;
are omitted), the model efficiently isolates the last token&#x27;s representation before projecting it
into vocabulary-sized logits, since predicting the next word only requires this final aggregated
context. Conversely, during training, the entire sequence is projected and the resulting logits and
targets are flattened to combine the batch and time dimensions, satisfying PyTorch&#x27;s
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;docs.pytorch.org&#x2F;docs&#x2F;stable&#x2F;generated&#x2F;torch.nn.functional.cross_entropy.html&quot;&gt;&lt;code&gt;F.cross_entropy&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; loss requirements.&lt;&#x2F;p&gt;
&lt;p&gt;While these final routing steps handle output formatting and the training objective, they are not
where the model&#x27;s main representational power resides. That capability comes from the repeated
attention and feed-forward layers inside &lt;code&gt;self.blocks&lt;&#x2F;code&gt;. To understand how the model builds
contextual meaning across a sequence, we will next unpack &lt;code&gt;self.blocks(emb)&lt;&#x2F;code&gt; and examine the &lt;code&gt;Block&lt;&#x2F;code&gt;
class along with its core components, &lt;code&gt;AttentionHead&lt;&#x2F;code&gt; and &lt;code&gt;MultiHeadAttention&lt;&#x2F;code&gt;, to see how they
interact under the hood.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;class&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; AttentionHead&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;nn&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;Module&lt;&#x2F;span&gt;&lt;span&gt;):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;A single causal self-attention head.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 4&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; __init__&lt;&#x2F;span&gt;&lt;span&gt;(self, cfg: Config) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 5&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Initializes QKV projection and dropout, and cache causal mask to avoid recomputing it.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 6&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 7&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        super&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;__init__&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 8&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.qkv&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Linear(cfg.n_embd,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; *&lt;&#x2F;span&gt;&lt;span&gt; cfg.head_size,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; bias&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;cfg.bias)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 9&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.dropout&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Dropout(cfg.dropout)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.register_buffer(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;mask&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, torch.tril(torch.ones(cfg.block_size, cfg.block_size)))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;11&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;12&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; forward&lt;&#x2F;span&gt;&lt;span&gt;(self, x: torch.Tensor) -&amp;gt; torch.Tensor:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Computes masked single-head self-attention for the input tensor.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;14&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;15&lt;&#x2F;span&gt;&lt;span&gt;        _, T, D&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; x.shape&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;        q, k, v&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.qkv(x).split(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.qkv.out_features&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; &#x2F;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; dim&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;17&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;18&lt;&#x2F;span&gt;&lt;span&gt;        attn_scores&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; q&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; @&lt;&#x2F;span&gt;&lt;span&gt; k.transpose(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span&gt;        attn_scores&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; attn_scores&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; *&lt;&#x2F;span&gt;&lt;span&gt; D&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;**-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0.5&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Prevent softmax from blowing up&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;20&lt;&#x2F;span&gt;&lt;span&gt;        attn_scores&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; attn_scores.masked_fill(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.mask[:T, :T]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ==&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 0&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; float&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;-inf&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;))&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Mask future tokens&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;21&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;22&lt;&#x2F;span&gt;&lt;span&gt;        attn_weights&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.softmax(attn_scores,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; dim&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;23&lt;&#x2F;span&gt;&lt;span&gt;        attn_weights&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.dropout(attn_weights)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        return&lt;&#x2F;span&gt;&lt;span&gt; attn_weights&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; @&lt;&#x2F;span&gt;&lt;span&gt; v&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;25&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;26&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;class&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; MultiHeadAttention&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;nn&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;Module&lt;&#x2F;span&gt;&lt;span&gt;):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;27&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;A Multi-Head Attention (MHA).&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;28&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;29&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; __init__&lt;&#x2F;span&gt;&lt;span&gt;(self, cfg: Config) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;30&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Initializes multi-head self-attention with output projection and dropout.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;31&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        super&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;__init__&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;33&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.heads&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.ModuleList([AttentionHead(cfg)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt; _&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; range&lt;&#x2F;span&gt;&lt;span&gt;(cfg.n_head)])&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;34&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.proj&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Linear(cfg.n_embd, cfg.n_embd)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;35&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.dropout&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Dropout(cfg.dropout)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;36&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;37&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; forward&lt;&#x2F;span&gt;&lt;span&gt;(self, x: torch.Tensor) -&amp;gt; torch.Tensor:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;38&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Computes masked multi-head self-attention for the input tensor.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;39&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;40&lt;&#x2F;span&gt;&lt;span&gt;        out&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.cat([head(x)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt; head&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.heads],&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; dim&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;41&lt;&#x2F;span&gt;&lt;span&gt;        out&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.dropout(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.proj(out))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;42&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        return&lt;&#x2F;span&gt;&lt;span&gt; out&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;43&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;44&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;class&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Block&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;nn&lt;&#x2F;span&gt;&lt;span&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;Module&lt;&#x2F;span&gt;&lt;span&gt;):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;45&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;Transformer block.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;46&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; __init__&lt;&#x2F;span&gt;&lt;span&gt;(self, cfg: Config) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;48&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Initializes a transformer block.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;49&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;50&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        super&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;__init__&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;51&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.ln1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.LayerNorm(cfg.n_embd,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; bias&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;cfg.bias)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;52&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.mha&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; MultiHeadAttention(cfg)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;53&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.ln2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.LayerNorm(cfg.n_embd,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; bias&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;cfg.bias)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;54&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        self&lt;&#x2F;span&gt;&lt;span&gt;.mlp&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; nn.Sequential(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;55&lt;&#x2F;span&gt;&lt;span&gt;            nn.Linear(cfg.n_embd,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; *&lt;&#x2F;span&gt;&lt;span&gt; cfg.n_embd,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; bias&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;cfg.bias),&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Flexible: can change to e.g. `4 * cfg.n_embd`&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;56&lt;&#x2F;span&gt;&lt;span&gt;            nn.GELU(),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;57&lt;&#x2F;span&gt;&lt;span&gt;            nn.Linear(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; *&lt;&#x2F;span&gt;&lt;span&gt; cfg.n_embd, cfg.n_embd,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; bias&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;cfg.bias),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;58&lt;&#x2F;span&gt;&lt;span&gt;            nn.Dropout(cfg.dropout),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;59&lt;&#x2F;span&gt;&lt;span&gt;        )&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;60&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;61&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; forward&lt;&#x2F;span&gt;&lt;span&gt;(self, x: torch.Tensor) -&amp;gt; torch.Tensor:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;62&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;        &amp;quot;&amp;quot;&amp;quot;Performs a forward pass: attention + MLP with residuals and layer norms.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;63&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span&gt;        x&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; x&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.mha(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.ln1(x))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;65&lt;&#x2F;span&gt;&lt;span&gt;        x&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; x&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; self&lt;&#x2F;span&gt;&lt;span&gt;.mlp(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;self&lt;&#x2F;span&gt;&lt;span&gt;.ln2(x))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;66&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        return&lt;&#x2F;span&gt;&lt;span&gt; x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;AttentionHead&lt;&#x2F;code&gt; class implements a single causal self-attention head. Attention, introduced in
the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1706.03762&quot;&gt;Attention Is All You Need&lt;&#x2F;a&gt; paper, is a directional communication mechanism that
allows tokens in a sequence to exchange information, with each token gathering information from
others. Each token produces three tensors:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Query: Like asking a librarian a question such as &quot;Which books talk about dinosaurs?&quot; It
represents what someone is searching for and is used to scan the catalog to find relevant matches.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Key: Like the labels or table of contents entries in the library catalog. Each key describes what
a book or page is about and acts as metadata that helps the query determine which sources are
relevant.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Value: Like the actual pages inside the book. After the query matches with the keys, it retrieves
the real information or content stored in the values.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The interaction between queries and keys determines how strongly tokens attend to one another, while
values carry the aggregated content. In an autoregressive transformer, causal self-attention
restricts each token to attend only to previous tokens, preventing future information leakage during
training and inference. Furthermore, the queries, keys, and values all originate from the same input
sequence. Formally, for the token sequence length \(n\) and input dimension \(d\), query weights
\(Q\), key weights \(K\), and value weights \(V\) are projected as \(Q, K, V \in
\mathbb{R}^{n \times d}\) and the attention is computed as:&lt;&#x2F;p&gt;
&lt;p&gt;$$\text{Attention}(Q, K, V) = \text{softmax}\big(\frac{Q K^\top}{\sqrt{d}}\big) V$$&lt;&#x2F;p&gt;
&lt;p&gt;where \(QK^\top &#x2F; \sqrt{d}\) computes scaled similarity scores between queries and keys, with the
scaling factor preventing softmax saturation. The softmax converts these scores into attention
weights, which in our library analogy represent the relative importance of each page in providing
the answer. The weighted sum of \(V\) then produces an output that emphasizes the most relevant
information, similar to extracting summarized notes from the most useful pages. In the
implementation, attention is computed in the &lt;code&gt;forward&lt;&#x2F;code&gt; method of the &lt;code&gt;AttentionHead&lt;&#x2F;code&gt; class.&lt;&#x2F;p&gt;
&lt;p&gt;In &lt;code&gt;MultiHeadAttention&lt;&#x2F;code&gt;, we implement MHA. Conceptually, MHA consists of several parallel
&lt;code&gt;AttentionHead&lt;&#x2F;code&gt; modules whose outputs are concatenated and passed through a final linear projection.
This design allows the model to attend to multiple aspects of the input simultaneously, with each
head learning distinct relational patterns such as syntactic structure, long-range dependencies, and
subtle semantic cues, which enrich the overall representation. We then apply a dropout to the
projected output to regularize the MHA module during training. Without this regularization,
different heads may co-adapt and learn redundant patterns, and the projection layer after
concatenation can become overly confident in certain features. Since transformers have large
capacity, this helps reduce the risk of overfitting and encourages the model to learn more diverse
and robust representations.&lt;&#x2F;p&gt;
&lt;p&gt;Have the &lt;code&gt;TransformerBlock&lt;&#x2F;code&gt; class implements a single transformer block, which serves as a core
building unit of the model. The first layer normalization standardizes the input to stabilize
training, and the MHA module allows the block to focus on multiple relationships across the
sequence, with the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1512.03385&quot;&gt;residual connection&lt;&#x2F;a&gt; ensuring that the original information is
preserved. The second layer normalization prepares the data for the feed-forward network, which
expands and transforms each position independently to capture higher-level features, while the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1606.08415&quot;&gt;GELU&lt;&#x2F;a&gt; activation introduces nonlinearity and the dropout regularizes the output. The second
residual connection adds the transformed features back to the input, helping gradients flow more
effectively and enabling the block to learn complex patterns without losing essential information.
With this, we now understand the line &lt;code&gt;out = self.ln(self.blocks(emb))&lt;&#x2F;code&gt; and have everything we need
to train a GPT-like model.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, let us discuss the &lt;code&gt;generate&lt;&#x2F;code&gt; method. This method performs autoregressive text generation
using a transformer-style language model. Starting from an initial sequence of &lt;code&gt;token_ids&lt;&#x2F;code&gt;, it
generates one token at a time for up to &lt;code&gt;max_new_token_ids&lt;&#x2F;code&gt; steps. At each iteration, the model is
given only the most recent &lt;code&gt;block_size&lt;&#x2F;code&gt; tokens, ensuring the input fits within the model’s context
window. The logits corresponding to the last position are extracted and scaled by the &lt;code&gt;temperature&lt;&#x2F;code&gt;
parameter to control the randomness of the output. Optionally, top-k filtering can be applied to
restrict sampling to the &lt;code&gt;k&lt;&#x2F;code&gt; most likely tokens by masking all others. The filtered logits are then
converted to probabilities using &lt;code&gt;softmax&lt;&#x2F;code&gt;, and the next token is sampled stochastically with
&lt;code&gt;torch.multinomial&lt;&#x2F;code&gt;. This sampled token is appended to the sequence, and the process repeats until
the specified number of tokens has been generated, producing the final extended token sequence.&lt;&#x2F;p&gt;
&lt;p&gt;This completes the implementation of the GPT-like transformer! We will now train the model with
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1711.05101&quot;&gt;AdamW&lt;&#x2F;a&gt; and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2502.16982&quot;&gt;Muon&lt;&#x2F;a&gt;, and generate text.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; urllib.request&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; sample_batch&lt;&#x2F;span&gt;&lt;span&gt;(data: torch.Tensor, batch_size:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;, block_size:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; tuple&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 4&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;Randomly samples training sequences for Next-Token Prediction (NTP).&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 5&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 6&lt;&#x2F;span&gt;&lt;span&gt;    idxs&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.randint(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;(data)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; -&lt;&#x2F;span&gt;&lt;span&gt; block_size, (batch_size,))&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;                    # Random starting positions&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 7&lt;&#x2F;span&gt;&lt;span&gt;    token_ids&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.stack([data[idx : idx&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span&gt; block_size]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt; idx&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span&gt; idxs])&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;        # Input sequences&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 8&lt;&#x2F;span&gt;&lt;span&gt;    targets&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.stack([data[idx&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span&gt; : idx&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span&gt; block_size&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt; idx&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span&gt; idxs])&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Same sequences shifted by +1 (NTP)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt; 9&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span&gt; token_ids, targets&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;10&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;with&lt;&#x2F;span&gt;&lt;span&gt; urllib.request.urlopen(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;https:&#x2F;&#x2F;www.gutenberg.org&#x2F;cache&#x2F;epub&#x2F;84&#x2F;pg84.txt&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; as&lt;&#x2F;span&gt;&lt;span&gt; f:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Read Frankenstein&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;12&lt;&#x2F;span&gt;&lt;span&gt;    text&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; f.read().decode(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;utf-8&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;tokenizer&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; CharacterLevelTokenizer(text)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;14&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;15&lt;&#x2F;span&gt;&lt;span&gt;device&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.device(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;cuda:0&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; if&lt;&#x2F;span&gt;&lt;span&gt; torch.cuda.is_available()&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; else&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;mps&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; if&lt;&#x2F;span&gt;&lt;span&gt; torch.backends.mps.is_available()&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; else&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;cpu&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;cfg&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; Config(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;vocab_size&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;tokenizer.vocab_size,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; block_size&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; n_layer&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; n_head&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; n_embd&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; dropout&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; bias&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;False&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;17&lt;&#x2F;span&gt;&lt;span&gt;max_steps, log_interval, batch_size, learning_rate&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2_000&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 10&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 256&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1e-3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;18&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span&gt;data&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.tensor(tokenizer.encode(text),&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; device&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;device)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;20&lt;&#x2F;span&gt;&lt;span&gt;model&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; Transformer(cfg).to(device).train()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;21&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;22&lt;&#x2F;span&gt;&lt;span&gt;n_params, adamw_params, muon_params&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 0&lt;&#x2F;span&gt;&lt;span&gt;, [], []&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;23&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; param&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span&gt; model.parameters():&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span&gt;    n_params&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +=&lt;&#x2F;span&gt;&lt;span&gt; param.numel()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;25&lt;&#x2F;span&gt;&lt;span&gt;    (adamw_params&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; if&lt;&#x2F;span&gt;&lt;span&gt; param.ndim&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; &amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; else&lt;&#x2F;span&gt;&lt;span&gt; muon_params).append(param)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;26&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;f&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;Model parameters: &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;n_params&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}\n&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;27&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;28&lt;&#x2F;span&gt;&lt;span&gt;adamw&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.optim.AdamW(adamw_params,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; lr&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;3e-4&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; betas&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0.9&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 0.95&lt;&#x2F;span&gt;&lt;span&gt;),&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; eps&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1e-8&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; weight_decay&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0.01&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;29&lt;&#x2F;span&gt;&lt;span&gt;muon&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.optim.Muon(muon_params,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; lr&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0.02&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; momentum&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0.95&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; weight_decay&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0.1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;30&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; step&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; range&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, max_steps&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span&gt;):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;31&lt;&#x2F;span&gt;&lt;span&gt;    token_ids, targets&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; sample_batch(data, batch_size, cfg.block_size)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;32&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;33&lt;&#x2F;span&gt;&lt;span&gt;    adamw.zero_grad()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;34&lt;&#x2F;span&gt;&lt;span&gt;    muon.zero_grad()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;35&lt;&#x2F;span&gt;&lt;span&gt;    _, loss&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; model(token_ids, targets)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;36&lt;&#x2F;span&gt;&lt;span&gt;    loss.backward()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;37&lt;&#x2F;span&gt;&lt;span&gt;    adamw.step()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;38&lt;&#x2F;span&gt;&lt;span&gt;    muon.step()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;39&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;40&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    if&lt;&#x2F;span&gt;&lt;span&gt; step&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; %&lt;&#x2F;span&gt;&lt;span&gt; log_interval&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ==&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 0&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;41&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;        print&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;f&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\r&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Step &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;step&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;max_steps&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;step&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; &#x2F;&lt;&#x2F;span&gt;&lt;span&gt; max_steps&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:.2%&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;) | Loss: &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;loss.item()&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:.4f&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; end&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; flush&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;True&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;42&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;43&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;# Seed the model with the prompt &amp;quot;I am&amp;quot; to kick off text generation&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;44&lt;&#x2F;span&gt;&lt;span&gt;token_ids&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; torch.tensor(tokenizer.encode(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;I am&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;),&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; device&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;device).reshape(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;45&lt;&#x2F;span&gt;&lt;span&gt;output&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; model.generate(token_ids,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; max_new_token_ids&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;512&lt;&#x2F;span&gt;&lt;span&gt;)[&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span aria-hidden=&quot;true&quot; class=&quot;giallo-ln&quot; style=&quot;color: #444D56;&quot;&gt;46&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;f&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\n\n&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;OUTPUT:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\n\n{&lt;&#x2F;span&gt;&lt;span&gt;tokenizer.decode(output.tolist())&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Training logs from under 40 seconds of training on &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;MacBook_Pro_(Apple_silicon)&quot;&gt;MacBook Pro&lt;&#x2F;a&gt; with &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Apple_M4&quot;&gt;Apple M4
Pro&lt;&#x2F;a&gt; chip:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Model parameters: 140,127&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Step 2000&#x2F;2000 (100.00%) | Loss: 1.3831&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;OUTPUT:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;I am the forth which were on his stranger. The light of many our reasun alive in the enter of my&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;cottagers; there in the murderer, when I walked in&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;my enemy sat much of spring in its removed. I had not for my friend, and the cottager of the lack of&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;the wint on the lovely of cestant, and that the old man, but I was entered that&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;the eviner of my hands. My journey is many beauty will of compassins of the more spully appeared&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;with meaning a man work was not rease to be macking my own and a&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;gentle understand even&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;</description>
      </item>
      <item>
          <title>Autoregressive Transformer</title>
          <pubDate>Thu, 12 Mar 2026 00:00:00 +0000</pubDate>
          <author>Unknown</author>
          <link>https://oniani.org/transformer/</link>
          <guid>https://oniani.org/transformer/</guid>
          <description xml:base="https://oniani.org/transformer/"></description>
      </item>
      <item>
          <title>COVID-19 and Transition Into Full-Time Summer Work</title>
          <pubDate>Fri, 22 May 2020 00:00:00 +0000</pubDate>
          <author>Unknown</author>
          <link>https://oniani.org/blog/covid-19-and-transition-into-full-time-summer-work/</link>
          <guid>https://oniani.org/blog/covid-19-and-transition-into-full-time-summer-work/</guid>
          <description xml:base="https://oniani.org/blog/covid-19-and-transition-into-full-time-summer-work/">&lt;p&gt;A lot of things have happened in the last few months. The situation with COVID-19 has undoubtedly
posed new challenges for many of us. Schools have switched to online learning and I am now working
fully remotely.&lt;&#x2F;p&gt;
&lt;p&gt;While what happened is certainly not favorable by any means, I do think that it gives us a lot of
essential and transferable skills. The ability to deal with high levels of stress and move on
despite the difficulties is one of the most important skills one could acquire. I think it is also
interesting to think about the pandemic from the technological perspective. The pandemic would have
hit a lot harder if it happened twenty or even ten years ago. Now that pretty much everyone owns a
computer and has access to the internet, it is manageable to maintain social life. In other words,
it is just physical distancing that we have been practicing and not necessarily social distancing.&lt;&#x2F;p&gt;
&lt;figure&gt;
  &lt;img src=&quot;data_visualization_tool.png&quot; alt=&quot;Data visualization tool&quot; width=&quot;512px&quot;&gt;
  &lt;figcaption&gt;Data visualization tool&lt;&#x2F;figcaption&gt;
&lt;&#x2F;figure&gt;
&lt;p&gt;In regard to my internship, as the pandemic started its rapid emergence, some of our projects got
postponed&#x2F;delayed and we started working on projects related to COVID-19 right away. For the past
few months, we have been fully dedicated to the COVID-19-related research and development. In the
first few days, Dr. Yanshan Wang and I automated COVID-19 screening process for Mayo nurses and
physicians, which ultimately saved a lot of precious time for medical personnel. Additionally, I
have been working on two big&#x2F;major projects.&lt;&#x2F;p&gt;
&lt;p&gt;The first project I worked on (with Dr. Feichen Shen) leveraged the AI-driven graph mining
techniques to assist COVID-19 knowledge discovery, which has the potential value for innovative
COVID-19 drug discovery. We also built a user-friendly
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;oniani.org&#x2F;covid-19-network&quot;&gt;web-based tool&lt;&#x2F;a&gt; to visualize the data as well as support
link&#x2F;relation prediction. The project that Dr. Shen and I worked on yielded promising results and
our manuscript was accepted by Journal of the American Medical Informatics Association (JAMIA), the
most prestigious journal in the domain of medical informatics.&lt;&#x2F;p&gt;
&lt;p&gt;The other project (with Dr. Yanshan Wang) revolved around building a chatbot that answers questions
related to COVID-19. Additionally, we have been testing a number of state-of-the-art models and
embedding-generation techniques for performance comparison. We have almost finished drafting the
paper and will soon submit it to one of the top biomedical informatics conferences.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, my mentors have expressed enthusiastic support for extending my internship into full-time
summer work, so I am planning to stay here over the summer.&lt;&#x2F;p&gt;
&lt;p&gt;We will beat this pandemic and come out stronger than ever before!&lt;&#x2F;p&gt;
&lt;p&gt;Stay safe and stay strong!&lt;&#x2F;p&gt;
&lt;p&gt;The original blogpost is located &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www2.luther.edu&#x2F;long-term-blogs&#x2F;rochester&#x2F;?story_id=911867&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
      </item>
      <item>
          <title>Mayo Clinic - First Impressions</title>
          <pubDate>Tue, 24 Mar 2020 00:00:00 +0000</pubDate>
          <author>Unknown</author>
          <link>https://oniani.org/blog/mayo-clinic-first-impressions/</link>
          <guid>https://oniani.org/blog/mayo-clinic-first-impressions/</guid>
          <description xml:base="https://oniani.org/blog/mayo-clinic-first-impressions/">&lt;p&gt;The first few weeks were full of orientations, trainings, mentor and supervisor meetings, and
project discussions. I got acquainted with some of the internal processes at Mayo, the general
employee workflow, and my workflow as a research and development intern. To maintain the quality
associated with the Mayo Clinic tradition, the clinic has a well-defined corporate culture. The work
environment is awesome. I am surrounded by really smart individuals who are passionate and
enthusiastic about the work that they do. I have learned a lot about both the history and the core
values of the clinic. As a side note, there is a wonderful documentary by Ken Burns that gives a
fascinating walk into the rich history of Mayo - it is titled The Mayo Clinic: Faith, Hope, Science.&lt;&#x2F;p&gt;
&lt;figure&gt;
  &lt;img src=&quot;mayo_clinic_heritage_hall.png&quot; alt=&quot;Mayo Clinic Heritage Hall&quot; width=&quot;512px&quot;&gt;
  &lt;figcaption&gt;Mayo Clinic Heritage Hall&lt;&#x2F;figcaption&gt;
&lt;&#x2F;figure&gt;
&lt;p&gt;My supervisor, Stephanie, has been very caring and supportive. Her comments and bits of advice have
been helpful and insightful, and she also does an excellent job of finding the right connections for
me. My mentors, Drs. Feichen Shen and Yanshan Wang, have been great learning resources and have
already made me familiar with several new artificial intelligence and informatics techniques. Per my
request, they got me engaged in two research projects. Since the projects involve state-of-the-art
approaches, I cannot disclose any further details at the time, but I hope to be able to reveal more
in the future.&lt;&#x2F;p&gt;
&lt;p&gt;It is surprising how many of the research and development skills acquired at Luther are directly
applicable to my current work. I am thankful to professors Richard Merritt, Alan Zaring, and Roman
Yasinovskyy for their ardent support and help in building these skills. If you are a Luther student
reading this blog post, go find out about the exciting summer opportunities at Luther and do not
miss out on a wonderful journey called research!&lt;&#x2F;p&gt;
&lt;p&gt;I am grateful for the opportunities that the program has provided to the students and am looking
forward to the new adventures to come!&lt;&#x2F;p&gt;
&lt;p&gt;The original blogpost is located &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www2.luther.edu&#x2F;long-term-blogs&#x2F;rochester&#x2F;?story_id=905337&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
      </item>
      <item>
          <title>Privacy for the Web</title>
          <pubDate>Fri, 03 Jan 2020 00:00:00 +0000</pubDate>
          <author>Unknown</author>
          <link>https://oniani.org/blog/privacy-for-the-web/</link>
          <guid>https://oniani.org/blog/privacy-for-the-web/</guid>
          <description xml:base="https://oniani.org/blog/privacy-for-the-web/">&lt;p&gt;Just wanted to make a few suggestions on improving the web privacy:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Lightweight Privacy (emphasis on usability and speed)
&lt;ul&gt;
&lt;li&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.mozilla.org&#x2F;en-US&#x2F;firefox&#x2F;new&#x2F;&quot;&gt;Firefox&lt;&#x2F;a&gt; as the default browser
&lt;ul&gt;
&lt;li&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bitwarden.com&#x2F;&quot;&gt;Bitwarden&lt;&#x2F;a&gt; for password management&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;ublockorigin.com&#x2F;&quot;&gt;uBlock Origin&lt;&#x2F;a&gt; to block ads (and other junk)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;arkenfox&#x2F;user.js&quot;&gt;arkenfox user.js&lt;&#x2F;a&gt; for extra security&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;signal.org&#x2F;&quot;&gt;Signal&lt;&#x2F;a&gt; as a messenger&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;element.io&#x2F;about&quot;&gt;Element&lt;&#x2F;a&gt; for group chats and collaborations&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;protonmail.com&#x2F;&quot;&gt;Protonmail&lt;&#x2F;a&gt; for the email service&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Heavier Privacy (emphasis on privacy)
&lt;ul&gt;
&lt;li&gt;Use &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.torproject.org&#x2F;&quot;&gt;Tor&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Super Advanced Privacy (stronger emphasis on privacy)
&lt;ul&gt;
&lt;li&gt;Do not use the web&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
      </item>
      <item>
          <title>The Essence of Programming - Functional Approach</title>
          <pubDate>Sun, 25 Nov 2018 00:00:00 +0000</pubDate>
          <author>Unknown</author>
          <link>https://oniani.org/blog/the-essence-of-programming-functional-approach/</link>
          <guid>https://oniani.org/blog/the-essence-of-programming-functional-approach/</guid>
          <description xml:base="https://oniani.org/blog/the-essence-of-programming-functional-approach/">&lt;p&gt;This blogpost is a general overview of a rather underappreciated programming methodology called
functional programming. Throughout the blogpost, I will occasionally use the purely functional
programming language Haskell as well as an imperative-style programming language Python. I will be
assuming the knowledge of basic programming concepts such as variable assignment, arithmetic
operations, conditionals, functions, loops, and recursion.&lt;&#x2F;p&gt;
&lt;p&gt;It is important to note that the blogpost is just an introduction to the paradigms in functional
programming and does not cover any of them in great detail.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;what-is-functional-programming&quot;&gt;What is Functional Programming?&lt;&#x2F;h3&gt;
&lt;p&gt;As mentioned above, functional programming is just an approach to programming. Particularly, it
refers to programming using functions, hence the name &lt;strong&gt;functional programming&lt;&#x2F;strong&gt;. To better
understand what it means for a programming language to be functional, let&#x27;s make a short
side-by-side comparison of functional and, wildly popular, imperative style of programming languages
and then discuss the differences in more detail.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Imperative language&lt;&#x2F;th&gt;&lt;th&gt;Functional language&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Classes or structures are the first-class citizens&lt;&#x2F;td&gt;&lt;td&gt;Functions are the first-class citizens&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;State changes are important&lt;&#x2F;td&gt;&lt;td&gt;State changes are limited or non-existent&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Primary control flow: loops and conditionals&lt;&#x2F;td&gt;&lt;td&gt;Primary control flow: function calls and recursion&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;classes-vs-functions&quot;&gt;Classes VS Functions&lt;&#x2F;h3&gt;
&lt;p&gt;The first comparison shows that, generally speaking, in the imperative languages (i.e., Python, C,
Java, etc.), variables (instances of classes or structures) dominate over all other objects. Thus,
imperative paradigm makes a clear distinction between variables and functions. On the other hand, in
functional programming languages, functions are the first-class citizens making virtually everything
else rank below them.&lt;&#x2F;p&gt;
&lt;p&gt;Imperative programming languages treat variables as data, while functions are generally used just to
manipulate variables or generate data. When programming in a functional language, we say that
functions are very similar to variables. In fact, we say that they are no different than variables
as they not only manipulate the data, but also represent the data themselves. Thus, in the
functional world, we say that the piece of code like a function is also data.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;There is a term describing a language if a program written in it can be manipulated as data. This
quality is referred to as &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Homoiconicity&quot;&gt;homoiconicity&lt;&#x2F;a&gt; and such
languages are called homoiconic. One of such languages is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lisp_(programming_language)&quot;&gt;Lisp&lt;&#x2F;a&gt;. And, based on our discussion
points, it is not very surprising that Lisp is a functional language.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;I will give you a concise proof of why functions are data. Remember the table representations of
functions we learned at some point in the elementary school? That&#x27;s the proof! Any function can be
represented as a table of values. For instance, consider a function \( f(x) = 2x \). The following
will be a table representation of the function.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;\( x \)&lt;&#x2F;th&gt;&lt;th&gt;\( f(x) \)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;8&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;10&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;12&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;...&lt;&#x2F;td&gt;&lt;td&gt;...&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Looks more like data? That&#x27;s because it is the data! We have effectively generated a 2-column table
where each of the cells has a certain value. And yes, this is very similar to SQL tables and pandas
data frames.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;natural-outcomes&quot;&gt;Natural Outcomes&lt;&#x2F;h3&gt;
&lt;p&gt;Because functions are so important, there are natural outcomes which are shared among most of
functional languages.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s write a Haskell function to find the factorial of a positive integer.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- | A function to find the factorial of a positive integer&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;factorial x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; product [&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;..&lt;&#x2F;span&gt;&lt;span&gt;x]&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- easy as that&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The function builds a list of integers from 1 up to x and then calculates the product of these
elements. This way we effectively get a product \( 1 \times 2 \times 3 .. \times \ x \) which is
the same as \( x! \). Since we now have a function, we can call it with the actual parameters!&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;print (factorial &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- prints out 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;print (factorial &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- prints out 720&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;print (factorial &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- prints out 362880&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As shown above, something that in imperative languages would require importing modules, looping,
etc. is a single line in Haskell. This is one of the outcomes of functions being first-class
citizens. Most functional languages have a rich pool of built-in&#x2F;predefined functions that help
manipulate data. In the example above, we also see a very interesting notation. Namely, &lt;code&gt;[1..x]&lt;&#x2F;code&gt;
which builds up a list of integers from 1 up to \( x \) (\( x \) must also be an integer such
that \( x \geq 1 \)). Thus, another outcome is that data structures and collections can be created
very easily, usually just in a single line of code, leaving more time for the programmer to deal
with functions and the logic. These are some of the reasons why functional languages are so concise.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;math-sets-and-haskell&quot;&gt;Math, Sets, and Haskell&lt;&#x2F;h3&gt;
&lt;p&gt;Notice that in the factorial function above, I excluded the case when the function is called with
\( 0 \) (\( 0! = 1 \)). It was done on purpose so that now we are able to add some other
notation and explain the whole function in detail. Below is a better and more complete version of
the function.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- | A function to find the factorial of a number&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;factorial&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; :: Integer -&amp;gt; Integer&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;factorial &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;factorial x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; product [&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;..&lt;&#x2F;span&gt;&lt;span&gt;x]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;::&lt;&#x2F;code&gt; - prompts that it is a function declaration&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;Integer&lt;&#x2F;code&gt; - type that can hold any number no matter how big, up to the limit of the machine&#x27;s
memory.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;-&amp;gt;&lt;&#x2F;code&gt; - tells either what is the type of the next formal parameter or what is the type of the output&lt;&#x2F;p&gt;
&lt;p&gt;Looks similar to something you have seen before? If you have taken any undergraduate math class,
there is a chance that you&#x27;ve encountered the following notation:&lt;&#x2F;p&gt;
&lt;p&gt;\[ f : A \rightarrow B : x \mapsto y \]&lt;&#x2F;p&gt;
&lt;p&gt;The notation above describes a simple function that takes an input from set \( A \) and maps it to
the output in the set \( B \).&lt;&#x2F;p&gt;
&lt;p&gt;Here is the complete definition of the factorial function that we saw above:&lt;&#x2F;p&gt;
&lt;p&gt;\[ f : \mathbb{Z}^+ \cup \{ 0 \} \rightarrow \mathbb{Z}^+ : x \mapsto x! \]&lt;&#x2F;p&gt;
&lt;p&gt;Haskell defines in the similar fashion.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;factorial :: Integer -&amp;gt; Integer&lt;&#x2F;code&gt; says that &lt;code&gt;factorial&lt;&#x2F;code&gt; is a function that takes an element from the
set of integers and maps it to some other element in the set of integers. As opposed to math,
however, Haskell does not use \( \mapsto \) notation and instead has the statements below the
definition.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;factorial &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;factorial x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; product [&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;..&lt;&#x2F;span&gt;&lt;span&gt;x]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The code above is equivalent to saying that if the element is 0, map it to 1 and in all other cases,
map it to the product from 1 up to the element.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;state-changes-and-functional-programming&quot;&gt;State Changes and Functional Programming&lt;&#x2F;h3&gt;
&lt;p&gt;Functional languages have a limited notion of state and typically, avoid the shared mutable state at
any cost. &lt;strong&gt;Purely&lt;&#x2F;strong&gt; (see &lt;a href=&quot;https:&#x2F;&#x2F;oniani.org&#x2F;blog&#x2F;the-essence-of-programming-functional-approach&#x2F;#purely-functional-languages&quot;&gt;Purely Functional Languages&lt;&#x2F;a&gt;) functional
languages like Haskell, do not have any state at all. Since there are no changes in state, there are
no mutable variables. Instead, functional languages offer functions and immutable variables.&lt;&#x2F;p&gt;
&lt;p&gt;To make it clear, let&#x27;s look at two examples below. One is from Python and the other is from
Haskell.&lt;&#x2F;p&gt;
&lt;p&gt;Python example&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;# Define a variable &amp;#39;my_number&amp;#39; and assign it to 3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;my_number&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;# Increment the variable &amp;#39;my_number&amp;#39; by 1 and reassign it to the result&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;my_number&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span&gt;(my_number)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Prints out 4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s repeat the same steps in Haskell.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;myNumber &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;    -- Define a variable &amp;#39;myNumber&amp;#39; and assign it to 3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;myNumber &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;+=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;   -- Haskell gags here (infinite loop)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;print myNumber  &lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- This statement is not reachable&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Looking at the code above, you might have already noticed that Haskell does not allow for changing
the state of the program. Now, you might be wondering how could one increment variables.&lt;&#x2F;p&gt;
&lt;p&gt;Here is a short answer:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;myNumber &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;                  -- Define a variable &amp;#39;myNumber&amp;#39; and assign it to 3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;myOtherNumber &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; myNumber &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- Define a variable &amp;#39;myOtherNumber&amp;#39; and assign it to &amp;#39;myNumber&amp;#39;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;myNumber &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; myOtherNumber      &lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- Redefine &amp;#39;myNumber&amp;#39; and set it to &amp;#39;myOtherNumber&amp;#39;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;print myNumber                &lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- Prints out 4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Longer and better answer&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;You do not really need such increments or decrements in functional programming languages. You can
easily overcome this hindrance through functions and recursion. Therefore, instead of mutating
objects, we use recursion to gradually get to the target.&lt;&#x2F;p&gt;
&lt;p&gt;Here is the example of how one could translate a well-known accumulator pattern from Python to
Haskell.&lt;&#x2F;p&gt;
&lt;p&gt;Here is a classic Python accumulator pattern:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;# An accumulator pattern approach for finding&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;# the sum of the first 100 positive integers.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;total&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; integer&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; range&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 101&lt;&#x2F;span&gt;&lt;span&gt;):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    total&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +=&lt;&#x2F;span&gt;&lt;span&gt; integer&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span&gt;(total)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Prints out 5050&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here is what it looks like in Haskell:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;accumulator &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;                        -- The base case for the recursion&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;accumulator x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span&gt; accumulator (x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- The recursive case&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;main &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; print (accumulator &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- Prints out 5050&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In the code excerpt above, we did not use any loops. In fact, we could not use any loops because
functional languages do not support loops. Instead, we defined a function, used the recursion and
calculated the sum of the values through function calls.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;h4 id=&quot;side-note&quot;&gt;Side Note&lt;&#x2F;h4&gt;
&lt;p&gt;In this particular case, we do not even need to implement the recursive accumulator pattern. All
we need to do is use the already predefined &lt;code&gt;sum&lt;&#x2F;code&gt; function and so-called texas range list notation
that we have already seen (&lt;code&gt;[1..x]&lt;&#x2F;code&gt;):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;print (sum [&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span&gt;])&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- Prints out 5050&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;control-flow&quot;&gt;Control Flow&lt;&#x2F;h3&gt;
&lt;p&gt;As we have already seen, there are no for loops or while loops in functional programming languages
and there are good reasons why. Let&#x27;s list a few of them and continue our discussion by elaborating
on those reasons.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Functional languages are declarative.&lt;&#x2F;li&gt;
&lt;li&gt;Most of functional languages are heavily influenced by lambda calculus.&lt;&#x2F;li&gt;
&lt;li&gt;If you were to implement a functional programming language,
you would yourself get rid of loops.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h4 id=&quot;functional-languages-are-declarative&quot;&gt;Functional Languages Are Declarative&lt;&#x2F;h4&gt;
&lt;p&gt;For those who are new to the idea of declarative languages, let&#x27;s first discuss what it means for a
language to be declarative. Here is a simple definition:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Declarative programming is a method of programming that abstracts away the control flow for logic
required for performing an action, and instead involves stating the task or desired outcome.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The examples of declarative languages are SQL, Haskell, Prolog etc.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Example 1&lt;&#x2F;strong&gt;: Consider the SQL querying language. In SQL, one doesn&#x27;t describe what how to get the
data. One just tells SQL what data is needed, and SQL engine figures out the best way to get it.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Example 2&lt;&#x2F;strong&gt;: A better example might be comparing two implementations of a simple function. Let&#x27;s
implement them in both Python and Haskell.&lt;&#x2F;p&gt;
&lt;p&gt;The function takes a list of integers and returns the sum of odd integers in it.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; odd_sum&lt;&#x2F;span&gt;&lt;span&gt;(list_of_integers):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;Returns the sum of all odd integers in the list.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    total&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    for&lt;&#x2F;span&gt;&lt;span&gt; integer&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; in&lt;&#x2F;span&gt;&lt;span&gt; list_of_integers:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;        if&lt;&#x2F;span&gt;&lt;span&gt; integer&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; %&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ==&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;            total&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +=&lt;&#x2F;span&gt;&lt;span&gt; integer&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span&gt; total&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span&gt;(odd_sum([&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 4&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 5&lt;&#x2F;span&gt;&lt;span&gt;]))&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Prints out 9&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s do a shallow analysis of the &lt;code&gt;odd_sum&lt;&#x2F;code&gt; function. As seen above, it starts by declaring a
variable &lt;code&gt;total&lt;&#x2F;code&gt; which is initially set to 0. Then, we iterate over the list and through each
iteration, we check if the integer is odd and if it is, we add it to &lt;code&gt;total&lt;&#x2F;code&gt;. In the end, we return
the &lt;code&gt;total&lt;&#x2F;code&gt; variable.&lt;&#x2F;p&gt;
&lt;p&gt;Now, that we have analyzed the function a bit, notice that in the for loop, through each iteration,
we are giving Python directions when to add the integer to &lt;code&gt;total&lt;&#x2F;code&gt; (only if it is odd). Thus, &lt;strong&gt;we
tell Python what to do step-by-step&lt;&#x2F;strong&gt;. This is an important characteristic that distinguishes
non-declarative languages from declarative ones.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s now look at the Haskell example.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;oddSum x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; sum (filter odd x)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;main &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; print (oddSum [&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;])&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  -- Prints out 9&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Notice what we did here. First we defined a function &lt;code&gt;oddSum&lt;&#x2F;code&gt; which takes a list. Then we used the
function &lt;code&gt;filter&lt;&#x2F;code&gt; (happened to be predefined) in conjunction with another predefined function &lt;code&gt;odd&lt;&#x2F;code&gt;
(returns true if the value is odd an false otherwise) to get the list of odd integers. Finally, we
summed up all the odd integers and got the result.&lt;&#x2F;p&gt;
&lt;p&gt;See the difference? In Python, we used a for loop and through each iteration, we told Python whether
to add the integer it to &lt;code&gt;total&lt;&#x2F;code&gt; or not. In Haskell, however, we gave a whole list to the function
and told it to just remove all of the even integers from the list and then to sum up the rest (if
you eliminate all the even integers, you are obviously left with all the odd integers). In other
words, in the Haskell example, we do not care how the functions &lt;code&gt;sum&lt;&#x2F;code&gt; and &lt;code&gt;filter&lt;&#x2F;code&gt; work internally,
we only care about the fact that they do their job - sum up the odd numbers in the list and return
the value.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;functional-programming-and-lambda-calculus&quot;&gt;Functional Programming And Lambda Calculus&lt;&#x2F;h4&gt;
&lt;p&gt;Lambda calculus (also written as \( \lambda \)-calculus) is a branch of mathematics which was
developed by &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Alonzo_Church&quot;&gt;Alonzo Church&lt;&#x2F;a&gt; in the 1930s. It is a
formal system for expressing computation and an alternative to what&#x27;s called &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Turing_machine&quot;&gt;Turing
machine&lt;&#x2F;a&gt; which was introduced by &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Alan_Turing&quot;&gt;Alan
Turing&lt;&#x2F;a&gt;. Turing machines involve loops and other
non-declarative approaches (Turing machines are the inspiration for programming languages like Java,
Python, etc). A few years later, Church and Turing collaboratively wrote a paper which is now know
as the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Church%E2%80%93Turing_thesis&quot;&gt;computability thesis&lt;&#x2F;a&gt; and proved
that all the computation that was done using Turing machines could effectively be done in lambda
calculus as well. Hence, simply put, lambda calculus has the power equivalent to that of Turing
machines. Not too long after, people decided to base programming languages on the ideas in lambda
calculus (it was just as powerful as Turing machines so why not?!). This led to shared
characteristics among functional languages such as lack of loops. Virtually all functional
programming languages have no loops because lambda calculus has no loops. One could certainly add
loops, but they would have been redundant. Instead, functional languages use a mathematical idea of
recursion. This is the part of the reason why loops are not that appreciated in the functional
world.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;getting-rid-of-loops&quot;&gt;Getting Rid of Loops&lt;&#x2F;h4&gt;
&lt;p&gt;Despite the fact that sometimes they are very useful, loops must not be a part
of a functional programming language. There are several reasons for this.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Loops are imperative, prompting the language what to do.&lt;&#x2F;li&gt;
&lt;li&gt;Loops usually involve mutating values which is, once again, against functional virtues.&lt;&#x2F;li&gt;
&lt;li&gt;Even if we did not use it imperatively and not mutate values, it would create unnecessary
redundance in a language with the emphasis on recursion (which is just as powerful as a
regular loop!)&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;purely-functional-languages&quot;&gt;Purely Functional Languages&lt;&#x2F;h3&gt;
&lt;p&gt;You might have seen word &lt;strong&gt;pure&lt;&#x2F;strong&gt; in the beginning of the blogpost where I mentioned that Haskell is
&lt;strong&gt;purely&lt;&#x2F;strong&gt; functional programming language. However, I never defined what it means for a functional
language to be pure. So let&#x27;s do it now!&lt;&#x2F;p&gt;
&lt;p&gt;Those who read the &lt;a href=&quot;https:&#x2F;&#x2F;oniani.org&#x2F;blog&#x2F;the-essence-of-programming-functional-approach&#x2F;#math-sets-and-haskell&quot;&gt;Math, Sets, and Haskell&lt;&#x2F;a&gt;, remember the math notation for
functions? I will use them to take the mystery out of this concept of being &lt;strong&gt;pure&lt;&#x2F;strong&gt;!&lt;&#x2F;p&gt;
&lt;p&gt;Suppose we have a function \( f : \mathbb{Z} \rightarrow \mathbb{Z} \). Then by just looking at
the function, we see that it takes an input from a set of integers and its output is also in the set
of integers. In other words, function \( f \) cannot take inputs like -1.9, 0.2, 12.7 etc. as well
as it cannot give an output like 12.6, 71.9, -9.1 etc. Its input(s) and output(s) could &lt;strong&gt;only&lt;&#x2F;strong&gt; be
integers.&lt;&#x2F;p&gt;
&lt;p&gt;Now, let&#x27;s actually make this dull function \( f \) do something. Consider the function \( f :
\mathbb{Z} \rightarrow \mathbb{Z} : x \mapsto 2x \). Thus, we have a function which does a fairly
straightforward thing: takes an integer and maps it to twice its value (which will also be an
integer). Let&#x27;s now look at the Haskell implementation of this function&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- | A function that takes an input and outputs twice its value&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;f&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; :: Integer -&amp;gt; Integer&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;f x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; *&lt;&#x2F;span&gt;&lt;span&gt; x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The function above says that the input (corresponds to the &lt;code&gt;Integer&lt;&#x2F;code&gt; before the arrow) is always an
integer and the output (corresponds to the &lt;code&gt;integer&lt;&#x2F;code&gt; after the arrow) is also an integer. &lt;strong&gt;Hence,
we always know what type is the input and what type is the output.&lt;&#x2F;strong&gt; In fact, we also know that the
if we call a function with say 5, we will always get the same result. Namely, &lt;code&gt;f 5 = 10&lt;&#x2F;code&gt;. Hence, we
got that input(s) and output(s) are always integers and the function called with same actual
parameters always return the same value! This is what makes Haskell a purely functional language.
&lt;strong&gt;At any given point in time, we always know what is the type of input and what is type of output.
Besides, we know that the function called with the same actual parameter(s), always returns the same
value&lt;&#x2F;strong&gt;. Such functions virtually never produce side effects since we already know what to expect
for a given input. &lt;strong&gt;Such functions are called pure!&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;To further demystify this idea, let&#x27;s look at the following piece of code:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;An example of a function that is pretending to be pure.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; random&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; numgen&lt;&#x2F;span&gt;&lt;span&gt;(val:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;Generates a number.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span&gt; val&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; +&lt;&#x2F;span&gt;&lt;span&gt; random.randint(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, val)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; %&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;def&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; main&lt;&#x2F;span&gt;&lt;span&gt;() -&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; None&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;Test the number generation.&amp;quot;&amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;    print&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;f&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;Returns &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;numgen(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Prints out 8&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;    print&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;f&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;Returns &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;numgen(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Prints out 8&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;    print&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;f&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;Returns &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;numgen(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  # Prints out 7&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; __name__&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ==&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;__main__&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    main()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, we defined a function that takes an integer value as an input and it seems like the output is
also an integer. We now might be lured into thinking that function &lt;code&gt;numgen&lt;&#x2F;code&gt; gives the same output
for the same input, but that is clearly not the case here. Let&#x27;s take a closer look at what the
function does. It takes an integer value and returns the value plus some random number which is 0, 1
or 2. When we first called the function with the actual parameter 7, we got 8 as an output. The
second time, we got 8 again. The third time however, we got 7. Hence, for the third time, the output
was not the same. Therefore, the function is not pure.&lt;&#x2F;p&gt;
&lt;p&gt;You now might wondering why I could not do the same trick in Haskell. In fact, I certainly can.
However, in Haskell, such function would not have a type &lt;code&gt;Int&lt;&#x2F;code&gt;. It would have a type &lt;code&gt;IO Int&lt;&#x2F;code&gt;. &lt;code&gt;IO&lt;&#x2F;code&gt;
is usually associated with file input &#x2F; output and it is reasonable that it is associated with
functions that are not always &quot;truthful&quot; as File I&#x2F;O could in fact be one of the nastiest
experiences for a programmer. So many things can go wrong! (e.g., writing to a file which was
deleted, reading from a file on a USB which was ejected, writing a file that was moved to some other
directory etc). Thus, when we deal with uncertainty (which usually comes with side effects), Haskell
warns us by using the &lt;code&gt;IO&lt;&#x2F;code&gt; notation:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;haskell&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;{-&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;Example of a function that if called with the same argument,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;does not always return the same result.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; System.Random&lt;&#x2F;span&gt;&lt;span&gt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;randomRIO&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;notAPureFunction&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; :: Int -&amp;gt; IO Int&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;notAPureFunction value &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;= do&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    randomValue &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span&gt; randomRIO (&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return (value &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span&gt; randomValue)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;main &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;= do&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span&gt; notAPureFunction &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;7&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    print x                  &lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- Prints out 9&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span&gt; notAPureFunction &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;7&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    print x                  &lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- Prints out 7&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    x &lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;&amp;lt;-&lt;&#x2F;span&gt;&lt;span&gt; notAPureFunction &lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;7&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    print x                  &lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;-- Prints out 9&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Take a look at the Haskell code above. You can disregard all the notational fluff. Just look at the
return type of the function &lt;code&gt;notAPureFunction&lt;&#x2F;code&gt;. It is &lt;code&gt;IO Int&lt;&#x2F;code&gt;! In other words, Haskell informs us
that the function might have side effects.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we can have a rough definition of a pure functional language:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A functional language is pure if and only if the user is informed about all side effects or there
are no side effects at all.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;In fact, Haskell did not even allow random values back in 1990s when its development was first
launched. Furthermore, there was no notion of File IO either and writing to files was done using
shell redirection commands (i.e., &lt;code&gt;runhaskell Program.hs &amp;gt; out.txt&lt;&#x2F;code&gt;). Because of this, Haskell was
considered useless for all practical purposes. Eventually, engineers and the Haskell committee
decided to change the direction of Haskell. In lieu of getting rid of all the side effects, they
decided to control the side effects and created a more &quot;regulated&quot; programming environment.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h3&gt;
&lt;p&gt;Functional programming languages are different from imperative ones. Most of them are based on ideas
in lambda calculus. Functional languages are the proper subset of declarative languages. There are
no loops and recursion is used instead. Changes in state are non-existent and therefore, all the
variables are immutable. Functional languages usually have a lot of predefined functions to make it
easy for a programmer to solve problems. Most of functional languages are also very concise,
minimizing the time spent on coding and leaving more time for the logic. Pure functional languages
are the proper subset of functional languages. Purely functional languages go a long way to inform
the user about potential side effects.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;how-to-get-started-with-functional-programming&quot;&gt;How to Get Started with Functional Programming?&lt;&#x2F;h3&gt;
&lt;p&gt;There are lots of functional languages. One will obviously have to decide which one to learn first.
My recommendation would be learning Haskell. It is a purely functional programming language which
has most of (if not all) functional ideas in it. Besides,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Simon_Peyton_Jones&quot;&gt;SPJ&lt;&#x2F;a&gt; dedicates most of his time on extending
the language and adding new features to it. So if there is something new and interesting in the
functional programming world, Haskell will likely adopt it.&lt;&#x2F;p&gt;
&lt;p&gt;After learning one functional language, it is not all that difficult to transition to the other.
Being familiar with one functional language automatically makes one somewhat familiar with others.
Hence, a good understanding of Haskell will make it easy to learn languages such as Rust, Scheme,
etc.&lt;&#x2F;p&gt;
&lt;p&gt;To get started, visit the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.haskell.org&#x2F;documentation&quot;&gt;Haskell Documentation&lt;&#x2F;a&gt; page which
is full of various educational resources.&lt;&#x2F;p&gt;
</description>
      </item>
    </channel>
</rss>
