1import string 2 3class CharacterLevelTokenizer: 4 """A character-level tokenizer that treats individual characters as tokens.""" 5 6 def __init__(self, text: str = string.printable) -> None: 7 """Builds the vocabulary from the given text.""" 8 9 self.vocab: list[str] = sorted(set(text)) 10 self.vocab_size: int = len(self.vocab) 11 self.char_to_token: dict[str, int] = {char: idx for idx, char in enumerate(self.vocab)} 12 13 def encode(self, text: str) -> list[int]: 14 """Converts a string into a list of token IDs.""" 15 16 return [self.char_to_token[char] for char in text] 17 18 def decode(self, token_ids: list[int]) -> str: 19 """Converts a list of token IDs back into a string.""" 20 21 return "".join(self.vocab[token_id] for token_id in token_ids) 22 23tokenizer = CharacterLevelTokenizer() 24enc = tokenizer.encode("robot") # -> [87, 84, 71, 84, 89] 25dec = tokenizer.decode(enc) # -> 'robot'</code></pre> As shown above, CharacterLevelTokenizer</code> treats each individual character as a separate token.

Note that while character-level tokenization is conceptually straightforward, most production systems utilize subword tokenizers to achieve a more optimal balance between vocabulary size and representational capacity. By capturing frequent character sequences as single units, subword algorithms, such as Byte-Pair Encoding (BPE)</a> or WordPiece</a>, significantly enhance computational efficiency compared to more granular methods. However, in this article, we will be using CharacterLevelTokenizer</code>. </blockquote> At this stage, it may help to look at the transformer architecture as a whole to get a sense of its overall structure. After that, we can break it down and examine each component step by step to see how everything fits together. It may feel a bit overwhelming at first, which is natural, but do not be afraid, since in the end it is simply linear algebra arranged in a certain manner. 1from dataclasses import dataclass 2import torch 3import torch.nn as nn 4import torch.nn.functional as F 5 6@dataclass(frozen=True) # We like our dataclasses frozen! 7class Config: 8 """Transformer config.""" 9 10 vocab_size: int # Tokenizer vocabulary size 11 block_size: int # Max sequence length (context window) 12 n_layer: int # Number of transformer layers 13 n_head: int # Attention heads per layer 14 n_embd: int # Embedding dimension (must be divisible by `n_head`) 15 dropout: float # Dropout probability 16 bias: bool # Whether `nn.Linear` and `nn.LayerNorm` use bias 17 18 @property 19 def head_size(self) -> int: 20 """Returns the per-head dimension (embedding is split evenly across attention heads).""" 21 22 return self.n_embd // self.n_head 23 24class Transformer(nn.Module): 25 """Autoregressive transformer language model.""" 26 27 def __init__(self, cfg: Config) -> None: 28 """Initializes the building blocks of the transformer.""" 29 30 super().__init__() 31 self.block_size = cfg.block_size 32 33 self.tok_emb_table = nn.Embedding(cfg.vocab_size, cfg.n_embd) 34 self.pos_emb_table = nn.Embedding(cfg.block_size, cfg.n_embd) 35 36 self.blocks = nn.Sequential(*[Block(cfg) for _ in range(cfg.n_layer)]) 37 self.ln = nn.LayerNorm(cfg.n_embd, bias=cfg.bias) 38 self.proj = nn.Linear(cfg.n_embd, cfg.vocab_size) 39 40 # Weight tying: reduces the total number of parameters without degrading accuracy 41 # Reference: https://arxiv.org/abs/1608.05859 42 self.proj.weight = self.tok_emb_table.weight 43 44 def forward(self, token_ids: torch.Tensor, targets: torch.Tensor | None = None) -> tuple: 45 """Computes logits and optional loss.""" 46 47 B, T = token_ids.shape 48 tok_emb = self.tok_emb_table(token_ids) 49 pos_emb = self.pos_emb_table(torch.arange(T, device=token_ids.device)) 50 emb = tok_emb + pos_emb 51 52 out = self.ln(self.blocks(emb)) 53 if targets is None: 54 logits = self.proj(out[:, [-1], :]) # Last-token projection with time dimension 55 loss = None 56 else: 57 logits = self.proj(out) 58 loss = F.cross_entropy(logits.view(B * T, -1), targets.view(-1)) 59 60 return logits, loss 61 62 @torch.no_grad() 63 def generate( 64 self, 65 token_ids: torch.Tensor, 66 max_new_token_ids: int, 67 temperature: float = 0.7, 68 top_k: int | None = None, 69 ) -> torch.Tensor: 70 """Generates tokens IDs autoregressively.""" 71 72 self.eval() 73 for _ in range(max_new_token_ids): 74 logits, _ = self(token_ids[:, -self.block_size :]) 75 logits = logits[:, -1, :] / temperature 76 77 if top_k is not None: 78 k = min(top_k, logits.size(-1)) 79 threshold = torch.topk(logits, k).values[:, [-1]] 80 logits[logits < threshold] = -float("inf") 81 82 probs = torch.softmax(logits, dim=-1) 83 next_token = torch.multinomial(probs, num_samples=1) 84 token_ids = torch.cat((token_ids, next_token), dim=1) 85 return token_ids</code></pre> Let us focus on the forward</code> method, the core inference function that is also used during generation. After tokenizing the input text, semantic information is captured in tok_emb</code>, which produces token embeddings that represent each token's meaning as a numerical tensor. However, these embeddings do not encode token order. To incorporate positional information, we use learned positional embeddings computed in pos_emb</code> rather than fixed sinusoidal encodings</a>, as learned embeddings are more expressive and can adapt to task-specific positional patterns. The token and positional embeddings are then combined through simple addition to form a unified representation emb</code> that encodes both meaning and position. This additive approach is sufficient, as it breaks permutation symmetry and allows the attention mechanism to infer and model positional structure without requiring more complex operations. Learned positional embeddings are simple to implement and work well in practice, but they typically tie a model to the maximum sequence length used during training. Many modern architectures instead adopt Rotary Position Embedding (RoPE)</a>, which encodes position by rotating query and key vectors with position-dependent angles. This design allows attention to represent relative distances between tokens and often extrapolates more gracefully to longer contexts. For simplicity, however, this article uses learned positional embeddings. </blockquote> Although the line out = self.ln(self.blocks(emb))</code> appears compact, it encapsulates a substantial portion of the transformer's computational core. Here, self.blocks</code> represents a stack of transformer blocks, each composed of Multi-Head Attention (MHA) mechanisms and Multilayer Perceptrons (MLPs) that progressively refine the token embeddings by modeling complex semantic and contextual relationships across the sequence. Following these deep transformations, self.ln</code> applies layer normalization</a> to stabilize the network and ensure well-behaved gradients. From there, the forward pass branches depending on the objective: during inference (when targets</code> are omitted), the model efficiently isolates the last token's representation before projecting it into vocabulary-sized logits, since predicting the next word only requires this final aggregated context. Conversely, during training, the entire sequence is projected and the resulting logits and targets are flattened to combine the batch and time dimensions, satisfying PyTorch's F.cross_entropy</code></a> loss requirements. While these final routing steps handle output formatting and the training objective, they are not where the model's main representational power resides. That capability comes from the repeated attention and feed-forward layers inside self.blocks</code>. To understand how the model builds contextual meaning across a sequence, we will next unpack self.blocks(emb)</code> and examine the Block</code> class along with its core components, AttentionHead</code> and MultiHeadAttention</code>, to see how they interact under the hood. 1class AttentionHead(nn.Module): 2 """A single causal self-attention head.""" 3 4 def __init__(self, cfg: Config) -> None: 5 """Initializes QKV projection and dropout, and cache causal mask to avoid recomputing it.""" 6 7 super().__init__() 8 self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.head_size, bias=cfg.bias) 9 self.dropout = nn.Dropout(cfg.dropout) 10 self.register_buffer("mask", torch.tril(torch.ones(cfg.block_size, cfg.block_size))) 11 12 def forward(self, x: torch.Tensor) -> torch.Tensor: 13 """Computes masked single-head self-attention for the input tensor.""" 14 15 _, T, D = x.shape 16 q, k, v = self.qkv(x).split(self.qkv.out_features // 3, dim=-1) 17 18 attn_scores = q @ k.transpose(-2, -1) 19 attn_scores = attn_scores * D**-0.5 # Prevent softmax from blowing up 20 attn_scores = attn_scores.masked_fill(self.mask[:T, :T] == 0, float("-inf")) # Mask future tokens 21 22 attn_weights = torch.softmax(attn_scores, dim=-1) 23 attn_weights = self.dropout(attn_weights) 24 return attn_weights @ v 25 26class MultiHeadAttention(nn.Module): 27 """A Multi-Head Attention (MHA).""" 28 29 def __init__(self, cfg: Config) -> None: 30 """Initializes multi-head self-attention with output projection and dropout.""" 31 32 super().__init__() 33 self.heads = nn.ModuleList([AttentionHead(cfg) for _ in range(cfg.n_head)]) 34 self.proj = nn.Linear(cfg.n_embd, cfg.n_embd) 35 self.dropout = nn.Dropout(cfg.dropout) 36 37 def forward(self, x: torch.Tensor) -> torch.Tensor: 38 """Computes masked multi-head self-attention for the input tensor.""" 39 40 out = torch.cat([head(x) for head in self.heads], dim=-1) 41 out = self.dropout(self.proj(out)) 42 return out 43 44class Block(nn.Module): 45 """Transformer block.""" 46 47 def __init__(self, cfg: Config) -> None: 48 """Initializes a transformer block.""" 49 50 super().__init__() 51 self.ln1 = nn.LayerNorm(cfg.n_embd, bias=cfg.bias) 52 self.mha = MultiHeadAttention(cfg) 53 self.ln2 = nn.LayerNorm(cfg.n_embd, bias=cfg.bias) 54 self.mlp = nn.Sequential( 55 nn.Linear(cfg.n_embd, 2 * cfg.n_embd, bias=cfg.bias), # Flexible: can change to e.g. `4 * cfg.n_embd` 56 nn.GELU(), 57 nn.Linear(2 * cfg.n_embd, cfg.n_embd, bias=cfg.bias), 58 nn.Dropout(cfg.dropout), 59 ) 60 61 def forward(self, x: torch.Tensor) -> torch.Tensor: 62 """Performs a forward pass: attention + MLP with residuals and layer norms.""" 63 64 x = x + self.mha(self.ln1(x)) 65 x = x + self.mlp(self.ln2(x)) 66 return x</code></pre> The AttentionHead</code> class implements a single causal self-attention head. Attention, introduced in the Attention Is All You Need</a> paper, is a directional communication mechanism that allows tokens in a sequence to exchange information, with each token gathering information from others. Each token produces three tensors: Query: Like asking a librarian a question such as "Which books talk about dinosaurs?" It represents what someone is searching for and is used to scan the catalog to find relevant matches. </li> Key: Like the labels or table of contents entries in the library catalog. Each key describes what a book or page is about and acts as metadata that helps the query determine which sources are relevant. </li> Value: Like the actual pages inside the book. After the query matches with the keys, it retrieves the real information or content stored in the values. </li> </ul> The interaction between queries and keys determines how strongly tokens attend to one another, while values carry the aggregated content. In an autoregressive transformer, causal self-attention restricts each token to attend only to previous tokens, preventing future information leakage during training and inference. Furthermore, the queries, keys, and values all originate from the same input sequence. Formally, for the token sequence length $n$ and input dimension $d$, query weights $Q$, key weights $K$, and value weights $V$ are projected as $Q, K, V \in \mathbb{R}^{n \times d}$ and the attention is computed as: $$\text{Attention}(Q, K, V) = \text{softmax}\big(\frac{Q K^\top}{\sqrt{d}}\big) V$$ where $QK^\top / \sqrt{d}$ computes scaled similarity scores between queries and keys, with the scaling factor preventing softmax saturation. The softmax converts these scores into attention weights, which in our library analogy represent the relative importance of each page in providing the answer. The weighted sum of $V$ then produces an output that emphasizes the most relevant information, similar to extracting summarized notes from the most useful pages. In the implementation, attention is computed in the forward</code> method of the AttentionHead</code> class. In MultiHeadAttention</code>, we implement MHA. Conceptually, MHA consists of several parallel AttentionHead</code> modules whose outputs are concatenated and passed through a final linear projection. This design allows the model to attend to multiple aspects of the input simultaneously, with each head learning distinct relational patterns such as syntactic structure, long-range dependencies, and subtle semantic cues, which enrich the overall representation. We then apply a dropout to the projected output to regularize the MHA module during training. Without this regularization, different heads may co-adapt and learn redundant patterns, and the projection layer after concatenation can become overly confident in certain features. Since transformers have large capacity, this helps reduce the risk of overfitting and encourages the model to learn more diverse and robust representations. Have the TransformerBlock</code> class implements a single transformer block, which serves as a core building unit of the model. The first layer normalization standardizes the input to stabilize training, and the MHA module allows the block to focus on multiple relationships across the sequence, with the residual connection</a> ensuring that the original information is preserved. The second layer normalization prepares the data for the feed-forward network, which expands and transforms each position independently to capture higher-level features, while the GELU</a> activation introduces nonlinearity and the dropout regularizes the output. The second residual connection adds the transformed features back to the input, helping gradients flow more effectively and enabling the block to learn complex patterns without losing essential information. With this, we now understand the line out = self.ln(self.blocks(emb))</code> and have everything we need to train a GPT-like model. Finally, let us discuss the generate</code> method. This method performs autoregressive text generation using a transformer-style language model. Starting from an initial sequence of token_ids</code>, it generates one token at a time for up to max_new_token_ids</code> steps. At each iteration, the model is given only the most recent block_size</code> tokens, ensuring the input fits within the model’s context window. The logits corresponding to the last position are extracted and scaled by the temperature</code> parameter to control the randomness of the output. Optionally, top-k filtering can be applied to restrict sampling to the k</code> most likely tokens by masking all others. The filtered logits are then converted to probabilities using softmax</code>, and the next token is sampled stochastically with torch.multinomial</code>. This sampled token is appended to the sequence, and the process repeats until the specified number of tokens has been generated, producing the final extended token sequence. This completes the implementation of the GPT-like transformer! We will now train the model with AdamW</a> and Muon</a>, and generate text. 1import urllib.request 2 3def sample_batch(data: torch.Tensor, batch_size: int, block_size: int) -> tuple: 4 """Randomly samples training sequences for Next-Token Prediction (NTP).""" 5 6 idxs = torch.randint(len(data) - block_size, (batch_size,)) # Random starting positions 7 token_ids = torch.stack([data[idx : idx + block_size] for idx in idxs]) # Input sequences 8 targets = torch.stack([data[idx + 1 : idx + block_size + 1] for idx in idxs]) # Same sequences shifted by +1 (NTP) 9 return token_ids, targets 10 11with urllib.request.urlopen("https://www.gutenberg.org/cache/epub/84/pg84.txt") as f: # Read Frankenstein 12 text = f.read().decode("utf-8") 13tokenizer = CharacterLevelTokenizer(text) 14 15device = torch.device("cuda:0" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu") 16cfg = Config(vocab_size=tokenizer.vocab_size, block_size=32, n_layer=4, n_head=4, n_embd=64, dropout=0.0, bias=False) 17max_steps, log_interval, batch_size, learning_rate = 2_000, 10, 256, 1e-3 18 19data = torch.tensor(tokenizer.encode(text), device=device) 20model = Transformer(cfg).to(device).train() 21 22n_params, adamw_params, muon_params = 0, [], [] 23for param in model.parameters(): 24 n_params += param.numel() 25 (adamw_params if param.ndim < 2 else muon_params).append(param) 26print(f"Model parameters: {n_params:,}\n") 27 28adamw = torch.optim.AdamW(adamw_params, lr=3e-4, betas=(0.9, 0.95), eps=1e-8, weight_decay=0.01) 29muon = torch.optim.Muon(muon_params, lr=0.02, momentum=0.95, weight_decay=0.1) 30for step in range(1, max_steps + 1): 31 token_ids, targets = sample_batch(data, batch_size, cfg.block_size) 32 33 adamw.zero_grad() 34 muon.zero_grad() 35 _, loss = model(token_ids, targets) 36 loss.backward() 37 adamw.step() 38 muon.step() 39 40 if step % log_interval == 0: 41 print(f"\rStep {step}/{max_steps} ({step / max_steps:.2%}) | Loss: {loss.item():.4f} ", end="", flush=True) 42 43# Seed the model with the prompt "I am" to kick off text generation 44token_ids = torch.tensor(tokenizer.encode("I am"), device=device).reshape(1, -1) 45output = model.generate(token_ids, max_new_token_ids=512)[0] 46print(f"\n\nOUTPUT:\n\n{tokenizer.decode(output.tolist())}")</code></pre> Training logs from under 40 seconds of training on MacBook Pro</a> with Apple M4 Pro</a> chip: Model parameters: 140,127 Step 2000/2000 (100.00%) | Loss: 1.3831 OUTPUT: I am the forth which were on his stranger. The light of many our reasun alive in the enter of my cottagers; there in the murderer, when I walked in my enemy sat much of spring in its removed. I had not for my friend, and the cottager of the lack of the wint on the lovely of cestant, and that the old man, but I was entered that the eviner of my hands. My journey is many beauty will of compassins of the more spully appeared with meaning a man work was not rease to be macking my own and a gentle understand even</code></pre>

COVID-19 and Transition Into Full-Time Summer Work

Unknown — Fri, 22 May 2020 00:00:00 +0000

A lot of things have happened in the last few months. The situation with COVID-19 has undoubtedly posed new challenges for many of us. Schools have switched to online learning and I am now working fully remotely.

While what happened is certainly not favorable by any means, I do think that it gives us a lot of essential and transferable skills. The ability to deal with high levels of stress and move on despite the difficulties is one of the most important skills one could acquire. I think it is also interesting to think about the pandemic from the technological perspective. The pandemic would have hit a lot harder if it happened twenty or even ten years ago. Now that pretty much everyone owns a computer and has access to the internet, it is manageable to maintain social life. In other words, it is just physical distancing that we have been practicing and not necessarily social distancing.

Mayo Clinic - First Impressions

Unknown — Tue, 24 Mar 2020 00:00:00 +0000

The first few weeks were full of orientations, trainings, mentor and supervisor meetings, and project discussions. I got acquainted with some of the internal processes at Mayo, the general employee workflow, and my workflow as a research and development intern. To maintain the quality associated with the Mayo Clinic tradition, the clinic has a well-defined corporate culture. The work environment is awesome. I am surrounded by really smart individuals who are passionate and enthusiastic about the work that they do. I have learned a lot about both the history and the core values of the clinic. As a side note, there is a wonderful documentary by Ken Burns that gives a fascinating walk into the rich history of Mayo - it is titled The Mayo Clinic: Faith, Hope, Science.

The Essence of Programming - Functional Approach

Unknown — Sun, 25 Nov 2018 00:00:00 +0000

This blogpost is a general overview of a rather underappreciated programming methodology called functional programming. Throughout the blogpost, I will occasionally use the purely functional programming language Haskell as well as an imperative-style programming language Python. I will be assuming the knowledge of basic programming concepts such as variable assignment, arithmetic operations, conditionals, functions, loops, and recursion.

It is important to note that the blogpost is just an introduction to the paradigms in functional programming and does not cover any of them in great detail.

What is Functional Programming?</h3>

As mentioned above, functional programming is just an approach to programming. Particularly, it refers to programming using functions, hence the name functional programming. To better understand what it means for a programming language to be functional, let's make a short side-by-side comparison of functional and, wildly popular, imperative style of programming languages and then discuss the differences in more detail.

Imperative language</th> Functional language</th></tr></thead>

Classes or structures are the first-class citizens</td> Functions are the first-class citizens</td></tr>

State changes are important</td> State changes are limited or non-existent</td></tr>

Primary control flow: loops and conditionals</td>

Primary control flow: function calls and recursion</td></tr> </tbody></table>

Classes VS Functions</h3>

The first comparison shows that, generally speaking, in the imperative languages (i.e., Python, C, Java, etc.), variables (instances of classes or structures) dominate over all other objects. Thus, imperative paradigm makes a clear distinction between variables and functions. On the other hand, in functional programming languages, functions are the first-class citizens making virtually everything else rank below them.

Imperative programming languages treat variables as data, while functions are generally used just to manipulate variables or generate data. When programming in a functional language, we say that functions are very similar to variables. In fact, we say that they are no different than variables as they not only manipulate the data, but also represent the data themselves. Thus, in the functional world, we say that the piece of code like a function is also data.

There is a term describing a language if a program written in it can be manipulated as data. This quality is referred to as homoiconicity</a> and such languages are called homoiconic. One of such languages is Lisp</a>. And, based on our discussion points, it is not very surprising that Lisp is a functional language. </blockquote>

I will give you a concise proof of why functions are data. Remember the table representations of functions we learned at some point in the elementary school? That's the proof! Any function can be represented as a table of values. For instance, consider a function $ f(x) = 2x $. The following will be a table representation of the function.

$ x $</th>	$ f(x) $</th></tr></thead>
0</td>	0</td></tr>
1</td>	2</td></tr>
2</td>	4</td></tr>
3</td>	6</td></tr>
4</td>	8</td></tr>
5</td>	10</td></tr>
6</td>	12</td></tr>
...</td>	...</td></tr> </tbody></table> Looks more like data? That's because it is the data! We have effectively generated a 2-column table where each of the cells has a certain value. And yes, this is very similar to SQL tables and pandas data frames.</p> Natural Outcomes</h3> Because functions are so important, there are natural outcomes which are shared among most of functional languages.</p> Let's write a Haskell function to find the factorial of a positive integer.</p> -- \| A function to find the factorial of a positive integer</span></span> factorial x </span>=</span> product [</span>1</span>..</span>x]</span> -- easy as that</span></span></code></pre> The function builds a list of integers from 1 up to x and then calculates the product of these elements. This way we effectively get a product $ 1 \times 2 \times 3 .. \times \ x $ which is the same as $ x! $. Since we now have a function, we can call it with the actual parameters!</p> print (factorial </span>1</span>)</span> -- prints out 1</span></span> print (factorial </span>6</span>)</span> -- prints out 720</span></span> print (factorial </span>9</span>)</span> -- prints out 362880</span></span></code></pre> As shown above, something that in imperative languages would require importing modules, looping, etc. is a single line in Haskell. This is one of the outcomes of functions being first-class citizens. Most functional languages have a rich pool of built-in/predefined functions that help manipulate data. In the example above, we also see a very interesting notation. Namely, [1..x]</code> which builds up a list of integers from 1 up to $ x $ ($ x $ must also be an integer such that $ x \geq 1 $). Thus, another outcome is that data structures and collections can be created very easily, usually just in a single line of code, leaving more time for the programmer to deal with functions and the logic. These are some of the reasons why functional languages are so concise.</p> Math, Sets, and Haskell</h3> Notice that in the factorial function above, I excluded the case when the function is called with $ 0 $ ($ 0! = 1 $). It was done on purpose so that now we are able to add some other notation and explain the whole function in detail. Below is a better and more complete version of the function.</p> -- \| A function to find the factorial of a number</span></span> factorial</span> :: Integer -> Integer</span></span> factorial </span>0</span> =</span> 1</span></span> factorial x </span>=</span> product [</span>1</span>..</span>x]</span></span></code></pre> ::</code> - prompts that it is a function declaration</p> Integer</code> - type that can hold any number no matter how big, up to the limit of the machine's memory.</p> -></code> - tells either what is the type of the next formal parameter or what is the type of the output</p> Looks similar to something you have seen before? If you have taken any undergraduate math class, there is a chance that you've encountered the following notation:</p> \[ f : A \rightarrow B : x \mapsto y \]</p> The notation above describes a simple function that takes an input from set $ A $ and maps it to the output in the set $ B $.</p> Here is the complete definition of the factorial function that we saw above:</p> \[ f : \mathbb{Z}^+ \cup \{ 0 \} \rightarrow \mathbb{Z}^+ : x \mapsto x! \]</p> Haskell defines in the similar fashion.</p> factorial :: Integer -> Integer</code> says that factorial</code> is a function that takes an element from the set of integers and maps it to some other element in the set of integers. As opposed to math, however, Haskell does not use $ \mapsto $ notation and instead has the statements below the definition.</p> factorial </span>0</span> =</span> 1</span></span> factorial x </span>=</span> product [</span>1</span>..</span>x]</span></span></code></pre> The code above is equivalent to saying that if the element is 0, map it to 1 and in all other cases, map it to the product from 1 up to the element.</p> State Changes and Functional Programming</h3> Functional languages have a limited notion of state and typically, avoid the shared mutable state at any cost. Purely</strong> (see Purely Functional Languages</a>) functional languages like Haskell, do not have any state at all. Since there are no changes in state, there are no mutable variables. Instead, functional languages offer functions and immutable variables.</p> To make it clear, let's look at two examples below. One is from Python and the other is from Haskell.</p> Python example</p> # Define a variable 'my_number' and assign it to 3</span></span> my_number</span> =</span> 3</span></span> </span> # Increment the variable 'my_number' by 1 and reassign it to the result</span></span> my_number</span> +=</span> 1</span></span> </span> print</span>(my_number)</span> # Prints out 4</span></span></code></pre> Let's repeat the same steps in Haskell.</p> myNumber </span>=</span> 3</span> -- Define a variable 'myNumber' and assign it to 3</span></span> myNumber </span>+=</span> 1</span> -- Haskell gags here (infinite loop)</span></span> </span> print myNumber </span>-- This statement is not reachable</span></span></code></pre> Looking at the code above, you might have already noticed that Haskell does not allow for changing the state of the program. Now, you might be wondering how could one increment variables.</p> Here is a short answer:</p> myNumber </span>=</span> 3</span> -- Define a variable 'myNumber' and assign it to 3</span></span> myOtherNumber </span>=</span> myNumber </span>+</span> 1</span> -- Define a variable 'myOtherNumber' and assign it to 'myNumber'</span></span> myNumber </span>=</span> myOtherNumber </span>-- Redefine 'myNumber' and set it to 'myOtherNumber'</span></span> print myNumber </span>-- Prints out 4</span></span></code></pre> Longer and better answer</strong></p> You do not really need such increments or decrements in functional programming languages. You can easily overcome this hindrance through functions and recursion. Therefore, instead of mutating objects, we use recursion to gradually get to the target.</p> Here is the example of how one could translate a well-known accumulator pattern from Python to Haskell.</p> Here is a classic Python accumulator pattern:</p> # An accumulator pattern approach for finding</span></span> # the sum of the first 100 positive integers.</span></span> </span> total</span> =</span> 0</span></span> for</span> integer</span> in</span> range</span>(</span>1</span>,</span> 101</span>):</span></span> total</span> +=</span> integer</span></span> </span> print</span>(total)</span> # Prints out 5050</span></span></code></pre> Here is what it looks like in Haskell:</p> accumulator </span>1</span> =</span> 1</span> -- The base case for the recursion</span></span> accumulator x </span>=</span> x </span>+</span> accumulator (x </span>-</span> 1</span>)</span> -- The recursive case</span></span> </span> </span> main </span>=</span> print (accumulator </span>100</span>)</span> -- Prints out 5050</span></span></code></pre> In the code excerpt above, we did not use any loops. In fact, we could not use any loops because functional languages do not support loops. Instead, we defined a function, used the recursion and calculated the sum of the values through function calls.</p> Side Note</h4> In this particular case, we do not even need to implement the recursive accumulator pattern. All we need to do is use the already predefined sum</code> function and so-called texas range list notation that we have already seen ([1..x]</code>):</p> print (sum [</span>1</span>..</span>100</span>])</span> -- Prints out 5050</span></span></code></pre></blockquote> Control Flow</h3> As we have already seen, there are no for loops or while loops in functional programming languages and there are good reasons why. Let's list a few of them and continue our discussion by elaborating on those reasons.</p> Functional languages are declarative.</li> Most of functional languages are heavily influenced by lambda calculus.</li> If you were to implement a functional programming language, you would yourself get rid of loops.</li> </ul> Functional Languages Are Declarative</h4> For those who are new to the idea of declarative languages, let's first discuss what it means for a language to be declarative. Here is a simple definition:</p> Declarative programming is a method of programming that abstracts away the control flow for logic required for performing an action, and instead involves stating the task or desired outcome.</p> </blockquote> The examples of declarative languages are SQL, Haskell, Prolog etc.</p> Example 1</strong>: Consider the SQL querying language. In SQL, one doesn't describe what how to get the data. One just tells SQL what data is needed, and SQL engine figures out the best way to get it.</p> Example 2</strong>: A better example might be comparing two implementations of a simple function. Let's implement them in both Python and Haskell.</p> The function takes a list of integers and returns the sum of odd integers in it.</p> def</span> odd_sum</span>(list_of_integers):</span></span> """Returns the sum of all odd integers in the list."""</span></span> </span> total</span> =</span> 0</span></span> for</span> integer</span> in</span> list_of_integers:</span></span> if</span> integer</span> %</span> 2</span> ==</span> 1</span>:</span></span> total</span> +=</span> integer</span></span> return</span> total</span></span> </span> </span> print</span>(odd_sum([</span>1</span>,</span> 2</span>,</span> 3</span>,</span> 4</span>,</span> 5</span>]))</span> # Prints out 9</span></span></code></pre> Let's do a shallow analysis of the odd_sum</code> function. As seen above, it starts by declaring a variable total</code> which is initially set to 0. Then, we iterate over the list and through each iteration, we check if the integer is odd and if it is, we add it to total</code>. In the end, we return the total</code> variable.</p> Now, that we have analyzed the function a bit, notice that in the for loop, through each iteration, we are giving Python directions when to add the integer to total</code> (only if it is odd). Thus, we tell Python what to do step-by-step</strong>. This is an important characteristic that distinguishes non-declarative languages from declarative ones.</p> Let's now look at the Haskell example.</p> oddSum x </span>=</span> sum (filter odd x)</span></span> </span> main </span>=</span> print (oddSum [</span>1</span>,</span>2</span>,</span>3</span>,</span>4</span>,</span>5</span>])</span> -- Prints out 9</span></span></code></pre> Notice what we did here. First we defined a function oddSum</code> which takes a list. Then we used the function filter</code> (happened to be predefined) in conjunction with another predefined function odd</code> (returns true if the value is odd an false otherwise) to get the list of odd integers. Finally, we summed up all the odd integers and got the result.</p> See the difference? In Python, we used a for loop and through each iteration, we told Python whether to add the integer it to total</code> or not. In Haskell, however, we gave a whole list to the function and told it to just remove all of the even integers from the list and then to sum up the rest (if you eliminate all the even integers, you are obviously left with all the odd integers). In other words, in the Haskell example, we do not care how the functions sum</code> and filter</code> work internally, we only care about the fact that they do their job - sum up the odd numbers in the list and return the value.</p> Functional Programming And Lambda Calculus</h4> Lambda calculus (also written as $ \lambda $-calculus) is a branch of mathematics which was developed by Alonzo Church</a> in the 1930s. It is a formal system for expressing computation and an alternative to what's called Turing machine</a> which was introduced by Alan Turing</a>. Turing machines involve loops and other non-declarative approaches (Turing machines are the inspiration for programming languages like Java, Python, etc). A few years later, Church and Turing collaboratively wrote a paper which is now know as the computability thesis</a> and proved that all the computation that was done using Turing machines could effectively be done in lambda calculus as well. Hence, simply put, lambda calculus has the power equivalent to that of Turing machines. Not too long after, people decided to base programming languages on the ideas in lambda calculus (it was just as powerful as Turing machines so why not?!). This led to shared characteristics among functional languages such as lack of loops. Virtually all functional programming languages have no loops because lambda calculus has no loops. One could certainly add loops, but they would have been redundant. Instead, functional languages use a mathematical idea of recursion. This is the part of the reason why loops are not that appreciated in the functional world.</p> Getting Rid of Loops</h4> Despite the fact that sometimes they are very useful, loops must not be a part of a functional programming language. There are several reasons for this.</p> Loops are imperative, prompting the language what to do.</li> Loops usually involve mutating values which is, once again, against functional virtues.</li> Even if we did not use it imperatively and not mutate values, it would create unnecessary redundance in a language with the emphasis on recursion (which is just as powerful as a regular loop!)</li> </ol> Purely Functional Languages</h3> You might have seen word pure</strong> in the beginning of the blogpost where I mentioned that Haskell is purely</strong> functional programming language. However, I never defined what it means for a functional language to be pure. So let's do it now!</p> Those who read the Math, Sets, and Haskell</a>, remember the math notation for functions? I will use them to take the mystery out of this concept of being pure</strong>!</p> Suppose we have a function $ f : \mathbb{Z} \rightarrow \mathbb{Z} $. Then by just looking at the function, we see that it takes an input from a set of integers and its output is also in the set of integers. In other words, function $ f $ cannot take inputs like -1.9, 0.2, 12.7 etc. as well as it cannot give an output like 12.6, 71.9, -9.1 etc. Its input(s) and output(s) could only</strong> be integers.</p> Now, let's actually make this dull function $ f $ do something. Consider the function $ f : \mathbb{Z} \rightarrow \mathbb{Z} : x \mapsto 2x $. Thus, we have a function which does a fairly straightforward thing: takes an integer and maps it to twice its value (which will also be an integer). Let's now look at the Haskell implementation of this function</p> -- \| A function that takes an input and outputs twice its value</span></span> f</span> :: Integer -> Integer</span></span> f x </span>=</span> 2</span> *</span> x</span></span></code></pre> The function above says that the input (corresponds to the Integer</code> before the arrow) is always an integer and the output (corresponds to the integer</code> after the arrow) is also an integer. Hence, we always know what type is the input and what type is the output.</strong> In fact, we also know that the if we call a function with say 5, we will always get the same result. Namely, f 5 = 10</code>. Hence, we got that input(s) and output(s) are always integers and the function called with same actual parameters always return the same value! This is what makes Haskell a purely functional language. At any given point in time, we always know what is the type of input and what is type of output. Besides, we know that the function called with the same actual parameter(s), always returns the same value</strong>. Such functions virtually never produce side effects since we already know what to expect for a given input. Such functions are called pure!</strong></p> To further demystify this idea, let's look at the following piece of code:</p> """"</span></span> An example of a function that is pretending to be pure.</span></span> """</span></span> </span> import</span> random</span></span> </span> </span> def</span> numgen</span>(val:</span> int</span>) -></span> int</span>:</span></span> """Generates a number."""</span></span> </span> return</span> val</span> +</span> random.randint(</span>1</span>, val)</span> %</span> 3</span></span> </span> </span> def</span> main</span>() -></span> None</span>:</span></span> """Test the number generation."""</span></span> </span> print</span>(</span>f</span>"Returns </span>{</span>numgen(</span>7</span>)</span>}</span>"</span>)</span> # Prints out 8</span></span> print</span>(</span>f</span>"Returns </span>{</span>numgen(</span>7</span>)</span>}</span>"</span>)</span> # Prints out 8</span></span> print</span>(</span>f</span>"Returns </span>{</span>numgen(</span>7</span>)</span>}</span>"</span>)</span> # Prints out 7</span></span> </span> </span> if</span> __name__</span> ==</span> "__main__"</span>:</span></span> main()</span></span></code></pre> Here, we defined a function that takes an integer value as an input and it seems like the output is also an integer. We now might be lured into thinking that function numgen</code> gives the same output for the same input, but that is clearly not the case here. Let's take a closer look at what the function does. It takes an integer value and returns the value plus some random number which is 0, 1 or 2. When we first called the function with the actual parameter 7, we got 8 as an output. The second time, we got 8 again. The third time however, we got 7. Hence, for the third time, the output was not the same. Therefore, the function is not pure.</p> You now might wondering why I could not do the same trick in Haskell. In fact, I certainly can. However, in Haskell, such function would not have a type Int</code>. It would have a type IO Int</code>. IO</code> is usually associated with file input / output and it is reasonable that it is associated with functions that are not always "truthful" as File I/O could in fact be one of the nastiest experiences for a programmer. So many things can go wrong! (e.g., writing to a file which was deleted, reading from a file on a USB which was ejected, writing a file that was moved to some other directory etc). Thus, when we deal with uncertainty (which usually comes with side effects), Haskell warns us by using the IO</code> notation:</p> {-</span></span> Example of a function that if called with the same argument,</span></span> does not always return the same result.</span></span> -}</span></span> </span> import</span> System.Random</span> (</span>randomRIO</span>)</span></span> </span> </span> notAPureFunction</span> :: Int -> IO Int</span></span> notAPureFunction value </span>= do</span></span> randomValue </span><-</span> randomRIO (</span>0</span>,</span>2</span>)</span></span> return (value </span>+</span> randomValue)</span></span> </span> </span> main </span>= do</span></span> x </span><-</span> notAPureFunction </span>7</span></span> print x </span>-- Prints out 9</span></span> x </span><-</span> notAPureFunction </span>7</span></span> print x </span>-- Prints out 7</span></span> x </span><-</span> notAPureFunction </span>7</span></span> print x </span>-- Prints out 9</span></span></code></pre> Take a look at the Haskell code above. You can disregard all the notational fluff. Just look at the return type of the function notAPureFunction</code>. It is IO Int</code>! In other words, Haskell informs us that the function might have side effects.</p> Finally, we can have a rough definition of a pure functional language:</p> A functional language is pure if and only if the user is informed about all side effects or there are no side effects at all.</p> </blockquote> In fact, Haskell did not even allow random values back in 1990s when its development was first launched. Furthermore, there was no notion of File IO either and writing to files was done using shell redirection commands (i.e., runhaskell Program.hs > out.txt</code>). Because of this, Haskell was considered useless for all practical purposes. Eventually, engineers and the Haskell committee decided to change the direction of Haskell. In lieu of getting rid of all the side effects, they decided to control the side effects and created a more "regulated" programming environment.</p> Conclusion</h3> Functional programming languages are different from imperative ones. Most of them are based on ideas in lambda calculus. Functional languages are the proper subset of declarative languages. There are no loops and recursion is used instead. Changes in state are non-existent and therefore, all the variables are immutable. Functional languages usually have a lot of predefined functions to make it easy for a programmer to solve problems. Most of functional languages are also very concise, minimizing the time spent on coding and leaving more time for the logic. Pure functional languages are the proper subset of functional languages. Purely functional languages go a long way to inform the user about potential side effects.</p> How to Get Started with Functional Programming?</h3> There are lots of functional languages. One will obviously have to decide which one to learn first. My recommendation would be learning Haskell. It is a purely functional programming language which has most of (if not all) functional ideas in it. Besides, SPJ</a> dedicates most of his time on extending the language and adding new features to it. So if there is something new and interesting in the functional programming world, Haskell will likely adopt it.</p> After learning one functional language, it is not all that difficult to transition to the other. Being familiar with one functional language automatically makes one somewhat familiar with others. Hence, a good understanding of Haskell will make it easy to learn languages such as Rust, Scheme, etc.</p> To get started, visit the Haskell Documentation</a> page which is full of various educational resources.</p>

David Oniani

Autoregressive Transformer

Autoregressive Transformer

COVID-19 and Transition Into Full-Time Summer Work

Mayo Clinic - First Impressions

Privacy for the Web

The Essence of Programming - Functional Approach

David Oniani

Autoregressive Transformer

Autoregressive Transformer

COVID-19 and Transition Into Full-Time Summer Work

Mayo Clinic - First Impressions

Privacy for the Web

The Essence of Programming - Functional Approach

Side Note</h4> In this particular case, we do not even need to implement the recursive accumulator pattern. All we need to do is use the already predefined sum</code> function and so-called texas range list notation that we have already seen ([1..x]</code>):</p>

Purely Functional Languages</h3> You might have seen word pure</strong> in the beginning of the blogpost where I mentioned that Haskell is purely</strong> functional programming language. However, I never defined what it means for a functional language to be pure. So let's do it now!</p>