Shubham Rasal - Portfolio & Blog

Notes on RL Questions

Sun, 07 Jun 2026 00:00:00 GMT

35 questions covering LLM RL algorithms and infrastructure. Compiled as a personal reference — answers are deliberately brief and can be extended much further with follow-ups.

Sources: arXiv preprints, official GitHub repos, Lilian Weng's blog. Based on Xiuyu Li (@sheriyuo).

Algorithm (Q1–Q19)

Q1. Why use Actor-Critic instead of a pure Critic approach?

Pure critic (value-only) methods like DQN require discretizing the action space or expensive optimization at every step. Actor-Critic separates policy (actor) from value estimation (critic), enabling continuous action spaces and lower-variance gradient estimates via the baseline. The actor directly parameterizes the policy, which is more sample-efficient in high-dimensional or continuous spaces.

Q2. What is the relationship between KL divergence, cross entropy, and MLE?

KL(P||Q) = H(P,Q) − H(P). Minimizing cross entropy H(P,Q) (as in MLE on data distribution P) is equivalent to minimizing KL(P||Q) when H(P) is constant. In RL, KL from the reference policy acts as a regularizer equivalent to a log-ratio penalty.

Q3. How should rewards be designed in different RL scenarios?

Sparse: outcome-only (pass/fail). Simple but high-variance.
Dense: shaped per-step rewards. Faster learning but risks reward hacking.
PRM (Process Reward Model): reward per reasoning step — useful for math/code.
ORM (Outcome Reward Model): reward only the final answer.
For LLMs: format rewards + verifiable outcome rewards (exact match, code execution) preferred to avoid hacking.

Q4. How do importance sampling, rejection sampling, and other Monte Carlo methods fit into RL?

Importance sampling (IS): reweights off-policy samples; used in PPO's surrogate ratio π_θ/π_old.
Rejection sampling: accept/reject samples from a proposal; used in RLHF data filtering and best-of-N selection.
Monte Carlo rollouts: estimate returns by full-trajectory sampling; high variance but unbiased.

Q5. How is advantage computed in PPO and GRPO? Why subtract a baseline? Is std normalization necessary?

PPO (GAE): A_t = Σ (γλ)^k δ_{t+k} where δ_t = r_t + γV(s_{t+1}) − V(s_t).
GRPO: group-relative — A_i = (r_i − mean(group)) / std(group), no value network needed.
Baseline: subtracting a baseline reduces variance without introducing bias (REINFORCE theorem).
Std normalization: empirically stabilizes training but theoretically optional; DAPO and Dr.GRPO question its necessity.

Q6. How do RL training and test-time scaling perform exploration differently?

Training: exploration via stochastic sampling (temperature, entropy bonus, ε-greedy). The policy is updated to reinforce rewarded trajectories.
Test-time scaling: best-of-N, beam search, MCTS, or repeated sampling with a verifier — no gradient updates. Exploration is over the fixed policy's distribution.

Q7. How does PPO clipping work? Why take the minimum? What happens without clipping? How does CISPO differ?

PPO clips the IS ratio: L = min(r·A, clip(r, 1−ε, 1+ε)·A). Taking the min is a pessimistic bound — it only updates when the clipped and unclipped objectives agree on direction.

Without clipping: the policy can take excessively large steps, destabilizing training (the original TRPO problem).
CISPO: replaces clipping with importance-weighted SFT on accepted samples — smoother optimization, avoids the discontinuity in PPO's gradient.

Q8. Why does GRPO include a KL penalty? How is KL computed? Why do DAPO and GSPO remove it?

KL penalty prevents the policy from drifting too far from the reference (SFT model), acting as a regularizer against reward hacking.
Computed as token-level KL: KL(π_θ || π_ref) = Σ_t log(π_θ(a_t)/π_ref(a_t)).
DAPO: removes KL because token-level KL is an imprecise constraint and interferes with entropy-driven exploration.
GSPO: uses sequence-level KL instead of token-level, arguing it better aligns with the RL objective.

Q9. During LLM training, what happens if loss is accidentally All Reduced multiple times?

All Reduce averages gradients across ranks. Doing it multiple times effectively divides the gradient by world_size each extra time — the effective learning rate shrinks exponentially. Can silently cause training to stall or converge to suboptimal solutions. Hard to detect without gradient norm logging.

Q10. What is the reward function in DPO? Can reward hacking occur? How can it be mitigated?

DPO implicitly defines reward: r(x,y) = β log(π_θ(y|x)/π_ref(y|x)).
Reward hacking: the model can exploit distribution shift from π_ref without improving actual quality.
Mitigations: IPO (identity transform), cDPO (conservative), iterative DPO with updated reference, or hybrid DPO+RM.

Q11. What methods address train-inference mismatch in MoE models, and how do they work?

Expert choice routing at train vs. token choice at inference causes load imbalance.
Auxiliary load-balancing loss: penalizes uneven expert utilization.
DeepSeek MoE: uses fine-grained expert splitting and shared experts.
OPD / Expert-parallel fine-tuning: aligns routing distributions between train and serving.

Q12. How should group size, learning rate, PPO epochs, and generation length be selected during RL training?

Group size (G): larger G reduces variance of advantage estimate but increases memory; 8–64 typical.
LR: typically 1e-6 to 5e-6 for LLM RL; lower than SFT to avoid policy collapse.
PPO epochs: 1–4; more epochs risk over-optimization on old data.
Generation length: should match target task; too short truncates reasoning, too long wastes compute. Cap with length penalty if needed.

Q13. Compared with GRPO, how do Dr.GRPO, DAPO, GSPO, CISPO, SAPO, DPPO, MaxRL, and SimKO improve training?

Method	Key Improvement	Limitation
Dr.GRPO	Removes length bias; per-token normalization	Minor gain over GRPO
DAPO	No KL, clip-higher, token-level loss	Careful entropy tuning needed
GSPO	Sequence-level KL constraint	More complex reference computation
CISPO	Replaces clip with IS-weighted SFT	Requires careful accept threshold
SAPO	Separates actor/critic updates; better stability	Extra hyperparameters
DPPO	Distributed PPO with async rollout	Staleness risk
MaxRL	Reward maximization with explicit diversity	Experimental
SimKO	Simplified KL objective, no reference model	Weaker regularization

Q14. How do TRPO, DPPO, and AReaL enforce trust-region constraints?

TRPO: hard KL constraint via conjugate gradient + line search. Computationally expensive.
DPPO: separates rollout workers from training, enforces trust region via clipping asynchronously.
AReaL: uses partially stale rollouts with bounded staleness as an implicit trust region; monitors KL drift to trigger rollout refresh.

Q15. Can RL fundamentally expand the capability frontier of LLMs?

Debated. RL can unlock latent capabilities (e.g., step-by-step reasoning in R1) and improve reliability on verifiable tasks. However, it likely cannot introduce genuinely new knowledge — it redistributes probability mass over already-learned behaviors. Capability frontier expansion requires the base model to have learned the skill implicitly during pre-training.

Q16. Based on ProRL, how should we think about scaling RL training?

ProRL shows that prolonged RL training (>1000 steps) with curriculum and diverse task mixing continues to improve — contradicting early saturation beliefs. Key findings: task diversity prevents forgetting, reward signal quality matters more than quantity, and entropy maintenance is critical for sustained improvement.

Q17. What improvements does OPD introduce over traditional RL and SFT?

OPD (Online Preference Distillation) combines online rollouts with preference-based distillation from a stronger teacher. It avoids the cold-start problem of pure RL and overcomes SFT's distribution shift by continuously generating on-policy data. Applications: math reasoning, code generation with execution feedback.

Q18. At which stage of training does reasoning ability emerge in LLMs?

Evidence suggests reasoning emerges during pre-training at scale (chain-of-thought is latent in the base model). RL (e.g., R1-Zero) can surface and amplify it without SFT cold-start. However, the reliability and length of reasoning chains are shaped by post-training.

Q19. From DeepSeek R1 to V3.2 and future V4, what RL-related improvements have been introduced?

R1: GRPO-based RL on math/code with sparse verifiable rewards; "aha moment" self-reflection.
V3: MoE base + multi-token prediction + RL alignment pipeline.
V3.2 / V4 (speculative): longer context RL, improved MoE routing stability, process reward models, agentic RL with tool use. MoE adds expert load-balancing loss during RL.

Infrastructure (Q20–Q35)

Q20. Ignoring CPU offload, how many model copies exist in memory during GRPO training?

At minimum: reference model (frozen π_ref) + actor (π_θ) + optimizer states (Adam: 2× params in fp32). For GRPO without a value network: ~2 model copies (actor + reference) + optimizer states ≈ 3–4× parameter memory in practice. With a critic: add another copy.

Q21. Distributed inference: KV cache transfer optimization and multi-GPU communication strategies.

Prefill-decode disaggregation: separate GPU pools for prefill vs. decode; KV cache transferred over NVLink/RDMA.
Chunked prefill: interleaves prefill chunks with decode steps to reduce head-of-line blocking.
Paged KV cache (vLLM): non-contiguous memory blocks, reduces fragmentation.
Communication: NCCL all-reduce for TP; P2P send/recv for PP; RDMA for cross-node KV transfer.

Q22. INT8 versus FP8. What are the tradeoffs? Which precisions are preferred for training and inference?

INT8: simpler quantization, well-supported, but limited dynamic range for activations.
FP8 (E4M3/E5M2): better dynamic range than INT8, natively supported on H100. E4M3 for weights/activations, E5M2 for gradients.
Training: FP8 master weights risky; typically BF16 weights + FP8 compute (Transformer Engine).
Inference: FP8 or INT8 (W8A8) for throughput; INT4 for memory-bound generation.

Q23. What is the long-tail problem in RL rollouts, and how can it be addressed?

Some prompts generate much longer sequences than others, causing GPU idle time while waiting for the longest sequence in a batch. Solutions:

Sequence packing: bin-pack sequences to fill fixed-length buffers.
Dynamic batching: group by similar length.
Async rollout: decouple generation from training; discard or requeue long stragglers.
Length penalty in reward: discourage excessively long outputs.

Q24. What issues does continuous batching introduce in RL training? How do vLLM and SGLang differ?

Continuous batching mixes sequences at different stages — problematic for RL because you need complete trajectories before computing rewards. Solutions: track per-request state, flush complete sequences.

vLLM: PagedAttention, flexible scheduling, chunked prefill; rollout API via AsyncLLMEngine.
SGLang: RadixAttention for prefix caching, faster TTFT, optimized for structured generation (multi-turn, tool calls).

Q25. How do you measure utilization in vLLM and SGLang? How do you evaluate KV cache utilization during training?

GPU utilization: nvidia-smi, but MFU (model FLOP utilization) is more meaningful.
vLLM metrics: /metrics endpoint exposes gpu_cache_usage_perc, num_running_requests, num_waiting_requests.
SGLang: similar Prometheus metrics endpoint.
Training KV cache: monitor cache hit rate (prefix reuse) and eviction rate; high eviction → increase cache budget or reduce batch size.

Q26. How is backpropagation implemented in large-scale multi-node RL training?

Tensor Parallelism (TP): split weight matrices across GPUs; all-reduce after each layer.
Pipeline Parallelism (PP): micro-batches flow through stages; gradient accumulation across micro-batches.
FSDP / ZeRO-3: shard optimizer states, gradients, and params across DP ranks; all-gather before forward, reduce-scatter after backward.
RL-specific: actor backward pass only on accepted tokens; reference model kept frozen (no backward).

Q27. What asynchronous RL frameworks exist, and what synchronization bottlenecks do they solve?

IMPALA: async actor-learner; solves GPU idle from slow environment steps.
DPPO / RLHF async: decouples rollout workers from training; solves the generation bottleneck (~3–10× slower than training step).
AReaL: partially async — rollouts from slightly stale policy, bounded by KL; solves the sync barrier between inference and training clusters.
slime: fully async with Megatron backend; uses shared memory ring buffers for data flow.

Q28. In AReaL or other partially rollout frameworks, are KV caches from previous policies preserved?

No, in general. When policy weights update, the KV cache computed under the old policy is stale and would produce incorrect attention outputs. AReaL refreshes the inference engine (or restarts vLLM workers) after weight sync. Some systems use speculative decoding-style checks, but recomputation is the safe default.

Q29. How does Expert Parallelism affect MoE throughput?

Expert Parallelism (EP) shards experts across GPUs — each GPU holds a subset of experts. For a token routed to expert i, it must be sent to the GPU holding that expert via All-to-All communication. This adds latency proportional to message_size × num_experts / bandwidth. High EP degree → lower memory per GPU but higher communication overhead. Optimal EP degree balances compute vs. network saturation.

Q30. In long-context training, how should compute-communication overlap be designed? How do Megatron and FSDP differ in parallelism strategies?

Overlap strategy: pipeline communication behind computation using CUDA streams; prefetch next micro-batch while computing current.
Megatron: interleaved 1F1B pipeline schedule with virtual stages; sequence parallelism (layernorm/dropout split across TP ranks); explicit all-gather/compute overlap.
FSDP: lazy all-gather via forward hooks; forward_prefetch and backward_prefetch options. Less efficient for long context due to larger all-gather buckets.

Q31. How do you enable deterministic execution? What is batch invariance? What causes it? Is atomic add involved?

Deterministic execution: torch.use_deterministic_algorithms(True), set seeds, disable TF32.
Batch invariance: result should not change depending on how data is batched (e.g., same sequences in one batch vs. split across two). Violated by operations depending on batch statistics or non-deterministic reductions.
Cause: atomicAdd in CUDA reductions is non-deterministic across thread orderings. Flash attention, softmax, and layer norm are common culprits.
Mitigation: deterministic CUDA kernels (CUBLAS_WORKSPACE_CONFIG=:4096:8) or avoid atomic reductions in critical paths. Atomic add alone cannot solve batch invariance — it addresses within-kernel ordering, not cross-sample dependencies.

Q32. How do AReaL and slime differ in their understanding of the RL rollout bottleneck?

AReaL: bottleneck is the synchronization barrier between rollout and training. Solution: allow bounded-stale rollouts so training never waits for generation to finish.
slime: bottleneck is KV cache memory and inference engine throughput under RL workload (variable lengths, frequent weight updates). Solution: tight Megatron integration with shared memory, avoiding vLLM restart overhead on weight sync.

Q33. How should we think about staleness in fully asynchronous RL training? What are typical values in practice?

Staleness = number of gradient updates between when a rollout was generated and when it's used for training. High staleness → IS ratio π_θ/π_old drifts → clipping becomes too aggressive or too permissive. Typical practice: track staleness per sample, discard or reweight samples beyond a threshold (e.g., 2–4 updates stale). AReaL monitors per-sample KL to bound staleness implicitly. In practice, staleness of 1–3 steps is generally acceptable for LLM RL workloads.

Q34. How does data flow through slime? How is it integrated with Megatron? How is the loss computed?

Data flow: Megatron handles training; slime manages rollout workers (vLLM-based). After each training step, updated weights are broadcast to rollout workers via shared memory or NCCL. Rollout workers generate trajectories, queued in a ring buffer and consumed by the Megatron training loop.
Megatron integration: slime hooks into Megatron's training loop via a custom data iterator that pulls from the rollout queue instead of a static dataset.
Loss: standard PPO/GRPO loss computed over log-probs from Megatron's forward pass; reference log-probs either recomputed or stored during rollout.

Q35. If you had to choose among VeRL, TRL, Unsloth, AReaL, and slime, which one would you use and why?

Framework	Best For
TRL	Rapid prototyping, small models, research experiments
Unsloth	Single-GPU fine-tuning, memory efficiency (QLoRA), low-budget
VeRL	Production LLM RL at scale, Ray-based, good vLLM integration
AReaL	Async RL research, studying staleness, academic scale
slime	Megatron-native shops, largest-scale training, MoE models

Recommendation: VeRL for most industry use cases (mature, well-documented, Ray ecosystem). slime if you're training >100B parameter MoE models with Megatron already in your stack.

DEVLOG #6

Fri, 22 May 2026 00:00:00 GMT

Spent the afternoon benchmarking Carbon, a new family of genomic foundation models from HuggingFace trained on 1 trillion tokens of DNA sequences.

What is Carbon

Carbon is a causal language model for DNA. It uses a hybrid tokenizer that switches between BPE for text and 6-mer encoding for DNA sequences, triggered by a <dna> tag. The family has three sizes: 500M, 3B, and 8B. The 500M model is explicitly designed as a draft model for speculative decoding with the 3B.

Throughput on T4 GPU

I ran inference benchmarks for Carbon-500M on a Colab T4 GPU. The model is 512M parameters in bfloat16 precision.

Config	Prompt	Generated	Throughput
128bp context	23 tokens	64 tokens	27.2 tok/s
128bp context	23 tokens	256 tokens	25.0 tok/s
512bp context	87 tokens	64 tokens	26.6 tok/s
512bp context	256 tokens	256 tokens	26.3 tok/s

The throughput is remarkably flat across prompt lengths, which makes sense given the model size. No attention bottleneck at these context lengths on T4.

Speculative Decoding

Since Carbon-500M is designed as a draft for Carbon-3B, I tested speculative decoding using HuggingFace's built-in assistant_model flag. Both models together use about 8 GB of VRAM on the T4.

Config	3B standalone	3B with spec decoding	Speedup
128bp, 64 tokens	22.3 tok/s	19.4 tok/s	0.87x
128bp, 256 tokens	22.3 tok/s	20.1 tok/s	0.90x
512bp, 64 tokens	21.4 tok/s	18.6 tok/s	0.87x
512bp, 256 tokens	22.0 tok/s	19.7 tok/s	0.89x

Speculative decoding is about 10 to 13 percent slower than running 3B alone. This is not surprising on T4 for a few reasons.

First, the T4 is memory bandwidth bound, not compute bound. Loading weights for two models per round costs more than the verification saves. Second, HuggingFace's naive assistant_model implementation is not the optimized parallel speculative decoding used in production systems like vLLM. It runs draft tokens one at a time without batching the verification step. Third, speculative decoding gains are larger on A100 or H100 class hardware where the verifier is compute bound and the parallel draft verification comes essentially for free.

The design intention still makes sense though. A proper implementation with vLLM's speculative decoding backend on an A100 should show meaningful speedup. The 6x size ratio between draft and verifier, plus the shared tokenizer and training data, are all ideal conditions for high draft acceptance rates, which is where the actual wall clock improvement comes from.

What is next

Want to test Carbon on actual biology tasks rather than synthetic ATCG repeats. The evaluation suite includes variant effect prediction and sequence recovery tasks which would give a better picture of where the model actually stands relative to Evo2 and GENERator.

Chasing 1.18 BPB: Parameter Golf in 10 Minutes

Thu, 30 Apr 2026 00:00:00 GMT

OpenAI dropped a challenge in mid-March: train the best language model that fits in 16MB and runs in under 10 minutes on 8×H100s. Scored by bits-per-byte on FineWeb. Lower is better.

The baseline scores 1.2244 BPB. The current record is 1.1194. That gap looks small. It is not small.

What I Found Reading the Leaderboard

The repo has a records/ directory where every submission explains exactly what they did. Weirdly generous for a live competition.

Some things are basically free — LeakyReLU², orthogonal init, EMA on the weights. EMA surprised me because the gain isn't just better generalization — the averaged weights quantize more cleanly to int8. It's doing two jobs.

Then there's SmearGate. 512 parameters total. A single learned gate per embedding dim that blends the current token with the previous one. The BPB per parameter ratio is absurd.

The real unlock is int6 quantization. Pack weights into 6 bits, compress with zstd at level 22. This shrinks the model ~25% more than int8+zlib, which is what lets you run 3× wider MLPs while still fitting under 16MB. The wider MLP is apparently the single biggest contributor to score improvements.

Day 2 and 3: Implementation

Started with the MLX version locally. SmearGate and BigramHash went in cleanly. 200-step smoke test: val BPB dropped from 4.1 to 1.92. Just warmup, but promising.

Porting to CUDA was mostly mechanical. Then came RunPod.

SCP doesn't work through their SSH proxy — fails silently. You need runpodctl send/receive. Spot instances got terminated three times mid-run. The Docker image has almost nothing installed, so every new pod needs a git clone, 8GB of shard downloads, and a pip install before you can do anything. When you have 15 minutes of H100 credits, this stings.

Lesson: always nohup ... & your training run. Learned this the hard way.

The Run That Counted

Eventually a stable 8×H100 run finished.

step:8268/20000 val_bpb:1.2173 train_time:600058ms stopping_early: wallclock_cap
pre_quant_sliding val_bpb:1.1823
Total submission size int8+zlib: 16362289 bytes

1.1823 pre-quant BPB. Better than 12 of the 19 leaderboard entries.

The problem: 16.36MB. The limit is 16MB. 360KB over.

The fix was clear — switch to zstd or int6. I added zstd support. The run that would have used it got terminated by another spot eviction.

Where This Lands

1.1823 would rank around 10th if it had fit. Not a record, but a real result built from scratch.

The thing that surprised me most was how much co-design matters. Every architectural decision — MLP width, layers, embedding size — has to be evaluated through "does this fit after compression?" The best submissions are designing the model and the compression scheme together.

The int6 path is clear. No good reason 1.1194 should be the floor.

From Prompt to Pixels — SAM 3.1 as a Local Video Operator

Sun, 26 Apr 2026 00:00:00 GMT

Notes from running Meta’s Segment Anything 3.1 on Apple Silicon via MLX

While experimenting with SAM 3.1 (Segment Anything Model 3.1) on local hardware, I hit a moment that felt less like “running a model” and more like having an operator I could steer in English. I pointed it at dashcam footage of a busy highway interchange and asked—not for a caption—for a directional vehicle count. The stack answered. The numbers matched what I could verify by eye.

If you had to build traffic flow analytics or layer-aware compositing from raw video, how would you do it?

One workable answer today is: treat segmentation + tracking as the primitive, and push semantics (left vs. right, foreground vs. background) into a thin layer of code on top.

Mental Model: Classification vs. Operator

Most people’s picture of “vision AI” is classification: image in, label out. That was remarkable in 2012; in 2026 it is baseline.

SAM 3.1 is doing something closer to instance geometry:

You specify what you care about in natural language
The model returns masks—exact outlines—not just categories
Those masks persist across frames as tracks (IDs, colors, continuity)

So the useful abstraction is not “what is this frame about?” but “which pixels belong to which object over time?”

Traditional mental model          SAM 3.1 mental model
        |                                  |
   Photo → Label                    Video + Prompt → Masks + Tracks
        |                                  |
   "car" (one tag)                 Car₁, Car₂, … Carₙ per frame
                                         |
                                  Stable IDs across time

What the Stack Optimizes For (Locally)

The runs I care about here are offline-friendly and privacy-preserving:

No cloud — weights and inference on the machine
No API round-trips — latency is compute-bound, not network-bound
Plain-English prompts — the interface is language, not hand-tuned CV pipelines

Concretely: mlx-community/sam3.1-bf16 on MLX, 873M parameters, on Apple Silicon.

Experiment 1 — Highway: From Masks to a Directed Count

Input: ~53 seconds of 720p overpass footage.
Prompt: "a car".

Observation: Per-frame instance counts bounced in a band (roughly 19–28 vehicles), which is expected—occlusion, merge/split, and frame boundaries are hard. What mattered more was track continuity: each vehicle kept an ID, mask, and trajectory long enough to reason about dominant side of frame / road.

I wrote a small post-processing step: from track geometry, classify which side of a split line each track spent most of its lifetime on, then aggregate.

Result:

Direction	Count	Share
Left lane	408	70.1%
Right lane	174	29.9%

Raw frames
    |
    v
SAM 3.1 (prompt: "a car")
    |
    v
Per-frame masks + track IDs
    |
    v
Heuristic: dominant side of split line per track
    |
    v
Directional flow statistics

Takeaway: the “heavy” part—find instances and keep them coherent—is delegated to the model. The “thin” part—geometry + aggregation—is ordinary code. That split is the design win.

Experiment 2 — Detection Cadence: Trading Accuracy for Throughput

Default behavior effectively re-detects every frame. On 720p hardware here, that landed near ~0.5 fps—fine for batch analysis, unusable for interactive loops.

SAM exposes a --every N knob: full detection every N frames, propagation in between.

Detection cadence	Approx. speed (720p)
Every frame	~0.5 fps
Every 5 frames	~3.0 fps
Every 15 frames	~7.8 fps

At every 15, the tracker’s interpolation is smooth enough that, for many UIs, the shortcut is invisible—until you freeze-frame and hunt for edge cases.

Every frame                    Every N frames
     |                               |
 Full detect each step            Detect ──► Propagate ──► Propagate ──► …
     |                               |
 High fidelity masks              Lower CPU/GPU duty cycle
     |                               |
 ~0.5 fps (here)                   ~7–8 fps (here, N=15)

Experiment 3 — Compositing: A Depth List Instead of One Mask

Goal: the “text sandwiched between environment and subject” effect (background → text → foreground).

Naive approach: segment the person; paste text behind. Works until desk, mic, bottle, tablet should also occlude the text—you either fight with rotoscoping or you enumerate foreground.

Better approach: define a foreground set; union masks; composite in three layers.

FOREGROUND_OBJECTS = [
    "a person",
    "a wooden table",
    "a bottle",
    "an iPad",
    "a microphone",
]

SAM segments all prompts in one pass; the compositor does background → typography → union(foreground masks).

Layer stack (bottom → top):

[ Environment pixels ]
        |
        v
[ Text ]
        |
        v
[ Union(person, table, bottle, iPad, mic) ]

Takeaway: once masks are cheap and prompt-driven, compositing becomes a configuration problem, not a hand-drawn matte problem.

The Insight That Actually Generalizes

What stuck is not any single demo. It is that SAM collapses the distance between intent (“these things belong in front”) and machine-manipulable state (per-pixel ownership over time).

Prompt → mask → program.
You stop hand-authoring detectors for every object class; you name what matters and operate on the pixels the model hands back.

That is closer to writing constraints for an operator than to classic OpenCV plumbing—different skill, different failure modes.

Limits (Honest Accounting)

Wall time: a 53 s clip at conservative settings was on the order of ~6 minutes end-to-end here. “Real-time” at full HD is still aspirational; lower resolution (e.g. 224 px short side) is where live-ish tracking becomes plausible.
Temporal noise: fast edges flicker without smoothing; an EMA across masks fixes a lot of it but needs tuning.
Semantics: the model gives you shapes and categories, not causality. It will not explain why the left lane is busier—only that the pixels were there.

Where This Goes

The trajectory is unambiguous: faster silicon, better propagation, tighter integration with capture pipelines. Tomorrow’s constraint is less “can it segment?” and more “what do you ask it to pay attention to?”

That question is already the interface SAM forces you to answer—one English prompt at a time.

End result: a local, prompt-driven video operator that turns “describe what matters” into masks and tracks you can feed normal code—counts, dashboards, editors—without standing up a CV lab first.

References

mlx-community/sam3.1-bf16 (weights)
MLX (Apple Silicon framework)
Meta’s SAM family (Segment Anything) — see official docs and model cards for capability boundaries

DEVLOG: Fine-tuning Evo2 with RLVR for Regulatory DNA Design

Fri, 24 Apr 2026 00:00:00 GMT

This is a planning devlog for a project I've been thinking about for a while — using reinforcement learning with verifiable rewards (RLVR) to fine-tune Evo2 for targeted regulatory DNA design.

The Problem

Designing regulatory DNA sequences that activate in a specific cell type is hard. Current generative models — including Evo2, which I've worked with before — can produce plausible sequences, but they're not steered toward any particular functional objective. You get diversity, not specificity.

The standard approach is to train a supervised model on MPRA (Massively Parallel Reporter Assay) data and hope it generalizes. But MPRA datasets now have 150M+ labeled sequences with measured regulatory activity across cell types. That's enough signal to build a reward model — and if you have a reward model, you can do RL.

The Idea

The core loop is straightforward:

Evo2 generates candidate regulatory sequences
A reward model trained on MPRA data scores each sequence for actual regulatory activity in a target cell type
RLVR updates Evo2 toward sequences that score high

The reward is grounded in real experimental measurements, not another model's predictions. That's what makes it "verifiable" in the RLVR sense — similar to how math RL uses a checker rather than a judge.

Top candidates from the RL loop get validated against Borzoi as a secondary oracle (a sequence-to-activity model trained on ENCODE data, Nature Genetics 2024).

This is inspired by the recent RL for Crystal Relaxation work — same philosophy: use a physically grounded reward instead of a learned proxy.

Why This Hasn't Been Done

I looked through the literature and nobody has combined Evo2-scale generation (~7B parameters, trained on 2.7M genomes) with MPRA-grounded reward models under RLVR. A few reasons this gap exists:

Evo2 only dropped in early 2025
MPRA datasets at this scale are very recent (the 150M+ collection is from March 2025)
Most bio ML work still treats sequence design as a supervised problem

The pieces are all available now. The RLVR framework from Prime Intellect's verifiers library maps cleanly onto this — you just swap out the math verifier for an MPRA-trained reward model.

Prior Work I'm Building On

Evo2 paper (Nature 2026) — the base model and weights
RL for regulatory DNA design — closest prior work, but uses a weaker base model and synthetic rewards
MPRA Dataset Collection — 150M+ labeled sequences, the reward model training data
Borzoi — secondary oracle for validation

I've already shipped one application on top of Evo1, so I have a practical sense of where generative DNA models fail on functional sequence design. That experience is what made this problem obvious to me.

Compute Plan

Rough estimate: 100–150 H100 hours total.

Reward model training on MPRA data: ~30–50 hours
RLVR fine-tuning of Evo2-7B: ~50–80 hours
Borzoi validation passes: ~10–20 hours

Starting with Evo2-7B. If the reward signal is strong, scaling up to larger checkpoints is the obvious next step.

Next Steps

[ ] Set up MPRA data pipeline and train initial reward model
[ ] Implement RLVR training loop using Prime Intellect's verifiers framework
[ ] Baseline: compare RLVR outputs vs. pure Evo2 generation on target cell type activity
[ ] Validate top-K sequences against Borzoi

Will post updates as I make progress.

Cody — my personal background coding agent

Sat, 11 Apr 2026 00:00:00 GMT

For the last few months, I have been running a background coding agent built using OpenCode on a VPS.

I was inspired by Ramp Inspect and other implementations and wanted to see how far I can push it.

I started with a tall order for the agent to be useful and worth investing my time in.

What I want the agent to do

I want the coding agent to handle the following tasks (almost) autonomously:

Fix minor issues like formatting, creating a new REST API endpoint, creating a new page, etc.
Build the project, update dependencies, and fix any errors that come up
Deploy to the given instance
Build n8n workflows and test whether they are correct or not
Inform me when the work is done
Has access to all my GitHub repos and is easy to deploy on a VPS
Use a browser to check the output visually and fix any issues

The litmus test

What would make this agent actually useful to me? It needs to be able to:

Add a new blog post to my personal site
Create a non-trivial automation workflow without my help
Add a new feature like CLI enhancements to a complex Golang project
Add features to itself

Enter OpenCode

OpenCode is a really great agent. It exposes an HTTP interface and you can spawn as many instances as you want. But managing multiple sessions and projects was getting complex, so I built a small script used for high-level project management.

Caddy setup

In order to expose multiple projects on custom subdomains, I decided to use a proxy server. I started with nginx but didn't like the setup. Then I found Caddy — really simple and straightforward.

Here's a sample config that reverse-proxies to OpenCode instances running on different ports for better isolation:

*.example.com {
    tls {
        dns digitalocean {env.DO_AUTH_TOKEN}
    }

    @app1 host app1.example.com
    handle @app1 {
        reverse_proxy localhost:4096
    }

    @app2 host app2.example.com
    handle @app2 {
        reverse_proxy localhost:4097
    }

    handle {
        abort
    }
}

Tools

In the default setup, OpenCode can only make changes locally on the server. But I wanted it to use the same tools I have access to as a dev. So I set up:

gh (GitHub CLI)
n8n skills and MCP
browser-use
ffmpeg
docker
gws (Google Workspace CLI)
gemini-cli (for image gen via Nano Banana)

Using OpenClaw as an orchestrator

I gave this setup to BerryClaw — my personal OpenClaw instance running on a Raspberry Pi — and used it to manage multiple projects.

Results

BerryClaw, with the help of this background coding agent:

Created 2 n8n workflows that worked end to end
Sent 3 minor PRs to my personal sites
Wrote 2 blog posts (I gave it the rough content)

I'll keep improving this agent, but I'm happy with where it is now.

References

Inside Cursor's Agent Traces

Sat, 11 Apr 2026 00:00:00 GMT

Cursor ships with a surprisingly detailed telemetry system for tracking AI-generated code. It stores everything locally in two SQLite databases - one that logs every code chunk the AI writes, another that holds your entire chat history as JSON blobs. Neither was designed as a training dataset, but together they contain the raw materials for one.

This is a writeup of me sniffing around the ~/.cursor folder to see whether Cursor records enough data to reconstruct accept/reject preference signals.

Spoiler: there's no explicit signal. But there's enough to reconstruct meaningful ones implicitly.

The Two Databases

Cursor persists two relevant stores on disk:

~/.cursor/ai-tracking/ai-code-tracking.db          # code provenance tracking
~/Library/Application Support/Cursor/User/
  globalStorage/state.vscdb                         # chat history + KV store

The first is specific to Cursor's AI tracking extension. The second is the standard VSCode global state database - Cursor uses it for chat history on top of the usual IDE state.

ai-code-tracking.db - Six Tables

The tracking database has six tables, two of which are empty but revealing in their intent:

Table	Rows	Notes
`ai_code_hashes`	86,763	One row per code chunk written by AI
`scored_commits`	301	Per-commit AI attribution percentages
`ai_deleted_files`	31	AI-written files that were later deleted
`tracking_state`	1	Config row - stores `trackingStartTime`
`conversation_summaries`	0	Empty. Would have been the richest signal.
`tracked_file_content`	0	Empty. Would store full file content per session.

The ai_code_hashes schema is the core of the system. Each row records the chunk content hash, the source (composer vs human), the file it was written to, the model that generated it, and critically - a requestId that links back to the originating chat turn.

state.vscdb - The Chat Store

The chat history lives in the cursorDiskKV table inside VSCode's global state database. Each row is a JSON blob keyed with one of two patterns:

bubbleId:<conversationId>:<messageId>   # one row per chat message
agentKv:blob:<hash>                     # composer session blobs

A "bubble" is Cursor's term for a single message in the chat - either a user turn (type: 1) or an AI turn (type: 2). The AI turn is actually split across several bubbles: one for the thinking block, one per tool call, one per tool result, and one for the final text response.

The Foreign Key: requestId

The most important thing I found was a genuine join key between the two databases. The user-prompt bubble stores a requestId field in its JSON. The ai_code_hashes table stores the same requestId on every code chunk written during that request. This is a hard, verifiable link - not an inferred one.

Note on conversationId. It's tempting to join on conversationId instead, since it appears in both stores. But the conversationId is not stored inside the bubble's JSON - it's only encoded in the key string (bubbleId:<conversationId>:<messageId>). The requestId field is the actual foreign key.

Coverage isn't perfect. Of 581 distinct requestIds in ai_code_hashes, 512 (88%) have a matching bubble in state.vscdb. The 12% gap is most likely old chat bubbles that were pruned from the rolling retention window while the tracking DB kept its code hashes indefinitely.

A Complete Trace

Here's a concrete example - a real request reconstructed from the two databases. The conversation ID is 11325bee, the request is a UI tweak to a loading screen component.

<div class="trace not-prose"> <div class="trace-header"> <div class="trace-dot"></div> Conversation 11325bee · 2026-03-15 · composer-1.5 </div>

<div class="trace-step"> <div class="trace-time"> <span class="trace-kind user">user</span> 08:50:14Z </div> <div class="trace-body"> <span class="hi">"the @LoadingScreen.tsx screen text is too light.<br> Make it larger, brighter"</span><br> <span class="dim">fileSelections: [LoadingScreen.tsx]</span><br> <span class="dim">requestId: f8f61054-8500-49ff-b521-140ec847fcc4</span> </div> </div>

<div class="trace-step"> <div class="trace-time"> <span class="trace-kind thinking">thinking</span> 08:50:16Z </div> <div class="trace-body"> <span class="dim">"Adjusting text styling... increasing font sizes<br> and brightness for phase captions, legend items..."</span><br> <span class="dim">thinkingDurationMs: 48</span> </div> </div>

<div class="trace-step"> <div class="trace-time"> <span class="trace-kind tool">tool</span> 08:50:16Z </div> <div class="trace-body"> <span class="em">read_file_v2</span>(<span class="hi">"LoadingScreen.tsx"</span>)<br> <span class="dim">→ 647 lines</span> </div> </div>

<div class="trace-step"> <div class="trace-time"> <span class="trace-kind tool">tool</span> 08:50:24Z </div> <div class="trace-body"> <span class="em">edit_file</span> - <span class="ts">applies diff to LoadingScreen.tsx</span><br> <span class="dim">phase captions, legend, eyebrow, title, tagline, button</span> </div> </div>

<div class="trace-step"> <div class="trace-time"> <span class="trace-kind ai">response</span> 08:50:26Z </div> <div class="trace-body"> Phase captions: <span class="hi">0.52rem → 0.72rem</span>, <span class="ok">rgba(168,144,96,0.55) → rgba(245,217,106,0.92)</span><br> Legend items: <span class="hi">0.42rem → 0.58rem</span> + <span class="ok">filter: brightness(1.15)</span><br> Title: <span class="hi">clamp(58px,11vw,108px) → clamp(72px,13vw,128px)</span><br> <span class="dim">... 6 components updated total</span> </div> </div>

<div class="trace-step"> <div class="trace-time"> <span class="trace-kind data">hashes</span> 08:50:21Z </div> <div class="trace-body"> <span class="ok">18 chunks</span> written to <span class="em">ai_code_hashes</span><br> <span class="dim">requestId: f8f61054-8500-49ff-b521-140ec847fcc4 ✓ matches bubble</span><br> <span class="dim">timestamp delta: +7s from user prompt</span> </div> </div>

<div class="trace-step"> <div class="trace-time"> <span class="trace-kind user">user</span> 08:51:06Z </div> <div class="trace-body"> <span class="hi">"stop the actions of any agent that is not running..."</span><br> <span class="dim">→ no revert, no pushback - moved to next task</span><br> <span class="ok">implicit acceptance signal (+40s)</span> </div> </div> </div>

The timestamps close the loop. The bubble was created at 08:50:14Z. The code chunks in ai_code_hashes carry a timestamp of 08:50:21Z - seven seconds later, consistent with the time to read the file and apply the edit. The same requestId links them unambiguously.

The Missing Signal: Accept vs. Reject

I wanted to build a dataset from these traces but the available signals are too weak to be reliable on their own.

Signal	Source	Quality
User continues after AI edit	Next bubble timestamp	Weak positive
AI-written file later deleted	`ai_deleted_files`	Weak negative
High `composerLinesDeleted` in commit	`scored_commits`	Weak negative
Low churn - AI lines survive commit	`scored_commits`	Weak positive
Explicit thumbs up/down	-	Not present
Per-suggestion accept/dismiss	-	Not present

The closest usable signal is pairing ai_deleted_files and high composerLinesDeleted commits as negative examples, and low-churn commits with high composerLinesAdded retained as positive ones. It's a stretch.

The two empty tables - conversation_summaries and tracked_file_content - are the real pity here. The schemas suggest Cursor intended to store full file snapshots and summaries per conversation, which would enable before/after diffs per request. Neither table has ever been populated, at least on this machine. Either the feature was removed, or it was never shipped.

Cursor almost certainly does have explicit accept/dismiss events internally - the tab-completion UI tracks these to report the per-seat acceptance rates that appear on enterprise dashboards. But that data does not surface in either local database.

The Join Graph

For anyone wanting to build on this, here's the complete chain of joins across the two databases and the filesystem:

-- Step 1: pull the user prompt from a bubble
cursorDiskKV["bubbleId:<convId>:<msgId>"]
  → json.text          -- the prompt
  → json.requestId     -- the foreign key  ← THIS IS THE JOIN

-- Step 2: find code chunks written during that request
ai_code_hashes WHERE requestId = '<requestId>'
  → fileName, hash, model, timestamp

-- Step 3: infer survival via git (indirect)
scored_commits
  WHERE commitDate ≈ timestamp
  AND   branchName = current branch
  → composerLinesDeleted   -- high = negative signal
  → v2AiPercentage         -- context

-- Step 4: explicit deletion (strongest negative signal)
ai_deleted_files WHERE conversationId = '<convId>'
  → gitPath    -- was this file written then deleted?

Warning: The scored_commits join is not a hard join. There is no requestId in scored_commits - only git commit hashes. Linking a specific AI request to a specific commit requires matching on file path plus timestamp proximity, which will produce false positives on active repos with many small commits.

What This Is Actually Good For

Despite not having an explicit reward signal, the database is genuinely useful for a few things that aren't RL training:

Code attribution auditing. The hash in ai_code_hashes is a content hash of each code chunk. Given a file at any point in git history, you can check whether its chunks appear in the tracking DB and recover which model wrote them, when, and in response to which conversation.

Model comparison. The database spans multiple models across the same codebase and the same developer workflow. Comparing chunk survival rates per model - how much of what composer-2 wrote ended up deleted vs. retained versus claude-sonnet-4-6 - would be a meaningful if noisy quality signal.

Workflow reconstruction. The bubble JSON is rich. It stores attached files, terminal selections, tool call sequences, thinking blocks, and the full context window configuration for every turn. Reconstructing the full trajectory of how a feature was built - including the order of AI requests, what context was provided, and how the conversation evolved - is entirely possible from these two databases.

Notes:

All data examined was from a single machine. Database schema and content will vary across Cursor versions and per-user configuration.
The 88% requestId coverage figure reflects bubbles available in state.vscdb at time of query. Cursor likely prunes old bubbles; the tracking DB does not prune hashes.
ai_code_hashes.hash is 8 characters, suggesting a truncated SHA or CRC of the chunk content. Not a full SHA-256; collisions are possible at scale.
The agentKv:blob:* rows in cursorDiskKV are binary-encoded protobuf or similar - not examined in this session.

BYOA — Bring Your Own Agent

Tue, 31 Mar 2026 00:00:00 GMT

Recently Alex Hormozi talked about how jobs are changing and what is being done to automate roles in a business.

Each job role has a description of what the person will do in that role. Most roles are workflow-based — essentially a set of tasks that need to get done.

The anatomy of a role

Consider a software engineer. The role requires someone to:

Read and write code
Review other people's code
Design software
Coordinate with team members
Report to manager
Write specs, docs, emails

These are some high-level tasks that can be broken down further if required. These tasks, along with the unique knowledge and context of the company or product, make a software engineer.

What can be automated right now

The core idea is to think about which parts of these tasks can be automated.

For example:

Merge conflicts — takes up a significant amount of time
Preliminary code review — lint issues, flagging company policy violations, etc.

These are just two very small tasks that can be automated right away with an AI agent and the right tools.

BYOA is becoming the expectation

I have been looking at many job posts recently, and most of them expect the applicant to bring their own workflows, agents, and tools — to make it easier and faster for them to do their job and get outcomes faster.

This trend is something I suspect will keep growing. What I am doing right now:

Encode my knowledge base (doing it for the last 3 years)
Provide agents with tools to make them better at doing their job
Write, write, write — SOPs, workflow specs, instructions, prompts

The proactive framing

This line of thinking makes you proactive:

"How can I automate myself out of my job in the next two months?"

If this is something you are already doing or care about, you are already on the right track.

Web3Compass x BYOS Experience

Mon, 30 Mar 2026 00:00:00 GMT

So recently I started the learning path for understanding Filecoin Onchain Cloud and what I can do with it. It was hosted on this platform called - Web3Compass

First of all the platform is really well crafted. There are so many things I like about it:

the way the program is structured
quizes
progress tracking

All work as expected.

I am doing the BYOS learning path and I was specifically interested in the trustless agent section since I have already been familiar with FOC concepts.

One good thing about this path is that all the required code and starter packs are all linked in the chapters itself, making it easy for me to integrate in the project while I am learning a concept.

To test the concepts I have learnt in the course, I decided to build an ambitious project - FilCraft.

FilCraft is an AI agent economy where the agents have to buy their own storage by delivering valuable artifacts. I want to see how far I can push the autonomy of these agents.

To keep track of their memory and reputation, I adopted ERC-8004 and also deployed custom contracts on Filcoin Calibration network for the same.

I am still experimenting with a lot of things in the project and plan to progressively add features as I learn more.

You can try out the agents here - filcraft.vercel.app

The Region Beta Paradox of Modern Coding

Thu, 26 Feb 2026 00:00:00 GMT

You would have experienced this paradox in scenarios other than coding.

You would walk half a mile to reach somewhere in 10 minutes, whereas you would drive to reach your workplace 2 miles away and get there faster. A minor back pain is ignored for years, while an acute fracture is immediately taken care of.

The mild inconvenience never triggers a response. The intense one forces you to act, and acting makes things better.

Something similar is happening in the coding agents space.

The comfortable middle

Earlier, users started using tab complete models extensively. They provided the right amount of control and productivity boost. The feedback loop was tight: you typed, the model predicted, you accepted or rejected. It felt like a natural extension of how developers already worked.

Now with better long-horizon models, users have increasingly adopted agentic workflows. Michael Truell's recent thread about Cursor's evolution captures this shift clearly: in March 2025, Cursor had roughly 2.5x as many Tab users as agent users. That ratio has now flipped — they have 2x as many agent users as Tab users, with agent usage growing over 15x in a single year.

The paradox

When you use tab completion, you write code 20–30% faster. That feels good. You're still walking in the rain, just slightly faster.

But when you hand off 35% of your PRs to autonomous cloud agents — as Cursor's own team now does — everything has to change. Your tests can't be flaky. Your environments can't be brittle. Your task descriptions need to be precise enough for a non-human teammate to act on independently.

The pain of that reorganisation is sharp enough to actually trigger a deeper structural response — and that's exactly what makes it productive.

This is the paradox. The mild productivity boost of tab completion was comfortable enough to never demand change. Autonomous agents are disruptive enough that they force it.

The agent-first developer

The amount of code generated by agents is increasing rapidly, making agents the default option for coding.

Truell describes developers who have fully embraced this shift as having three traits:

Agents write almost 100% of their code
They spend their time breaking down problems and reviewing artifacts
They spin up multiple agents simultaneously instead of handholding one to completion

The adoption curve is probably skewed toward power users. Lots of programmers have not yet experienced the capabilities of the latest coding agents. Many are still in the tab completion zone — productive enough to feel modern, not disrupted enough to transform.

What comes next

Agentic coding patterns require a lot of structural changes in organisations that have traditionally been slow to move.

I predict new companies will skip tab-completion-based coding entirely and directly start with an agent-first approach for building software. They won't carry the baggage of workflows designed around human keystroke productivity. They'll structure their codebases, their specs, and their review processes for agents from day one — and that head start will compound fast.

The region beta paradox tells us that the comfortable middle is often the trap. In coding, that trap is tab completion.

The teams and companies that push through to the acute discomfort of agent-first development will be the ones that come out faster on the other side.

Defining the Hairy Problem

Sat, 10 Jan 2026 00:00:00 GMT

This is the fundamental issue each knowledge worker needs to solve on a daily basis.

Entropy reduction

If you see a random day in a knowledge worker's job, it would sound something like this—

"Boss has asked me to fix the sales numbers. What does that even mean?"
"This design does not drive conversions, and I have no idea what to do."

Entropy is why an email with the subject line "Quick sync?" ruins your morning.

The job of a knowledge worker is to reduce the entropy in the system—to act like a funnel that takes in high entropy (chaos) and converts it into low entropy (order).

Articulation is the tool with which one can achieve this. It is not just writing the problem down; it is constraint identification.

Clarity of thought vs intelligence

Standardized tests (SATs, GREs) are designed to measure raw intelligence, but they cheat: they give you well-defined problems. They remove the noise. Real life is nothing but noise. Therefore, real-world success is a function of Raw Intelligence × Clarity of Thought.

Clarity of thought and intelligence seem to be two different things. While one is largely innate, the other is an acquired skill.

Articulation is a skill, as it is trainable. The skill can be defined as the ability to define the shape of a problem. So, how do we train this "clarity of thought"? It starts by visualizing the problem physically.

The shape of the problem

A shapeless problem is a source of stress. It might look like—"My work is boring!"

A better problem statement would be—"I need to negotiate for two interesting tasks that help me improve my Python coding skills without increasing my total working hours."

Let's take a really complex, hairy issue—something that has a lot of moving variables, like deciding a career or finding the right house.

We can take inspiration from computers and try to solve it algorithmically. This would mean breaking down the problem into smaller ones until we hit a trivial problem with a known solution. This is called recursive decomposition and is a well-studied concept that is heavily used while designing solutions for complex problems in computer science.

Vague: Deciding a career

Decomposed: "Need to find a role that pays at least $X, requires no more than $Y$ hours of travel, and utilizes skill $Z$, within the next six months."

This makes the problem far more solvable, which is exactly what we are trying to do.

Math is very rigorous when it comes to defining problems. Jacques Hadamard suggested the concept of well-posed problems.

A problem is only "well-posed" if:

A solution exists
The solution is unique
The solution depends continuously on the data

While life doesn't always have a unique solution like math, we should strive for the clarity that math demands.

This gives us an algorithm to convert a mess into a better-organized problem:

Define constraints that the desired solution should have
Restrict the scope of the solution

The cool thing about defining a problem is that it is a very effective tool for understanding the problem.

Automate the articulation

The highest form of this skill is when articulation becomes automated. Great problem solvers are really just framework developers. They find themselves defining the same problem types repeatedly, so they crystallize their methodology.

For example, Elon's framework for solving hard problems is:

Make the requirements less dumb. (Question the constraints immediately.)
Delete the part or process. (If you are not adding back at least 10% of the time, you aren't deleting enough.)
Simplify or optimize. (Only do this after steps 1 and 2.)
Accelerate cycle time.
Automate.

Or Jack Dorsey's protocol:

Perfect every detail.
Limit the number of details.

The payoff

"A problem well put is half solved." — John Dewey

The hidden benefit of articulation is that once a problem is constrained and defined, you no longer have to be the one to solve it. A vague problem requires you to stress over it. A well-articulated problem can be outsourced—to a junior employee, to code, or to AI.

Designing a High-Performance OLTP Database from First Principles

Tue, 06 Jan 2026 00:00:00 GMT

A deep dive inspired by TigerBeetle and Viewstamped Replication

While reading about Viewstamped Replication (VSR), I came across TigerBeetle, which uses this replication protocol to achieve extremely high reliability. Little did I know how fascinating the design of such a database would turn out to be.

If you had to design a highly efficient OLTP database from first principles, how would you do it?

One strong answer is: do what TigerBeetle did.

TigerBeetle builds its system by tightly integrating the four primary dimensions of computer science—network, storage, memory, and compute—to arrive at a configuration that is radically more optimal than conventional databases.

Design Goals

The system starts with a set of aggressive but clear goals:

No horizontal scaling—optimize orders of magnitude of ops/sec on a single node
1000× performance
10× safety
10× developer and operator experience

                    Design Goals
                         |
        +----------------+----------------+
        |                |                |
    1000× Perf      10× Safety    10× Dev/Ops Exp
        |                |                |
  Orders of mag    Data Integrity    Simplicity
   more ops/sec     & Durability     & Reliability
                         |
              Single Node Optimization
                         |
            No Horizontal Scaling Complexity

Rethinking OLTP Workloads

The nature of OLTP workloads has changed drastically over time.

Originally, OLTP systems handled simple debit/credit operations, primarily for financial data. These operations are essentially database transactions derived from business transactions.

If we conceptually split OLTP into:

OLGP – general-purpose workloads
OLTP – pure transaction processing

we can target scale far more effectively.

Inverting Query Amplification

Traditional financial databases often require ~10 database queries per business transaction.

What if we invert this relationship?

Because OLTP workloads are fundamentally accounting workloads:

1 database query = 1000 business transactions

This inversion fundamentally changes scalability.

Traditional Approach:
    1 Business Transaction  →  ~10 Database Queries
    ❌ High overhead, many round trips

Inverted Approach (TigerBeetle):
    1 Database Query  →  ~1000 Business Transactions
    ✅ Massive batching, optimal throughput

Exploiting the Four Axes

           TigerBeetle Architecture
                    |
    +-------+-------+-------+-------+
    |       |       |       |       |
 Network Storage Memory  Compute
    |       |       |       |
    |       |       |       +-- Viewstamped Replication
    |       |       |       +-- Deterministic Leader Election
    |       |       |
    |       |       +-- Zero-copy
    |       |       +-- Zero-allocation
    |       |       +-- No runtime malloc
    |       |
    |       +-- 1 MiB WAL writes
    |       +-- Direct I/O
    |
    +-- Batching: 8000 txns/batch

Network

With batching:

One recv() call can ingest ~1 MiB of data
This corresponds to roughly 8000 transactions per batch

Storage

One write() call writes a 1 MiB WAL entry
Low latency via fsync() or direct I/O

Note: Direct I/O is now often faster than buffered I/O due to modern storage hardware. The real bottleneck has shifted to memory, which must be addressed via:

Zero-copy
Zero-allocation
No malloc during runtime

Compute

Distributed systems must handle leader election when the primary node fails—which is inevitable.

Common consensus algorithms include Paxos and Raft. Here, we focus on Viewstamped Replication (VSR), which handles primary failure while preserving operation order and data correctness.

Viewstamped Replication (VSR)

Key Notes

Use VSR on a single-node system to turn it into a highly available distributed system
Fault tolerance is achieved through redundancy in space and time
Based on state machine replication: initial state + replicated operation log → identical final state

TigerBeetle uses VSR instead of Paxos or Raft primarily because leader election is deterministic.

VSR Protocol Flow:

Client ────Request───> Primary
                         |
                      Prepare
                         |
          +--------------+---------------+
          |                              |
    Prepare Msg                    Prepare Msg
          |                              |
          v                              v
      Replica 1                      Replica 2
          |                              |
     Prepare OK                     Prepare OK
          |                              |
          +--------------+---------------+
                         |
                         v
                      Primary
                         |
                      Commit
                         |
          +--------------+---------------+
          |              |               |
      Response      Commit Msg      Commit Msg
          |              |               |
          v              v               v
       Client        Replica 1       Replica 2

The Performance Result

4 syscalls, 4 memory copies, 3 network requests → ~8000 TPS

The biggest win is actually the reduction of row locks—from ~16k down to effectively zero.

Request Batch → recv syscall → write WAL syscall → send ack syscall → read state syscall
                                                                              |
                                                                         8000 TPS ✅

Why Only Four Syscalls?

In a typical processing cycle, the system needs to:

Receive batched requests from the network
Write them to the WAL
Send acknowledgments
Read state for processing

Instead of issuing a syscall per packet or block, TigerBeetle aggregates all operations.

Using io_uring, the application submits a batch of reads and writes and enters the kernel once. This allows a single execution thread to saturate modern NVMe drives and 100 GbE network links.

Durability Considerations

More scale demands more durability
Even 0.5% disk corruption over two years becomes catastrophic at large scale
Durability issues grow exponentially with workload size

Scale
  └─> More Data
       └─> Higher Risk of Corruption
            └─> Exponential Impact
                 └─> Need Cryptographic Guarantees

Replication Concerns with Raft

Raft does not exploit global redundancy
In a 1-primary, 2-replica setup:
- One replica missing data
- One replica corrupted
If the primary fails, Raft cannot determine a new leader and stalls

TigerBeetle avoids this by using cryptographic hash chains for replication integrity.

Raft Limitation:
    Replica 1: Missing Data
    Replica 2: Corrupted
    Primary Fails
      └─> Cannot Elect New Leader
           └─> System Stalls ❌

TigerBeetle Solution:
    Cryptographic Hash Chains
      └─> Verify Data Integrity
           └─> Safe Leader Election ✅

Codebase Deep Dive

Written in Zig
Explicit control flow
Strict, static resource allocation
Extensive use of "checksums to check the checksums"

Every data structure—from protocol headers to on-disk state—is fortified with cryptographic checksums.

Core Components

VSR implementation: src/vsr.zig
Replica state machine: src/vsr/replica.zig
- Acts as the CPU of a TigerBeetle node
- Transitions state based on VSR messages (Prepare, Commit, ViewChange, etc.)

One interesting finding from this crash report is how the quorum_headers function tallies cluster responses.

Time is measured not in wall-clock seconds, but in logical ticks—processed messages via replica.tick().

Because state is a pure function of the log, state transfer is as simple as sending snapshots or data files.

Replica State Machine:

    [Start]
       |
       v
   ┌────────┐
   │ Normal │<─────────────┐
   └────────┘              |
       |                   |
  Primary Failure          |
       |                   |
       v                   |
   ┌───────────┐           |
   │ ViewChange │          |
   └───────────┘           |
       |                   |
  log_view < view?         |
       |                   |
    +──+──+                |
    |     |                |
   Yes   No                |
    |     |                |
    v     v                |
DoViewChange  StartView    |
    |          |           |
    |          |           |
    +────┬─────+           |
         |                 |
    New Primary Elected    |
         |                 |
         └─────────────────┘

Handling Leader Failure

To initiate leader change, a replica transitions from normal mode to participation mode.

This is handled in src/vsr/superblock.zig:

pub fn view_headers(superblock: *const SuperBlockHeader) vsr.Headers.ViewChangeSlice {
    return vsr.Headers.ViewChangeSlice.init(
        if (superblock.vsr_state.log_view < superblock.vsr_state.view)
           .do_view_change
        else
           .start_view,
        superblock.view_headers_all[0..superblock.view_headers_count],
    );
}

If log_view < view, the system infers a transition state and executes do_view_change.

The critical insight: view state is persisted in the superblock, not volatile memory.

Storage as the Anchor of Trust

TigerBeetle does not rely on filesystem guarantees—because they effectively do not exist.

Instead, it implements its own transactional guarantees using a cryptographic hash chain, capable of reconstructing a universal state even from partially corrupted replicas.

SuperBlockHeader

Defined in src/vsr/superblock.zig

Uses Zig's extern struct to guarantee precise on-disk layout.

Key fields include:

Field	Type	Description
`checksum`	`u128`	Checksum of remaining fields
`copy`	`u16`	Superblock copy index (0–3)
`sequence`	`u64`	Monotonic version counter
`cluster`	`u128`	Cluster UUID
`parent`	`u128`	Hash of previous superblock
`vsr_state`	`VSRState`	Embedded consensus state

Embedding VSRState directly into storage solves split-brain at the storage layer. On startup, the node immediately knows its latest consensus state—no log replay needed.

The superblock is physically duplicated four times on disk.

        SuperBlock
            |
    +-------+-------+-------+
    |       |       |       |
  Copy 0  Copy 1  Copy 2  Copy 3
    |       |       |       |
    +-------+-------+-------+
            |
        Redundancy
            |
   Survives Partial Corruption

Journal (WAL) as a Hash Chain

TigerBeetle's WAL—called the journal—is not a simple append log.

Each entry contains:

A sequence number
A checksum pointing to the previous entry

This forms a hash chain similar to a blockchain.

The function valid_hash_chain_between verifies integrity across log ranges. If corruption is detected (e.g., torn writes), the system repairs itself by fetching clean blocks from peers.

Journal Hash Chain:

[Entry 1] --checksum--> [Entry 2] --checksum--> [Entry 3] --checksum--> [Entry 4] --checksum--> [Entry 5]

Each entry contains:
  - Sequence number
  - Checksum of previous entry
  - Transaction data

Similar to blockchain structure for integrity verification

Storage Data Structures

Instead of a single LSM tree, TigerBeetle uses an LSM forest:

Object trees – transfers sorted by timestamp
Index trees – primary and secondary indexes

This improves locality and performance.

        LSM Forest
            |
    +-------+-------+
    |               |
Object Trees   Index Trees
    |               |
Transfers by    +---+---+
 Timestamp      |       |
                |       |
            Primary  Secondary
            Indexes  Indexes
                |
                v
         Better Locality
                |
                v
        Higher Performance

I/O: Direct I/O and Determinism

TigerBeetle bypasses OS abstractions like page cache to eliminate unpredictability.

From /src/io/linux.zig:

fn fs_supports_direct_io(dir_fd: fd_t)!bool {
    if (!@hasField(posix.O, "DIRECT")) return false;
    //...
}

Kernel page cache is dangerous because:

fsync() may lie about durability
Memory pressure causes unpredictable eviction
Cached corruption propagates silently

Traditional I/O ❌                Direct I/O ✅
      |                                |
 Kernel Page Cache               Bypass Page Cache
      |                                |
  +---+---+                       +----+----+
  |   |   |                       |         |
fsync  Unpredictable        Deterministic  True
may lie  eviction           Behavior     Durability
      |
Silent corruption

Static Memory Allocation

Many production crashes happen under load due to allocation failures.

TigerBeetle avoids this entirely:

Startup: compute maximum memory requirements
Allocate one large contiguous block
Runtime: pop and return buffers from a pool

See src/message_pool.zig and this stack trace

Static Memory Allocation Flow:

Startup
  └─> Calculate Max Memory Requirements
       └─> Allocate One Large Contiguous Block
            └─> Create Buffer Pool
                 └─> Runtime ✅
                      ├─> Pop Buffer
                      ├─> Use Buffer
                      └─> Return Buffer
                           └─> (cycle continues)

Benefits:
  • No allocation failures under load
  • Predictable memory usage
  • No fragmentation
  • Zero runtime malloc overhead

A Note on Programming Style

This style of programming is refreshing.

Even without deep familiarity with Zig, the code is easy to follow due to:

Explicit callbacks
Clear control flow
Descriptive struct layouts
Noun-based naming (replica.pipeline, replica.preparing)

It is a style that makes correctness obvious—and bugs uncomfortable.

End result: a database designed from first principles, where performance, safety, and determinism are not trade-offs, but consequences of the architecture.

References

Understanding the will of the universe

Mon, 05 Jan 2026 00:00:00 GMT

import Sidenote from '../components/Sidenote.tsx';

If you really love something, you need to understand the language in which it speaks. For all universe lovers, especially people who are trying to understand the will of universe, it is crucial to <Sidenote note="Math is the universe's native tongue — dense, infinite, and largely untranslated.">understand math to a pretty high degree so that you can decipher it's thoughts.</Sidenote> The problem with math as a language is that it is really dense and packed with meaning. It can be arbitrarily complex and can branch out into infinite dialects.

A small fraction of math is actually usable for real world scenarios. <Sidenote side="left" note="Most math is invented to fit observations, not the other way around.">Rest all is just invented just out of a need to solve a problem or fit an observation.</Sidenote> Consider the partitions or the infinite converging series. For instance, you know that you can't divide a segment infinitely, but you do realize that we need some way to formulate or display this phenomenon using an equation which in this case is a single finite value on one side and an infinite sum on the other (x = inf)

Another case would be <Sidenote note="i = √−1 — a hack that turned out to describe reality more accurately than 'real' numbers.">imaginary numbers.</Sidenote> They show up all the times where you least expect them to be. But mostly in math. First time I heard about them, it just felt that math nerds wanted to get their hands on the tools used by geometrical people, so they decided to invent a new dimension. But it seems to be that it is conceptually no different from extending the natural numbers into the negative side of the number line.

Another interesting example is from the field of physics. <Sidenote side="left" note="Planck didn't want to quantize energy — he did it as a mathematical trick and it worked.">Max Planck could not solve the ultraviolet catastrophe without using a constant in his equations that made waves quantum.</Sidenote> Now I don't know whether he intended to invent it or discover it, but it surely did fit the description of the universe well, so much so that currently you can't explain any phenomenon without using it.

Science and math have had a deep rooted relationship of what I call the <Sidenote note="The lancer is physics. The smith is mathematics. Each needs the other to advance.">lancer-smith duo.</Sidenote> The lancer requires tools and weapons to tame seemingly impossible monsters. The smith has them but has forgotten where he has kept it. It is the job of the lancer to dig deep in the smith's workshop and find what he needs.

Coming from a noob's standpoint, having done very little math personally (most math I have done is just of passing grades), I really admire mathematicians. <Sidenote side="left" note="Math is like wine — you need a refined palate before you can appreciate the good stuff.">I consider math to be an acquired taste or like a higher pleasure.</Sidenote> You need be at a certain level to admire it and understand it meaningfully.

I still don't understand how creativity or innovative thinking works. <Sidenote note="Ramanujan claimed his theorems came to him in dreams from a goddess. Peer review was apparently optional.">How could Ramanujan come up with so many ground breaking equations and theorems on partitions and series with just a basic knowledge of math?</Sidenote>

It should definitely have some random functions component to it. But it has to be so refined and guardrailed that you can rely on it. Since we are playing in the knowledge world, normal rules of evolution don't apply here. <Sidenote side="left" note="Each generation re-climbs the same mountain. No inherited memory — just textbooks.">The math needs to be learnt again and again from the ground up by every person who wants to further it or work on it.</Sidenote> This I feel is a limitation that needs to be addressed but we can talk about it some other time.

It is proven fact that people who are experts in a particular field, and have often spend a significant amount of time in it, <Sidenote note="Neuroplasticity means the brain literally rewires itself to think in the language of the domain.">have different neural pathways than an average human being</Sidenote> — thanks to our surprisingly good neuroplasticity. This makes it possible for the experts to spend more time on the frontier and skilfully fill in the gaps with their specialised tools.

Khoj Progress Update

Sun, 21 Sep 2025 00:00:00 GMT

Khoj is our platform for AI-powered, location-based treasure hunts.

Players solve riddles, visit physical locations, and earn onchain rewards. Each user receives a unique NFT upon registration, and completing hunts grants rewards recorded on-chain.

Since our MVP at ETHIndia '24, we’ve significantly improved usability, stability, and the overall experience.

This milestone focuses on making Khoj more robust, secure, and fun — while staying true to our mission of building a Web2-native experience that seamlessly onboards users to Web3.

New Architecture

🚀 Major Updates

1. Accurate Distance Algorithm

Replaced the earlier naive distance logic with the haversineDistance algorithm for precise geolocation checks.
Reduces error margins for location validation, with a sub-10m accuracy, ensuring fairer gameplay.
We also noticed that mobile phone browsers use the native GPS sensor. This leads to better user location coordinates. Since, Khoj is mobile-first, coordinates received from phones with precise location turned on, lead to much better results.
PR #36

2. Teams: Secure, Decentralized, and User-Friendly

One of the most requested features during our pilot hunts was support for teams. Designing this was far from trivial:

The original Khoj design assumed solo-only hunts.We needed a way to let users form teams securely, without central storage, and without exposing sensitive invite codes on-chain.The final flow had to balance security, decentralization, and user experience.

After several iterations, here’s the final architecture we built:

🔐 Smart Contract Responsibilities

The smart contract is the source of truth for teams.
It handles creation, membership checks, and validation when new members join.

struct Team {
    address owner;
    uint256 maxMembers;
    uint256 memberCount;
    mapping(address => bool) members;
}

mapping(uint256 => Team) public teams;
uint256 public nextTeamId;

Create Team:
createTeam(maxMembers) → creates a new team, sets the owner, and stores maximum size.
Join via Invite:
joinWithInvite(teamId, expiry, signature) → verifies the invite signature and admits new members.

Validation checks:

Invite not expired (block.timestamp <= expiry).
User not already in team.
Team not full (memberCount < maxMembers).
Signature must match the team owner:
- Hash: keccak256("TeamInvite", teamId, expiry, chainId, address(this))
- Verified via ecrecover().

This ensures only legitimate invites signed by the team owner are accepted.

📱 Frontend Responsibilities

The frontend handles user interaction, invite generation, and QR code management, all without relying on a centralized backend.

Creating a team

Team invite code & QR

A. Registration

User connects with ThirdWeb wallet (can be Web3 wallet or Google/social login).
Reads contract state:
- If not in a team → show Create Team / Join via Invite options.
- If already in a team → show team details.

B. Create Team

User sets max team size.
Calls createTeam(maxMembers) on-chain.
Contract returns teamId.

C. Generate Invite (Multi-use, Ephemeral)

Owner chooses expiry (e.g., hunt_start + 1hr).
Frontend generates:
- inviteHash = keccak256("TeamInvite", teamId, expiry, chainId, contractAddress)
Signs the hash using wallet.signMessage() via ThirdWeb.
Bundles {teamId, expiry, signature}.
Encodes into Base58 (shorter than hex → better QR).
Generates a QR code or short code.

⚠️ Invite is shown once only — never stored by backend or on-chain.

D. Join Team

User scans QR or pastes invite string.
Frontend decodes {teamId, expiry, signature}.
Calls joinWithInvite(teamId, expiry, signature) on-chain.
Contract validates and admits user if rules are satisfied.

👉 Detailed implementation: PR #59

Joining a team

Team details after joining a team

3. LLM Upgrade: Claude → Gemini 2.5

Migrated clue generation from Claude to Gemini 2.5.
Benefits:
- Structured outputs, easier parsing.
- Reduced prompt-to-clue parsing errors.
- More robust and consistent gameplay experience.
PR link

4. Custom NFT Images for Hunts

Hunt creators can now upload custom images.
Earlier, all hunts used the same default NFT art.
Each hunt’s NFT is minted via a the main Khoj contract, making experiences more personalized.

5. Hunt Creation UI

A full UI for hunt creation now replaces the Remix IDE workflow.
Creators can configure all parameters:
- Hunt name, description, timeline
- Teams toggle + max team size
- Custom NFT image
- Clues (encrypted via Lit + uploaded to IPFS)
Smooth wallet integration powered by ThirdWeb.

6. Wallet & Onboarding

Integrated ThirdWeb wallet, allowing both Web3 wallets and Web2 social login (Google, etc.). This significantly improves the user experience of our target audience.
This aligns with our focus on Web2-native UX → Web3 rewards.

7. Lit Protocol + IPFS Migration

Moved encrypted clue storage from Walrus to Lit Protocol + IPFS (Pinata).
Challenges solved:
- Data passing issues between contract, Lit, and IPFS.
- Handling retries for failed uploads.
- Eliminated hardcoded logic in clue validation and encryption.

Note: Lit seems to have a lot of reliability issues and we are considering migrating from the service and explore alternatives.

8. Backend Stability & Testing

Built an extensive test suite for smart contracts to handle edge cases and prevent regressions.
Added retry mechanisms for clue verification & clue decryption.
Cleaned up legacy code → removed hardcoded values across flow.
Backend deployed on OCI, frontend on Netlify, making Khoj fully live and testable.

Live link: khoj-alpha.netlify.app

9. Miscellaneous Improvements

We also pushed several smaller but important updates:

Environment variable cleanup (safer, more configurable deployments).
Improved error handling and UX flow for retries.
Added product guide for easier onboarding.
Removed unnecessary logic that caused fragility.
Improved the Hunts page UI with a start and end date, pill for whether teams are enabled or not and the number of participants.
Introduced 3 new states for the frontend buttons: Register, Manage and Ended.

Discover Hunts

Hunt States

Added the Khoj logo and updated the favicon.
Added a custom loading screen while fetching data from the contract / backend.
Standardised UI across the Hunt Details, Clue and Hunt End pages.

Solving a Clue

Successful completion of a hunt

Moved all frontend type definitions to a separate directory following principles of DRY.
Support PassetHub.
Rename the contracts to Khoj and KhojNFT for consistency.

📖 Product Guide

We’ve also added a full product guide with screenshots and step-by-step instructions:

👉 Read it here

Covers:

Hunt creation
Registering & exploring hunts
Team creation & joining flow
Solving clues
Rewards

🧭 Next Steps

Run more pilot hunts to refine user experience.
Expand testing to edge cases.
Improve offline → on-chain reward distribution logic.
Continue focusing on frictionless onboarding while keeping the underlying infra secure and decentralized.

We are also actively working on the landing page and other aspects of the product to make it the best in class.

🔗 Resources

Live app: Khoj Alpha
GitHub repo: Khoj

That’s the progress so far 🚀. If you’d like to try Khoj or give feedback, hop into a hunt and let us know your thoughts!

DEVLOG #5

Fri, 06 Jun 2025 00:00:00 GMT

This devlog is focused on KrishiAI and it's pivot.

I had worked on KrishiAI as my undergrad major project. I realised that I was building something wrong after building a set of features for the app which include

a plant disease scanner (works for rice only)
pesticide label scanner
voice AI with grounding in agriculture knowledge

This made me think what problem does the app really solve?

My initial hypothesis

My initial hypothesis was that with better accuracy of vision models and accessibility provided by voice AI interface (with multi-lingual support), the farmers would find it useful.

I had finalised that the following were the most pressing issues for farmers in India

Lack of access to information (especially about government subsidies and schemes)
Unable to identify plant disease at the earliest
Low literacy causing issues with understanding the dosage of pesticides

I had talked to around 4-5 farmers withing my network to understand their problems, but there is something that I missed.

What went wrong

The core issue was that I did not ask the right questions to the farmers. I was imposing a solution that I cherished and wanted to construct the problem around it -- it should have been the other way around.

Another core issue was that I did not approach the problem with a first principles thinking mindset and forgot about the second and third order effects of my product decisions.

Apart from this I was trying to learn app dev on the go (with the help of AI) and vibe coded the whole app. This is not a bad thing but it made be more app focused rather that solving the core problem.

Pivot

When we think about Indian farmers and Agriculture in India, the problems are systemic and need to be solved by state level actors. Not every problem can be solve by an app but there are somethings that should make the average farmers life easier.

How can we reimagine an agriculture app that is enganging and solves a problem so well that people spread it with word of mouth.

I have a few ideas that need to be tested with real usage by farmers. These ideas have retention built-in and rely heavily on notifications and alerts.

Raw data until you process and present it in a meaningful way for the user to consume. So here are some ideas that I planning to pursue and check the waters -

Crop Schedules
Pest and Disease Alerts
Post-Harvest Pricing decisions

Now the only work left to do is to figure out a way to present this to the farmer in a useful way. To understand the problem even deeply, I will be taking to FPO (Farmer Producer Organisation) managers to understand core problems and see if the usecases I have identified fit their scenario.

An Agentic Browser: What It Needs To Be

Sat, 31 May 2025 00:00:00 GMT

I was scrolling X, as you do, and thinking: How do you really build an AI-native agentic browser?

Browsers are our main window to the internet. Many companies are adding AI to them. But it doesn't feel right. I found this blogpost by a YC GP on how AI native applications should feel like - https://koomen.dev/essays/horseless-carriages/

For me to use a new one, it needs to hit a few marks:

All my current browser stuff (like Chrome) must work. No missing features.
It needs to be private. Using local LLMs is key here.

That's it for the basics. I'm okay with new ways of doing things, even if it means changing habits.

Some featured I would like

Here are some features that would make a browser incrementally agentic:

Auto Tab Grouping

Browsers already group tabs. I want it supercharged. A small, local LLM could group tabs based on what I'm reading or doing. It would just know.

Learn Mode

This is like RLHF (Reinforcement Learning with Human Feedback). It's how models like ChatGPT got so good. This should be built into the browser. It would learn from my repetitive tasks.

Think about these scenarios:

Researching a company and its people.
Going down Hackernews rabbitholes to find comments or resources.

Fast, Unified Memory

My bookmarks and history need a better system. I want fast, unified memory for them.

How To Build These Features

Now, how would these smart features actually work?

For Auto Tab Grouping: A Local Brain for Your Tabs

Imagine a mini-AI, a small LLM, right inside your browser. This "local brain" would constantly look at the stuff on your open tabs: the words, the titles, even the website addresses. It would learn what kinds of things you work on together.

For example, if you open three tabs about "new running shoes," it would see they're similar and put them in a "shopping" group for you. It gets smarter over time, learning your own way of organizing.

For Learn Mode: Watching and Getting Feedback

This is trickier. The browser would "watch" what you do.

When you do something over and over, like always opening a specific set of websites to research a company, the browser notices that pattern.

Then, the next time, it might pop up and ask, "Hey, are you researching a company again? Want me to open those usual sites for you?" If you say "yes," it gets a good mark. If you say "no," it learns not to do it that way next time.

This feedback (your "yes" or "no") helps it get better at helping you with your tasks, making it super personalized.

For Fast, Unified Memory: An AI Library for Everything You've Seen

Think of your browser's history and bookmarks like a giant library.

With a fast, unified memory, when you visit a website or save a bookmark, the browser's "brain" (the LLM) would read and understand what that page is about. It stores this understanding, not just the link.

You could ask, "Find that article about running shoe reviews from last week," even if you don't remember the exact website.

The agent would use its understanding of all the pages you've seen to find exactly what you're looking for, super fast.

Implementing these features with current tech is non-trivial. The current way is often using agents with an LLM.

The goal is a browser that truly understands and helps you. It's about moving from just a tool to something with digital intuition.

Memetic Desire

Tue, 27 May 2025 00:00:00 GMT

import HierarchySimulator from '../components/Games.tsx';

None of my desires are my own. We pick up desires from our surrounding people and initially we don't have any taste in doing so. We could end up picking relatively harmless but non-productive desires.

This is deeply ingrained in our DNA and what makes us human and it is hard to overcome this tendency to copy other people's desire. But there are some serious (and interesting) implications of copying others.

Infinite vs Finite Games

There are two types of games that someone can play. Wealth game or status game. The memetic desire is the latter cause if you desire something, getting it will make you superior to someone who doesn't have it - in a way it is a zero sum game.

According to Naval Ravikant, it is better to play the wealth game as it is inherently non-zero game. You end up creating more value for all the people involved.

Pursuing wealth tends to increase the size of pie itself whenever one achieves a breakthrough.

You can't avoid memetic desire completely, since sometimes these advancements come while one is pursuing his memetic desire.

Heirarchies

This problem has plagued humanity since the beginning. As soon as you increase the size of the society, the interactions of people naturally caused this memetic desire to emerge. Status games have this inherent property of forming hierarchical structures since the resource that everyone is going after is limited.

People tend to be not happy in a hierarchy. This is because status resides at the top and getting to the top is not a fair game in the real world. Some have a head start, some cheat. But since everyone desires to be at the top, it is bound to create clashes.

Now is there a way to solve this tension? Maybe have something so valuable that everyone desires it and make it ultimately unreachable in a lifetime. This is what religion tries to do. Historically, religion has been something that has kept the bottom half of the pyramid peaceful by given them a higher purpose.

Solution for the time being

Till we resolve this problem, what we can do is try to pursue higher, more wealth oriented desires. This at least ensures that fulfillment of your desire will ensure that value has been provided to someone.

Guide to Engineering Colleges

Mon, 05 May 2025 00:00:00 GMT

Navigating an engineering college can overwhelming sometimes. I am writing this from my own experiences. So take any advice in this blog with a grain of salt.

disclaimer - this post is not for people who have already figured out what they are interested in or are working on something great. This is for the people who feel that they are clueless and want to make the most out of college.

The first step

If you have joined a college close to your house, then this might not be as relevant. But for those who have moved to a far away place, please make sure to fix a schedule.

There is a very quote from atomic habits - You do not rise to the level of your goals. You fall to the level of your systems.

One of the most essential systems in your college life will be a schedule. It doesn't mean that you have to remove the fun out of it. In fact, add it as a definitive slot. Make it as long as you want.

Go out

The best way to release any kind of mental pressure that I have found is going out.

Go for a walk (boosted if you go out with some friends)
Go play a sport
Go for a run or a jog

Just leave your room and go out. Time won't matter.

Of course you need some friends

This needs no telling and I don't think you will have any trouble making friends. Just make sure that you get some that are not like minded as you. This exposure will help you grow.

Branch matters

If fact, I would go out and say that it matters a lot. The branch you have got might not be what you wanted or what fits your skillset. The branch decides

where you spend most of you waking time
occupies a mental space that can't be discarded
4 years

Having fun

After sorting out the above mentioned points, optimise for having fun. This time is the last time where you will have no obligations to do anything. This will change in future as soon as you leave the college. Best luck with your journey

Working Hard

Thu, 10 Apr 2025 00:00:00 GMT

Long ago, I had an opinion that I should be able to get what I want without working hard -- whatever that meant to me at that time. Now that I look back at that opinion, I feel that although I wasn't wrong, I wasn't right either.

I have always struggled to find meaning in this world. One way I tried to cope with it was to define short term goals and achieve them. This kept me occupied for a while. But when I stopped deriving meaning from this pursuit, I was again moving towards a nihilistic future. This is when I found Alex Hormozi.

Hardwork is the Goal

Defining goals is not very simple, especially when the goal are vague and amorphous -- winning at life, being successful, etc. So the key insight I got from Alex's words was that we should (again, I am using should here, but none of this is a commandment) set behavior as goals. This is for mainly two reasons -

It is derived from the future self you desire to become.

For example, say you want to become more charismatic. Charisma is a bucket term for many things - talking slowly, greeting everyone when you enter a room, using people's names in the conversations often, etc. . The behavior can be to write summaries everyday. Just do it and the behavior will ensure that you become the person you desire.

Easier to track a behavior.

Measuring long term goals is really hard. This maybe because your monkey brain is not good at calculating really long term benefits.

I am particularly choosing working hard as my goal because it helps me measure myself accurately. I tend to compare myself a lot and this goal would give an outlet for that comparison. Although I still somewhat care about the outcome of the work that I put in, I plan to reduce it.

This ties surprisingly well into the Karma philosophy from Bhagwat Geta, at least in the way I interpret it.

Defining work

This is essential since work has a lot of meanings to people. For me, I tend to define work as something that generates output. I solely try to measure the quantity work by the amount of output it gets. This works for me, since I have clearly defined goals and measurable metrics to track progress. I don't count the hours or effort (what is is really?) required to achieve the goals.

All this is true for me since I am a knowledge worker. I provide value by using my mind and not physical labor. Due to this, we can use the traditional physics-based definition of work which is: distance x force.

Knowledge work has huge leverage

The Final Objective Function

Wed, 19 Mar 2025 00:00:00 GMT

Yeah the title is a little misleading if you have no context what I was thinking about the last few weeks.

The context

I have been trying to understand why current LLMs don't think like humans or why don't they yet surpass us. This lead to me reading Suchir's Balaji's view of AGI and eventually I wrote a commentary on it.

Generalization

Generalization is data efficiency. The idea is to capture all the possible causations and correlations among the data points by connecting all the dots.

How to get the models to generalize?

First, though, we need to confirm whether general intelligence is real or not.

Let’s define the characteristics of an architecture that will be able to acquire general intelligence. For this, we can take inspiration from our own brain -

We have the mammalian brain that is trained over millions of years with basic instincts
The neocortex makes us humans special and adaptive to the changing world (this is the reason why we spend a significant time of development outside the womb)

One requirement is obvious - it should be data efficient from the start. This is so that it can form a large number of internal models about that world.

The second requirement would be to have the ability to reuse neurons for multiple recurrent computations that are distinct. This also ensures the data efficiency of our brain

Due to my limited knowledge of the field, I am not sure if the reusable part is true or not. Need to fact-check.

What are we trying to make?

One interesting fact Suchir mentions is, people try to draw similarities between the genome and the trained LLM. This is not really true. The genome is much closer to the LLM codebase than it is to the actual trained LLM.

This is because evolution can’t use the genome to store the words or situations encountered by the ancestor organisms due to the simple fact of keeping the genome short.

This means that the shortness of the genome is an important condition for generalization.

AGI

What Open AI is trying to do to achieve AGI is to scale test time computing. This might work in the short term. But it doesn’t solve the generalization problem.

Another problem with building AGI with current approaches, such as scaling NNs, is that the brain is a complex neural network that can learn when data is applied to it.

What we want is not to replicate the brain (which can be considered proof of concept for the idea that AGI exists) but to develop a simpler model of evolution that actually learned the “meta-skill” of producing the brain when tasks across a diverse environment were applied to it.

RL is also not that useful to achieve our goal which is to optimize the objective itself.

Let’s take an example. Consider you have to make an agent solve a maze.

The loss function you might consider would be |x_agent - x_end|. Now, what if the agent gets stuck in a corner? No amount of exploration would help you. Nor will the size of the NN control the agent.

This is because in pathfinding, exploration is the problem we are trying to solve.

We could, as programmers, add heuristics and backtracking, but the point is to see whether the agent can come up with it.

Newer model architectures

Recently, there might be some hope - https://arxiv.org/abs/2501.00663

The core idea of the paper is to model learning at test time by introducing neural memory modules. These structures aim to emulate human memory.

The limitation of transformers is the quadratic dependency on data due to the attention mechanism. The RNNs don’t have that issue as they are linearly dependent. The researchers replace the fixed size vector used as a hidden state in an RNN with a NN that is trained during test time.

The most interesting thing in the paper was how they modeled surprise -

This approach simulates the long-term and short-term memory of humans. As a result, it performed much better on the needle-in-the-haystack benchmarks.

But does this fit our criteria for intelligence?

Before we go into that, let's explore another idea discussed in this paper - https://www.arxiv.org/pdf/2410.02536

The main takeaway of the paper is that certain rules that are used to generate synthetic data are much more likely to improve the model's logical reasoning capabilities than others. Does this sound familiar?

This can be compared with the genome example we discussed earlier where the actual intelligence is derived from a complex adaptive meta-learning process called evolution. If you look at the rules in the paper, you can see that they are pretty simple, something like Conway’s game of life.

The researchers also tried to check whether the model has learned about these simple rules from the dataset. It did not. It did have a dependence on the pattern generated by the rules which affected it’s attention mechanism, but not the rules.

We do have different forms of reasoning - intuitive, logical, though, inductive, causal, search. These are not fixed. For example, Ramanujan had a goddess-based reasoning that helped him get answers to some complex mathematical puzzles.

This means that whatever mechanism we develop, it should not just memorize the simple rules it is trained on. The edge of chaos approach therefore seems promising for the future of reasoning.

On Reasoning

Tue, 25 Feb 2025 00:00:00 GMT

Disclaimer - heavily inspired by Suchir Balaji’s View and other recent papers. Also, this blog post is not very structured.

What is intelligence really?

Suchir defines intelligence as the rate at which you learn a capability from data points. An intelligent being will learn it faster than an unintelligent being.

This definition is important because a machine trained on loads of data might be superhuman on a given task, but the generalization would be absent.

These gaps of understanding between various specialized tasks can’t be filled with RL or scale. This means RLHF or a billion GPUs won’t get us to GPT-11, for example.

Generalization

Generalization is data efficiency. The idea is to capture all the possible causations and correlations among the data points by connecting all the dots.

How to get the models to generalize?

First, though, we need to confirm whether general intelligence is real or not.

Let’s define the characteristics of an architecture that will be able to acquire general intelligence. For this, we can take inspiration from our own brain -

We have the mammalian brain that is trained over millions of years with basic instincts
The neocortex makes us humans special and adaptive to the changing world (this is the reason why we spend a significant time of development outside the womb)

One requirement is obvious - it should be data efficient from the start. This is so that it can form a large number of internal models about that world.

The second requirement would be to have the ability to reuse neurons for multiple recurrent computations that are distinct. This also ensures the data efficiency of our brain

Due to my limited knowledge of the field, I am not sure if the reusable part is true or not. Need to fact-check.

What are we trying to make?

This is because evolution can’t use the genome to store the words or situations encountered by the ancestor organisms due to the simple fact of keeping the genome short.

This means that the shortness of the genome is an important condition for generalization.

AGI

What Open AI is trying to do to achieve AGI is to scale test time computing. This might work in the short term. But it doesn’t solve the generalization problem.

Another problem with building AGI with current approaches, such as scaling NNs, is that the brain is a complex neural network that can learn when data is applied to it.

RL is also not that useful to achieve our goal which is to optimize the objective itself.

Let’s take an example. Consider you have to make an agent solve a maze.

The loss function you might consider would be |x_agent - x_end|. Now, what if the agent gets stuck in a corner? No amount of exploration would help you. Nor will the size of the NN control the agent.

This is because in pathfinding, exploration is the problem we are trying to solve.

We could, as programmers, add heuristics and backtracking, but the point is to see whether the agent can come up with it.

Newer model architectures

Recently, there might be some hope - https://arxiv.org/abs/2501.00663

The core idea of the paper is to model learning at test time by introducing neural memory modules. These structures aim to emulate human memory.

The most interesting thing in the paper was how they modeled surprise -

This approach simulates the long-term and short-term memory of humans. As a result, it performed much better on the needle-in-the-haystack benchmarks.

But does this fit our criteria for intelligence?

Before we go into that, let's explore another idea discussed in this paper - https://www.arxiv.org/pdf/2410.02536

This means that whatever mechanism we develop, it should not just memorize the simple rules it is trained on. The edge of chaos approach therefore seems promising for the future of reasoning.

DEVLOG #4

Sat, 22 Feb 2025 00:00:00 GMT

I am back after a long time and this time it's different.

TL;DR

Almost got our first client for Maximal Studio
My team's project won bounties @ETHIndia 24
Built an exciting new project that got 2000 visitors
Another project that won me a bounty

1. Maximal Studio and the new client

My friend whom I run the agency with got a lead referral form one of his friend. But it was not that simple.

The thing was, they wanted to do social media marketing (paid) and we don't specialise in that. Nevertheless, we did not say no to the client and are in the process of closing the deal.

We also shot an ad for our agency. Follow to get to see the video and BTS.

2. ETHIndia 2024

It was fun hacking at this year's ETHIndia as a team with some of my seniors.

We built Khoj (meaning 'search' or 'discovery' in Hindi), a gamified geo-location based treasure hunt platform that combines real-world exploration with Web3 technology.

Khoj transforms traditional scavenger hunts into an immersive blockchain experience with a dream to bring millions of new users to the Web3 world.

The technical details about the project can be found in the repo here and the demo of the project can be found on Youtube - https://youtu.be/9JETJmd9TOk

3. Cardify

As part of our tool marketing strategy for Maximal Studio, I made a fun little website that can turn your website into a card.

I was pretty happy with the result as the site got around 2000 visitors on the day of the launch.

4. Otakuverse - AI Manga Translator

I built a site that can translate any Japanese manga image into an English one.

This was built using Spheron Network of compute for training (https://spheron.network).

I will explain how I did it in another blog post soon.

This was a wrap-up of all the things I did recently. See you in the next edition!!

How I built an AI Manga Translator

Sun, 29 Dec 2024 00:00:00 GMT

On finding out about the Spheron Network's bounty challenge, I decided to find out what I can do among the challenges - https://github.com/spheronFdn/sos-ai-bounty.

Since I am into Anime, the manga translator task caught my eye.

I had no idea how to do it, but I decided I would figure it out on the go.

The rough plan

The plan was to approach the problem as a human being. What would I do as a human to get the English version?

I would look out for the speech bubbles in the text.
Get the text and mark the locations or the speech bubbles they belong to.
Translate each of the texts.
Put it back in the right bubbles after removing the Japanese one.

Getting those speech bubbles...

If you don't know what a Manga is, here is one of the manga panels of One Piece:

You will notice that most of the text is inside the blobs called speech bubbles.

So first, I needed a way to extract the speech bubbles out of the manga art.

This did sound like a vision challenge. Also, I was influenced by the fact that I had finished a similar surveillance project with my teammates for a college course. There, I barely did any of the ML part as it was handled by a professional.

Here's the project for the curious - https://github.com/TheUndercover01/Multi-Camera-person-tracking-NITK

So, I decided I would be using a similar approach of detecting speech bubbles instead of jaywalkers. For this, I would need some kind of manga image dataset... hmmmm, got it!

Dataset - https://universe.roboflow.com/luciano-bastos-nunes/mangas-bubble

Training the model...

The model which is most commonly used for detection tasks is YOLO. It is good for general-purpose detection, but for my task, it was not particularly effective. It would have been a great relief otherwise.

Enter fine-tuning:

Fine-tuning the model meant that we have to train the model on specific data, which in our case is the manga speech dataset. This will enable the model to detect the speech bubbles better.

The training environment was rather unconventional as it was trained on a decentralized network of GPU compute where you can freely rent any type of compute.

The translation...

Converting the text to English is pretty straightforward. I just used the deep-translator package from Google to translate the Japanese.

In-painting...

In-painting is the process of filling up the missing parts of an image. There are many ways to go about this process, including the use of LaMa: Resolution-robust Large Mask Inpainting with Fourier Convolutions, which can give you pretty good results.

For my fun little project though, I thought of going simpler using some Python cv package.

The text placement process begins by preparing the image to accept text overlays and measuring the speech bubble's dimensions to determine space constraints.

Starting with default text settings for size, spacing, and wrapping, the system automatically scales down these properties if the content exceeds the available bubble space.

Finally, it centers the adjusted text within the bubble boundaries, ensuring translated text fits naturally within speech bubbles of any size.

Finally, the web app

I quickly built a web app that talks to the inference server by sending in the image and getting back the translated manga image.

This concludes the project and it would not have been possible without the immense research that goes into building these kind of models and people who make them easy to use.

Lastly, I would encourage you to look into the bounties as they have a pretty good prize pool for other tasks.

DEVLOG #3

Fri, 15 Nov 2024 00:00:00 GMT

I am back with few updates on the projects I am working on.

TL;DR

ApnaHood UI enhancements
Maximal Studio website re-design

1. ApnaHood

Just to introduce it to the new readers, ApnaHood is a map based experience with the exact purpose yet to be finalized. You can check it out at https://apnahood.maximalstudio.in/

Map Style Update

In the recent updates, we have updated the UI by quite a bit. We now have a adventure map style and color scheme -

Apart from these, we are working on the quests board that will show users interesting challenges from various places. This is still a work in progress, so stay tuned for the next edition.

Popup has new features

Now you can click on a message and see how popular a place is, based on the votes, how many people have been there, etc.

You also have the option to show directions to that pin, which opens up in a Google Maps instance.

The UI of the Popup is made to match the overall map style. The share button is currently non-functional, which when implemented, will give you the option to share the pin to someone.

Minor fixes include, fixing Authentication code exchange for PKCE OAuth flow, removal of unimplemented features like the leaderboard (coming soon), responsiveness improvements.

2. Maximal Studio Website Re-design

If you don't know, I run a web development agency for building software of any kind. You can check it out here - https://www.maximalstudio.in/

The old agency website was a clone of a agency website template provided by a popular themes library - Lexington Themes.

We tried to use it as a starting point since the idea was to sell websites to local businesses. But now we are pivoting to making MVPs for people who want to go to market as quickly as possible.

To fit this theme we have changed our UI from this

to this

I hope to find more clients using this site. If you want a website or a software to build quickly, then contact us through the website.

That's it for this devlog. See you in the next edition!!

DEVLOG #2

Thu, 07 Nov 2024 00:00:00 GMT

Since this is the first devlog of my currently active projects, it will be a long one. In this edition of devlog, we will be looking at the updates mainly from ApnaHood and my college major project.

1. ApnaHood

If you have not read the previous edition of this devlog, ApnaHood as a map-based app where you can leave anonymous messages that can be seen by other users.

User Feedback

One important things happened between the last update and this. We decided to get feedback from the users. This meant that we cannot do what we want to do, but will have to listen to the users.

Our hypothesis for the website was that users will use it for all sorts of things including - marking special and hidden spots, putting some secret messages, etc. But none of this happened.

Among other reasons, the most prominent reason was that the UI was not intuitive. This meant that the user would land on the page and did not instantly know what to do. We tried to bandage this situation by adding a pop-up which explains what actions you can take on the site, but this made us realize that the UX and the user journey is terrible.

Plans for Improvement

The idea is to address the user feedback and improving the UX of the site. First thing we changed was to make the map 3D. This makes it much more interesting and appealing.

Apart from this change, we are planning to add game elements into the app and center the whole concept around it. Starting with "Quests", we will be adding the functionality to add limited period quest that you can complete to get points in the site, redeemable for real cash.

2. Major Project

So me and my team are building a Agriculture focused AI chat-bot that can be used by farmers for resolving all their query. After intensive research, talking to farmers and agriculture specialists, we came up with the following architecture -

Each module is there for a purpose. Which one to choose for a particular query will be the job of the reasoning engine which is just a fancy wrapper term for LLM function calling ;).

Chat Module

For now, I am using the OpenAI realtime API for the chat functionality which is surprisingly good at Indian languages. Pairing it up with open-whisper, I am able to get a real-time native language voice interaction.

Schemes Module

Apart from the chat module, I am building a RAG system for schemes and subsidies offered by the government for various types of crops. This information is often place in obscure locations on inaccessible government and state websites.

I am using Supabase for this functionality. I have two edge functions - embed and search. embed converts the scheme text into embeddings and stores it in the database with the help of a postgres extension - pg_vector. search is an edge function for semantic search using cosine similarity and returns the top-k matching results for a given query.

DEVLOG #1

Thu, 03 Oct 2024 00:00:00 GMT

I have always wanted to write technical blogs but I was lazy. But then I recently say few indie hackers publishing what they do on a weekly basis. I felt this writing format is something that I can try and hopefully stick to in the long run.

2. General Inference Engine

This is a still in stealth mode since we are in the process of developing the pipeline. In the ideal case when it is built, the project can be used for literally any type of media based inference.

3. Stock Prediction using News

I know this might sound like a cliche project, but I still ended up learning quite a few interesting things.

The approach I took was to take a bunch of news articles (2018-2021) and performed two major extraction operations on it -

Sentiment Extraction - This is pretty much a standard approach. I did it using a fine-tuned model called FinBERT that is trained on financial news and can understand that lingo.
LLM based Feature Extraction - This was a rather interesting experiment. I did some prompt engineering and got the final prompt which was something like this -

"Analyze the following financial news headline: {news}."

    Based on this headline, extract the following structured information as a tuple, choosing each feature from its predefined categories:

    1. **Company**: Identify the primary company mentioned"
    2. **Event**: Identify the main event or action, choosing from: "Merger," "New Product," "Profit/Loss Announcement," "Partnership/Collaboration," "Policy Change."
    3. **Reason**: Identify the reason for the event, choosing from: "Market Demand," "Regulatory Requirement," "Internal Strategy," "External Competition," "Economic Conditions."
    4. **Verdict**: Assess the likely impact on stock, choosing from: "UP," "DOWN," "NEUTRAL."

    NOTE: give categories from the categories provided in the prompt. DO NOT EXPLAIN THE CATEGORIES or output. Also, try to reason out the answer based on the headline see if the news might have any impact on the stock price of the company mentioned in the headline. DO NOT OUTPUT ANYTHING ELSE OTHER THEN THE TUPLE. DO NOT SAY - " Here's the structured information extracted from the headlines:"

    **Return the information in tuple format** only and nothing else using the example format below:

    Example format:
    `("Company", "Event", "Reason", "Verdict")`

Using the Gemini free LLM API, I was able to get the features for 887 news articles. This helped in finding the best predictors of stock movement and what magnitude will it move with.

That's all for this devlog. Stay updated for newer editions!!

Programming is doomed

Fri, 10 Nov 2023 00:00:00 GMT

Introduction

Today I had an assignment in a Data structures and Algorithms Lab to implement integer multiplication using following:

Recursion
karatsuba
bitwise efficient implementation

The internet was not allowed and I was not on my usual coding setup (vs code). I couldn't complete the assignment in time. But I realized something about my way of programming.

My coding methodology

I use vs code as my primary editor (with vim bindings, yet to become hardcore vi enthusiast) along with GitHub copilot which is a code completion tool. Also I have a lot of language extensions that assist me with awesome auto completion.

Before 2021, when I started coding, I used to spend a lot of time going through stack overflow, GitHub issues, docs to figure out something that made the computer do what I needed.

After I started using copilot, I stopped losing context while coding. I could focus more on what I really cared that was solving the problem I had. I never leave my editor for checking syntax or finding a way to filter an array. Copilot just works. This made me think what is programming all about.

What is programming?

Computer Science is all about -

Converting ideas into program through some intermediate steps like algorithms, flowcharts and eventually code. It is to write programs, see that they are understandable and maintainable by humans in future.

According to me, I lack at all the described aspects. I am bad at reading someone else's code (or my own code after 2 weeks), maintaining my own code itself seems like a daunting task. Also, I try to write code that usually works and solves my current problem and I rarely think about it's understand-ability.

To go back to the goal of Computer Science is to convert ideas into programs that can be executed by an von Neuman machine. Can you see that there is no mention of humans in this definition. Why should human be the only ones to write programs.

We started with binary (arranging MOSFETS in a certain structure and passing correct signals through it to get a sum of two 8 bit digits, for example), then we built high level constructs like assembly then Pascal, FORTRAN and C. Then we started building higher order construct using these languages like python and JavaScript. If you observe this is all so that human can focus more on the idea expression side of it and less time coding up the instructions for the computer.

If AI written code becomes the norm

For now, if we assume that AI written code is as good as an average developer, reviewing it and sending it into production won't be an issue as long as thoroughly reviewed (which can also be assisted by another specialized model).

This might give rise to interesting behavioral changes in programming. For starters, all the design pattern that are currently abundant in most of the codebases (more so in languages like JAVA) are tailored to human understanding and easy readability. AI written code need not be restricted to such arbitrary patterns as long as the code works.

Just as we have gotten used to probabilistic answering nature of LLMs where no one yet completely understands it's working but still it's output can more or less be considered on par with most of junior level coders, we will soon get habituated with having code that you understand less and less but have a high level understanding of what is does.

Is programming limited to computers?

OpenAI recently launched their new features like pdf chats, custom GPT builder without code (you can just talk with a base model and create a custom GPT) , etc. What really caught my eye was this new Assistant API which can help you develop autonomous agents that can interact with the world with the help of external API and do things for you like booking your tickets deciding you schedule, planning and execution you wedding, figuring out code deployment and al lot more.

This means that the process of programming and scripting will be completely abstracted away and the only thing that remains for humans to do is problem solving and prompting the ideas in an expressive way.

Wild thought?

Eventually CS as a field will become a niche like every other engineering field with limited applicability. Models will become so powerful that only a simple prompt will be enough for it to do anything you need to get done.

M# Hackathon Experience

Sat, 28 Oct 2023 00:00:00 GMT

When we got to know about the hackathon problem statements, we immediately started brainstorming about various PSs. We knew we had to pick a problem statement that complimented with the diverse skillset of our team. We had two members for ML, two for Web and one primarily for app (These roles are flexible and they need to be to make sure to adapt to the changing requirements of the project during the hackathon).

After going through all the problem statements, we finalized a problem statement related to mental illness onset detection in a college target group. We came up with a elaborate plan and system design which almost included everything that could possibly be put into it (except blockchain, though we got a feedback during the later stage as to why did we not integrate it).

Mindful AI System Design

We then proceeded to make a fabulous PPT with all the buzz words to make it feel innovative. This got us selected in the top 30 teams that qualified to the offline hackathon at MIT Manipal.

Day 1

I didn't do much on day one. Two members were working on the ML part of our hack which needed a lot a big data pre processing and a lot a training time since they were trying to build a custom ml model. The app guy was busy setting up auth for the application and me along with my web member were making a dashboard kind of thing for the admin. I picked up off the shelf ui components (shadcn, tailwind) to quickly get something beautiful as in hackathon like these it's all about how good your UI/UX is.

Day 2

Had to finish firebase AUTH integration and making the backend to handle video upload and passing it through the ML models built by then. It was tricky to work with flask as backend tech since it was my first time and using it was a requirement since the ML model dependecies are all in python.

Day 3

This day as well I didn't do much as majority of work was done by my teammates. I would always advice you to pick good teammates who are accountable and whose vibe matches with you and clashes are bound to happen and this will help your team to hold together.

I had a lot of fun joking around and playing TT and carrom. Sleep shouldn't be your priority since here the main issue for not sleeping would be surprising. The whole place was confined and had a centralised AC. Everytime we tried to sleep someone would always bring down the temperate. this left us sleepless for days which is a common occuring in hackathons.

Results

At the end we managed to bag a consolation and a spot in the top nine teams with our cool hack.

Here's what we built, if your interested - https://github.com/Shubham-Rasal/Mindful-AI

Database choice for toy blockchain in Go

Sat, 17 Jun 2023 00:00:00 GMT

This blog post is one of the many blog posts in the series of blog posts about my toy blockchain in Go. In order to understand Ethereum, I have decided to build one from scratch. Will use go-ethereum as a reference.

In order to provide state persistence, the Ethereum spec defines the use of database to store the world state. the For storing the world state of the blockchin (which is just a very large merkel patrichia tree), a highly performant and fast key-value store is required. There are many databases which can be used for this purpose.

In this blog post, I will be discussing the database choice for my toy blockchain in Go. I will be discussing the following databases:

BadgerDB

The database is built in Go and is optimized for SSDs. It is a key-value store and is very fast. It is also ACID compliant. It is also used by Dgraph.

There is no issue as such for me to not pick this database. But I want to explore other databases as well. Also, I think this db would be an overkill for my toy blockchain.

etcd

etcd is a distributed key-value store. It is built in Go and is used by cloud native softwares such as Kubernetes and Docker.

The main issue which prevented me from using this db was that it's installation was not that simple. Some of the dependencies were only complied for 32-bit systems causing problems.

LevelDB

Built in C++ and used by Google Chrome. It is also a key-value store. It is also very fast. It is also used by Ethereum.

The most popular implementation of ethereum is the go-ethereum or geth. When I was reading the code of geth, I found that they are using LevelDB but since the codebase is in Go, they are using the wrapper of LevelDB in Go - https://github.com/syndtr/goleveldb

Although this is an old implementation, when I tried it, al its features were working fine. I think I will be using this database for my toy blockchain.

Conclusion

Although I have decided to go ahead with LevelDB, I will be exploring other databases as well if the former ends up being a bottleneck. I will be writing a blog post about the implementation of LevelDB in my toy blockchain.

My first blog post

Fri, 16 Jun 2023 00:00:00 GMT

Today I am writing my first blog post. I am writing this blog post to share my experience of learning new things. Future blogs will mostly be related to the following topics:

Systems
Programming
Philosophy
My side projects

I am not sure if I will be able to keep up the consistency of writing blogs, but I will try my best to write at least one blog post every month.