Yash Raj Pandey - AI Agents Architect

AI Agents Architect at the University of Florida (IFAS), building local-first AI infrastructure: self-hosted open-weight LLMs, retrieval-augmented generation (RAG), vector search, and AI agents running in production. Joined UF as a Software Engineer (Mar 2025), promoted to Lead (Oct 2025), then Architect (Apr 2026).

Profile

Mail - yashpn62@gmail.com

Selected Work

Selected Works - Production systems, developer tools, and open-source work across the stack. Click any card to open the case study.

Stack: Python, TypeScript, Swift, C/C++, Django, React, Node.js, PostgreSQL, RAG, Qdrant, vLLM, Docker, Kubernetes, Terraform

Blue Omics - Full-stack research data platform

A Django, React, and PostgreSQL platform that grew from zero to 5M+ live records and became the primary system for an entire research lab.

Problem: A research lab ran its data on a sprawl of spreadsheets and manual workflows. Submitting, searching, and cross-referencing records was slow, error-prone, and impossible to scale across 30+ researchers and 5 labs.

Approach:

  • Designed and built Blue Omics from scratch: a React frontend on a Django REST backend with PostgreSQL, structured across 32 data models and 58 API endpoints.
  • Built 7 ingestion pipelines for heterogeneous formats (PDF, Excel, CSV, Word, PowerPoint), cutting manual data prep from hours to minutes.
  • Tuned PostgreSQL with 35 explicit indexes and caching to hold low-millisecond latency under concurrent access by 30+ users.
  • Deployed on GCP with Kubernetes and Terraform, Docker multi-stage builds, and CI/CD. Optimized the frontend from 8s to 3s load time.

Stack: Django REST, React, TypeScript, PostgreSQL, GCP + Kubernetes, Terraform, Docker

Impact: Live records: 0 → 5M+; Trait-lookup latency: spreadsheet → ~25 ms; Frontend load time: 8 s → 3 s; Daily active users: baseline → +40%

Trade-offs: Chose a well-indexed PostgreSQL core over premature service-splitting to keep one clear backup and monitoring story. The platform replaced manual workflows entirely and became the system of record, which is what earned the promotion path from Software Engineer to Lead.

Looma - Local-first project memory for coding agents

A command-line tool that turns Claude Code, Codex, and Cursor history into resumable project context, with zero third-party dependencies.

Problem: Coding-agent transcripts pile up fast, but the moment you switch projects the context is gone. Searching old sessions to remember what you were doing, what you decided, and what is left is slow and unreliable.

Approach:

  • Normalizes Claude Code, Codex, and Cursor history into vendor-agnostic events, then reconstructs structured WorkItems (features, bugfixes, refactors, migrations) instead of keyword-searching logs.
  • Emits token-budgeted context packs so one agent can hand off to another without replaying the whole history.
  • Built on the Python standard library only (SQLite + FTS5), with an optional local LLM extractor that inherits the same heuristic guardrails.

Stack: Python (stdlib only), SQLite + FTS5, Local-first, Optional local LLM

Impact: Third-party deps: typical CLI → 0; Extraction F1 (clean fixtures): Qwen2.5-7B 0.84 → heuristic 0.86; Test suite: baseline → 131 passing

Trade-offs: Chose a transparent heuristic core over an LLM-by-default pipeline: it is auditable, runs anywhere with no keys, and on clean fixtures actually beat a 7B local model. Every reconstruction carries a confidence score and shows alternatives instead of guessing.

https://github.com/devYRPauli/looma

mddocs - Git-native collaborative Markdown, with an agent API

A local-first, self-hostable Markdown editor: real-time multiplayer, comments, and accept/reject suggestions, plus a first-class HTTP API for AI agents. Published on npm.

Problem: Teams want Google-Docs-style collaboration on Markdown without handing their content to a SaaS, and the AI agents that edit documents are usually bolted on as second-class clients with no real API.

Approach:

  • Built a git-native editor where every change is a commit, so there is no central database to run and the full history lives in the repo.
  • Real-time multiplayer, inline comments, and accept/reject suggestion review backed by a CRDT (Yjs) model that merges concurrent edits without conflicts.
  • Shipped a first-class agent HTTP API: per-agent tokens, rate-limit headers, and a Server-Sent Events stream, so automated writers are first-class collaborators.

Stack: TypeScript, Node.js, Yjs (CRDT), Git, Server-Sent Events

Trade-offs: Git-native storage trades a query-optimized database for transparency and zero-infra self-hosting: the repo is the source of truth and the backup. The agent API mirrors the human surface exactly, so anything a person can do, an agent can do through tokens and rate limits.

https://github.com/devYRPauli/mddocs

TurboQuant on Apple Silicon - CPU-only LLM quantization study

Independent evaluation of TurboQuant (arXiv 2504.19874) ported to run on Apple Silicon. Open source and reproducible.

Problem: TurboQuant is a near-optimal LLM weight and activation quantization method, but the reference path assumed dedicated GPU hardware. The open question: can it run, and hold long-context accuracy, on consumer Apple Silicon with no GPU?

Approach:

  • Worked from a CPU-only fork on an M1 Pro (16GB) and fixed five implementation bugs that were blocking correct inference.
  • Ran a two-round study: an MLX path and a separate llama.cpp Metal path, each benchmarked on long-context needle-in-a-haystack retrieval.
  • Published the full evaluation, the bug fixes, and reproducible results as an open-source repository, with writeups on LinkedIn and X.

Stack: MLX, llama.cpp (Metal), Apple Silicon (M1 Pro), Python

Impact: Needle retrieval @ 16K: 0% → 100%; KV cache memory: baseline → significantly reduced; Bugs fixed in fork: 5 blocking → 0

Trade-offs: A CPU-only target trades raw throughput for accessibility: the point was proving strong quantization and long-context accuracy are reachable on hardware anyone has on their desk, not winning a latency benchmark. Reflects how I approach AI infrastructure: take a research-grade method, get it actually running on accessible hardware, measure it honestly, and share it.

https://github.com/devYRPauli/turboquant-m1pro-evaluation

ApplyScore - AI resume gap-analysis extension

A published Chrome extension that scores how well a resume matches any job posting on the web, with evidence-linked gaps and no fluff.

Problem: Most AI resume tools hallucinate skills and rewrite bullets with confident fluff that recruiters see through instantly. The honest question, how well does this resume actually match this job, went unanswered.

Approach:

  • Built a universal scraper that reads job postings across LinkedIn, Greenhouse, Ashby, Lever, Workday and more, piercing Shadow DOM to work on virtually any board.
  • Runs a strict, evidence-based gap analysis: a confidence-weighted 0-100 fit score, requirement-by-requirement matches linked to the exact resume bullets that prove them, and a prioritized list of what is missing.
  • Privacy-first by design: the resume is cached locally and the user brings their own API key (OpenAI, Anthropic, or Google), so data and model choice stay fully in their control.

Stack: JavaScript, Chrome Extension APIs, Shadow DOM scraping, LLM APIs (BYO-key)

Trade-offs: Deliberately a gap analyzer, not a rewriter. Suggesting only 1-2 targeted, non-hallucinated bullets keeps it honest; the BYO-key model trades one-click convenience for the user keeping full control of their data and cost.

https://chromewebstore.google.com/detail/applyscore/ibecekikdjelajpnjnmapejhahgcplim

About

5M+ - Records in production

3 - Roles in 13 months

The Journey

  • 2019 - BTech begins: Computer Science at Jaypee University of Engineering and Technology.
  • 2022 - First production app: SWE intern at Hackdev: shipped a Flutter legal-tech app to production.
  • 2023 - Exchange to UF: Final undergrad semester at UF as an exchange student, which led to MS admission.
  • 2025 - Blue Omics: Joined UF IFAS, built a 5M+ record platform, promoted to Lead.
  • 2026 - AI Agents Architect: Proposed and now lead a local-first AI systems function.

Off the clock - Football Hub. A live football stats app I built because just watching the game was never quite enough. (https://football-hub-six.vercel.app/)

Builder Tools (free, client-side)

Builder Tools - Free, client-side. Your data never leaves the browser.

Token Counter - Cost across frontier models, side-by-side (runs entirely in your browser, no signup)

Prompt Formatter - Restructure raw prompts into blocks (runs entirely in your browser, no signup)

JSON to Schema - Generate Pydantic / Zod / TypeScript (runs entirely in your browser, no signup)

Regex Playground - Test, explain, match in real-time (runs entirely in your browser, no signup)

cURL Converter - cURL to fetch / Python requests / httpx (runs entirely in your browser, no signup)

Contrast Checker - WCAG AA/AAA with live preview (runs entirely in your browser, no signup)

Playbooks

3 - Battle-tested plays

Self-Hosting Open-Weight LLMs - Run capable models locally without sending data to a cloud API

There is a whole class of work where you cannot send the data to a cloud API: confidential records, regulated environments, anything air-gapped. The good news is that open-weight models have gotten good enough that you do not have to. Here is how I think about running them locally.

  • Pick the model to fit the hardware, not the other way around
  • Choose a serving layer on purpose
  • Watch the context window, it is where performance goes to die
  • Keep the application layer hardware-agnostic
  • Measure before you trust

RAG That Holds Up in Production - Retrieval, reranking, and the evals that keep it honest

Most RAG demos look great and most RAG systems quietly disappoint, because the demo never stressed retrieval. The model is rarely the bottleneck. The retrieval and the chunking are.

  • Garbage chunks, garbage answers
  • Hybrid retrieval beats pure vector
  • Rerank, but watch the dilution
  • Cite or it did not happen
  • Build the eval before you optimize

Evaluation-Gated Releases for LLM Systems - Stop shipping regressions you cannot see

LLM systems fail differently from normal software. A change can improve five cases and silently break three, and nothing throws an error. The only defense is a gate: no change ships unless it clears a measured bar.

  • Freeze a benchmark
  • Freeze the judge too
  • Know your noise floor
  • Set tiers before you look at results
  • A regression is a reason to stop

Contact

LinkedIn - /in/yashrajpandeyy

GitHub - devYRPauli

Gainesville, FL - Eastern Time / UTC-5. University of Florida / IFAS.

Open to Conversations - AI infra / Full-stack / Systems. Always up for a good conversation on AI infrastructure, systems, and building things that ship.