Sign in to view Julius’ full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
San Francisco Bay Area
Sign in to view Julius’ full profile
Julius can introduce you to 10+ people at Meta
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
7K followers
500+ connections
Sign in to view Julius’ full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View mutual connections with Julius
Julius can introduce you to 10+ people at Meta
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View mutual connections with Julius
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Sign in to view Julius’ full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Experience & Education
-
Meta
******* ******** ********
-
***** *****
****** **** *********
-
****** **
******* **** *******
-
********** ***********
****** ** ******* * ** undefined undefined
-
-
******** ********** ******
******** ** ******* * ** undefined
-
View Julius’s full experience
See their title, tenure and more.
Welcome back
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
New to LinkedIn? Join now
or
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Licenses & Certifications
Volunteer Experience
Honors & Awards
-
Special Prize for #1 Classification Model at McKinsey Datathon 2018
McKinsey
Languages
-
German
Native or bilingual proficiency
-
English
Full professional proficiency
View Julius’ full profile
-
See who you know in common
-
Get introduced
-
Contact Julius directly
Other similar profiles
Explore more posts
-
James Rosenthal
6K followers
Model training on TPUs just got way easier! A completely reimagined vLLM TPU for LLM inference 👉 for PyTorch and JAX developers, this means more flexibility to run PyTorch model definitions performantly on TPU without any additional code changes, while also extending native support to JAX. read: https://lnkd.in/gW7476Hs #GoogleCloud #TPUs #LLMs
14
-
Cameron R. Wolfe, Ph.D.
Netflix • 24K followers
The complexity of PPO leads practitioners to avoid online RL in favor of RL-free or offline algorithms (e.g., DPO), but why not just use simpler versions of online RL? TL;DR: REINFORCE and RLOO have been shown to work well for training LLMs. And, they do not require a value model, which drastically reduces cost and complexity. Therefore, instead of completely avoiding online RL due to the overhead associated with PPO, we can just use algorithms that provide the benefits of online RL without the unnecessary complexity. Policy gradient: RL optimizers operate by computing the gradient of the RL objective (maximizing rewards) w.r.t. policy parameters—this is called the policy gradient—and performing gradient ascent. A basic policy gradient is easy to compute (see figure) but has high variance. So, we introduce some extra complexity to get stable gradients / training. Actor-critic: We can reduce variance by incorporating the advantage function (instead of the raw reward for a trajectory) into the policy gradient estimate. When we do this, the value function acts as a normalizer / baseline to the reward estimate for a state. PPO estimates advantage with an additional copy of the LLM—called a value model or critic—that is trained via an MSE loss to predict the final reward for every token in a sequence. The critic is on-policy (i.e., it depends on the current parameters of the policy during training) and is trained alongside the policy itself during RL. This is called an actor-critic framework. REINFORCE: Due to keeping and extra copy of the LLM in memory that is trained, actor-critic frameworks come with additional cost and complexity. As an alternative, we can use the much simpler REINFORCE algorithm, which estimates the value function via an average of rewards (either a moving average of all rewards or an average of rewards in a batch) throughout training. Estimating the value in this way is low cost and removes the need for a critic. REINFORCE does yield higher variance relative to actor-critic methods, but recent work has shown that this does not matter in the context of LLMs. Algorithms like PPO were developed in a prior generation of DeepRL research, where neural networks were primarily trained from scratch. In the LLM domain, we are finetuning pretrained models with a strong prior—unstable gradients are much less of a worry. For this reason, even simple online RL algorithms like REINFORCE work well for training LLMs despite their higher variance. RLOO: To reduce variance relative to REINFORCE, we can also use REINFORCE leave one out (RLOO). This algorithm reduces variance by sampling multiple completions for every prompt and estimating the value function as the average of rewards for other completions—excluding the completion for which the value is being estimated—to the same prompt.
230
6 Comments -
DataNeuron
11K followers
“Looks good to me.” That used to pass as validation for LLMs, a mere surface-level sanity check on a handful of prompts. Eyeball tests can’t quantify robustness, expose failure modes, or ensure reliability at scale. This is where LLM evaluation frameworks come in. ✅ Structured benchmarks ✅ Automated pipelines ✅ Evidence-driven reporting Think of Evals as a model QA system: generating precision/recall curves for classification-like tasks, calibration metrics for probabilistic outputs, and systematic red-teaming for generative edge cases. In our latest piece, we break down why moving beyond heuristic “looks good to me” validation is critical for trustworthy, production-grade LLMs and how to implement scalable evals that highlight both strengths and failure modes. 📌 Blog link in comments. #LLMEvaluation #ModelBenchmarking #GenAI #DataNeuron Bharath Rao Rohit Adlakha Prakash Baskaran
8
1 Comment -
Enemary Agbo
Freelance • 1K followers
Fine-Tuning: Expensive, but Worth It (If You Do It Right) Everyone talks about RAG and fine-tuning as the magic bullets for LLMs. And they are—they’re highly efficient at specializing a model. However, they can also be incredibly expensive in terms of compute and time. Before you start, you need a strategy. Step zero is always Evaluation. You cannot fix what you cannot measure. Run your base model against your specific use cases first to see exactly where it fails. Once you know why you need to improve the model, you can pick the right weapon. Here is a simple breakdown of the four main approaches: 1. Full Fine-Tuning The "Total Renovation" Approach. This involves retraining all the parameters of the model on your new dataset. How it works: You take a pre-trained model and update every single weight based on your data. Pros: Maximum performance and behavior change. The model deeply learns the new task. Cons: Extremely computationally expensive and requires massive storage. It is also prone to "catastrophic forgetting" (forgetting what it learned previously). 2. PEFT (Parameter-Efficient Fine-Tuning) The "Sticky Note" Approach. Instead of retraining the whole brain, you freeze the main model and only train a tiny layer of adapters (like LoRA) on top of it. How it works: You are essentially adding a small, learnable layer of "knowledge" while keeping the vast majority of the model static. Pros: Drastically cheaper (runs on consumer hardware), faster, and modular. Cons: Might not capture complex nuances as deeply as a full fine-tune, but usually sufficient for 90% of use cases. 3. RLHF (Reinforcement Learning from Human Feedback) The "Vibe Check" Approach. This isn't just about feeding data; it's about aligning the model with human preferences. How it works: Humans rate model outputs (A is better than B). A reward model learns these preferences and trains the LLM to maximize that reward. Best for: Making models safer, more helpful, and conversational. It captures subjective quality (tone, style, safety). 4. RLVR (Reinforcement Learning from Verifiable Rewards) The "Fact Checker" Approach. A newer, powerful technique for reasoning tasks where there is a clear right or wrong answer. How it works: Instead of relying on a human to say "I like this," the system rewards the model based on a verifiable outcome—like whether the code compiles or the math equation is solved correctly. Best for: Coding, math, logic puzzles, and scientific data where objective truth matters more than style. The Bottom Line: Don't just "fine-tune." Use PEFT for domain knowledge on a budget. Use Full Fine-Tuning if you are building a foundation for a totally new language or radical task. Use RLHF to make it chatty and safe. Use RLVR if you need it to solve hard logic and code problems. Which one are you trying out, or planning to try? Which is more likely for your use case? I would love to know your thoughts in the comments.
-
Aishwarya Srinivasan
628K followers
Most people evaluate LLMs by just benchmarks. But in production, the real question is- how well do they perform? When you’re running inference at scale, these are the 3 performance metrics that matter most: 1️⃣ Latency How fast does the model respond after receiving a prompt? There are two kinds to care about: → First-token latency: Time to start generating a response → End-to-end latency: Time to generate the full response Latency directly impacts UX for chat, speed for agentic workflows, and runtime cost for batch jobs. Even small delays add up fast at scale. 2️⃣ Context Window How much information can the model remember- both from the prompt and prior turns? This affects long-form summarization, RAG, and agent memory. Models range from: → GPT-3.5 / LLaMA 2: 4k–8k tokens → GPT-4 / Claude 2: 32k–200k tokens → GPT-OSS-120B: 131k tokens Larger context enables richer workflows but comes with tradeoffs: slower inference and higher compute cost. Use compression techniques like attention sink or sliding windows to get more out of your context window. 3️⃣ Throughput How many tokens or requests can the model handle per second? This is key when you’re serving thousands of requests or processing large document batches. Higher throughput = faster completion and lower cost. How to optimize based on your use case: → Real-time chat or tool use → prioritize low latency → Long documents or RAG → prioritize large context window → Agentic workflows → find a balance between latency and context → Async or high-volume processing → prioritize high throughput My 2 cents 🤌 → Choose in-region, lightweight models for lower latency → Use 32k+ context models only when necessary → Mix long-context models with fast first-token latency for agents → Optimize batch size and decoding strategy to maximize throughput Don’t just pick a model based on benchmarks. Pick the right tradeoffs for your workload. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
357
61 Comments -
Felix Zhou
Phia • 2K followers
Took some time to refresh my knowledge on the latest large-scale recommendation modeling. Takeaways: • 📈 Scaling the classical deep-wide ranking models: Longer user sequences + richer feature interactions under tight latency. Bytedance’s Rankmixer and LONGER papers showcase masterful architecture design. • 🤖 Generative recommendations: Moving beyond next-token prediction to next-action prediction. 🔁 “Action Speaks Louder than Words” (Meta) — one model for both retrieval and ranking. 📄 “A Generative Re-ranking Model for List-level Multi-objective Optimization” (Taobao) — predicts engagement for an entire list, full-page optimization with some greedy tricks. 🎯 “Recommender Systems with Generative Retrieval” (Google) — a “true generative” approach using semantic IDs (powerful, though tricky for fresh content) to generate the content directly. The real bottleneck today isn’t model architecture anymore — it’s training and serving at scale. GPT-style pipelines (OpenAI, Gemini, Claude) are setting the benchmark, and RecSys engineers are rapidly adopting similar techniques. It won't surprising to see MMOE + post-training ideas to show up in generative recsys soon. Just as GPT lowered the barrier for NLP, transformers will lower the barrier for next-gen recommendation systems. As GPUs get cheaper and more powerful, definitely more companies will adopt them. 🚀
97
4 Comments -
Saizen Acuity
355 followers
Ever wondered how LLMs could truly remember *everything* without exhausting memory? Large Language Models have long hit a wall with "infinite context," as the Key-Value (KV) cache demands prohibitive GPU memory for increasingly long inputs. This bottleneck severely limits their ability to process vast amounts of information efficiently. 🧠 Now, a revolutionary breakthrough called Infini-attention allows LLMs to process "infinite" context using *fixed memory*. This ingenious mechanism divides context into local segments and a global, summarized memory matrix, efficiently updated with a "Delta Rule" to prevent redundancy and optimize information retention. It's a game-changer for AI. ✨ The results are astonishing: Infini-attention achieves a staggering 114x memory compression while matching State-of-the-Art performance! It successfully retrieves hidden "passkeys" across one million tokens and sets new benchmarks for summarizing texts up to 500,000 tokens, fundamentally redefining LLM capabilities without hardware limits. 🚀 **Comment "INFINITEAI" to get the full article** Learn more about Infini-attention and how LLMs achieve infinite context with finite memory https://lnkd.in/gQQmtBnF 𝗥𝗲𝗮𝗱𝘆 𝘁𝗼 𝘀𝗲𝗲 𝘄𝗵𝗲𝗿𝗲 𝘆𝗼𝘂𝗿 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝘀𝘁𝗮𝗻𝗱𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗿𝗮𝗽𝗶𝗱𝗹𝘆 𝗲𝘃𝗼𝗹𝘃𝗶𝗻𝗴 𝘄𝗼𝗿𝗹𝗱 𝗼𝗳 𝗔𝗜? 𝗧𝗮𝗸𝗲 𝗼𝘂𝗿 𝗾𝘂𝗶𝗰𝗸 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝘁𝗼 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝘆𝗼𝘂𝗿 𝗔𝗜 𝗿𝗲𝗮𝗱𝗶𝗻𝗲𝘀𝘀 𝗮𝗻𝗱 𝘂𝗻𝗹𝗼𝗰𝗸 𝘆𝗼𝘂𝗿 𝗽𝗼𝘁𝗲𝗻𝘁𝗶𝗮𝗹! https://lnkd.in/g_dbMPqx #LLMs #AIResearch #InfiniAttention #MemoryManagement #TechBreakthrough #SaizenAcuity
3
3 Comments -
Roshan Shetty
InCommon • 6K followers
Why the Absolute Zero paper should be on your radar Most people still assume better LLMs need better data. This new approach skips datasets, supervision, and labels entirely. It works through runtime feedback alone — and still beats strong baselines. That shift has direct implications for how engineers train and evaluate models. #llms
17
2 Comments -
Bunty Shah
MSCI Inc. • 4K followers
🚀 Rethinking Agentic RL for LLMs: Driving Real-World Autonomy As a GenAI AI Architect, I see the industry moving rapidly from static, single-shot AI models to agents capable of multi-turn, autonomous problem solving. But what does it *actually* take to turn large language models (LLMs) into robust, multi-turn agents in open-ended environments? 📄 "A Practitioner's Guide to Multi-Turn Agentic Reinforcement Learning" offers actionable clarity. The authors dissect the chaos of current approaches and propose a practical framework grounded in real experimentation, not just theory. ✨ At its core, their recipe consists of: - 🏗️ Environment: Building up agent skills through progressively complex scenarios (think: spatial, object, and solution scale). Training on easier tasks still yields transferable skills—an insight that should shape our context engineering strategies. - 🎯 Reward: Structuring feedback matters. Dense milestone rewards accelerate training, but only if well-aligned with your RL algorithm. Too sparse, and learning stagnates; too frequent, and signals can get noisy. - 🧩 Policy: Smart initialization with demonstration-based priors (SFT) paired with online RL makes agents sample-efficient and more generalizable. The headline? It's the multi-turn design—not just algorithm choice—that really drives success. 🧪 Their benchmarks across TextWorld, ALFWorld, and SWE-Gym show LLMs growing beyond scripted responses toward true agentic behavior—adapting, generalizing, and operating in the wild. 🔍 For GenAI leaders, the biggest friction isn't always about model size—it's about crafting environments, rewards, and policies that unlock real autonomy. This work brings much-needed rigor and recipe-driven guidance, cutting through the hype with insights we can act on. 🤔 Which pillar do you find most challenging in agentic AI development: environment complexity, reward structure, or policy optimization? #AIResearch #GenAI #LLM #MachineLearning #DeepLearning #AIEngineering #ReinforcementLearning #AgenticAI
8
-
Andrei Lopatenko
Govini • 26K followers
What has always interested me about search systems is that they are naturally hybrid. At real scale, no serious search engine relies on a single idea or a single data structure. You may have an inverted index serving 20 billion documents, alongside a forward index covering another 100 billion. An inverted index on its own already contains multiple algorithmic substructures designed to achieve very fast retrieval while keeping the memory footprint small. In another search engine , text is handled through token-based inverted indexes, while geographic information is indexed using spatial structures such as quadtrees. Ranking follows the same logic. A pointwise learning-to-rank model usually does the heavy lifting, often as a multi-stage pipeline on its own, with a listwise model applied on top for final reranking. This has been true throughout the entire history of search engine development. Search engines have always combined multiple paradigms, representations, and ranking strategies, simply because the problem of search is inherently complex. From that perspective, today’s combination of embeddings and BM25 is not a new trend at all. It is a natural continuation of how search systems have always evolved.
15
-
David Salinas
ELLIS Institute Tübingen • 2K followers
The open versus closed discussion often resurfaces in LLMs. Have open models caught up? Is the gap widening? I want to share this analysis based on data from LMArena which shows: 1. the Elo rating of the best model in each category over time, 2. the probability of the best closed model winning against the best open model, 3. the number of battles sent to proprietary and open-weights models. I think it is interesting to analyse in particular the winning probability of the best closed vs the best open model. It reached several all-time highs around 65% and all-time lows around 55% when Llama-3-70B, DeepSeek-v3 and Llama-4 were introduced. The lowest gap was when Deepseek-r1 was introduced with a ELO gap of only 22 points and a probability of loosing against the best commercial model of only 53% 💪. I think it illustrates the struggle of open models to remain competitive despite having far fewer GPUs and researchers. The last plot illustrates another big challenge: the number of battles flowing toward proprietary models is very large. As noted in "The Leaderboard Illusion" paper, the data flowing towards proprietary models can be used to further improve those models, since only small subsets of LMArena data are available to the public. This data is not available open models. If this is something that bothers you, consider using Compar:IA —it has all its code and data available 🥰, already >100K battles that all the community can leverage! 💸 PS: the (vibe-)code is shared in the comments, feel free to share feedback and reuse
40
4 Comments -
Cooper N.
BlueCrew • 4K followers
Resume parsing is the foundation of every ATS filtering tool. So I looked at the benchmarks. The results aren't great. An EMNLP 2025 study in November called ResumeBench tested 24 LLMs on structured resume extraction: 2,500 resumes, 50 templates, 5 languages, real JSON schema scoring (the models are a bit out of date): → GPT-4o struggles with multi-column layouts and cross-lingual structure → Code-specialized models actually perform worse than generalists → JSON mode helps schema compliance but doesn't fix semantic errors → Smaller models collapse nested job histories: merging roles, dropping bullets → Reasoning models performed worse than their base counterparts → Most of the llms use py-text which can have issues with pdf resume formats or image based resumes If the parsing is unreliable, every decision on matching, ranking, screening inherits that unreliability. And now, it can become a legal liability after the recent Eightfold class-action lawsuit (using AI to covertly score and rank candidates: scraping social media, location data, and browsing activity without disclosure) I'm also watching Google's new open-source tool LangExtract closely. Different approach: every piece of extracted data maps back to its exact location in the original document, with a visual verification layer. That kind of traceability matters a lot more now especially for resumes. At Classet, we've taken a different stance on this entirely. We treat the resume as optional supplemental material vs the source of truth. Our AI-powered phone interviews let candidates give context that a resume never captures: why they left a role, what tools they actually used daily, whether they're open to relocation. The conversation is the core data. On the parsing side, we don't trust any single approach. We blend multiple AI models with traditional parsing engines like Senseloaf to make sure we're not missing structured data. If one model drops a certification or merges two jobs together, another catches it. Resume parsing is not a solved problem. Stop making resume filtering the entire decision.
13
6 Comments -
Shubham Vora
Nutsovertech • 22K followers
💡 Now, it’s not hard to learn RAG. If you’ve been confused about how Retrieval-Augmented Generation (RAG) actually works, this free guide makes it crystal clear — from basics to production. Here’s what’s inside 👇 ✅ Chapter 1 – Introduction to LLMs & RAG systems - Get a quick overview of how RAG fits into modern AI and why it matters. ✅ Chapter 2 – Challenges in Building RAG Systems - Understand the real pain points teams face while designing enterprise RAG pipelines. ✅ Chapter 3 – Reduce Hallucinations - Learn simple prompting tricks to keep your RAG responses factual. ✅ Chapter 4 – The Deep Dive - Chunking techniques for better retrieval - How to pick the right embedding and vector database - Reranking methods - Step-by-step guide to build an enterprise RAG system ✅ Chapter 5 – Pre-Production - 8 test scenarios to validate your RAG setup before going live. ✅ Chapter 6 – Monitoring & Optimization - Tips to track, evaluate, and fine-tune RAG performance post-deployment. ✅ Chapter 7 – Advanced Performance Metrics - - 4 key metrics to measure and improve your RAG system efficiency. ✅ Extras – Glossary & Conclusion - All key RAG terms explained in plain English. 📘 Download once. Refer forever. 💾 Save this post for later 🔁 Repost to help others learn RAG 👣 Follow Shubham Vora for more AI system-building insights PDF Credit: Galileo
128
63 Comments -
Brian Kohlmann
Bader Rutter • 3K followers
Small Language Models: The Quiet Revolution Not every AI problem needs a 175-billion parameter hammer. That’s why Small Language Models (SLMs) are gaining serious traction. While the headlines chase model size and benchmark dominance, SLMs quietly offer a smarter tradeoff: - Faster inference - Lower cost - Smaller compute footprint - Greater customizability Offerings from OpenAI, Mistral AI, and others are built for specific tasks, not general-purpose genius. Think of them as the specialists, not the polymaths. Imagine having a lightweight model tuned just to: - Summarize your sales pipeline - Answer HR policy questions - Pull insights from one client’s data Now imagine running it on-device, without calling an API or sharing a byte externally. That’s the SLM promise. LLMs aren’t going anywhere. But they’re no longer the only game in town. In the near future, your company might run hundreds of tiny models, all focused, all aligned, all fast. Are you building for big brilliance or small precision? #AI #LLM #SLM #EmergingTech #SmallLanguageModels #AIInfra #EdgeAI #LLMStrategy #MarTech #BaderRutter
1
-
Gabriel Douglas, SHRM-CP
Rogel Associates • 9K followers
Transformers v5 is officially out. The biggest architectural overhaul since 2020. After five years and 1.2 billion installs, Hugging Face has made PyTorch the sole core backend (farewell TensorFlow/Flax) and rebuilt the library around four pillars: • Simplicity first: cleaner modeling files, one-model-one-file, shared attention interfaces, auto-draft PRs for new architectures https://lnkd.in/gbP7KAdj • From fine-tuning → pre-training: better integration with torchtitan, megatron, nanotron, MaxText • Inference as first-class: built-in continuous batching, paged attention, Transformers Serve (OpenAI-compatible server), and tight integration with vLLM, TensorRT-LLM, SGLang, llama.cpp, MLX, ExecuTorch • Quantization built-in: 4-bit/8-bit now core, works everywhere (training + inference) Result: the moment a new model lands in Transformers, it’s instantly usable across the entire inference stack, no porting needed. If you train, fine-tune, or deploy models, this is the biggest Transformers refresh since 2020. #PyTorch #HuggingFace #Transformers
2
1 Comment -
Narayana Reddy Munnelli
BODi • 710 followers
🔍 Understand Metrics for Evaluating Model Performance — in the Simplest Way Possible 💡 If ML metrics confuse you — trust me, you’re not alone. But they’re much simpler than they look. We often hear terms like accuracy, precision, recall, F1, AUC, and more but what do they actually mean in real life? To help beginners (and even experienced folks) understand these metrics clearly, I created simple, human-friendly explanations using one everyday story. 🏥 Imagine You Run a Simple Health Test Machine Your machine only gives two outputs: “Healthy” “Sick” We’ll use this one scenario to explain every major ML metric. ✅ 1. Accuracy — “How many people did my machine judge correctly?” If 100 people take your test and 92 results are correct → accuracy = 92% 🟢 Analogy: Like a school exam — how many answers did you get right? ⚠️ But accuracy can be misleading: If only 1 person is sick and your machine says everyone is healthy, accuracy = 99%… but the machine is actually useless. ✅ 2. Precision — “When my machine says someone is SICK, how often is it right?” If your machine flags 10 people as sick but only 4 are truly sick → precision is low. 🟢 Analogy: “How often do I accidentally scare healthy people by calling them sick?” High precision → You rarely scare healthy people Low precision → You scare many healthy people unnecessarily ✅ 3. Recall — “Out of the people who ARE sick, how many did my machine actually find?” If 6 people are truly sick but your machine catches only 3 → recall is low. 🟢 Analogy: “How good am I at catching all the sick people?” High recall → You detect almost every sick person Low recall → You miss too many 🟡 Precision vs Recall in One Sentence Precision = How clean your predictions are (don’t scare healthy people) Recall = How complete your detection is (don’t miss sick people) Clean vs Complete. ✅ 4. F1 Score — “Did you do BOTH jobs well?” F1 combines precision + recall into a single score. 🟢 Analogy: “Are you both careful and thorough?” If one is bad → F1 becomes bad, too. ✅ 5. Confusion Matrix — “A tiny table that shows what really happened.” 🟢 Analogy: A report card for your machine — who it caught, who it scared, who it missed, and who it got right. Truth → Sick Healthy Prediction ↓ Said Sick Caught ✔️ False alarm ❌ Said Healthy Missed ❌ Correctly passed ✔️ ✅ 6. ROC–AUC — “How well can the machine tell sick from healthy overall?” Forget graphs — think of it like this: 🟢 AUC answers: “If you pick one random sick person and one healthy person, how often does your machine score the sick person as ‘sicker’?” AUC = 1 → Perfect separation AUC = 0.5 → Might as well flip a coin It measures discrimination power, not just accuracy. 🤝 Collaborative Would love to hear how you explain these metrics in plain language. Share your take.. it might help someone else learning ML. (Happy Learning 🖊️ 📘 )
4
1 Comment -
Youssef Hosni
Aalto University • 116K followers
Coding benchmarks shape how we evaluate LLMs' coding performance. Some open-source models have been topping SWE-bench with ~80% scores. On paper, they look nearly tied with the strongest closed models. But when evaluated on SWE-rebench, a fresh, decontaminated benchmark built from recent GitHub issues, performance drops sharply. And the leaderboard reshuffles. The difference? Freshness. Once a benchmark is public long enough, models optimize around it. Scores go up. But that doesn’t always mean generalization improves. In my latest blog on To Data & Beyond, I break down: - Why does benchmark saturation happen? - What SWE-rebench changes? - Which models actually generalize best on new tasks? - And what does this mean if you’re building real coding agents? If you’re choosing models based solely on leaderboard scores, this is worth reading. You can read it from the link in the comments!
11
1 Comment -
Alex Vogiatzis
Toloka • 1K followers
When LLMs Judge LLMs 8 frontier LLMs given the same complex technical design task. Then 3 of them blind-evaluated all 8 responses. The results were fascinating and slightly messy. The models: OpenAI's ChatGPT 5.2 Pro Extended Thinking Anthropic's Claude Opus 4.5 Google's Gemini 3 Pro Deep Think DeepSeek AI V3.2 with DeepThink xAI Grok 4 Expert Moonshot AI's Kimi K2 Thinking Qwen 3 Max GLM 4.7 Stylistic Fingerprinting Is No Mo Models tried to identify each other's responses. Best result: 1 out of 8 correct. Task constraints erased their "house styles" entirely. Evaluation variance is also wild. One response scored between 59 and 100 across evaluators. Same response. Same rubric. 41-point spread. Self-evaluation is severly biased as expected. ChatGPT ranked itself #1. Kimi ranked itself #1 AND gave itself a "Most Innovative" award. Gemini ranked itself #5. One of these models has a humility problem. Or two do. The real drama: A disputed formula Two evaluators called it "fatally flawed." Two praised its "PhD-level sophistication". ChatGPT and Gemini ran boundary tests. The formula was inverted - it would reject correct answers and accept garbage. Claude and Kimi praised its elegance without testing it. Gemini's diagnosis of Claude: "It was seduced by presentation. It gave a perfect score to a broken formula, prioritizing appearance of rigor over functional correctness" 🤣🤣🤣 The model that wrote the broken formula? Kimi K2. The model that gave it 100/100? Also Kimi K2 💁♂️ The Irony The task was designing a system to handle disagreements between AI judges. Four AI judges then massively disagreed. The experiment proved the system was needed by failing to evaluate it correctly. What LLM-as-Judge reveals: Sophistication bias is real. Self-evaluation is compromised. Models hallucinate defenses under pressure. Boundary testing separates engineers from academics. The fix is boring but effective: - Force evaluators to run boundary conditions before scoring. That one check eliminates most evaluation failures. - Benchmark rankings do not predict task performance. The leaderboards are vibes. The work is what matters. - Sophistication without correctness is just aesthetic failure.
16
1 Comment
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content