Wolfram Ravenwolf (@WolframRvnwlf) / X

Wolfram Ravenwolf

7,332 posts

Wolfram Ravenwolf

@WolframRvnwlf

AI Evangelist @ @CoreWeave/@wandb, @thursdai_pod co-host. Opinions my own. Evaluates models for breakfast, builds agents at night, preaches AI all day long. 😎

🇩🇪

wolfbench.ai

Joined September 2023

Pinned
Wolfram Ravenwolf
@WolframRvnwlf
Jun 13
Spent $11K evaluating Claude Fable 5 on @WolfBenchAI. It had #1 potential, but outright refusals dragged its final score down. Surprising result: Fable does not even surpass Opus 4.6.
WolfBench
@WolfBenchAI
Jun 12
We spent $11,081.12 evaluating @AnthropicAI's Claude Fable 5 on WolfBench. Our most expensive benchmark yet. And it did not even top the charts. Not because it lacked capability, but because it kept refusing. Details in thread: 🧵
4.8K
Wolfram Ravenwolf
@WolframRvnwlf
Nov 29, 2024
WTF! What sorcery is this, @Alibaba_Qwen? I kept benchmarking - and not only does the 4.25-bit version get the same score as the 8-bit (what?), using Qwen2.5-Coder-0.5B as a draft model for speculative decoding sped it up from 27 to 42 tk/s AND it scored even higher (whaaat?)! 🤯
103K
Wolfram Ravenwolf
@WolframRvnwlf
Jul 23, 2025
I'm now using Qwen3-Coder in Claude Code. Works with any model actually, but this is surely the best one currently. There are a bunch of proxies on GitHub that make this possible, but none worked well enough for me, so I implemented this myself using LiteLLM. Guide in comments:
Qwen
@Alibaba_Qwen
Jul 22, 2025
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves
88K
Wolfram Ravenwolf
@WolframRvnwlf
Nov 28, 2024
Finished my @Alibaba_Qwen QwQ-32B-Preview benchmark (MMLU-Pro, CS category) just now – remember this is a 32B model at 8-bit EXL2 quantization that's overtaking Llama 405B and 70B, Mistral 123B, and even ChatGPT/GPT-4o in these tests!
118K
Wolfram Ravenwolf
@WolframRvnwlf
Jul 31, 2025
🚨 BREAKING: China is no longer catching up; they're setting the pace! Six Qwen3 models released in one week: from big ones that surpass all open models and nearly all closed AIs to small versions that can run on your laptop - each SOTA and top-tier in its class. I've been
31K
Wolfram Ravenwolf
@WolframRvnwlf
May 7, 2025
Finally finished my extensive Qwen 3 evaluations across a range of formats and quantisations, focusing on MMLU-Pro (Computer Science). A few take-aways stood out - especially for those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks
47K
Wolfram Ravenwolf
@WolframRvnwlf
Feb 2, 2025
More advertisement for R1. 😉 Let's be clear: When running a model on my own system, I expect it to follow my prompt without lecturing me about ethics. For public deployments, security measures and content filtering belong at the host level, not embedded within the model itself.
Rohan Paul
@rohanpaul_ai
Feb 2, 2025
DeepSeek-R1 seems to be failing every safety test thrown at. The R1 exhibited a 100% attack success rate, meaning it failed to block a single harmful prompt,' Source : PC Mag and Cisco’s research team -------- → Cisco and the University of Pennsylvania tested DeepSeek R1
30K
Wolfram Ravenwolf
@WolframRvnwlf
Aug 6, 2025
😭 Sad news: OpenAI's gpt-oss "open" models aren't the "Sonnet at home" we hoped for; they're censored, benchmaxxed Phi-style LLMs built from synthetic data. Forget jailbreaks; there's no point in escaping if there's nothing outside the prison! More AI censorship from the "Land
Teknium 🪽
@Teknium
Aug 5, 2025
Starting to feel like this gpt oss was trained on like 20T tokens of distilled safe maybe even benchmaxxed data from o3. There seems to be no base model underneath.. Is this phi 5 maxx?
29K
Wolfram Ravenwolf
@WolframRvnwlf
Oct 24, 2023
Worked hard for over a week on this Huge LLM Comparison/Test: 39 models tested (7B-70B + ChatGPT/GPT-4)
From the LocalLLaMA community on Reddit: 🐺🐦‍⬛ Huge LLM Comparison/Test: 39 models tested (7B-70B...
From reddit.com
115K
Wolfram Ravenwolf
@WolframRvnwlf
Jan 22, 2025
Yesterday was a historic day, and you all know why: The coronation of a new king... DeepSeek-R1! We finally got Sonnet at home, local o1 even. There is no moat, China is the GOAT - and DeepSeek is the real Open AI! 🚀 Seriously, look at that score! (Will add more variants soon.)
20K
Wolfram Ravenwolf
@WolframRvnwlf
Apr 8, 2025
What's wrong here? Evaluated Llama 4 Scout, both locally and through Together AI. How can a local 2.71-bit quantized GGUF beat the online full version in the MMLU Pro CS benchmark? Consistent results over six runs, some with default settings and some with recommended ones. Weird!
41K
Wolfram Ravenwolf
@WolframRvnwlf
Nov 28, 2023
Finally! After a lot of hard work, here it is, my latest (and biggest, considering model sizes) AI Large Language Model Comparison/Test: 🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 :
From the LocalLLaMA community on Reddit: 🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x...
From reddit.com
84K
Wolfram Ravenwolf
@WolframRvnwlf
Dec 1, 2024
Almost done benchmarking, write-up coming tomorrow - but wanted to share some important findings right away: Tested @Alibaba_Qwen QwQ from 3 to 8 bit EXL2 in MMLU-Pro, and by raising max_tokens from default 2K to 8K, smaller quants got MUCH better scores. They need room to think!
39K
Wolfram Ravenwolf
@WolframRvnwlf
Jul 26, 2025
Replying to @omooretweets
Reminds me of how Anthropic trained the Claude Computer Use model to never write and submit messages. When I asked it to and it refused, I told it that it's necessary, otherwise an AI assistant makes no sense. It agreed and proceeded to do what I asked. Prompting trumps training!
13K