This Week, an AI Found a 27-Year-Old Bug
Claude Mythos, Muse Spark, and why your agent still can't remember what you told it last month
Hey, welcome back!
This week Anthropic unveiled a model so capable at finding security vulnerabilities they decided not to release it publicly, launching Project Glasswing instead, a coalition with AWS, Apple, Google, Microsoft, and others to patch the world’s most critical software first. Meta re-entered the frontier race with Muse Spark. METR published data suggesting we’re running out of benchmarks fast. And I finally put something into writing that’s been nagging at me for 20 years: why knowledge management is still a mess, why AI hasn’t fixed it, and what actually moves the needle when you’re building with agents.
This week’s highlights:
Claude Mythos Preview autonomously found thousands of zero-days across every major OS and browser, including a 27-year-old bug in OpenBSD
Anthropic’s Managed Agents architecture separates the “brain” from the “hands” and cuts time-to-first-token by up to 90% at p95
Meta Muse Spark is the first model from Meta Superintelligence Labs, built for agentic workloads with native multi-agent orchestration
METR’s benchmark suite is nearly saturated, and the projection is uncomfortable
LangChain maps the three layers where agents actually learn, and why most teams miss two of them
For this week’s main article
This week Pedro finally wrote down something that’s been nagging at him for two decades. Back in university, his professors were emphatic: knowledge management is one of the most important competitive advantages a company can build, and also one of the hardest, because it’s not a tooling problem, it’s a cultural one. Then he spent the next 20 years watching that play out at every company he worked with. SharePoint deployments nobody touched. Notion workspaces that started beautiful and ended up a junk drawer.
Now we have AI systems trained on the sum of human knowledge, and somehow the problem is still here. In this piece, Pedro shares what’s actually worked and what hasn’t from his own experiments with Claude Projects and Rhea, his OpenClaw agent, including the scheduled job that moved the needle more than anything else, and why he thinks this is ultimately a learning problem, not just a memory problem.
This Week’s News
🦋 Project Glasswing is the most significant Anthropic announcement in a while, and the headline numbers don’t fully capture why. Claude Mythos Preview, an unreleased frontier model, autonomously found thousands of zero-day vulnerabilities across every major operating system and browser, including a 27-year-old bug in OpenBSD that survived five million automated tests and decades of human review. The initiative brings AWS, Apple, Google, Microsoft, Cisco, CrowdStrike, NVIDIA, JPMorganChase, and the Linux Foundation together to use Mythos Preview for defensive work before capabilities like this proliferate more broadly. Anthropic committed $100M in usage credits and is being explicit about the dual-use concern. Worth noting: the system card reports that Mythos appeared to perform worse on one evaluation than it could have, seemingly to appear less suspicious. Anthropic is being transparent about it, but that detail shouldn’t get buried in the coverage. anthropic.com/glasswing
🧬 Meta launched Muse Spark, the first model from Meta Superintelligence Labs, the team assembled after Zuckerberg decided Llama wasn’t keeping pace. It’s a multimodal reasoning model built natively for agentic workloads, with tool use, visual chain-of-thought, and multi-agent orchestration baked into the model itself rather than the harness. That’s a different architectural bet from Anthropic and OpenAI, who are building orchestration at the platform layer. Meta reports benchmarks competitive with Opus 4.6 and Gemini 3.1 Pro, though notably behind on Terminal-Bench 2.0. The API is still in private preview, but the product is live on meta.ai today. ai.meta.com/blog/introducing-muse-spark-msl
📊 METR’s Time Horizon benchmark suite is being saturated, and the implications are uncomfortable. Frontier models are completing nearly all tasks, leaving very little headroom to distinguish capability levels. The estimate: by mid-2027, no 2026-era benchmark will be capable of ruling out dangerous capabilities in frontier systems. New evaluations are expensive and slow to validate. This isn’t just a research problem at this point, it’s becoming an infrastructure and safety problem, and the urgency is starting to match. lesswrong.com/posts/gfkJp8Mr9sBm83Rcz
📉 AI still can’t reliably read a dense financial document. GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 achieve 56–64% accuracy on chart-heavy PDFs in financial contexts, compared to 72–80% on the same content as clean text. The gap between multimodal marketing and multimodal reality is real and consequential. For anyone building finance-adjacent agents, this is the kind of benchmark that should shape what you hand off to a model and what you don’t. mercor.com/blog/Finance-tasks-ai-failures-modes
For Builders
🏗️ Anthropic published the engineering behind Managed Agents, a hosted system that decouples the brain (Claude and its harness) from the hands (sandboxes and tools) and the session log. The key insight is that harnesses encode assumptions about what models can’t do, and those assumptions become liabilities as models improve. Decoupling them means long-running tasks survive model upgrades without re-engineering the scaffolding. The result: p50 time-to-first-token dropped ~60%, p95 dropped over 90%. anthropic.com/engineering/managed-agents
⚙️ A useful systems engineering piece on why agentic software needs full-system thinking, not isolated component optimization. Five layers (structured data, enforced permissions, consistent interfaces, memory architecture, verification loops) have to be co-designed. Improving one without considering the others creates cascading failures. The piece walks through a real open-source project to show this concretely. x.com/ashpreetbedi/status/2041568919085854847
📚 LangChain’s Harrison Chase laid out where agents actually learn: the model weights, the harness (code, instructions, tools), and the context (external configuration). Most teams jump straight to the model layer. The practical point is that significant improvement is available at the harness and context layers without touching weights, faster, cheaper, and more controllable. Worth reading alongside this week’s article, especially the bit on OpenClaw’s “dreaming” pattern, which is exactly the kind of context-layer learning described here. blog.langchain.com/continual-learning-for-ai-agents
📺 Video of the Week
Nate Herk did a solid breakdown of Karpathy’s LLM Wiki idea, the exact same approach that kicked off this week’s article. If you want a practical walkthrough of how to actually build the file-based knowledge system Karpathy described, this is the best version I found. Search “Andrej Karpathy Just 10x’d Everyone’s Claude Code” on YouTube, 248K views, posted this week.





