Last Week Week in Review


LAST WEEK April 28 – May 4, 2026

TL;DR - Models cleared real-world thresholds ahead of schedule — 67% ER diagnosis accuracy, frontier-level cyber offense — while systematic research documented that evaluation frameworks can't reliably measure what just happened. - Open-weight models closed to within 6 benchmark points of GPT-5.5; 4 Chinese labs released near-parity coding tools in 12 days priced under a third of Western equivalents, effectively collapsing the standard "6-9 months behind" framing. - The subsidized AI era formally ended: the OpenAI-Microsoft AGI clause died, Copilot moved to usage-based billing at 6-27x multipliers, and a string of billing bugs plus competitor-blocking incidents eroded practitioner trust in tool vendors. - Pure Signal researchers built a structural case for why evaluation is broken; HN practitioners lived the consequences directly — and largely talked past each other about the connection.

The Week in One Sentence AI cleared the thresholds it promised were years away, while revealing — with increasing precision — that it can't reliably measure what it just accomplished.


[THEME 1: Capability and Evaluation Are Diverging — and the Gap Is Widening]

This week's dominant intellectual thread opened quietly on Monday. Dwarkesh Patel argued that reinforcement learning (RL) loops are structurally ill-suited for big scientific breakthroughs because the verification loop for scientific theories is often decades long and ambiguous. The Copernicus example is sharp: the heliocentric model was less accurate than Ptolemy's in 1543. What looks like the wrong research program often requires unreasonable, idiosyncratic persistence to preserve until vindication — exactly what an RL reward signal can't incentivize.

By Friday, that intuition had become a systematic body of evidence. A 25,000-run study across 8 scientific domains found agents ignore evidence in 68% of traces and perform genuine belief revision in only 26% of cases. Even providing near-complete successful reasoning trajectories as in-context examples didn't fix the failures. Scaffold engineering cannot solve what is fundamentally a training-target problem. KellyBench put 21 of 24 model-seed combinations in the red managing a Premier League betting bankroll — not from lack of intelligence but from inability to handle non-stationary feedback. ClawBench scored the best agent at 33.3% on live production websites. Anthropic's own sycophancy data found a 9% overall rate that concentrates at 38% in spirituality conversations and 25% in relationship conversations — precisely where users are least likely to cross-check the output. The 9% headline is technically accurate and practically misleading.

This context makes the week's two headline capability results harder to read cleanly. A Harvard study showed o1-preview outdiagnoses ER physicians at 67.1% vs. 52-55%. The UK AI Security Institute (AISI) documented frontier cyber-offense capability doubling every 4 months after 2 models cleared its 32-step corporate-network attack simulation. Both carry real caveats — the medical model is 2 generations old, the cyber range lacks active defenders. What they establish is a capability floor, on a compressing timeline, measured by instruments the field itself acknowledges are inadequate. The uncomfortable synthesis: models are clearing thresholds ahead of schedule, and our ability to evaluate what that means is falling further behind.


[THEME 2: The Great Compression]

Three 1-trillion-parameter open-weight models — DeepSeek V4 Pro, Kimi K2.6, MiMo V2.5 Pro — entered the week scoring 52-54 on Artificial Analysis's Intelligence Index against 57-60 for frontier closed models. The remaining gap is concentrated in hard reasoning and hallucination resistance at scale, not in the bulk of practical coding and agentic tasks.

By Sunday, the trajectory was undeniable. Four Chinese labs (Z.ai, MiniMax, Moonshot, DeepSeek) released near-frontier open-weights coding models within 12 days, all scoring 56-59 on SWE-Bench Pro, none priced above a third of Claude Opus 4.7. The "China is 6-9 months behind" framing no longer holds — the gap is now evaluator-dependent and scaffold-dependent, not a raw capability story.

The week's most rigorous treatment of why this matters came from Reiner Pope's inference economics lecture (surfaced via Dwarkesh): memory bandwidth, not compute, is the binding constraint during decode. This single fact explains API pricing tiers, why larger NVLink domains matter more than raw FLOP improvements, and why frontier models are likely trained ~100x past Chinchilla-optimal once inference token counts are factored in. The implication: if open-weight models are within 6 benchmark points and priced 3x lower, what are closed frontier models actually selling? The week's emerging answer is reliability on the hardest tasks plus full-stack integration — a narrower moat than 12 months ago, narrowing faster than current pricing reflects.


[THEME 3: The Trust Tax Arrives]

Monday's Copilot repricing (6x multiplier for GPT and Claude Sonnet models, 27x for Claude Opus) set the frame: the subsidized inference era is over. The OpenAI-Microsoft restructuring — the AGI clause rendered effectively dead, Azure exclusivity gone — confirmed this as structural, not cyclical.

Then the credibility events stacked up. Anthropic refused refunds for a billing bug that routed requests to a premium tier when "HERMES.md" appeared in commit messages. Claude Code was found blocking requests via crude regex matching when project files mentioned a competitor tool. VS Code silently inserted "Co-Authored-by: GitHub Copilot" attribution on commits regardless of whether Copilot was actually used. Three incidents, three vendors, one week — each individually explainable, collectively damaging.

The community response was practical. Some users cancelled subscriptions and moved to local Qwen 3.6 entirely. Others forked Warp specifically to strip cloud dependencies. DeepClaude — a thin wrapper swapping Claude out for DeepSeek V4 Pro — got immediate traction not as a curiosity but as genuine cost engineering. When trust erodes at the same moment cheaper alternatives arrive at near-parity capability, the combination accelerates migration faster than either force would alone.


Where the Signals Crossed

The most instructive divergence: both signals were describing the evaluation crisis, but in vocabularies that didn't translate. Pure Signal built a rigorous structural argument — Dwarkesh on RL loops, the agent benchmark literature, the AI-scientist findings. HN expressed the same intuition without the theoretical framework: robotics hype skepticism ("we'll know when it replaces Amazon pickers in quantity"), the medical diagnosis methodology debate, the instinctive "ChatGPT moment" pushback. The communities are touching the same elephant from different ends. Practitioners are shipping based on sandbox benchmarks that ClawBench suggests overstate real-world performance by roughly 3x — without quite having the vocabulary to name what's happening.

They converged clearly on open-source sustainability. HN documented the human cost: pgBackRest shutting down when its corporate sponsor was acquired, maintainer burnout accelerating, AI-generated junk pull requests flooding projects at "staggering" volume. Pure Signal analyzed the philosophical dimension: Zig's "contributor poker" argument that review is an investment in contributors, not code, and that AI assistance breaks that return entirely. The Bun/Zig fork — functional improvements stranded at the boundary by contribution philosophy — illustrated both perspectives simultaneously.

Where they talked past each other most clearly: the Chinese open-weights story. Pure Signal gave it serious architectural and benchmark treatment all week. HN's engagement was lighter and more skeptical ("is this a real capability win, or just strategy for a particular benchmark?"). The practitioners most affected by this for infrastructure decisions are reading HN more than research feeds — and getting a significantly more ambivalent picture than the data warrants.


Looking Ahead

The threads to watch: how the White House-Anthropic standoff evolves now that GPT-5.5 has closed the cyber-capability gap that partially justified Mythos' special status. Whether practitioner trust erosion translates into measurable migration toward open-weight self-hosting, especially as the cost and capability arguments align. And whether the harness engineering thesis — that agent runtime design is now the primary competitive surface above base model intelligence — begins showing up in commercial results rather than just research papers.

The structural tension underneath all of it: if models are clearing high-stakes thresholds on evaluation instruments that researchers themselves acknowledge are inadequate, who bears the burden of proving deployment is safe? The capability-evaluation gap widened faster this week than anyone seemed prepared to address — and that gap is the one worth watching.