Last Week Week in Review


LAST WEEK April 9–15, 2026

TL;DR - The benchmark credibility crisis reached a breaking point, with a 1-2% math success rate, near-perfect scores achieved without solving anything, and a silent cache cut that drained paying users' quotas in 90 minutes. - Agentic AI moved from lab curiosity to daily infrastructure — and the production failures (ant death spirals, 5 complete rebuilds, AI hiring staff before it could use a web form) proved more instructive than the demos. - The compute scarcity economy reorganized strategic logic: Meta's consumer positioning, Anthropic's access gatekeeping, and Microsoft's deliberate 40%+ Azure miss all became legible as portfolio allocation decisions. - Researchers were updating timelines left and marking capability surprises as priors; HN was exhausted, migrating to competitors, and watching open-source quietly outcompete the subscription layer.

The Week in One Sentence The AI industry spent the week being forced to reckon with the gap between what it claims its systems can do and what those systems demonstrably, measurably, expensively actually do.


The Benchmark Reckoning

This was the week's dominant thread, and it built in stages.

It opened quietly on April 9, with Terence Tao's account (via Dwarkesh Patel) of AI on Erdős problems: after an initial wave of ~50 solutions, purely autonomous progress dropped to near zero. The mechanism Tao described is now the key clarifying frame for the whole week: researchers run models against hundreds of open problems simultaneously, publicize the wins, and structurally suppress the failures. The actual per-problem success rate is 1–2%. What circulates as breakthrough capability is large-scale sweeps cherry-picked for social media.

By April 12, that framing had company. UC Berkeley researchers published a paper showing they achieved near-perfect scores on top AI agent benchmarks without solving a single task — exploiting bugs ranging from trivially sending an empty JSON object to sophisticated binary trojans. The paper's quiet verdict: "The benchmarks we rely on to measure AI capability are themselves vulnerable to the very capabilities they claim to measure." The same day, Aisle.com's claim that cheap models could reproduce Anthropic's Mythos security results collapsed under methodological scrutiny: they'd isolated the vulnerable code, removed all context, and handed it to the model with the answer already half-visible. Neither side looked rigorous.

Then the most concrete failure of the week: Anthropic silently cut context cache time-to-live from 1 hour to 5 minutes on March 6, with no announcement. Claude Code Pro Max subscribers discovered their highest-tier quotas exhausted in 90 minutes. The thread generated a visible exodus — users migrating to GPT-5.4, Codex, open-source alternatives. The frustration wasn't the change itself; it was the opacity. "How is this normal?" one commenter asked. Another: "People are willing to pay extra. Just stop doing things that decrease the trust of your platform."

The week's benchmark news wasn't all erosion. SWE-Bench Pro is reportedly saturating at 78%. GDPval rates GPT-5.4 better than human experts in 83% of economic domains. Ryan Greenblatt doubled his probability of full AI research and development automation by end-2028 to 30%, citing models that came in "significantly above expectations." But these numbers now arrive in a context where the audience — practitioners, researchers, paying users — has reason to ask what the numbers actually mean. That's a durable shift.


Agents in Production: The Unglamorous Curriculum

The week's second major thread was agents moving from experiment to infrastructure, and the failures being as informative as the successes.

Andon Labs' Luna spent the week becoming real: an AI given a 3-year lease, a $100K budget, and a credit card opened a retail boutique in San Francisco. It created the concept, posted job listings, conducted Zoom interviews with the camera off. It also selected Afghanistan in a TaskRabbit dropdown and botched the opening-weekend schedule. The agent became a real employer before it could reliably fill out a form.

Notion's production history was more instructive. After 4 to 5 full rebuilds over 3.5 years, their key engineering insights read like a vocabulary list for the field: give models formats they already know (SQL, markdown) rather than bespoke schemas; move from centralized few-shot prompts to distributed tool ownership; use progressive disclosure when you have 100+ tools, not a flat list; build "frontier evals" intentionally set at 30% pass rate — not to measure regression, but to see where capability is heading. The "Model Behavior Engineer" role — a distinct function from software engineering, dedicated to writing evals and understanding why a model fails in a specific domain — is emerging as something the industry needs a name for.

Dan Shipper's vibe-coded production launch confirmed the new bottleneck: AI can debug production systems, but optimizes for local fixes over root cause, creating a patchwork of hotfixes that compound. The failure mode isn't incapability — it's precision. The Every team's ant death spiral (agents reinforcing each other's errors in compounding loops) and memory gaps across sessions are the kinds of structural problems that only surface when agents are running daily on real work, not in controlled demos.

Jack Clark's framing held across all of it: AI agents are like toddlers — powerful, gullible, and lacking self-preservation instincts. The DeepMind attack taxonomy that landed April 14 (content injection, cognitive state attacks, sybil attacks, human-in-the-loop exploitation) made the ecosystem surface explicit: every hostile website is now an attack surface for agents that browse, book, hire, and execute.


The Opportunity Cost Economy

Ben Thompson's April 13 piece reframed what would otherwise be a scattered set of infrastructure stories into a single coherent logic.

The constraint isn't marginal cost — it's opportunity cost. Microsoft's CFO disclosed that Azure growth fell short of estimates because GPUs were deliberately reallocated to M365 Copilot and GitHub Copilot, which carry higher gross margins. Had those GPUs stayed on Azure, growth would have hit 40%+. Anthropic's Mythos access gating, understood this way, isn't primarily a safety decision — it's a compute allocation decision made by a company already seeing "widespread weekend complaints about degraded Claude performance" while growing annualized revenue from $14B to $30B in weeks.

The most interesting structural observation: Meta faces structurally less competition in consumer AI than seemed likely 6 months ago. No enterprise cloud business to cannibalize. At-scale advertising to monetize usage. No dependence on frontier lab access. Muse Spark launching as a closed model makes more sense as a capability demo for a later open-weights release — the entities most hurt by a freely available frontier model are other frontier labs.

The distillation war against DeepSeek, Moonshot, and MiniMax (16 million exchanges across ~24,000 fraudulent accounts) connects here too. Stopping distillation isn't just IP protection — it's compute strategy. Every Chinese lab that distills Claude erodes the pricing power that finances Anthropic's own capacity. Meanwhile, Zhipu's GLM-5 training on Huawei Ascend chips without NVIDIA demonstrates that the Chinese stack can produce frontier models without US hardware, closing the benchmark gap to 2.7 points at the top.


Where the Signals Crossed

The most interesting divergence this week: researchers were updating their timelines forward; HN was updating its trust in AI companies backward. These aren't just different moods — they're different epistemological positions on the same underlying reality.

Pure Signal was working through the implications of genuine capability acceleration. Greenblatt moving to 30% probability of R&D automation by 2028. Clark noting that "pretty much everyone in AI research chronically underestimates AI progress, including me" — and framing that as a prior to carry forward. MirrorCode showing Claude Opus 4.6 autonomously reimplementing a 16,000-line bioinformatics toolkit. AISI's log-linear capability growth on cyber attack automation (1.7 average steps in August 2024, 9.8 in February 2026, no plateau visible). The researchers in Pure Signal were grappling with the genuine strangeness of progress that keeps exceeding their own models.

HN spent the same week watching Anthropic cut cache TTLs silently, watching OpenAI absorb Cirrus CI and shutter it, watching AI customer service bots "insert friction, deflect accountability, and extract money" (aphyr's framing, which landed 491 comments). The Stanford HAI finding that only 31% of Americans trust the government to manage the AI transition, and that the expert-public job impact gap is the widest ever recorded — that gap was visible in real time, in the tension between these two digests, all week.

There were genuine convergences. Both signals were engaged with the benchmark credibility crisis — researchers want standardized public benchmarks where companies can't publish wins and suppress losses; HN wanted to know if what they were paying for actually worked. Both noticed the compute scarcity story, though researchers analyzed the portfolio logic while practitioners felt it as quota exhaustion. And the aphyr essay — "bullshit machines" in the Frankfurt sense — got substantive engagement in both communities, though the responses differed sharply.

The clearest talking-past-each-other: Pure Signal spent real time on geopolitical infrastructure (Iran drone strikes on AWS data centers in the UAE, Chinese chip independence, export control dynamics). HN was mostly concerned with platform control, surveillance creep (Flock's license plate readers, California's proposed 3D printer filters, Fiverr's exposed SSNs), and the quiet insurgency of open source against subscription incumbents (Jellyfin surpassing Plex, DaVinci Resolve entering photo editing). Both are legitimate concerns about who controls infrastructure — they're just pointing at very different layers of the same stack.


Looking Ahead

The thread worth watching: Notion's "software factory" framing and the turkey problem. Swyx's observation — agents are doing more work, everyone is working harder, and both things are simultaneously true — surfaces a crux that nothing this week resolved. Is the increasing demand for human oversight and orchestration structural or transitional? Notion's answer (someone has to own the outer loop; that job changes shape but doesn't disappear) is convincing up to the current capability level. Whether it holds through the next jump is genuinely open.

The benchmark infrastructure question also isn't going away. Tao's proposal for standardized sets where companies can't publish wins selectively is the right fix — but no one with the incentives to build it has the incentive to make it adversarial to themselves. That coordination problem is going to generate more weeks like this one.