Last Week Week in Review
LAST WEEK April 13–19, 2026
TL;DR - Anthropic split its frontier into two tracks — a public Opus 4.7 and a gated Mythos Preview 13+ SWE-bench Pro points ahead — marking the first time the "true frontier" is meaningfully withheld from paying customers. - Multiple researchers and practitioners converged independently on the same finding: harness and scaffold design produces larger reliability gains than model substitution, often by an order of magnitude. - AI-assisted cyberattack capability is scaling log-linearly with compute, attack costs have fallen to £65 for a full network penetration, and Iranian drone strikes on AWS data centers made cloud infrastructure a literal theater of war. - Pure Signal was tracking capability acceleration with cautious optimism; HN was living its downstream consequences — token inflation, quota exhaustion, and a growing conviction that the labs are extracting value faster than they're delivering it.
The Week in One Sentence
The labs are moving the frontier faster than they're moving it toward you — and the gap between what frontier AI can do and what paying customers actually receive is now the defining tension of the industry.
The Frontier Splits in Two
This was the week the "true frontier" became a distinct product tier, not just a lab-internal concept. Anthropic is now running 2 parallel tracks: the public Opus 4.7 (released April 16) and the gated Mythos Preview, accessible only to roughly 40 trusted partners including certain US government agencies. The gap between them is 13+ SWE-bench Pro points — 64.3% vs 77.8% — which in practical terms represents several months of capability development that Anthropic is explicitly not shipping to the public. This is the first major instance where the industry's "true frontier" is both measurable and deliberately withheld, and it won't be the last.
The Opus 4.7 launch itself was real progress — +11 SWE-bench Pro points over 4.6, a 3x image resolution increase, and a ~35% reduction in output tokens at comparable quality scores — but the launch was accompanied by a tokenizer change that inflated input token costs by 30–45% for real-world code and prompted documents. Users reported burning through 27% of their weekly allowance in a day. Anthropic's official position was that reasoning efficiency gains more than offset the tokenizer inflation; users' actual experience was that their quotas collapsed. The week before, Anthropic had also silently cut context cache TTL from 1 hour to 5 minutes — a change that explained why Pro Max subscribers were exhausting 5x quotas in 90 minutes. Taken together: the model got better, the experience got worse, and neither change was communicated proactively.
Ryan Greenblatt's timeline update adds urgency to this dynamic. He doubled his probability of full AI R&D automation by end-2028 to 30%, driven by models that came in "significantly above expectations" on long-horizon tasks. METR and Epoch's MirrorCode benchmark provided concrete grounding: Claude Opus 4.6 autonomously reimplemented a 16,000-line Go bioinformatics toolkit — a task estimated at 2–17 human weeks — with performance scaling with inference time, suggesting the ceiling on harder tasks is compute-bound, not architecture-bound. Jack Clark's editorial note became the week's most quotable line: "Pretty much everyone in AI research chronically underestimates AI progress, including me." He recommends treating this calibration failure as a prior. The Stanford HAI 2026 Index put numbers on the gap: 53% global AI adoption, 31% American public trust in government management of the transition, and the widest expert-public disagreement on job impact ever recorded.
Harness Engineering is the Underweighted Insight
Underneath the benchmark announcements, the week's most practically actionable signal was a convergence across multiple researchers and practitioners on the same finding: the wrapper around a model matters as much as the model itself, sometimes more.
The most dramatic demonstration came from a benchmark showing Qwen3-8B scoring 33/507 on LongCoT-Mini when scaffolded with dspy.RLM, versus 0/507 vanilla. The scaffold did, in the author's framing, "100% of the lifting." The earlier Meta-Harness result (surfaced in Benaich's quarterly review) showed that changing only the harness around a fixed LLM produces a 6x performance gap on the same benchmark — with raw execution traces up to 10M tokens outperforming compressed summaries, which were slightly worse than scores alone.
Notion's Latent Space interview made the argument from production experience. After 4–5 rebuilds over 3.5 years, their key lessons generalize well: give models formats they already know (SQL and markdown over bespoke XML or internal JSON schemas), distribute tool ownership away from a centralized few-shot string, implement progressive disclosure when your tool library exceeds ~20 definitions, and build evaluation infrastructure as a distinct engineering function rather than a side effect of shipping. Their "Notion's Last Exam" — frontier evals intentionally set at 30% pass rate — exists specifically because their production evals saturated and stopped being useful feedback. That's a maturity indicator worth tracking.
Simon Willison's practical illustration connects to the same principle from a different angle: pointing Claude Code at an existing Django ORM definition (rather than describing the schema in natural language) eliminated an entire class of ambiguity, letting the agent derive correct `UNION` clauses and mapping logic without being told either. The practical implication, stated plainly: before upgrading your model, audit your harness. The evidence suggests scaffold improvements compound and are model-agnostic; model upgrades are point-in-time gains that deprecate on the next release cycle.
Security as a Macroeconomic Variable
The AI cybersecurity story matured from theoretical concern to documented operational reality this week, and the numbers are uncomfortable.
The UK AI Safety Institute's research established the benchmark: across 7 frontier models on purpose-built cyber ranges, average autonomous attack steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Claude Opus 4.6, February 2026) — log-linear with inference compute, no plateau visible. NCSC's marginal cost estimate for an AI-assisted network penetration: £65. Mythos Preview, gated from public access, scored 83.1% on CyberGym versus 66.6% for Opus 4.6, and independently identified a 27-year-old OpenBSD remote-crash bug and a 16-year-old FFmpeg flaw that automated testing had missed 5 million times. These are dual-use capabilities in the most literal sense.
The Anthropic vs OpenAI access philosophy debate played out in public this week. Anthropic gates Mythos to ~40 trusted partners; OpenAI opened GPT-5.4-Cyber to any verified defender via its Trusted Access program. The rhetoric of "democratized defense" is somewhat ahead of the operational reality (Simon Willison noted the application process still requires a Google Form, not obviously different from Glasswing gating), but the underlying question is genuine: which access model actually reduces net harm? We don't have evidence yet. Drew Breunig's analysis adds the most interesting second-order observation: if security becomes proof-of-work (you must outspend attackers on token consumption, not outthink them), then open source libraries become more valuable, because amortizing hardening costs across all users is the only way to win the arithmetic.
The geopolitical layer became physical. Iran struck AWS data centers in the UAE and Bahrain with drone strikes, taking down 2 of 3 availability zones simultaneously — the first deliberate military attack on commercial cloud infrastructure. Iranian state media cited US military AI systems running on AWS as justification. Cloud infrastructure is now literal theater of war. Meanwhile, the Supermicro prosecution ($2.5B in NVIDIA servers allegedly diverted to China) and Zhipu AI's GLM-5 demonstrating frontier-quality training entirely on Huawei Ascend chips made the export control regime feel simultaneously more serious and more porous than it did a month ago.
Where the Signals Crossed
This was a week where the 2 communities were operating on different time horizons, and the friction between them produced the most interesting signal.
Pure Signal was tracking the capability trajectory with focused attention — Greenblatt's updated timelines, AISI's log-linear attack scaling, MirrorCode's 16,000-line reimplementation result. The researchers are excited and worried in roughly equal measure, but they're looking at 2026–2028 as the planning horizon. HN was living inside a much shorter frame: the week's Opus 4.7 tokenizer math, the 90-minute quota exhaustion, the Fiverr data leak exposing thousands of Social Security numbers, NIST abandoning most CVE enrichment. The practitioners were not interested in 2028 — they were trying to make their Tuesday work.
The harness engineering insight is where the signals most productively converged, without either community fully acknowledging the other. Researchers flagged it through Meta-Harness and Notion's rebuilds; HN practitioners were independently discovering it through the `autoprober` builder (who learned that sub-millimeter hardware precision makes hallucination immediately detectable) and the developer who built a SPICE-to-oscilloscope verification loop. Both communities landed on the same place — the interface between model and task matters as much as the model — but via completely different paths.
The sharpest divergence was on the "AI is getting more expensive" story. Pure Signal covered compute scarcity as a strategic portfolio allocation question (Ben Thompson's opportunity cost framing, Stargate's 9+ GW buildout). HN experienced it as a $14,000 annual cloud bill that drove developers to Hetzner, and a 45% token inflation that restructured business models overnight. The researchers were asking "who controls the allocation?"; the practitioners were asking "why is my Tuesday cost 45% higher?" Same underlying phenomenon, completely different urgency gradient.
The accountability thread ran through HN all week — WordPress plugin backdoors, vibe-coded medical apps exposing health data, Fiverr's publicly indexed tax forms, iTerm2 vulnerable to `cat readme.txt` — without much corresponding attention in Pure Signal. The researchers are focused on frontier capability; the practitioners are living in a world where the software supply chain is degrading faster than anyone is fixing it. Kyle Kingsbury's "The future of everything is lies" essay, which generated real heat on HN, would have fit naturally into the Pure Signal conversation about Gradual Disempowerment — but the two communities aren't really reading each other.
Looking Ahead
The gated frontier is the thread to watch. Anthropic running a public track and a Mythos track creates a structural tension Dwarkesh Patel named clearly: at $25/MTok for output tokens, $25M buys 1T tokens of synthetic training data — trivial for a well-funded lab. The labs can hide chain-of-thought, but agentic tool use happens locally and can't be hidden. If distillation can't be stopped and harness engineering matters as much as weights, the value of gating degrades over time — but "over time" may be long enough to matter for cybersecurity and biotech, where a 6-month capability advantage has asymmetric consequences.
The user trust deficit that built through this week doesn't get repaired by better benchmarks. The practitioners who migrated to competitors after the TTL cut, the developers repricing their products around the tokenizer change, the teams evaluating open-weight alternatives — they're making durable infrastructure decisions, not temporary workarounds. The question for next week: does Anthropic address the transparency deficit, or does it keep treating opacity as a product feature? Based on this week's system prompt archaeology, they're doing the latter more intentionally than ever.