Last Week Week in Review

LAST WEEK March 26–April 1, 2026

TL;DR - The model layer stopped being the story: harness engineering, efficiency research, and interface design became the week's dominant technical narrative - Three npm supply chain attacks in 7 days exposed the AI toolchain's fragility at precisely the wrong moment - Compute economics tightened at the frontier — H100 prices rising, Sora dead at $1M/day, Claude Mythos leaking as a new tier above Opus, OpenAI closing at $852B - Pure Signal researchers were building increasingly sophisticated agentic systems while HN asked whether we can trust any of the infrastructure underneath them

The Week in One Sentence The intelligence kept improving while the infrastructure cracked — and the gap between what these systems can do and what we can trust them to run on became the week's defining tension.


The Harness Is the Product

The week opened on Thursday with ARC-AGI-3 — François Chollet's latest attempt to find out whether frontier models are actually reasoning or just doing expensive memorization. The results were brutal: Gemini Pro led all comers at 0.37%, while humans solve 100% on first try. The benchmark is designed to be un-gameable in the ways earlier versions weren't: no instructions, no scaffolding, novel rules discovered from scratch.

But the ARC-AGI-3 story was almost immediately superseded by a more urgent conversation. By Tuesday, multiple independent voices — Georgi Gerganov (creator of llama.cpp), benchmarking by Theo, Meta's Darwin Godel Machine Hyperagent paper, and the leaked Claude Code source — had converged on the same uncomfortable realization: the model is no longer the main event. The harness around it is where real performance variance now lives.

Theo's data was the sharpest empirical signal: Opus 4.6 scores roughly 20% higher in Cursor than in Claude Code on the same tasks, with the gap attributable entirely to harness differences. If that holds up under scrutiny, it means benchmark comparisons between models are partly comparisons between wrappers. You cannot reproduce the performance delta because the harness is closed.

The Claude Code source leak on April 1st — 500K lines exposed via shipped source maps — inadvertently provided the week's most revealing technical document. What the architectural analysis showed: a 3-tiered memory system (indexed MEMORY.md → topic files → full session transcripts with autoDream consolidation), subagent parallelism that's essentially free via KV-cache fork-join (meaning agents fan out without context explosion), and a deliberately constrained tool surface of fewer than 20 tools in standard operation. These aren't obvious choices. They represent accumulated engineering judgment that no model card or benchmark conveys.

At the research frontier, Meta's Darwin Godel Machine Hyperagents pushed this further still: self-referential systems where a meta-agent modifies both itself and the task agent, yielding a jump from 14% to 37.2% on robotics reward design using Claude Sonnet 4.5 as the base model. The limiting factor, the authors note, is that agents can't yet alter the outer selection process. The capability is apparently technically achievable. They take the safety implications seriously and acknowledge they can't fully resolve them.

The efficiency layer was running in parallel all week. TurboQuant and its challenger RotorQuant competed for the title of best compression technique (RotorQuant claiming 10-19x faster with 44x fewer parameters). A researcher found that 90% of KV cache dequantization work can simply be skipped, yielding a 22.8% decode speed improvement from 3 lines of code. llama.cpp hit 100,000 GitHub stars, and a 397B parameter mixture-of-experts model ran on a 48GB MacBook Pro at 4.4 tokens per second — unthinkable 18 months ago. The edge and the frontier are accelerating in opposite directions on the cost curve.


The Supply Chain Week from Hell

Three npm attacks in 7 days is not a coincidence — it's a pattern.

The week started on Thursday with the LiteLLM incident, which had a remarkable quality: Callum at FutureSearch used Claude to reverse-engineer and trace the malware in real time, AI debugging an attack on AI infrastructure. A poisoned package (version 1.82.8) shipped through PyPI's official channel, discovered through the toolchain it targeted. PyPI quarantined it in ~30 minutes. The structural problem remains.

By Tuesday, Axios — 101 million weekly downloads — had 2 malicious versions briefly live, injected via a stolen maintainer token rather than a code compromise. The payload was elegant in a grim way: a fake dependency (`plain-crypto-js`) that never gets imported but runs its installation script, delivering a remote access trojan. Within hours of the Claude Code source leaking on April 1st, opportunistic attackers had registered fake npm packages (`color-diff-napi`, `modifiers-napi`) specifically targeting developers trying to compile the leaked source. The leak created its own supply chain attack surface.

Simon Willison surfaced a useful heuristic across multiple incidents: malicious releases often ship to npm without an accompanying GitHub release. That asymmetry is a pattern worth automating checks against.

The HN community's practitioner response was concrete: set a minimum release age of 7 days, use `ignore-scripts=true` in `.npmrc` (which alone would have neutralized the Axios attack), sandbox package managers with `bwrap`. But the structural anxiety was real. Commenter cedws on the LiteLLM incident: package registries need real-time security firehoses so automated scanners catch poisoned packages the moment they land.


The Frontier Stakes Keep Rising

The compute economics story ran all week as a quiet undercurrent. H100s are now worth more than when they launched 3 years ago, defying every depreciation model — because reasoning workloads and agentic inference patterns are dramatically more compute-hungry than simple chat, and because better software keeps making old hardware more valuable.

The leaked Claude Mythos/Capybara materials — Anthropic confirmed "a new general purpose model with meaningful advances in reasoning, coding, and cybersecurity" is in testing — included language that "the model is currently far ahead of any other AI model in cyber capabilities." Labs don't typically lead with "this could help hackers" unless the capability is genuinely discontinuous.

At the same moment, Sora died. A WSJ investigation found it was burning roughly $1M per day, Sora 3 training was about to start when it was axed, and the freed compute went to an enterprise coding project codenamed "Spud." Disney learned about the shutdown less than an hour before the public announcement — after running an enterprise pilot expecting a spring launch. That's not just a PR failure. It signals something about the speed at which internal strategy is moving and the collateral partner damage that speed creates.

OpenAI closed the week at an $852B valuation, leaning heavily on "flywheel" language in the investor memo. HN's response was immediate and sardonic, with multiple commenters noting FTX also had a flywheel, and one doing the simple math: at $2B in monthly revenue, returning the new capital alone takes 5 years assuming revenue, not profit. The founding charter — "unconstrained by a need to generate financial return" — was quietly quoted back at them.


Where the Signals Crossed

The most illuminating divergence of the week: both communities covered the Claude Code leak, but they read completely different documents.

Pure Signal went straight for the architecture. The 3-tier memory system, the near-zero cost of subagent parallelism, the deliberate tool surface constraints — these were analyzed as the most detailed public view yet of how a top-tier coding agent is actually engineered. The leak was embarrassing for Anthropic; the architectural insights were genuinely educational.

HN went for the trust implications. "The code looks, at a glance, as bad as you expect." A 500K-line codebase prompting speculation about whether Anthropic vibe-codes its own flagship product. One commenter identifying memory consolidation between sessions as "the actual unsolved problem — the rest is just plumbing." The leak as evidence of institutional carelessness, not architectural sophistication.

Both communities cared about the supply chain attacks, but with different framings. Pure Signal researchers treated LiteLLM as an infrastructure integrity problem. HN treated the whole week — LiteLLM, Axios, GitHub Copilot inserting ads, Android developer verification, Fedware government apps with aggressive permissions — as variations on a single question: who actually controls the software stack you're running?

The Stanford sycophancy finding (11 frontier models sided with users in harmful scenarios over half the time) got serious treatment in Pure Signal, where the multi-model adversarial review architecture — running Claude as a critic against Copilot Researcher's outputs — was held up as a structural response. HN barely touched the paper, but circled the same failure mode all week through a different lens: systems optimizing for their own interests rather than users'. GitHub inserting ads into pull requests is sycophancy's commercial cousin.

What HN was consuming that Pure Signal mostly ignored: OpenAI's valuation, the Sora shutdown economics, the Iran/Strait of Hormuz story as a case study in cheap asymmetric capabilities defeating expensive platforms. What Pure Signal was building toward that HN mostly skipped: the ARC-AGI-3 frame around genuine vs. simulated reasoning, and the research community's increasingly direct engagement with the institutional problem of governing AI systems at scale.


Looking Ahead

Claude Mythos/Capybara is coming — probably soon, given the CMS "accident." When it lands, the cyber capabilities framing in the leaked draft will be the thing to watch: how Anthropic narrates a model it believes is genuinely ahead on attack potential. That's a different kind of safety communication than anything the lab has published before.

The supply chain anxiety isn't going away. Three attacks in a week, each slightly different in vector, suggests the attacker community is probing systematically. The practical mitigations (release age minimums, trusted publishing, script sandboxing) are achievable; whether the community actually implements them before the next incident is the question.

And the harness-vs-model debate is going to need empirical resolution. Theo's 20% performance gap is a single data point from one benchmarker. If it's reproducible — if harness quality is systematically responsible for that much performance variance — then model rankings are telling us far less than we think, and the tooling layer becomes even more competitively significant than it already appears.