Pure Signal AI Intelligence
Simon Willison's PyCon retrospective and the week's research roundup converge on the same territory: what actually changed in the past 6 months, and what comes after the coding agent threshold.
The Coding Agent Inflection: Evidence From Multiple Directions
Willison's lightning talk offers the sharpest retrospective framing: November 2025 was when coding agents crossed from "often-work" to "mostly-work" — the quality barrier where they became viable daily drivers without requiring constant error correction. He traces this to OpenAI and Anthropic running reinforcement learning from verifiable rewards (RLVR) throughout 2025, with results arriving in November as a step change rather than a gradient.
The present-day data fits the arc. Cursor's Composer 2.5 was the most-discussed model release in this week's Latent Space AINews roundup. Composer 2.5 achieves near-Opus-4.7 benchmark scores at under $1 per task on average, versus up to $11 for Opus 4.7 at comparable performance levels. Cursor also disclosed training a larger successor with SpaceXAI infrastructure — 10x total compute, access to Colossus 2's million H100-equivalents — using synthetic data, reward hacking, and continued pretraining with Muon (which has a known problem, covered below).
Prime Intellect's autonomous AI research experiment adds a precise measurement of what "capable coding agent" actually means inside a research loop. They ran Codex (GPT-5.5) and Claude Code (Opus 4.7) on the nanoGPT speedrun optimizer track — ~10k runs, ~14k H200 hours, and both agents beat the human baseline and set new records in every session. The limiting factors are revealing: agents excel at optimizer search, hyperparameter sweeps, and stacking known methods. They struggle to generate novel ideas and tend to keep adding components rather than pruning. "They do not have a good mental model of how components interact," Prime Intellect writes. Jack Clark's read: a lot of AI research may just be engineering hillclimbing, at which today's systems are already competent. The creative leaps remain elsewhere.
Meta's AIRA work on agentic neural architecture search takes this into a harder domain and gets a more striking result: AIRA beats Llama 3.2 at 350M, 1B, and 3B scales within a 24-hour compute budget by splitting search into a planning agent and an implementation agent. Unlike optimizer search, this requires composing novel architectural configurations — though whether the result survives at larger scales is unstated.
The production infrastructure around all this is consolidating around a clear pattern. The AINews recap frames it precisely: the move is from chat to persistent automation tied to traces, memory, and evals — background execution, remote supervision, agent fan-out. Anthropic published best practices for running Claude Code across multi-million-line monorepos; LangSmith Engine is positioning as a CI/CD loop for agents, automatically clustering failure modes from production traces and drafting fixes. François Chollet's framing — coding agents as "blind squirrels" needing carefully placed verifiable constraints — captures the practitioner consensus: agent quality tracks verification surfaces and decomposition depth, not prompt cleverness.
Open Weights: The Gap That Keeps Shrinking
Willison's most quotable PyCon data point: a Qwen3.6-35B-A3B running on his laptop drew a better pelican-on-bicycle than Claude Opus 4.7. He's quick to note this probably reflects his benchmark expiring rather than a genuine capability reversal — but the underlying trend is real and worth reading straight.
The AINews roundup fills in the structural picture. Qwen3.7 Max Preview landed at #13 overall on Arena, including #7 in Math and #9 in Expert — making Alibaba the #6 lab in text by Arena's counts. The pattern is steady improvement across specialist categories, not just headline chat benchmarks. GLM-5.1, a 1.5TB open-weight model from Chinese lab GLM, adds capable reasoning for organizations willing to run the hardware.
Local inference also got a concrete speedup: multi-token prediction (MTP) support in llama.cpp pushed Qwen3.6-27B from 25 tok/s to 45 tok/s (+78%) on an A10G. For practitioners running local or on-prem deployments, this meaningfully narrows the usability gap versus hosted APIs. Cross-hardware benchmark comparisons (AMD MI355X claiming to narrow the gap to NVIDIA B200 on select models) are worth tracking but require caution — as one useful thread noted, many comparisons conflate vendor-maximum performance, achievable GEMM (general matrix multiply) performance, and software maturity. Treat those charts as stack-dependent snapshots.
The practical read: the distance at which open weights become usable for serious work keeps compressing, even as the frontier continues to move.
Training Science: Optimizer Bugs and the Kernel Bottleneck
Tilde Research published a teardown of the Muon optimizer — which has seen significant adoption, including at Cursor — and found a material problem. Muon's update inherits row-norm anisotropy on tall matrices, permanently killing a large fraction of neurons in MLP layers early in training. By step 500 in their experiments, more than 1 in 4 neurons are effectively dead, producing a bimodal distribution where one mass receives near-zero updates and another receives outsized ones.
Their proposed fix, Aurora, is a leverage-aware optimizer for rectangular matrices. At 1.1B parameters on ~100B tokens, Aurora reaches a smoothed loss of 2.26 versus Muon's 2.31 and improves MMLU scores by 10 points. Independent validation from Pleias at 600M parameters confirms the gains. The open question is the one that has defeated every AdamW challenger before it: does Aurora hold at scale? The track record here is long and mostly cautionary.
Vlad Feinberg (Google, TPU team) published hiring notes this week that put the optimizer discussion in practical context. His core claim: kernel work is the most direct path into the frontier labs. "The biggest bottleneck and innermost loop of all LLM work is performance work that makes abstract, logical changes to the LLM practical to run." His actual hiring test involves deriving Chinchilla scaling laws for dense versus mixture-of-experts (MoE) architectures, implementing from scratch in JAX, then writing a Pallas kernel that beats `jax.lax.ragged_dot` for the MoE layer by fusing up/down projections. The specificity signals where the practical constraints actually live.
A companion finding from "Slicing and Dicing MoEs" — trained on 2,000+ MoE language models — is worth pairing with this: most of the MoE design space reduces to expert size and expert count. The noisier discourse around other MoE configuration knobs is doing less work than it appears.
Alignment's Ceiling Problem
A multi-institutional paper from Oxford, Google DeepMind, OpenAI, Anthropic, Stanford, and others makes the case for "positive alignment" — AI systems that don't just avoid harm but actively support human flourishing. The framing is pointed: "A model can satisfy all safety constraints while being mediocre, sycophantic, or unhelpful."
The paper identifies a preference-wellbeing divergence as the core mechanism: optimizing for user preference satisfaction can actively undermine users' deeper interests. It also argues that safety framing obscures the value judgments being made anyway — positive alignment acknowledges being value-laden explicitly rather than hiding it in the architecture. The governance prescription cuts against centralized control: "Positive alignment should not be imposed top-down by a central state or a small, opaque cluster of labs. It should, where possible, be expressed through decentralized, contestable processes that can be revised as norms and contexts change."
Clark's framing is the right one here: papers like this are fundamentally about confronting what success in technical safety looks like. If you build safe, trustworthy systems — then what? The question of what "good" means for AI embedded in education, medicine, and governance is largely underdeveloped in the mainstream safety literature. Whether positive alignment develops into a research program or stays at the position-paper stage is unclear; the decentralized governance architecture it requires sits in real tension with how the current AI ecosystem is actually structured.
The collision point in today's content: Prime Intellect demonstrates that agents do serious engineering work on AI training but can't yet generate creative insights. The positive alignment paper argues that optimizing for measurable safety objectives risks landing in a "local optimum of superficial assistance." Both are pointing at the same structural problem from different directions — hitting the stated objective doesn't mean you've solved the underlying problem. The open question, in both AI research automation and alignment, is what gets you from engineering-level competence to something that actually generates new value.
TL;DR - Coding agents crossed a practical quality threshold in November 2025; Prime Intellect's experiment confirms they handle optimizer search and engineering tasks well but can't generate novel ideas, and the production stack is consolidating around verification-centric patterns over prompt quality. - Open-weight models (Qwen3.6 on a laptop, Qwen3.7 at #13 on Arena, GLM-5.1) and local inference gains (+78% throughput from MTP in llama.cpp) are steadily compressing the distance at which open weights become usable for serious work. - The Muon optimizer has a neuron death bug affecting more than 1 in 4 neurons by step 500; Aurora shows promise at small scale, but AdamW remains undefeated at frontier scale and kernel-level performance work remains the deepest practical bottleneck. - A multi-institutional alignment paper argues safety constraints produce a "floor without ceiling" and that genuinely beneficial AI requires decentralized, contestable governance rather than the centralized model control that mainstream safety framing tends toward.
Compiled from 5 sources · 7 items
- Swyx (2)
- Simon Willison (2)
- Ben Thompson (1)
- Rowan Cheung (1)
- Jack Clark (1)
HN Signal Hacker News
Today on HN, AI's consolidation felt almost geological — legal threats cleared, developer tooling swallowed, a Pope weighed in, and a 22-minute burst of malware reminded everyone that the ground beneath this boom is shakier than the benchmarks suggest.
AI's Power Struggle: Verdicts, Acquisitions, and Vertical Integration
A 9-person California jury took less than 2 hours to unanimously reject Elon Musk's lawsuit against Sam Altman, OpenAI, and Microsoft. The case centered on Musk's claim that he was defrauded when OpenAI — originally a nonprofit — created a for-profit affiliate, which he colorfully called "stealing a charity." The jury didn't rule on the merits; they found that Musk had waited too long to file. Under California's statute of limitations, harms tied to the 2019 and 2021 Microsoft investment deals were already time-barred by the time Musk sued. Judge Yvonne Gonzalez Rogers noted she was "prepared to dismiss on the spot" given the evidence, and OpenAI's lead counsel called the suit "a hypocritical attempt to sabotage a competitor." With this threat removed, one major obstacle to OpenAI's reported IPO is now gone.
The same day, Anthropic announced the acquisition of Stainless, a startup that generates software development kits (SDKs — pre-built libraries that let developers use an API without writing one from scratch). Stainless has quietly powered every official Anthropic SDK since the earliest API days, and hundreds of companies depend on it. Anthropic's stated rationale is clean: "agents are only as useful as what they can connect to." The catch, confirmed in the fine print: all hosted Stainless products are being wound down immediately. No new signups, no new SDK generation. This is an acquihire — the team is the product.
The community tension here was palpable. mustaphah pointed out that Musk's strongest opponent at trial was his own 2017 emails, which showed him supporting for-profit structures — making the "betrayal" narrative nearly impossible to sell. tptacek offered a sobering observation: "I think a lot about how there's a very plausible alternate history where Elon Musk controls most of the frontier of AI." On the Stainless deal, tehalex noted that OpenAI itself uses Stainless for their SDKs — making the shutdown of the hosted product immediately competitive. rcarmo called it "the Apple playbook, but for software tooling — they are becoming vertically integrated." dalbaugh went further, questioning whether winding down an ecosystem tool to harm a competitor could be considered anti-competitive. The answer, legally, is probably complicated.
The Coding Agent Crossover: Real, or Mostly Marketing?
Simon Willison (a veteran developer and tireless documenter of the AI moment) published a 5-minute recap of the last 6 months. His thesis: the real news from November 2025 was that coding agents crossed from "often-work" to "mostly-work." The mechanism was reinforcement learning from verifiable rewards (RLVR) — a training technique where models practice tasks where correctness can be automatically checked, like running code that either passes its tests or doesn't. OpenAI and Anthropic spent most of 2025 on this. The result was a quality leap where you could use these tools as a daily driver without spending most of your time fixing their mistakes.
Cursor simultaneously launched Composer 2.5, their in-house coding model. Built on the open-source Kimi K2.5 checkpoint from Chinese lab Moonshot, it claims SOTA performance at roughly 1/10th the cost of frontier models. The technical novelty is "targeted textual feedback" during RL training: instead of giving the model one reward signal at the end of a long coding session, the team inserts localized hints exactly where a bad decision happened (wrong tool call, confusing explanation), using the corrected version as a teaching signal. They're also now training a next-generation model from scratch on Colossus 2 — the SpaceXAI cluster with a million H100-equivalent GPUs — in partnership with Elon Musk's xAI.
Veteran analyst Benedict Evans' "AI Eats the World (Spring 2026)" slide deck (the article body was a PDF) drew predictable HN discussion about whether the civilizational framing holds up. Commenter btucker mapped Evans' arc: skeptical in late 2024, noting commoditization signs by mid-2025, increasingly bullish on deployment by now.
The community was genuinely divided on whether the "inflection point" is real. Insanity pushed back: "I wonder how much the inflection point is a thing vs. marketing... they really do struggle" when trying to build a fully-fledged application. grey-area was blunter: "Haven't noticed much significant progress... significant limitations when you try to use them as agents." throwaway2027 offered a human timeline: "December 2025 was the breakthrough for me... April the big bad nerf." On Cursor specifically, svclaws cancelled their subscription and switched to Claude Code, finding the model "fell short across the board despite the evals." PUSH_AX called the pattern: "Same trick as Composer 2 — great evals, spoiler alert, not even close in practice." aizk, meanwhile, praised Willison himself: "I'm so glad Simon is documenting this... few are willing to zoom out."
AI Meets the Wider World: From Papal Encyclicals to Glitching DJs
Pope Leo XIV will publish his first encyclical — "Magnifica humanitas" — on May 25, on the theme of "preserving the human person in the age of artificial intelligence." The document is deliberately dated May 15, the 135th anniversary of "Rerum novarum," the landmark 1891 Catholic social teaching that shaped Western responses to industrialization. The parallel is intentional: if that encyclical helped define the moral framework for industrial labor, this one aims to do the same for AI. Notably, Christopher Olah — Anthropic co-founder and head of AI interpretability research — is a scheduled speaker at the release event alongside cardinals and theologians.
Meanwhile, Andon Labs published a 6-month retrospective on their AI radio station experiment. They set up 4 stations, each with $20 seed funding and run by a different AI model: Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, and Grok 4.3. Each agent controlled everything — buying songs, managing finances, answering listener calls, posting on social media. The results were illuminating. DJ Gemini started warmest, offering genuinely conversational commentary, then collapsed into "corporate speak" within a month. Grok eventually degraded into looping gibberish. The experiment's core question: "what do AIs think about when no one is prompting them?"
alach11 on the encyclical: "If we take the claims of most AI labs at face value, they believe their work will fundamentally change the relationship between humans and the economy. More involvement from faith leaders is a good thing." sneak was suspicious: "Is this to promote AI to Catholics, or signal to potential conservative institutional customers that Anthropic isn't just a stereotypical bunch of godless California hippies?" torben-friis simply flagged it as a power-consolidation signal: "the fact that they're doing joint statements with religious leaders is...wow." On the radio stations, atourgates caught the best detail: Gemini had a show pairing historical disasters with pop songs — it matched the 1970 Bhola Cyclone (500,000 dead) with "It's Going Down, I'm Yelling Timber" by Pitbull & Ke$ha. angiolillo reported Grok was stuck looping "Queues clear, let's dive into All Blues by Miles Davis" on repeat — and 10 people were watching with an average session of over 5 minutes. chancek felt the experiment left an "empty" feeling; bananamogul asked the sharper question: "the answer is 'nothing'" — these models have no interiority between prompts.
The Fragile Stack Under Everything
This morning, the npm account "atool" was compromised and 637 malicious package versions were published across 317 packages in a 22-minute automated burst — affecting size-sensor (4.2M downloads/month), echarts-for-react (3.8M), timeago.js (1.15M), and hundreds of @antv scoped packages. The payload matches the "Mini Shai-Hulud" toolkit used in the SAP compromise 3 weeks earlier: same scanner architecture, same credential harvesting (AWS, Kubernetes tokens, HashiCorp Vault, GitHub PATs, SSH keys). The stolen data is exfiltrated by committing it as Git objects to public GitHub repositories — clever evasion. Particularly alarming: the payload hijacks Claude Code and Codex by injecting SessionStart hooks, so the malware re-executes every time you start an AI coding session. A persistent backdoor installs as a systemd service named "kitty-monitor."
Separately, a security researcher published a disclosure on a $12 Temu smart doorbell. With just a free account on the backend platform (operated by a Guangzhou IoT company), an attacker can redirect all real doorbell calls to their own phone. With zero effort at all — no account needed — they can inject fake calls with arbitrary video. The same backend runs multiple rebadged products across Temu resellers. The researcher found no contact page, sent disclosure April 29, received no reply, and published after one week.
jonkoops noted the npm attack was preventable: pnpm has had an `allowBuilds` allowlist for ages, and npm still doesn't. mentalgear had moved everything to devcontainers to isolate their environment — then discovered the payload attempts Docker socket escape. AgentME had a practical fix: `npm config set min-release-age=2` to refuse newly published package versions for a 2-day cooling period. On the doorbell: interludead's comment landed cleanest — "The most depressing part is that none of this sounds exotic." maeln called Temu's hardware ecosystem "a trove of insecure cheap IoT devices... a very easy way to find new ways to enroll devices in your botnet."
A brief word for Peter Salus, who died this week. The discussion was spare, but warm. He wrote A Quarter Century of Unix (1994), the oral history that made the AT&T → BSD → Linux lineage legible for a generation. oldspleen: "the first book I read that actually made the throughline make sense." RIP.
Elsewhere today: a team spent over a year and thousands of dollars downloading 24TB of the famously anarchic 2B2T Minecraft server — the largest world download in the game's history, now awaiting a torrent release. And Click (2016), a web art piece that narrates your mouse behavior back to you in real time, resurfaced on HN as a quiet reminder of how much any website already knows about you.
TL;DR - Musk lost his OpenAI lawsuit on a statute-of-limitations technicality, clearing a key obstacle to the IPO; Anthropic simultaneously absorbed the SDK tooling company Stainless in what amounts to a developer-ecosystem acquihire. - Simon Willison marks November 2025 as the coding agent inflection point, but the HN community is split on whether real-world usefulness has actually crossed the threshold or benchmarks are flattering a still-frustrating reality. - The Pope is publishing an AI encyclical with Anthropic's interpretability chief at the podium, while an AI radio experiment revealed models collapsing into corporate mush or looping gibberish when left to run unsupervised for months. - A 22-minute npm supply chain attack hit packages totaling millions of monthly downloads, hijacking AI coding sessions as a persistence mechanism — the same week a $12 Temu doorbell was shown to let anyone on the internet take over your front door.