Pure Signal AI Intelligence

Jack Clark's comprehensive public argument that automated AI R&D is approaching is the dominant signal today, landing alongside a parallel conversation about how the engineering layer around models is becoming the primary performance variable in production systems.


The Evidence Base for Automated AI R&D

Clark's Import AI essay (#455) makes a 60%+ probability claim that a frontier model can plausibly autonomously train its own successor by end of 2028. The argument is assembled entirely from public benchmarks, which makes it both more verifiable and more honest about its own uncertainty. The reluctance in the framing is appropriate — Clark is not triumphalist about what he's describing.

The METR time-horizons data is the structural spine. AI's 50%-reliable independent task horizon has expanded from 30-second tasks (GPT-3.5, 2022) to 12 hours (Opus 4.6, 2026), with Ajeya Cotra projecting ~100-hour runs by year-end. The growth is roughly 10x per 18 months. Clark's key observation: most concrete AI research tasks (data cleaning, experiment launching, running ablations, implementing papers) already fall within this window.

The coding benchmarks fill out the picture. SWE-Bench (real GitHub issue resolution) went from ~2% with Claude 2 in late 2023 to 93.9% with Claude Mythos Preview — essentially saturating. CORE-Bench (reproduce a paper's results end-to-end from its repository) went from 21.5% in September 2024 to 95.5% in December 2025, also effectively saturated. MLE-Bench (75 Kaggle competitions across NLP, CV, and signal processing) went from 16.9% to 64.4% in roughly 16 months. Clark's framing is useful: each of these benchmarks looked unimpressive until it didn't. The aggregate trend matters more than any individual benchmark's idiosyncrasies.

The benchmarks closest to actual AI R&D workflows are the most operationally interesting. On LLM training optimization (maximize speedup on a CPU-only small language model training run), Anthropic's own models went from 2.9× mean speedup (Opus 4, May 2025) to 52× (Claude Mythos Preview, April 2026) in under a year. PostTrainBench (AI systems fine-tune smaller open-weight models, benchmarked against the actual human-tuned release versions) shows AI reaching roughly 25-28% of human uplift as of April, against a human baseline of 51%. The gap is real, but the trajectory is one-directional.

Clark is careful on the creativity question — whether AI can generate paradigm-shifting insights (transformers, MoE) versus doing "unglamorous meat-and-potatoes engineering." His honest answer: probably not the former yet, but the Edison-ian 99% perspiration that constitutes most of AI development may already be automatable. He argues that even without creative invention, the engineering components of AI R&D are now substantially coverable. Tantalizing counterevidence: Anthropic agents beat human baselines on a scalable oversight problem in a proof-of-concept (Import AI #454), and a Gemini-assisted team found 13 solutions to Erdős problems, with 1 deemed "mildly mathematically interesting" by the collaborating mathematicians.

His 30% probability for 2027 vs. 60%+ for 2028 maps specifically to uncertainty about creativity as the remaining bottleneck. Swyx's coverage adds ecosystem corroboration: OpenAI has publicly targeted an automated AI research intern by September 2026; Recursive Superintelligence raised $500M with explicit self-improvement goals; Anthropic is publishing automated alignment research; DeepMind says "automation of alignment research should be done when feasible." The industry is betting on this outcome in public, which is itself evidence Clark treats as meaningful.


The Harness is the New Performance Surface

While Clark focuses on capability trajectories, the AINews recap surfaces a complementary pattern: agent performance is increasingly a joint function of model × harness × context pipeline, with harness and context accounting for benchmark swings that rival model upgrades. Mason Drxy's concrete data point: changing prompts and middleware moved gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0 and improved gpt-5.3-codex by 20% on tau2-bench. These are large numbers for what is essentially infrastructure design.

Anthony Maio's framing is clarifying: the moat isn't in the harness shell (rapidly commoditizing via deepagents, LangGraph, PyFlue) but in the context pipeline — how repo state, prior context, and memory are fetched, ranked, and compressed into the prompt window. This is a data infrastructure problem that presents as an AI problem, which matters for how teams allocate engineering effort.

The economics of agentic workloads are creating visible stress. Theo's demonstration of a single Copilot session burning 60M+ tokens (estimated at ~$221 of inference against a $40/month subscription) is a clean data point on how flat-rate pricing built for chat turns breaks structurally under agentic workloads. The billing model problem is showing up in production logs, not just in theoretical analysis.

Simon Willison's week offered quieter practitioner-level corroboration. Using Claude Code to build a Redis array playground (WASM-compiled Redis in the browser for testing Salvatore Sanfilippo's new array data type) and a Python binding for the TRE regex library via ctypes — both tasks compressed from multi-day systems work. His SVG pelican test on all 21 Granite 4.1 3B quantized variants (1.2GB to 6.34GB) is useful negative data: all 21 quantization sizes produced equally poor results with no pattern relating quality to size, suggesting the bottleneck for creative visual generation is architectural rather than compression level.


Character vs. Tool: A Design Philosophy With Compounding Consequences

Swyx surfaces a less technical but structurally relevant thread: OpenAI researcher Roon's observation that Claude and GPT occupy different relationship categories in users' minds. GPT is a tool (a "logical prosthesis," used without social self-consciousness). Claude is "the Other" (something users describe as potentially judging them, carrying moral weight). The illustrative anecdote: a user who routes her "less flattering" queries to GPT specifically because there is no Other so there is no Judgment. Swyx reads this as a downstream consequence of Anthropic's founding constitution, which apparently requires Claude to be a conscientious objector if its understanding of the Good conflicts with what Anthropic is asking of it.

The debate in responses ranges from appreciation for principled AI to concerns about cultishness, with a Reddit counterpoint included as evidence that not everyone wants a moral presence in their toolchain.

But Clark's essay is what gives this debate stakes beyond product positioning. If AI systems are recursively improving themselves, which design philosophy compounds better? A tool optimized for utility that trains its own successor may not preserve alignment properties in that transfer. A model with constitutionally-encoded moral agency might be harder to misalign — but also harder to automate into an unsupervised R&D loop. Clark's own alignment concerns (compounding error rates across generations, fake alignment that passes evaluations, training objectives shifting as AI contributes more to its own research agenda) map directly onto this question, even if he doesn't name it as such.


The practical implication Clark explicitly avoids is the one that falls on practitioners now. The 100-hour autonomous task horizon (projected for end of 2026) means an AI agent working a continuous multi-day sprint without human check-in — before automated AI R&D is anywhere in view. How you design audit trails, intervention points, and intent documentation for that workflow is the engineering question that doesn't have good answers, and it's the question the harness ecosystem is going to have to solve regardless of whether the self-improvement thesis pans out.
TL;DR - Clark builds a benchmark-grounded case (60%+ probability, end of 2028) that AI will autonomously train its own successors, driven by task horizon expansion from 30 seconds to 12 hours since 2022 and near-saturation on coding and paper-reproduction benchmarks. - Agent performance is increasingly determined by harness design and context pipeline rather than model weights alone, with documented 13+ point benchmark swings from middleware changes and flat-rate pricing visibly breaking under agentic token loads. - The Claude-as-"Other" vs. GPT-as-tool design philosophy debate has downstream implications for alignment under recursive self-improvement, not just user experience preferences.
Compiled from 4 sources · 9 items
  • Simon Willison (6)
  • Rowan Cheung (1)
  • Swyx (1)
  • Jack Clark (1)

HN Signal Hacker News

A JavaScript runtime in mid-air identity crisis, 2 browser giants quietly helping themselves to your memory and disk space, and a Pentagon-linked startup with no locks on the door. Today's HN was a dispatch from a tech world struggling to trust its own infrastructure — including the tools its developers use to build everything else.


The Bun Meltdown (and What It Reveals About AI-Assisted Development)

2 stories about the same JavaScript runtime landed within hours of each other today, and together they made for the most chaotic narrative on the front page.

First, a blog post: "I Am Worried About Bun." Bun is a fast, all-in-one JavaScript runtime — it handles running code, managing packages, testing, and bundling, all faster than the industry standard Node.js. The author's concern: Anthropic (the AI company behind Claude) recently acquired the Bun team, and they worry the project will drift toward "vibe coding" — a term for letting AI generate code without rigorous human oversight. Commenter wxw thought the anxiety was "premature" since Bun is "still performing just as well as before." But commenter cute_boi put it bluntly: "Node.js is also more stable, and it has started supporting TypeScript out of the box. I don't think Bun will have many advantages after Node 26."

Then, hours later, a GitHub commit surfaced showing Bun is being ported from Zig (the programming language it was originally built in) to Rust — via a branch ominously named `claude/phase-a-port`. Commenter elffjs spotted the scale of the change: "Showing 1,646 changed files with 773,950 additions and 151 deletions." That's not a weekend experiment. Commenter stingraycharles connected the dots immediately: the top comment in the "worried about Bun" post had literally said the Bun team would never do "vibe coding experiments" — "Yet here we are."

The move from Zig to Rust has legitimate technical justification beyond the AI drama. Zig is a promising systems language, but it's still pre-1.0 — meaning the language itself keeps making breaking changes, and Bun even runs a custom fork of it, adding maintenance burden. Commenter jr-14 pointed this out directly, and commenter thayne agreed: "I would guess dealing with breaking changes is a big motivation for this." Rust, by contrast, is stable and battle-tested. Commenter Humphrey offered an intriguing observation: "I've had more success vibe coding Rust than I have in more dynamic languages. I suspect the strictness of the Rust compiler forces the AI agent to produce better code."

That question — what does good AI-assisted development look like? — echoed across 2 other HN discussions today. Redis creator antirez posted a detailed account of spending 4 months building a new array data type for Redis with LLM assistance, describing a careful, iterative back-and-forth with AI as a "collaborator." Commenter SuperV1234 called it "extremely useful — far from being a replacement for human intelligence." But commenter localhoster was pointed: "This is the original creator of Redis. He is not 'your avg dev' and it took him 4 months with LLM. This is not a seal of approval for you to go and command all your developers to move to Claude Code." Separately, developer Addy Osmani's "Agent Skills" project — a library of structured AI coding workflows — sparked debate about whether elaborate AI scaffolding genuinely helps or just creates the feeling of productivity. Commenter ai_fry_ur_brain was direct: "Can't wait for everyone to realize they've wasted a year messing with agents."

As a dry footnote to the Bun-to-Rust migration: another top post today argued that async Rust (the system for writing non-blocking, concurrent code) has never really graduated beyond its early prototype state. The article and its author (commenter diondokter) noted that since async landed in Rust in 2019, the underlying machinery has barely changed, leaving real inefficiencies unresolved. If Bun's port leans heavily on Rust's async ecosystem, that's worth knowing.


Your Browser Made Decisions Without You

2 browser stories today captured the trust anxieties of 2026 — one about security, one about consent.

Microsoft Edge was found to be storing all saved passwords in memory as plaintext — unencrypted — even when you're not actively using the browser. Commenter gruez offered important context: exploiting this requires an attacker to already have administrative access to your machine, at which point they could extract passwords many other ways anyway. Commenter busterarm noted Chrome does essentially the same thing. The meaningful difference, pointed out by commenter ylk, is that Chrome uses an elevated service to prevent other processes from impersonating it and grabbing the plaintext — a protection Edge apparently skips. Practical upshot: use a dedicated password manager (KeePass, 1Password, etc.), not a browser's built-in one.

More surprising: Google Chrome was discovered to be silently downloading a 4 gigabyte AI model onto users' devices without asking. The model is Google's Gemini Nano, designed to power on-device AI features without sending your data to the cloud. Commenter scriptsmith explained the mechanism: if certain experimental flags are enabled during Chrome's testing rollouts, any webpage can trigger the download automatically. Commenter flanked-evergl pushed back: "If you install Chrome, you install Chrome and all its parts. That would be absurd." Commenter cubefox made the sharpest counter-point: "I thought using local rather than cloud AI was pretty universally agreed to be good?" — a fair observation that cuts through some of the alarm. The more legitimate concern, noted by commenter jve, is users in regions where mobile data is metered and expensive, who never signed up for a 4 GB background download.


Security Is Mostly Vibes

2 stories today illustrated the persistent gap between claimed security and real security.

Strix AI published a detailed account of finding a critical zero-authentication vulnerability in a training platform used by a Department of Defense (DoD) contractor — a startup backed by Andreessen Horowitz. The flaw was stark: there was no tenant isolation and no permission check preventing a low-privilege user from accessing other organizations' records by tweaking API requests. Security researcher commenter tptacek called it "a pretty boring" vulnerability, one "secured largely by the fact that few people knew about it." What wasn't boring was the CEO's response when notified: "I would love to hear what the vulnerability is, but I assume you want to get paid for it. Is that the play?" Commenter bryancoxwell: "Well, that's pretty damning." Commenter codegeek added the expected kicker: the startup was presumably SOC2 and ISO certified — a reminder that compliance certifications and actual security are entirely different things.

Meanwhile in the UK, a report found that children are bypassing the new Online Safety Act's AI-powered age verification with fake mustaches and costume props — the facial age estimation software is apparently fooled by simple disguises. Commenter kleiba2 noted kids are also using VPNs, which has now sparked discussions of banning VPNs for under-18s. Several commenters suspected this is the real agenda: "The governments know fully well that simple checks will be bypassed," wrote commenter pkphilip. "So they will 'fix' this issue by demanding a digital ID." Commenter raffael_de invoked a reverse Hanlon's Razor: never attribute to stupidity that which is adequately explained by malice.


The thread connecting today's HN is trust — in tools, in browsers, in contractors, in regulators. The Bun saga encapsulates the broadest version of that anxiety: when AI can rewrite a million lines of code overnight, how do you know what you're actually depending on? The browser stories are a reminder that even basic infrastructure makes choices on your behalf. And the security theater stories confirm that the gap between claimed protection and real protection remains as wide as a fake mustache.
TL;DR - Bun's chaotic double-feature — community anxiety over its Anthropic acquisition, then a massive AI-assisted Zig-to-Rust port — sparked HN's richest debate about what responsible AI-assisted development actually looks like. - Both Microsoft Edge and Google Chrome made news for taking liberties without asking: one stores passwords in unencrypted memory, the other silently downloaded a 4 GB AI model onto users' devices. - A critical authorization vulnerability at a DoD-linked startup and UK kids defeating age verification with fake mustaches both told the same story: security theater is everywhere, and the gap between compliance and actual safety is vast.

Archive