Pure Signal AI Intelligence

Today is dominated by the Claude Opus 4.7 launch — but the more interesting content sits underneath it: a debate about what the benchmarks actually measure, OpenAI's quiet pivot toward domain-specialized reasoning, and Dwarkesh Patel's detailed notes on why pretraining runs fail and whether distillation can be stopped.


Opus 4.7: Real Gains, Contested Tradeoffs

The headline numbers are unambiguous. Opus 4.7 hits 64.3% on SWE-bench Pro (+11 points over 4.6), 87.6% on SWE-bench Verified (+7 points), and claimed #1 on Vals Index at 71.4% — up from a 67.7% previous best. Cursor's internal benchmark jumped from 58% to 70%; Notion reportedly saw a 14% lift with one-third the tool errors. Swyx's AINews summary is the most comprehensive read: 4.7-low is strictly better than 4.6-medium, 4.7-medium strictly better than 4.6-high, and 4.7-high beats 4.6-max — with a new xhigh effort tier that Claude Code now defaults to.

The vision upgrade is the sleeper story. Opus 4.7 accepts images up to 2,576px on the long edge (~3.75 megapixels), more than 3x prior Claude limits, with no downscaling. For computer-use agents parsing dense screenshots or fine-grained diagram extraction, this is a material capability shift, not a footnote.

Where it gets messier is the tokenizer. Several observers flagged that 4.7 ships a different tokenizer from 4.6, prompting debate about whether this is a new base model, a tokenizer-swapped continuation, or a Mythos distillation. Anthropic's Boris Cherny acknowledged that the same input can map to 1.0–1.35x more tokens depending on content type, while simultaneously arguing that reasoning efficiency improved enough that total token use is still down up to 50% versus equivalent 4.6 effort levels. Cherny said Anthropic increased subscriber limits to compensate — but enterprise document pipelines face real cost pressure: LlamaIndex's ParseBench comparison puts Opus 4.7 at ~7¢/page for OCR-like work, versus ~0.4¢ for their cost-effective mode.

Document understanding shows the characteristic uneven pattern: charts improved massively (13.5% → 55.8%), while layout actually regressed (16.5% → 14.0%). Content faithfulness is strong, formatting is modest, tables negligible. The model got better in aggregate but not uniformly.

The most pointed technical pushback is on long-context performance. Multiple users noted MRCR (a needle-in-a-haystack-style metric) looks worse on 4.7. Cherny's response is that Anthropic is intentionally phasing out MRCR because it "overweights distractor-stacking tricks" and moving toward Graphwalks as a better applied-reasoning signal — where 4.7 goes 38.7% → 58.6%. That's a defensible position, but it also means you should treat the long-context story as genuinely contested until independent evaluation settles it.

Simon Willison adds a characteristically wry data point from his pelican-riding-a-bicycle SVG benchmark: Qwen3.6-35B-A3B running as a 21GB quantized model on a MacBook M5 outperformed Opus 4.7 on both SVG generation tests. His own caveat is that this doesn't mean Qwen is more capable generally — but it does illustrate something real: the correlation between benchmark leadership and practical task performance is noisier than it used to be at the top of the capability curve.

Cat Wu's operational framing from Anthropic is worth retaining: treat Opus 4.7 like an engineer you delegate to, not a pair programmer you micromanage — put full goals, constraints, and acceptance criteria up front, and encode testing workflows in claude.md. That's less a tip than a signal about what Anthropic optimized for.


OpenAI's Parallel Track: Domain Models + Superapp

While Opus 4.7 dominated the discourse, OpenAI put out two substantial things in 3 days. GPT-Rosalind, the first in a planned life sciences series, scored better than 95% of human scientists on a blind RNA prediction task from Dyno Therapeutics, with Amgen, Moderna, and the Allen Institute already in the test phase. Two days earlier, GPT-5.4-Cyber launched. The pattern is clear: a flagship general model plus purpose-built domain variants tuned for specific reasoning tasks and toolchains.

The Codex update is a separate move — background computer use, parallel agents, an in-app browser, memory across sessions, and inline image generation built in. At 3M weekly users and 70% month-over-month growth, Codex head Thibault Sottiaux called it "building the super app out in the open." The analogy to Claude Code is direct: both are now general-purpose agentic environments, not just coding assistants.

The domain model strategy is worth watching closely. The argument is that the best performance on high-value professional wedges — drug discovery, cybersecurity, financial analysis — may require models trained on domain-specific objectives, not just general RLHF on broad capability. If that's right, the competitive axis shifts from "who has the best general model" to "who has the best domain coverage map."


The Distillation Problem and the Gated Frontier

Dwarkesh Patel has a sharp framing of the distillation question that's worth sitting with. The core argument: frontier labs can't really stop open-source models from absorbing their capabilities, and the numbers make this uncomfortable. At $25/MTok for Opus 4.6 output tokens, 1T tokens of synthetic training data costs $25M. That's noise for a well-funded lab. The labs are hiding chain-of-thought, but Dwarkesh argues this is easy to route around — you can instruct models not to surface thinking, or reconstruct it as an RLVR target. More fundamentally, agentic tool use (file writes, bash commands) happens locally and can't be hidden.

His most interesting observation is about coding product companies inadvertently building better distillation than direct token sampling: by capturing the "gold diff" — what users actually accepted after multiple back-and-forths — and using that as an RL target, companies building on top of frontier APIs may end up training models better calibrated to real user preferences than the frontier models themselves.

This connects to what Swyx's AINews piece notes about Mythos: Anthropic is now running 2 parallel tracks — a fast 2-month public release cadence and a gated frontier line accessible only to select partners (reportedly including certain U.S. government agencies). The Mythos Preview scores 77.8% on SWE-bench Pro versus Opus 4.7's 64.3% — a 13-point gap that Anthropic is explicitly not closing for the public. This is the first major public instance of the "true frontier" being meaningfully ahead of what's commercially available, and the distillation dynamics Dwarkesh describes make it an unstable equilibrium.

On the cybersecurity question specifically: Dwarkesh's framing is that Mythos's qualitative leap in cyberattack capability isn't a smooth intelligence increase but a combinatorial unlock — prior models found individual vulnerabilities; Mythos can chain 5 together into a working exploit. He's cautiously optimistic on defense: software has gotten more secure despite accumulated human hacking talent, and AI-assisted patching of latent zero-days could strengthen defense more than offense. But a security expert he spoke to pushes back — AI is much better at finding vulnerabilities than patching them, because patching requires understanding all the downstream behavioral dependencies of a fix.


Why Pretraining Runs Fail

Dwarkesh's notes from a lecture by Horace He on training parallelisms are the most technically dense content of the day, and genuinely worth reading for practitioners. Two failure patterns stand out.

Breaking causality via expert routing: In mixture-of-experts (MoE) models, expert-choice routing (where experts pick tokens rather than tokens picking experts) gives cleaner load balancing during training but breaks the causal mask — token N's expert assignment can depend on token N+K. This may explain why Llama 4 and Gemini 2 underperformed expectations. The fix is known but the tradeoff is real: you sacrifice load balancing for correctness, or vice versa.

Floating point bias accumulation: The original GPT-4 training apparently hit a significant slowdown from a subtle FP16 bug in collective operations. When summing 10,000 small values into a large accumulator, the logarithmic density of FP16 means you can hit quantization cliffs — adding 1 to 1024 and rounding back down, repeatedly, causing the accumulated value to diverge 10x from ground truth. Bias compounds; variance averages out.

The broader framing Dwarkesh takes from his conversations: it's unlikely there are just "5 ways training runs fail" that labs can definitively solve. New failure modes keep emerging at scale. His source is bearish on AI fully automating kernel writing in the near term on these grounds — it's more "AGI-complete" than the "verifiable domain → easy RL loop" framing suggests.


The practical question today's content surfaces: the Mythos gap (13+ SWE-bench Pro points ahead of what's publicly available) and the distillation dynamics Dwarkesh describes create a structural tension. If gated frontier models can't maintain a durable lead over open-source distillation, the value of the gating degrades over time — but it may be durable long enough to matter for cybersecurity, biotech, and other high-stakes domains where a 6-month capability advantage has asymmetric consequences. Whether that gap is strategically meaningful or just a temporary moat is the question that will define the next year of lab positioning.
TL;DR - Opus 4.7 delivers real benchmark gains (+11 SWE-bench Pro, 3x image resolution) but reactions are divided on non-coding tasks, long-context performance is contested, and the new tokenizer creates cost complexity for document pipelines. - OpenAI is running a dual strategy of domain-specific models (Rosalind for biotech, 5.4-Cyber) alongside a Codex-as-superapp push, suggesting the flagship-plus-verticals architecture is becoming an industry pattern. - Dwarkesh argues distillation can't meaningfully be stopped — agentic tool use is observable, synthetic data is cheap, and coding product companies may be building superior training signal from user-accepted "gold diffs." - Two underappreciated causes of failed pretraining runs are causal mask violations in MoE expert routing (possibly behind Llama 4 and Gemini 2 underperformance) and FP16 bias accumulation in collective operations, with sources suggesting new failure modes keep emerging at every new scale.
Compiled from 4 sources · 6 items
  • Simon Willison (3)
  • Rowan Cheung (1)
  • Swyx (1)
  • Dwarkesh Patel (1)

HN Signal Hacker News

Today felt like a snapshot of the industry at a hinge point — loud model launches, louder user frustration, and underneath it all a philosophical argument about whether AI is making us better at anything.


The AI Arms Race Exhausts Its Audience

Anthropic dropped Claude Opus 4.7 today — the biggest story on HN by a mile, pulling 1,755 points and 1,261 comments. Within hours, OpenAI countered with a "major update to Codex," its AI coding assistant. Google, not to be left out, announced an Android command-line interface (CLI) promising to build Android apps 3x faster using AI agents. It was that kind of day.

The Opus 4.7 announcement led with benchmark improvements and a new tokenizer (the component that breaks text into chunks the model can process) — but the new tokenizer maps the same input to 1.0–1.35x more tokens, meaning users pay more per prompt for the same input. The community noticed. "Quick everyone to your side projects," wrote TIPSIO drily. "We have ~3 days of un-nerfed agentic coding again." The joke lands because users have learned that new models often get quietly degraded after launch.

The trust problem runs deeper. Anthropic benchmarked Opus 4.7 partly against "Mythos," a more capable model they're deliberately not releasing publicly — using it as a benchmark ceiling while selling you something weaker. To the community, this reads less like a product announcement and more like a carrot on a stick. endymion-light put it plainly: "Anthropic need to build back some trust and communicate throttling/reasoning caps more clearly." On the Codex side, sidgtm noted the timing: "They felt the pressure of posting something after Claude 4.7." The Codex update itself generated confusion — croemer asked what "major update" even meant, and Linux users discovered the new computer-use features are macOS-only.

A piece on the "beginning of AI compute scarcity" (by venture capitalist Tom Tunguz) provided useful backdrop: we may be entering 5–10 years of genuine hardware constraints on AI inference, driven by limits on semiconductor manufacturing capacity. Commenter mattas pushed back with a grocery metaphor — "if I pay $1 for oranges and sell them for $0.50, I can't say 'I don't have enough oranges'" — pointing out that calling it "scarcity" when labs are burning cash at historic rates is convenient framing. Paulddraper countered with efficiency data: DeepSeek V3 ran at 1/10th the cost of contemporary ChatGPT, and open-weight models keep improving. Both things are probably true: physical limits are real, but so is the relentless march of efficiency.


The Epistemic Rot Argument

The second-most-discussed story was a long essay by Kyle Kingsbury (a respected distributed systems engineer who writes at aphyr.com), titled "The future of everything is lies." Its central argument is grim: AI systems are degrading the epistemic infrastructure of the web — search, government services, institutional trust — faster than we can adapt. The essay invokes anthropologist James C. Scott's concept of "metis" (the tacit knowledge built only by doing hard things by hand) to argue that AI assistance is hollowing out our ability to think.

The comments were genuinely divided. lukev called it "a must-read series" and drew a parallel to early automobiles — a technology that could be useful but reshaped society in damaging ways. chungusamongus was dismissive: "Complaining about AI slop is starting to become its own kind of slop... none of them have a solution other than empty moralizing." analog8374 offered the sharpest one-liner: "We've recreated pre-enlightenment intellectual culture. Authority and logical consistency matter. Reality doesn't."

This thread connects naturally to a separate story: Discourse (the popular open-source forum software) published a post rebutting a trend of developer tools going closed-source under the banner of "security." Its creator argued that open source actually improves security because constant public scrutiny forces earlier investment in finding and fixing bugs. A commenter named LoganDark drew the explicit AI parallel: keeping powerful models private because they're "too dangerous" gives attackers a weapon with no corresponding defense — the same logic Anthropic used to justify gating Mythos. The Discourse post also landed a pointed shot at the broader trend: "You can't take five years of community contributions, close the gate, and claim you're grateful."


Builders Closing the Loop with Hardware

While the big labs traded announcements, some of the most genuinely interesting content came from people using AI to do real physical work.

A GitHub project called "autoprober" showed someone building an AI-driven circuit probe from a CNC machine, a camera, and duct tape. The concept — an automated arm that uses computer vision to locate and test points on a circuit board — is legitimately novel at hobbyist scale. Commercial "flying probes" (automated circuit testers) cost tens of thousands of dollars. Commenters were cautiously impressed, but uSoldering noted that no actual probing appears in the demo video, and claytonia identified the core tension: "AI is probabilistic, but hardware is precise. If the model miscalculates a pin's position by 0.1mm, the probe may crush the board."

A related "Show HN" post demonstrated SPICE (a standard tool for simulating how a circuit will behave before you physically build it) connected via an AI agent to a real oscilloscope (a device that measures live electrical signals). The builder used Claude to automatically compare simulated vs. real circuit behavior and flag discrepancies — a genuinely elegant use of AI as a verification layer that closes the loop between model and reality. The author was honest about limits: Claude "sometimes claims it matched the simulation when it obviously didn't."

Both projects share something worth noting: the constraint is a feature. When you're probing hardware at sub-millimeter precision, or comparing a simulation to live voltage readings, there's no room for hallucination. Reality provides instant feedback.


The Internet Governance Front

A US bill mandating on-device age verification generated significant heat. The proposal would require operating system providers to verify users' ages and make that data available to apps on request. Commenters were largely skeptical — dizzy9 made the core objection clean: "Age verification inherently means identity verification. There's no way to prove your age without first proving that you are YOU." hackinthebochs pushed back, calling this "probably the least bad" version of a trend that's coming regardless. pkphilip noted the timing: the EU released similar legislation the same week, calling the synchronization "pure coincidence."

Separately, Bluesky spent nearly a full day fighting a distributed denial-of-service attack (a flood of fake traffic designed to overwhelm servers). The irony noted in the thread: Bluesky is marketed as a "decentralized" social network, but a DDoS against its central infrastructure took it down for a day — suggesting the decentralization is less complete than advertised. bit1993 put it plainly: "A decentralized protocol by definition should not be vulnerable to DDoS attacks."


There's a through-line in today's HN. The frustration with Anthropic's opaque throttling, the anxiety in Kingsbury's essay, the skepticism about Bluesky's "decentralization" — they're all versions of the same complaint: systems that present themselves as open or trustworthy are quietly more closed and unreliable than they appear. The hardware builders probing circuit boards with duct tape and AI feel, by contrast, like the most honest practitioners on the feed today. They're not claiming anything works until they can measure it.
TL;DR - Anthropic's Opus 4.7 and OpenAI's same-day Codex counter triggered a wave of user frustration about model reliability, pricing opacity, and better models being dangled as benchmarks but withheld from customers. - A widely-read essay arguing AI is degrading epistemic infrastructure sparked real debate — while a Discourse post defending open source made the parallel argument that keeping powerful systems opaque benefits attackers more than defenders. - Hobbyist builders are using AI to close loops between circuit simulation and real hardware, with honest results: sometimes it works, sometimes the AI confidently lies. - A US age-verification bill and Bluesky's DDoS outage both exposed the gap between how platforms present themselves (private and decentralized) and how they actually work.

Archive