Pure Signal AI Intelligence
The same week labs are claiming to be on the doorstep of AGI—a new benchmark just reset the entire scoreboard below one percent. Here's what that means, and why efficiency research is quietly making it matter less than you'd think.
AGI Claims Meet Their Hardest Reality Check Yet
François Chollet and the ARC Prize Foundation just dropped ARC-AGI-3—the latest version of the benchmark designed to separate genuine reasoning from pattern matching. The results are striking. Gemini Pro leads all frontier models at just zero-point-three-seven percent. GPT scores zero-point-two-six. Anthropic's Opus lands at zero-point-two-five. Grok scores exactly zero.
Humans, by contrast, solve a hundred percent of tasks on the first try.
Here's what makes this benchmark different. Models face game-like scenarios with zero instructions. No hints, no scaffolding. They must discover rules, form goals, and build strategies entirely from scratch—the kind of open-ended reasoning that makes it genuinely hard to brute-force your way to a high score.
That last point matters. Labs spent millions training against earlier ARC versions, and pushed ARC-AGI-2 scores from three percent to around fifty percent in under a year. Chollet built version three specifically to find out whether that progress was real reasoning—or just expensive memorization of the test format.
What's different this time? Cofounder Mike Knoop says frontier labs are paying far more attention to version three than they ever did to earlier releases. A one-million-dollar prize backs the challenge. The industry knows this one is harder to game.
The uncomfortable question Chollet is forcing: if your model can't discover a novel rule from scratch with no instructions—something any human child can do—what exactly are we measuring when we call it intelligent?
Squeezing More From Less: The Efficiency Race Heats Up
While the benchmark story grabs headlines, Google Research published something quietly significant—a compression algorithm called TurboQuant that cuts AI memory usage by over six times with essentially zero accuracy loss.
To understand why this matters, here's the context. AI models maintain what's called a KV cache—a running log of everything said in a conversation. As chats get longer, that cache balloons. It slows responses. It drives up costs. At scale, it's one of the biggest operational headaches in deploying large language models.
TurboQuant—quantization—compressing model memory to use less storage—shrinks that cache by more than six times without retraining the model. On Nvidia's H100 chips, it also delivers up to eight times faster response processing at no extra cost. The paper even topped rival methods on vector search—the technique search engines use to find semantically similar results quickly.
The market noticed. Memory hardware stocks dropped three to five percent on the official release, even though the paper first appeared nearly a year ago. Wall Street is starting to price in a world where smarter software erodes the premium that AI memory commands.
This connects directly to the ARC-AGI-3 story in an underappreciated way. One reason labs can iterate so fast on benchmarks is that each efficiency breakthrough—compression, faster inference, cheaper training—compresses the timeline. The models that score near zero today will be competing seriously within months, not years. Whether that speed reflects genuine capability growth or just more efficient benchmark engineering is exactly the question ARC-AGI-3 is designed to answer.
A Signal Worth Noting: Claude's Biggest Launch Ever
On a different note—Swyx at Latent Space flagged something worth paying attention to. The reception to Claude's latest launch—specifically Claude Computer Use, coming out of Anthropic's Vercept acquisition—has been the largest in the company's history by a significant margin. They measured this by aggregating top tweets across company accounts, and the signal is clear.
This is worth watching not as a product announcement, but as a social signal. Computer use agents—AI that can actually operate software interfaces directly—have been technically available for months. What changed is the reception. Builders are paying attention in a way they weren't before. That kind of momentum shift in developer interest tends to be a leading indicator of what's actually getting deployed in the real world.
The through-line today is a productive tension. ARC-AGI-3 is asking whether frontier models can reason in genuinely novel situations—and the answer right now is no. TurboQuant is making those models cheaper and faster to run. And developer enthusiasm around agentic computer use is surging anyway. Progress doesn't wait for benchmarks to be solved. It just keeps accelerating—and Chollet's job is to make sure we're honest about what we're actually measuring along the way.
HN Signal Hacker News
Today on Hacker News, memory was the hidden thread. Stories about preserving what matters bumped against stories about institutions that refuse to learn from their mistakes. And underneath it all — a quietly growing anxiety about who's watching, and who's accountable.
The Memory Problem: Human and Machine
Two stories today asked the same question from completely different angles. What do we choose to preserve — and who does the preserving?
The first was a personal project. A developer built a Wikipedia-style encyclopedia of their own family. Photos, handwritten notes, stories — gathered into a private, searchable site. The community loved it. But commenter bawolff called it "bittersweet — like an artisan being put out of business by the factory." The love was in the handmade quality. Handing it off to an AI to organize feels like something is lost — even if the result is more complete.
Commenter casparvitch put a sharper edge on it. They've been thinking about consumption hobbies versus creation hobbies — and asked the question nobody wants to answer. If the AI does the creative work, what exactly are you left with?
On the machine side, a developer shared a plain-text cognitive architecture for Claude Code — Anthropic's AI coding assistant — giving it persistent memory across sessions, organized in flat text files it can read and update itself. This one returned with significantly more discussion since we first flagged it.
The comments here were genuinely rich. Commenter rodspeed identified a universal problem. "An observation from thirty sessions ago and a guess from one offhand remark just sit at the same level." Their fix: tag beliefs with confidence scores and let old ones decay over time. That's a deeply human insight — mapped onto machine memory. Commenter _pdp_ pushed back on the whole framing. Memory architecture for a coding assistant is fundamentally different from memory for a research assistant, they argued. There's no one-size-fits-all solution here.
Both stories circle the same unresolved tension. Memory is valuable. But the act of building it matters too.
Autonomous AI — Running Into Reality
Two "Show HN" posts this week pushed the limits of what AI can do without human oversight. Both ran into friction.
Optio — back with considerably more discussion since its launch — is a system that takes a GitHub issue ticket and has an AI agent produce a pull request — a package of code changes ready for review — entirely automatically. It runs on Kubernetes, the container management platform large software teams use to run services at scale.
The community's reaction was skeptical. Commenter stingraycharles was blunt: "I've come to the realization that these systems don't work — a human in the loop is crucial for task planning." Another commenter, upupupandaway, summarized the pipeline in three words: "Ticket. Pull request. Deployment. Incident."
A separate project — an LLM, or large language model, extractor for websites — drew similar tension. The tool scrapes websites and uses AI to pull out structured data: prices, product details, whatever you need. The core technique is clever. Convert messy HTML — the markup language web pages are built with — to plain Markdown first, slashing the amount of text the AI has to process. Then let the model extract what you need. Commenter chattermate confirmed the underlying problem is real. Smaller, cheaper AI models handle simple requests fine but silently mangle complex nested data in ways that are hard to predict.
But several commenters went straight to robots.txt — the standard web file that tells automated crawlers which pages they're allowed to visit. This tool quietly ignores it. The developer explained their main use case is public retail price monitoring, common practice in the industry. The community wasn't fully satisfied.
And then there's the Sora data point that surfaced today. Fifteen million dollars a day in inference costs — the computing expense of running the model — against two-point-one million dollars in total lifetime revenue. Sora is OpenAI's video generation model. That gap is a sobering reminder. The economics of AI products are still very much unsolved.
No Accountability, No Consequences
Two unrelated stories today converged on the same uncomfortable structure. A system exists. It fails. There is no mechanism to correct it.
The first — returning with significantly more discussion — involves a widely-cited academic paper containing false claims. Published in the journal Management Science. Statistician Andrew Gelman documented how the method described in the paper isn't the method the authors actually used. The paper has shaped business school curricula for years. And the journal's correction policy? Only the authors themselves can request corrections. As commenter thayne noted — that means the person who made false claims is also the only person who can fix them.
Commenter Analemma_ put it in broader context. The Reinhart and Rogoff austerity paper from 2010 guided multiple national governments into immiserating economic policy. It was equally flawed. No consequences. "There's no accountability for junk science — especially if it confirms what powerful people want to believe."
The second story came from NPR. Government agencies are buying bulk commercial data about Americans from data brokers — companies that collect location records, purchase histories, and social connections from the scattered exhaust of digital life. This sidesteps the Fourth Amendment, which requires warrants for surveillance. Buying data someone else collected isn't the same as collecting it yourself — legally, at least.
Commenter strogonoff quoted Anthropic's own CEO here. The issue isn't that data exists. It's the ability to assemble "a comprehensive picture of any person's life — automatically and at massive scale." That scale is the weapon. Commenter mattsimpson put it plainly: most people don't care because they don't grasp the future implications. Short-term thinking is a scary phenomenon.
Closing Thought
Amid all this, the shell tricks post was a quiet refuge. Terminal devotees swapping keyboard shortcuts — ctrl-R for history search, the hash-comment trick for preserving commands you'll want later. Small, human, genuinely useful. One commenter admitted they've stopped learning CLI tools entirely and just ask an AI agent. Another replied they'd switched to Haskell. Hacker News, in microcosm.
And then there was Obsolete Sounds — a small archive of sounds the modern world no longer makes. The dial-up modem. The fax machine. The mechanical typewriter. Commenter wan9yu said it quietly and perfectly: "We obsess over preserving images and video — but soundscapes vanish almost unnoticed."
Memory, accountability, and the sounds of things slipping away. That was today on Hacker News.