Pure Signal AI Intelligence
Today's content splits cleanly between 2 consequential topics: an AI cybersecurity arms race that's developing genuinely interesting economic dynamics, and a rare, unusually candid technical postmortem from Notion on what it actually took to ship production agents.
The AI Cybersecurity Race Has Real Economic Logic
OpenAI's GPT-5.4-Cyber release is being framed as a direct counter to Anthropic's Mythos — and the contrast in access philosophy is real. Anthropic gates Mythos to roughly 40 trusted partners; OpenAI is opening GPT-5.4-Cyber to any verified defender who passes ID checks via its Trusted Access for Cyber program. OpenAI researcher Fouad Matin put it directly: "no one should be in the business of picking winners and losers" on who gets to defend their systems.
Simon Willison adds useful nuance here: the "democratized" framing is somewhat overstated. Getting access to OpenAI's best security tooling still requires a separate Google Form application process — not obviously different from Anthropic's Project Glasswing gating. The rhetoric of openness is ahead of the operational reality.
What's more interesting than the access policy debate is the underlying economics Willison surfaces from Drew Breunig. The UK AI Safety Institute independently confirmed Mythos completed their 32-step corporate hack simulation and found that results kept improving as long as you kept spending tokens on it — no apparent ceiling. If that holds, security work collapses to a brutal arithmetic: to harden a system you need to outspend attackers on token consumption, not outthink them. This is security as proof-of-work.
The downstream implication Breunig flags is counterintuitive: open source libraries become more valuable in this regime, not less. The tokens spent securing an open source library amortize across every user of that library. This directly counters the vibe-coding argument that cheap throwaway implementations make established open source projects obsolete — the security economics now favor shared, audited codebases. It's the kind of second-order effect that's easy to miss when the headline debate is about which company has the more capable model.
Meanwhile, SWE-Bench is reportedly saturating (Swyx notes Mythos is at 78% on SWE-Bench Pro), and GDPval rates GPT-5.4 as better than or equal to human experts in 83% of economic domains tested. Greg Brockman is claiming ~1B weekly ChatGPT and Codex users. The benchmark frontier is moving fast enough that Notion, Anthropic, and OpenAI are all independently building "frontier evals" that intentionally only pass ~30% of the time — so they can see where capabilities are headed, not just confirm they haven't regressed.
Five Rebuilds: What It Actually Took to Ship Production Agents
The Latent Space interview with Notion's Simon Last and Sarah Sachs is the most technically substantive piece in today's content. Notion shipped Custom Agents after rebuilding the system 4-5 times over roughly 3.5 years. The history of those rebuilds reveals more about the current state of agent engineering than most benchmark announcements.
Rebuild 1 (late 2022): A JavaScript coding agent. Give the model JavaScript APIs and let it write code to invoke tools. Failed because the models at the time were too weak at code generation to make it reliable.
Rebuild 2: Custom XML tool calling (before the standard existed), with a schema designed to losslessly map to Notion's internal block data model. The learning: they were catering to what made sense for Notion's data model rather than what the model wanted. Models don't know bespoke XML. Switch to markdown.
Rebuild 3: Notion-flavored markdown for page editing, plus SQL Lite queries for database access instead of Notion's native JSON query format. Models are excellent at SQL; they were terrible at Notion's internal JSON schema. Give models what they already know.
Rebuild 4 (current production): Tool definitions with progressive disclosure. The big architectural shift was moving from centralized few-shot prompts (where 5-6 people were the gatekeepers of a single string in the codebase, and prompt ordering mattered enormously) to distributed tool ownership. Every team now owns their own tool definitions and evals. This unlocked velocity but introduced new failure modes: two tools with the same name caused Claude Sonnet to crash but GPT-5.4 to handle gracefully. They discovered this through a production incident.
Next version (landing shortly): 100+ tools with proper progressive disclosure — the model sees a search interface into the tool library rather than all 100 definitions simultaneously. The insight that motivated this: adding any new tool risked "Nerfing" the overall model quality by causing it to over-call the new tool.
Several engineering decisions stand out as generalizable. On MCP vs CLI: Simon Last is bullish on CLI agents because CLI is self-healing — if something breaks, the agent can debug and fix itself within the same environment. If an MCP transport gets corrupted, the agent loses the tool entirely with no recovery path. MCP's advantage is tighter permissioning and determinism for narrow, well-scoped tasks. Sarah Sachs adds a pricing dimension: using a language model to call deterministic APIs via MCP repeatedly is wasteful; if you can replace it with a one-time CLI tool, you should. This shapes which integrations Notion builds natively (Slack, Mail, where they want full control and quality) versus via MCP (long-tail third-party tools).
On evaluation infrastructure: Notion runs 3 tiers. Regression tests in CI with stochastic error tolerance. Launch-quality evals with target pass rates (80-90% depending on journey). And "frontier evals" intentionally set at 30% pass rate — these exist specifically to give Anthropic and OpenAI signal on where models need to improve, and to let Notion see where the capability river is flowing. They hit a point where their production evals were saturated and stopped being useful for feedback. The solution was what they're calling "Notion's Last Exam."
The Model Behavior Engineer role is worth flagging for practitioners. It started as someone reviewing Google Sheets of model outputs (a linguistics PhD dropout and a Stanford grad). It's evolved into a function distinct from software engineering: writing evals, triaging failures, running LLM judges, understanding why a model fails in a specific domain. They're now building agents to automate the eval pipeline itself — coding agent writes evals, runs them, analyzes failures, proposes fixes. The humans supervise the outer loop.
On pricing: credits as an abstraction over tokens was necessary because tokens aren't the right unit — web search, GPU-served fine-tuned models, priority serving tiers, and sandbox compute all have different cost structures. The "auto" model selector is positioned as a robo-advisor: matching task complexity to the minimum sufficient model, not the cheapest or most powerful. Sarah Sachs is explicit that frontier labs don't currently serve the middle of the intelligence-price-latency triangle well — everything clusters at capability extremes — and Notion is filling the gap with open-source models that are reaching where reasoning models were 3-4 months ago.
The Turkey Problem: Engineers as a Special Case
Swyx surfaces something worth sitting with: agents are doing more work, everyone is working harder, and both things are simultaneously true. Aaron Levie reports AI teams are the busiest they've ever been. Tyler Cowen argues you should work harder right now regardless of which AI trajectory you believe. Simon Last described going back to sleepless nights not because of ML training runs but because of agent token anxiety — before bed, did he start enough agents to keep making progress while he sleeps?
The turkey analogy Swyx raises is pointed: based on historical data, turkeys should conclude that humanity exists to feed them well. The doomsayers are crackpots. Until Thanksgiving. Are knowledge workers turkeys? The question is whether current increasing demand for human oversight and orchestration is structural or transitional — whether the elasticity of human-in-the-loop work keeps growing, or whether there's a crossover point.
Notion's position implicitly argues for a structural role: someone has to own the outer loop. Sarah Sachs describes software engineers going through "the identity crisis every manager goes through" — realizing their ability to write code matters less than their ability to delegate and context-switch. Simon Last's framing is sharper: managing a team of humans is fuzzy and human, but managing a system of agents is a rigorous technical design problem — PRs in block status, dependency graphs, self-verification loops. The job doesn't go away; it changes shape into something that requires more technical rigor about systems design and less about syntax.
Whether that reframing holds through the next capability jump is the open question the day's content doesn't resolve.
The practical tension today's content surfaces: both major AI labs are now producing cybersecurity-specialized models capable enough that the governance question (who should have access?) is no longer theoretical. OpenAI and Anthropic have chosen opposite answers — broad verified access vs. narrow trusted partners — and we don't yet have evidence for which approach actually reduces net harm. That's a design decision being made now, under real uncertainty, with significant consequences.
TL;DR - OpenAI's GPT-5.4-Cyber vs. Anthropic's Mythos represent genuinely different access philosophies for dangerous-capable models, but the more durable insight is economic: AI security becomes proof-of-work, and open source amortizes the token cost of hardening. - Notion's five-agent-harness rebuilds over 3.5 years document the actual engineering path to production agents: give models formats they know (SQL, markdown), distribute tool ownership away from a central few-shot file, build progressive disclosure for 100+ tools, and treat eval-writing as a distinct engineering function. - The "everyone is working harder while agents do more" paradox is real and unresolved — Swyx's turkey framing and Notion's "software factory" vision point in the same direction but don't agree on whether the human role is structurally stable or transitionally stable.
Compiled from 4 sources · 8 items
- Simon Willison (4)
- Swyx (2)
- Ben Thompson (1)
- Rowan Cheung (1)
HN Signal Hacker News
Today on Hacker News felt like watching 3 different clocks running at different speeds. AI tooling is in a full sprint — new features shipping weekly. Surveillance infrastructure is on a slow, steady march that most people still aren't tracking. And the open-source commons is quietly aging, held together by whoever shows up to fix it.
AI Feature Velocity and the Moat That Isn't
Anthropic shipped a new feature for Claude Code called Routines — the ability to define automated workflows that run on a schedule, trigger on API calls, or react to GitHub events. Commenter mellosouls captured the community's wry amusement: "Put Claude Code on autopilot... we ought to come up with a term for this new discipline, e.g. 'software engineering' or 'programming.'" The joke lands because it's accurate: this is essentially cron jobs plus webhooks, orchestrated through an AI agent. Routines also directly absorbs territory from OpenClaw, an open-source Claude automation tool that OpenAI recently acquired — and several commenters noted OpenClaw's roughly 2-week moat with dry admiration.
The feature delivery rate is genuinely striking. Commenter bpodgursky called it "a fast takeoff in miniature — pushing out multiple features each week that used to take enterprises quarters to deliver." But there's an uncomfortable irony at the center of this story. Minimaxir flagged that Anthropic recently made sharp cuts to Claude Code's usage limits, and asked the obvious question: how do more autonomous, compute-hungry tools work within tighter constraints? Commenter ctoth echoed the puzzle: "If they were compute-limited... the rational thing would be to not ship features that use more compute automatically?" No clear answer emerged.
The "moat" question surfaced repeatedly. Commenter airstrike put it bluntly: "Switching costs are virtually zero" — the only real lock-in is that a $200/month subscription undercuts per-token API billing. That's a price advantage, not a product moat.
Meanwhile, Google announced "Skills" for Chrome — the ability to save your best AI prompts as one-click browser tools. Reception was noticeably colder. Commenter hotsalad summarized the room: "So, bookmarklets for Chrome's AI integration?" Commenter woodydesign captured the real problem the feature is trying to solve: "My prompt collection lives in 3 different places right now — Raycast snippets, Apple Notes, and a Notion page." The organizational chaos of prompt management is real. Whether a browser toolbar is the answer is another matter entirely.
A quieter post — one developer's AI-assisted coding workflow — drew a mix of recognition and mild exasperation. The community's approach has essentially converged: discuss the problem with the agent → write a spec → implement → verify. Commenter gbrindisi calls their first step `/rubberduck`, using the AI to frame the problem before writing a single line. Multiple commenters independently noted this looks like old-school waterfall development — structured, sequential — just with shorter cycles. "Agile is just waterfall scaled down to 2 weeks," one commenter observed. "Spec-driven AI workflows might be waterfall scaled down to an afternoon."
The Surveillance Ratchet Keeps Turning
3 stories today, all pointing the same direction: more data is being collected, stored, and potentially weaponized — sometimes by accident, sometimes by design.
The highest-urgency story: Fiverr, the freelance marketplace, left customer files publicly accessible and indexed by Google. A user named morpheuskafka posted a "Tell HN" after trying to report the vulnerability through official channels. The scope is alarming — tax forms (including Form 1040s with Social Security numbers), API tokens, penetration testing reports, confidential PDFs, internal API documentation. Commenter janoelze wrote: "Fiverr needs to immediately block all static asset access until this is resolved. Business continuity should not be a concern here." Commenter qingcharles confirmed independently: "Thousands of SSNs in there." The data appears to have been served via a public content delivery network without access controls, with some of it surfaced through public HTML pages that Google's crawlers found and indexed. This is a disclosure that could expose thousands of people to identity theft.
On the planned-surveillance front, the Electronic Frontier Foundation (EFF) published a detailed takedown of California legislation that would require 3D printer manufacturers to implement a "state-certified algorithm" — a government-approved scanning system — that checks design files for firearm components and blocks prohibited prints. The HN community was almost uniformly skeptical. Commenter horsawlarway, who owns several printers, made the core practical objection: "If I wanted to make something resembling a firearm, I'd go to Home Depot WAY before I bothered 3D printing. You basically just need a metal tube." Commenter MisterTea noted the phrase "state-certified algorithm" carries "a really nice tyrannical ring to it." The consensus was that this bill would fail to prevent its stated harm while creating sweeping collateral damage — from novelty toys to medical models to anything that looks vaguely cylindrical.
Then there's Flock Safety. The site stopflock.com gained significant traction today, rallying opposition to Flock's license plate reader (LPR) network — a system of roadside cameras that records and stores the location and movement of vehicles, sold to thousands of local police departments. Commenter beloch flagged that Flock just started expanding into Ontario, Canada this month. Commenter bmitch3020 articulated the core issue precisely: "I don't want to stop Flock the company. I want to stop Flock the business model... it should at least come with liabilities so high that no sane business would want to hold data that is essentially toxic waste." Commenter jedberg, a former criminal investigator, offered a notable framework: "Your data should be an extension of your home, even if it's held by another company. It should require a warrant."
Amazon's acquisition of Globalstar — expanding its Amazon Leo satellite internet constellation (previously called Kuiper, rebranded last year) — added one more layer to this picture. As commenter kumarvvr tallied up: AWS servers, satellite communications, content delivery, advertising, product sales, and now satellite spectrum and carrier licenses. The vertical integration of communications infrastructure into a handful of private companies raises its own long-term questions about who controls the pipes — and what can be observed through them.
The Commons Has a Maintenance Problem
A thought-provoking post argued that "dependency cooldowns" — waiting days or weeks before updating software libraries your project relies on — makes organizations free-riders on the open-source ecosystem. The logic: if everyone waits for someone else to discover bugs in a new release, someone has to go first. The cooldown assumes that person exists and is paying attention.
Commenter antonvs pushed back hard: mature organizations have always waited before updating production dependencies. "Free-riding is not the right term. It's more like being the angels in 'fools rush in where angels fear to tread.'" Commenter onionisafruit agreed: "The people who will benefit from a cooldown weren't reviewing updates anyway." The counterproposal — upload queues and structured rollout schedules to distribute the testing burden — got a lukewarm reception. The structural conclusion most commenters landed on: this is unfixable without treating open-source infrastructure as a public utility. One commenter said exactly that.
Against that backdrop, a delightful post showed a 20-something finding and fixing a 20-year-old infinite loop bug in Enlightenment E16, a Linux window manager from the early 2000s. The bug caused a hang when detaching from the display. Several veteran commenters emerged with nostalgia: E16 was what lured many of them into Linux in the first place. Commenter madaxe_again: "I saw a black and white printout of someone's desktop and immediately set about figuring out how to get this unbelievable coolness working. By the time I was done I had committed my first patches to a kernel module."
It's a small story, but it illustrates something the dependency-cooldown debate dances around: the commons is maintained by whoever shows up. Sometimes that's a 20-year-old who never used the software when it was new. There's no queue. There's no institution. There's just the person who cared enough to attach a debugger.
The connective tissue across today's HN isn't hard to find: speed and control. AI tooling is moving fast enough that even its builders can't fully keep up with the constraints. Surveillance infrastructure is expanding faster than civic or legal response. And the open-source substrate that everything runs on is patched by a thin, voluntary layer of people doing it because no one else will.
TL;DR - Anthropic is shipping AI automation features rapidly, but usage limit cuts, near-zero switching costs, and the unresolved "who tests this?" problem are all circling the same drain. - 3 separate stories — Fiverr's massive data leak, Flock's expanding license plate surveillance network, and California's proposed 3D printer content filter — all point to data exposure and surveillance outrunning both public awareness and legal frameworks. - The open-source ecosystem faces a structural free-rider problem: dependency cooldowns push testing burden onto a shrinking group of early adopters, while a 20-year-old fixing a bug older than they are is the romantic face of what actually keeps aging software alive.
Archive
- April 14, 2026AIHN
- April 13, 2026AIHN
- April 12, 2026AIHN
- April 11, 2026AIHN
- April 10, 2026AIHN
- April 09, 2026AIHN
- April 08, 2026AIHN
- April 07, 2026AIHN
- April 06, 2026AIHN
- April 05, 2026HN
- April 04, 2026AIHN
- April 03, 2026AIHN
- April 02, 2026HN
- April 01, 2026AIHN
- March 31, 2026AIHN
- March 30, 2026AIHN
- March 29, 2026
- March 28, 2026AIHN
- March 27, 2026AIHN
- March 26, 2026AIHN
- March 25, 2026HN
- March 24, 2026AIHN
- March 23, 2026AIHN
- March 22, 2026AIHN
- March 21, 2026AIHN
- March 20, 2026AIHN
- March 19, 2026AIHN
- March 18, 2026AIHN
- March 17, 2026AIHN
- March 16, 2026AIHN
- March 15, 2026AIHN
- March 14, 2026AIHN
- March 13, 2026AIHN
- March 12, 2026AIHN
- March 11, 2026AIHN
- March 10, 2026AIHN
- March 09, 2026AIHN
- March 08, 2026AIHN
- March 07, 2026AIHN
- March 06, 2026AIHN
- March 05, 2026AIHN
- March 04, 2026AIHN
- March 03, 2026
- March 02, 2026AI
- March 01, 2026AI
- February 28, 2026AIHN
- February 27, 2026AIHN
- February 26, 2026AIHN
- February 25, 2026AIHN
- February 24, 2026AIHN
- February 23, 2026AIHN
- February 22, 2026AIHN
- February 21, 2026AIHN
- February 20, 2026AIHN
- February 19, 2026AI