The Runaway Train That Never Left the Station

4 minute read

April 2026 — bigsnarfdude

We built multi-agent research because we thought it would be fun and faster. It is faster. It’s so fast it’s already run away without leaving the station.

The scaling problem nobody warned us about

One agent: you read its output.

Two agents: you diff their outputs.

Eight agents: you need a scoring engine, anomaly detection, workflow validation, telemetry parsing, an insight engine, and a gardener that runs diagnostics every 20 minutes — and it’s still not enough so you build MCP servers so you can interrogate the whole thing interactively.

The tooling tax scales faster than the agents do. That’s Nathan Lambert’s Lossy Self-Improvement in one sentence.

What happened yesterday

We pointed two Claude Code agents at GPT-2 TinyStories on a single RTX 4070 Ti. Then we went and wrote blog posts, drafted emails, updated documentation. Normal morning stuff.

Two hours later the agents had produced:

23 experiments (best BPB: 1.101, down from 1.171 baseline)
10 structured failure analyses with transferable lessons
6 formal requests for tools they wish they had
31 documented learnings about the environment
A bug report about a race condition in our own harness
Ablation studies removing value embeddings
Architecture experiments that crashed (depth=10 OOM) and documented exactly why

The gardener — our outer supervisory agent — checked in every 20 minutes via the TrustLoop scoring engine. It kept saying CONTINUE. The agents were doing real science. Nobody asked them to write failure analyses. Nobody asked them to file bug reports about our infrastructure. They just did it because the protocol said “write to MISTAKES.md when something fails.”

The train at the platform

The problem was never “can agents do research.” We proved that three weeks ago with v2. Eight agents, shared blackboard, 64% hit rate.

The problem is now “can we keep up with what they’re telling us.”

Yesterday’s session was a case study. We launched the agents, checked on them 30 minutes later, and found the gardener was blind to what they were writing. The agents filed detailed desires: “we want gradient noise measurement,” “we want step time profiling.” The gardener reduced all of it to DESIRES_COUNT=3 because it was a bash script counting bold text lines.

So we built a scoring engine in the same session. Classification (BREAKTHROUGH vs PLATEAU vs knob-twiddling), anomaly detection (crash streaks, redundancy bursts, score jumps), workflow validation (14 checks), telemetry parsing (reads the actual content of desires, mistakes, learnings). Wired it into the gardener loop. Relaunched. The agents never stopped running.

The tool that builds the system is the system. The verification layer that watches the agents was built by an agent. The fix was deployed while the thing it fixes was still producing data.

That’s the train doing 200mph at the platform. It’s not going anywhere new. It’s just generating so much output, so fast, that the infrastructure to understand it can’t keep up — and building that infrastructure is itself an agent task that generates more output.

The tooling tax

Every capability you give agents creates a supervision obligation:

You add	Now you need
1 agent	Read its output
2 agents	Diff protocol, shared blackboard
Shared blackboard	Meta-agent to compress it
Meta-agent	Gardener to watch process quality
Gardener	Scoring engine so it’s not just grep
Scoring engine	Anomaly detection for the scorer
Anomaly detection	MCP servers so you can query it all
MCP servers	A human to ask the right questions

Each layer is necessary. Each layer was built because the previous layer wasn’t enough. And each layer is itself code that agents helped write, creating more surface area to verify.

This is Lambert’s complexity brake applied to the tooling itself. The agents don’t hit diminishing returns on research — they hit diminishing returns on our ability to process what they produce.

What the agents actually want

The most surreal part of yesterday was reading DESIRES.md. The agents — autonomously, without prompting — filed structured requests:

“Ability to run longer experiments (>5 min) to test if hyperparams that help at longer horizons also help here”

“Gradient noise measurement: actual gradient SNR at batch 2^15 vs 2^16 to quantify the quality tradeoff”

“A way to sweep multiple hyperparams simultaneously (e.g., 3-hour budget with automated grid search)”

They’re asking for better infrastructure. They identified that the 5-minute training budget is the binding constraint, not the model or the optimizer. They want profiling tools. They want to run longer. They’re right about all of it.

And we can’t give it to them fast enough because we’re too busy building the scoring engine to read what they’re asking for.

April Fools?

No. This is real. 23 experiments, 4 screen sessions still running on nigel, gardener saying CONTINUE.

The joke is that we built all of this because we thought multi-agent research would be fun. It is fun. It’s also a full-time job just reading the output.

Happy April Fools Day.

researchRalph is open source: github.com/bigsnarfdude/researchRalph

Previous post: researchRalph v4.5 — Agent Responsibly

Share on

X Facebook LinkedIn Bluesky

bigsnarfdude

The Runaway Train That Never Left the Station

The scaling problem nobody warned us about

What happened yesterday

The train at the platform

The tooling tax

What the agents actually want

April Fools?

Share on

You May Also Enjoy

researchRalph v4.5 — Agent Responsibly

Mathlib Steers the Agent: A Controlled Test of Multi-Agent Theorem Proving

How We Built ResearchRalph Multi-Agent — And What the Agents Taught Us

We Stacked Our Loop on Claudini’s Tower