Compounding Trust:
a lab that taught
itself gomoku
How verified skills become trust, and trust — left to run — compounds into an autonomous research loop.
We set out to train an AlphaZero agent for 9×9 free-style gomoku on a single Apple-Silicon laptop — and, in doing so, to build an autonomous lab that could run the experiments largely without a human at the keyboard. The organizing idea is borrowed from Sid Bidasaria’s “Stop babysitting your agents”: teach an agent to verify its own work, and reliability follows; reliability lets you run agents in parallel; parallel agents, left in background loops, do the work while you sleep. Each stage rests on the one beneath it, and trust is the currency that lets you climb.
The agent works. From self-play alone it reaches 100% against the fixed opponent ladder — every game won as Black, none lost as White — versus a heuristic player and shallow lookahead search. Getting there meant defeating a stubborn failure mode and discarding a long list of plausible ideas that did not survive verification. This report is as much about the discards as the wins.
The motivating idea
The premise of Bidasaria’s talk1 is that models have outgrown the way we supervise them: we sit in the loop out of habit, not necessity. His remedy is a ladder of three skills, each of which only becomes safe to climb once the one below it is solid.
First, verification — give the agent the tools and instructions to check its own work, the same write-build-test-debug loop a careful engineer runs. Second, parallelism — once an agent reliably verifies itself, you can run many at once without your attention becoming the bottleneck. Third, background loops — recurring prompts and scheduled routines that take the agent off your keyboard entirely, handling the standing chores: triage, docs, keeping the build green.
Hold Claude’s hand and show it how to verify; once it knows how, it can summarize those learnings into a skill file.
The through-line is trust as a compounding asset. You cannot stop babysitting an agent you cannot verify; verification is what earns the trust to delegate; delegation is what lets the work compound. Skills are the unit that carries trust forward — a verified procedure, written down once, reused without re-supervision. The same logic applies to a research lab: a result you can independently check becomes a foundation the next experiment can stand on.
One framing in this report — treating the lab as a cockpit to be designed rather than an autopilot to be trusted blindly — is our own extension, not Bidasaria’s. His three stages are the spine; the lab below is one concrete attempt to live by them on a real, GPU-bound problem.
The protocol, at altitude
If trust is the currency, the protocol exists to make verification — the thing that mints it — as cheap and as honest as possible. The operational details live in the project wiki2; what matters here are the principles, and why each one earns its place.
One GPU tenant, two queues
Heavy work — training, self-play, evaluation — runs strictly one-at-a-time on the GPU, so a measurement is never confounded by a competing tenant. Everything else — code, analysis, writing, audits — fans out in parallel. The scarce, serial resource is protected; the cheap, parallel one is exploited.
Receipts are the record
An experiment that ran without filing a receipt — its command, configuration, run ID, checkpoint, and metrics — did not happen, as far as the next session is concerned. The receipt is what makes a result re-checkable, and therefore trustworthy. Evidence is kept immutable; only the synthesis on top of it is revised.
Verify against something fixed
Strength is measured only against fixed opponents — a heuristic player and lookahead search — never sibling-versus-sibling, which is non-transitive and flatters the field. A separate reviewer pass audits each result before it is promoted. Short evaluations are treated as hints, not verdicts.
Optimise the honest objective
The north-star metric is elo gained per wall-clock hour — measured end-to-end, from a common checkpoint against a stable anchor. Throughput proxies such as augmentations-per-second are gameable means; the lab repeatedly caught configurations that won the proxy and lost the objective.
Isolate so parallel work composes
Every change is made in its own isolated workspace and merged back with an explicit merge commit — never rebased or squashed. Concurrent efforts stay legible and never entangle, which is what makes running many agents at once safe rather than chaotic.
Autonomy as a deny-list
Reversible, local work proceeds without asking; only the genuinely consequential moves — publishing, dependency changes, architectural decisions — pause for a human. Earned trust defaults to action. The boundary is drawn around what is hard to undo, not around what is large.
Read together, the principles are a single bet: make verification so cheap that being wrong is cheap too. When a bad idea costs a 60-second probe and a one-line receipt to disprove, you can afford to test many of them — and the lab’s long list of discarded ideas (§ 5) is the dividend.
The method, briefly
The learner is a standard AlphaZero loop with no human games and no opening book. A single residual network reads the board and emits two predictions: a policy (a distribution over moves) and a value (who is winning). That network guides a PUCT tree search, which balances the policy’s suggestions against the win-rates it accumulates as it reads ahead. The agent plays itself; the search makes each move stronger than the raw network, and the network is then trained to imitate the search. Self-play, train, evaluate against fixed baselines, repeat.
Two engineering choices make it tractable on one laptop. Leaf positions from many concurrent games are pooled into a single wave-batched forward pass to keep the GPU saturated; and every position is expanded into its eight board symmetries (D4 augmentation), yielding eight training examples for the price of one game.
What worked
The path to the result was not a single insight but four changes, each tested alone against the fixed ladder before being kept. In plain terms:
In-search VCF
A fast solver checks each search leaf for a forced win by consecutive fours and backs up a proven outcome. It teaches the network to recognise forcing threats without having to search them move-by-move — a decisive gain over the baseline.
Whole-board (global-pool) sampling
Instead of training mostly on the most recent games, the replay buffer is sampled uniformly across its whole history. Flattening what the model sees is what finally broke the feedback loop behind the central failure (§ 5).
Value discount
Slow wins are discounted slightly against quick, clean ones, so a decisive finish trains a sharper gradient than a grind. This pulled the agent out of a defensive-draw stall without sending it back into reckless attack.
First-Play-Urgency reduction (evaluation only)
At evaluation time, unvisited moves are given a smarter default than zero, so the search explores more evenly. It costs nothing in training and is what cracks the deepest lookahead opponents.
| Fixed opponent | Win as Black | Loss as White |
|---|---|---|
| Heuristic | 100% | 0% |
| Lookahead-2 | 100% | 0% |
| Lookahead-4 | 100% | 0% |
| Lookahead-6 † | ~86–90% | 0% |
| Lookahead-8 † | 100% | 0% |
The headline target — win all as Black, lose none as White against the heuristic and lookahead-2/4 — is met at default settings, with no evaluation-time tricks. † the evaluation-only levers then generalise to the harder lookahead-6/8 opponents.
What didn’t work
A research log is honest only if it records the failures, and most of this project was failures. They fall into three kinds: a failure mode that had to be solved, beliefs that turned out to be wrong, and levers that simply did not earn their keep.
The failure mode
Because both selves play equally weak defence, a low-budget search keeps confirming its own attacking prior and never explores the blocks. Games degenerate into a short blitz where whoever strikes first wins — in one early cell the agent scored 0/20 against the heuristic while game length collapsed from ~20 plies to ~12. The value head finds the result trivially predictable, which hides the rot. More search alone does not fix it; the buffer itself is starved of the positions that would teach defence.
Beliefs that were wrong
We assumed self-play generation was the bottleneck and that a wider single process would help. Benchmarks showed per-game time was flat across batch sizes — going wider in one process bought nothing. Running four worker processes instead gave a 5–6× speedup (≈85 s/epoch down to ≈15 s), at which point the loop was SGD-bound, not generation-bound.
Faster data is not better data. A search budget that produced 2.9× more games yielded only ~1.9× more examples because the games were shorter; and configurations with high throughput collapsed to fast-attack just the same. Raw speed is a gameable means — the reason the north-star metric is elo-per-hour, not data-per-hour.
Moving evaluation to Apple’s neural engine was slower per call than fused PyTorch on the GPU, not faster. Its only plausible advantage is overlapping with concurrent training — an as-yet-unproven systems bet, not the latency win first hoped for. An earlier performance “haircut” estimate was likewise withdrawn once it proved to be a measurement artifact.
Levers that didn’t earn their keep
Forcing varied first moves widened opening diversity (13 distinct first squares, up from 2) but did not touch the core collapse — nobody was forcing the agent to defend. Killed after eight epochs.
The 1.5 M-position buffer cycled roughly 28× over a single run, churning out useful diversity faster than it could be retained. The lesson — size the buffer to the generation rate — fed the whole-board sampling that eventually worked.
Head-to-head racing retired a parade of plausible ideas: a Mish activation (a fast-but-misleading anchored climb that finished last in real play), a narrower search-breadth setting, a soft-policy target, a continuous-threats (VCT) solver, and a value-only evaluation overlay that boosted attack while wrecking defence. Each looked reasonable; none survived a fixed-baseline check.
A standalone GPU-broker daemon was written and then superseded before it ever ran — the racing loop re-derived the scheduling it needed. A reminder that the cheapest verification is the one that stops you building the wrong thing.
Why it compounds
The discards in § 5 are not a confession; they are the point. Because a fixed-baseline check costs a short probe and a receipt, the lab could afford to be wrong dozens of times and still converge — failure was cheap precisely because verification was cheap. That is Bidasaria’s ladder running in a research setting: verification first, so that the parallel racing of recipes is trustworthy, so that the whole thing can be left in a background loop.
The compounding shows up in the documentation, too. Raw evidence is frozen; a synthesis layer on top of it is rewritten each time we learn something, so the next session starts from the distilled lesson rather than re-deriving it from logs. A verified result becomes a foundation; a foundation becomes a starting point; the starting point makes the next result cheaper to reach. That is what it means for trust to compound — and it is why a single Mac, left running, could climb from a blank network to a perfect record against its baselines.
All three skills together end up with a system that does a lot of work without you manually on the keyboard.
References & notes
- S. Bidasaria, “Stop babysitting your agents.” Talk, 20 May 2026. Transcribed locally with Whisper (no official captions); quotations are close paraphrases and may carry minor ASR artifacts. The verification → parallelism → background-loops progression is his; the “cockpit vs. autopilot” framing is ours.
- Gomoku project wiki and training notebook. The maintained synthesis layer (topic pages, charter, research board) over an append-only experiment log. All numbers, run IDs, and verdicts in this report trace back to it.
- Champion checkpoint. W&B run
44cxzc9d; recipe vcf → +global-pool → +value-discount, trained from scratch on an Apple M5 Max via PyTorch + MPS, logged to Weights & Biases throughout.
A note on certainty: figures are reported as the project recorded them; the lookahead-6 result is shown as a range because it depends on evaluation settings. Where the record was ambiguous, this report says less, not more.