Skip to main content
Jackson Bates

Giving the Agent a Brief

There's a version of AI-assisted coding that looks like this: you open a chat window, describe what you want, and then spend the next hour in a low-grade negotiation — rephrasing, course-correcting, explaining what you meant by what you said. You get something at the end. Whether it's the thing you wanted is a different question. I wrote about this not working for me previously.

I've been trying to do it differently.

The thing I kept getting wrong #

For a while, my approach to using Claude Code was pretty ad hoc. I'd have a rough idea, hand it to the agent, and see what came back. Sometimes this worked fine. For small, well-bounded tasks it still works fine. But for anything non-trivial — a new feature, a prototype, anything with more than a couple of moving parts — the results were inconsistent in ways that were hard to diagnose. Was the problem the model? The prompt? The fact that I hadn't fully thought through what I actually wanted?

Honestly, it was usually the third one.

The model isn't the bottleneck as often as I initially assumed. My spec is. A vague brief produces vague results, and then I'd spend more time fixing and redirecting than I would have spent just writing the thing myself. Which rather defeats the purpose.

Why not just use what Anthropic already provides? #

Fair question. Claude Code now has native sub-agents, context compaction, persistent memory, a Tasks API — Anthropic is clearly thinking hard about this problem. And I've tried some of it.

The honest answer is: I don't trust it yet. Not as a fixed position, just as where I am right now.

Context compaction in particular makes me nervous. The idea is that when the context window fills up, the model summarises what it knows and carries that forward instead. In theory this is fine. In practice, I don't know what gets lost in the summary, and I have no good way to find out until something breaks quietly three tasks later. That's a hard thing to debug.

But there's a more fundamental issue underneath this, which is that token count matters beyond just whether you've hit the limit. Claude has a 200,000 token context window — there's a million token option for some workflows — but more tokens in the window doesn't just mean more capacity, it means more noise. I've found, and others in the community have reported the same thing, that the more you feed into that context, the more confused the model gets and the lower the quality of the outputs. There's a real degradation that sets in well before you hit any hard limit.

And this is where compaction creates a problem even when it works as intended. It reduces the token count, yes, but it doesn't reset it to zero — it compresses and carries a summary forward, so you start your next session with that weight already in the window. With my approach, when I clear context between tasks, each new session starts at around 5% of the token allowance. Running through the orchestrator loop, it's rare for any individual session to exceed 50%. If I were relying on compaction instead, there'd be a cumulative residue building up over time — a kind of sediment that you can't fully clear. That's what I'm working around.

The sub-agents and Tasks features are newer still, and the pace of change in that part of the product means anything I wrote about them last month might already be out of date. Your mileage may vary, obviously. But I'd rather shape my own framework around the failure modes I actually understand than trust a black box around the ones I don't — at least until I've built up more confidence in how it behaves.

Which brings me to the Ralph Wiggum loop.

The orchestrator script I wrote is, at its core, a bash loop. It calls Claude Code's headless mode as a subprocess, reads the exit code, updates a state file, picks the next task, and goes again. Simple. Durable. Survives restarts. Produces timestamped logs I can actually read. The name is a nod to the old "I'm helping!" character — the loop doesn't know anything, it just executes faithfully and keeps state in a file on disk where I can see it. There's something I find reassuring about that.

The primary reason I built it this way was to manage the context window aggressively. Each task runs in its own fresh Claude session. The implementation agent gets exactly what it needs — the task prompt, the shared context preamble, any feedback from a previous review — and nothing else. No accumulated conversation history. No compaction. No mystery about what the model currently knows. State lives in STATE.json, not in the model's memory.

It's a little more plumbing to set up. But the tradeoffs are visible, which I prefer.

Plan first, then execute #

The workflow I've landed on splits the work into two distinct phases, and keeps them genuinely separate.

Phase one is planning. Before any code runs, I write a Product Requirements Document with the agent's help — iterating on scope, describing the UI, thinking through edge cases. It sounds more formal than it is. The point isn't to produce a document; the point is to think clearly before I've committed to anything. The PRD ends up being the source of truth for everything that follows. A well-written one means the agent's task prompts are precise. A vague one means it has to guess, and it will guess confidently, which is its own kind of problem.

Phase two is execution. Once the plan is solid, a slash command reads the PRD and decomposes the work into a sequence of small, self-contained tasks — each one scoped to be completable in a single agent session, each one producing a working, verifiable increment. Then an orchestrator script runs through them: implement, review, pass or fail, retry if needed, move on.

The implement-and-review cycle is the bit I find most interesting. Each task runs two separate Claude sessions — one to build, one to critique. The review agent is specifically instructed to be critical and to produce a structured pass/fail verdict. If it fails, the implementation agent gets the review notes and tries again. It's a tight feedback loop that catches a surprising amount before I ever look at the code.

For tasks that produce visible UI, the review goes further: it actually opens a browser, navigates to the feature, takes a screenshot, and compares it against the design reference. This catches things a code review can't — layout regressions, broken interactions, a button that's technically rendered but positioned somewhere inexplicable.

What this actually feels like to use #

I tried it properly for the first time on a writing coach prototype in February — an AI essay feedback tool with a two-panel layout and an annotation sidebar. Seven tasks, three phases, one morning. Two tasks needed a retry (the review caught real issues, not phantom ones). Three tasks had visual review. Total time from starting the orchestrator to the final checkpoint: about seventy minutes.

That's... pretty good? I'm still calibrating my expectations, to be honest. I don't think this replaces thinking. It just means the thinking happens earlier, in the planning phase, where changes are cheap. By the time the agent is writing code, I've already made the interesting decisions.

There are also checkpoints built in — moments where the orchestrator pauses and waits for explicit human approval before continuing. After the data model is done, for instance, it stops. Because if the data model is wrong, everything built on top of it is wrong, and it's much better to catch that before you've built anything on top of it.


The honest version #

I'm still figuring out the edges of this. It works well for prototyping and for bulk changes where the shape of the work is clear. It's less obviously useful for the kind of exploratory, "I don't know what I want yet" work where you need to see something to know whether you want it.

And the planning phase takes real effort. If you're hoping for a shortcut that lets you skip the thinking, this isn't it. If anything, it makes the thinking more visible, which is uncomfortable in proportion to how much you were previously avoiding it.

But that might be the point. The brief is the work. The agent is just very fast at executing it.

I'm also aware this might not be how I do it forever. The pace of change in Claude is genuinely fast, and some of the native tooling I'm skeptical of today might be exactly what I reach for in six months. I'll probably write about it when that happens.

Anyway. I'm keeping notes as I go.