Your Friend's Guide to Agentic Engineering

Your friend's guide to agentic engineering - a hand-drawn notebook page with AI tool logos

In early 2025 the quantity of software an individual developer can produce skyrocketed.

Today, you can have agents produce:

  • multiple features in parallel
  • for existing codebases
  • without physically typing
  • optionally from your phone.

But if you haven’t kept up with all of this, how do you catch up?

Half the guides and ‘field manuals’ online read like verbose AI slop. Many unquestioningly praise every related bit of tech they can find to mention, some undoubtedly to sell AI consulting & courses.

This is the answer to the catch-up quandry.

It’s free. It’s hand typed. It’s opinionated. There’s some spicy critique of tools, because friends tell it how it is.

This is Your Friend’s Guide to Agentic Engineering: derived from my experience, Twitter feed & a number of blog posts. Updated over time.

DISCLAIMER

*No post content here on how absurdly well agents work now, that evidence is ubiquitous1,2,3. The scope here is narrow but deep: how to use agents effectively.

**Nor an index of every harness/tool/model that exists; an opinionated guide on using what works, quickly.

Agents, Harnesses & Tooling Overview

We’ll start with some quick definitions so you know what I mean when I say ‘harness’ or ‘agent’. There is some conceputal overlap between the two.

Understanding the fundamentals of these concepts is valuable for a few different reasons, to my mind:

  • We’re still ‘early’ on agents for software production, AI’s native domain. You might think of writing agents as fruitless when there are already so many harnesses; I would argue with tools like OpenCode as a starting point, there is tremendous upside: and we’ll discuss companies doing exactly this (building on OpenCode) in the Harnesses section.
  • We have barely started writing harnesses for purposes outside of software production, which is where e.g. YCombinator is investing heavily in now.
  • Knowing the fundamentals helps you evaluate your options and understand flaws in today’s harnesses, the tool you’re probably going to be writing the majority of your code with.
DEFINITION
Agent

An LLM running in a loop: choosing tools, observing results, and deciding what to do next until the task is complete.

Building your own agent is not the equivelant of scaling a mountain. Thorsten Ball (working on Amp at SourceGraph) provides a great illustration of just how easy it is to implement (a simple version) for yourself, in How To Build An Agent.

DEFINITION
Harness

The infrastructure orchestrating interaction with the agent, including (non-exhaustively): the core loop, tool definition, execution of tools, management of session state, handling of permissions.

With this context, it probably isn’t a surprise to learn that the harness you use impacts the quality of the agent output, measurably. There are consequential, opinionated decisions being made on what reaches the model at this layer of the cake.

We don’t have to sit in wonder as to what a production implementation might look like, OpenCode is fully open source and very popular (~650k users and growing nearing 1M MAU now).4

Harnesses: Codex, Cursor, Open Code, Amp, Claude Code

As a litmus test to tell how familiar you are with the state of harnesses, consider the below quote. I don’t necessarily endorse it, but it’s also not an extreme viewpoint (it was probably a deliberately provocative comment to hype a video):

"

…if you’re still using an IDE to develop code by January 1st (2026), you’re a bad engineer.

"
— Steve Yegge

There are a number of harnesses worth discussing, but we’re going to start with OpenCode as it’s open source & model agnostic, which we can take advantage of to understand it’s constituent parts. Now that OpenCode are slowly rolling out their own token subscription plans (OpenCode Black - but very limited availability atm)5 it’s firmly becoming my #1 recommendation.

TUI
Desktop
Visual Studio Code
Zed
Cursor
Windsurf
VSCodium
GitHub Actions

OpenCode has two great clients to use, a TUI based on the team’s (also open source) OpenTUI, and the Desktop app (Windows, Mac, Linux) alongside extensions for VS Code, Cursor, Zed, Windsurf, VSCodium.

The below visualisation is intended to illustrate the core loops of the harness, needless to say there is some abstraction.

Rotate your device to view
click left/right to navigate
click to start
The harness orchestrates the agent loop, managing tools, permissions, and state.
CLIENT
TUI / IDE / Desktop
POST
stream
HARNESS
Session Loop
messages = fetchHistory()
agent = Agent.get()
result = processor.process()
if stop: break
Processor
reasoning → part
text → part
tool → Agent.exec()
finish → usage
Agent exec
Tools bash
stream
request
MODEL
+70
Generates:
thinking text tool request
0 / 18

There’s some really nice decoupling of the harness from not only models, but also the client. To interface with models, OpenCode utilises Vercel’s AI SDK. As a decoupled interface to the client side, OpenCode runs a HTTP server, Hono, to expose harness sessions to potentially multiple clients. This is great for flexibility: running OpenCode on your own machine, or a VPS, affords the ability to share sessions between your IDE, terminal, or likely a phone client (in the future), ideally over a VPN like Tailscale.

The HTTP server is implementing the open source Agent/Client Protocol to achieve client agnosticism, which specifies generic capabilities of both agent and client. OpenCode is implementing the ‘agent’ side:

  • initialise,
  • new session,
  • load session,
  • new prompt,
  • setModel,
  • setMode (plan/execute).
click to start
CLIENTS
TUI
Desktop
VSCode
Zed
Cursor
v v v
HTTP
^ ^ ^
SSE
HTTP SERVER
GET /session
POST /session/prompt
GET /tools
SSE /event
v v v
dispatch
^ ^ ^
events
HARNESS
Session Loop + Agent + Tools + Processor (see lifecycle above)
0 / 8

Click left to go back, right to advance

Putting the architecture aside, OpenCode often provides access to decent, free models to work with. At the time of writing this included:

  • Big Pickle,
  • GLM-4.7,
  • Grok Code Fast 1 &
  • MiniMax M2.1.

It’s important to keep in mind that you can bring your Claude* or Codex subscription to OpenCode, and use the subscription’s models (Opus, Sonnet, GPT 5.2 etc depending on subscription) via the OpenCode harness. OpenCode is agnostic to models (besides recommending some that integrate particularly well) unlike Claude Code and Codex.

Aside from folks regarding the OpenCode harness on par with Claude Code & Codex, the value proposition of having a hackable harness is tremendous. Ramp wrote up their approach to this to produce Inspect.

RABBIT HOLE
If you want my 2c on what hacking on OpenCode for your company might achieve

A few ideas that came to mind, prior to Ramp publishing their article, on what you could hack into a company harness:

Infrastructure A few years back I remember reading about Zack Kanter’s Stedi handling IAM permission refinement via slack alarms trigerred when an action is denied. In larger engineering teams it would probably make sense to have a custom slash-command to automate infrastructure requests generally, with custom UIs (execution is cheap now!) for infra teams to manage this.

Support In startups it’s common to have a bunch of cobbled together scripts for spinning up new customers, alongside other operations. Centralising access to these in the terminal, where everyone’s coding from, feels like it’d be a massive convenience win.

Regulated Industries I discuss this a little in the Amp section, but some industries have tight reporting requirements regarding risk. This is extremely challenging to do well because risk is everywhere, but telling an organisation this leads to everything or nothing being raised. Often nothing because filling out risk reports is a giant PITA. A saner approach might be a custom Risk Agent maintained by engineering/risk, with refined-over-time context to assess severity. Remembering that OpenCode is client agnostic, it’s possible this could make its way to the browser for non-developer distribution. It might work too well though, if you know what I mean.

Monitoring generally One of my least favourite activities is setting up Slack alarms. Too noisy and they’re inevitably muted, and while you might be able to tune the source and merge some channels, you might not (non-configurable, bureaucracy to change infra, list goes on). What might, and I stress might make more sense is client-side tuning via an LLM, decoupling the producer noise level from the entire receiver list. I would love to be able to have an LLM analyse how often I’m getting X notification and adjust my desktop notification for that channel accordingly, without making that decision for everyone else.

Security Third party packages are what worry me the most. Pre-LLMs it just seemed like an insurmountable mountain: do you take latest versions to quickly receive vulnerability fixes or do you delay so as not to receieve a short-lived malicious release. What might make sense is having an LLM to query for the capability you need, checking a centeralised org-wide package list, and if nothing meets the need at least have an agent audit the source code pre-installation. The agent would also get triggered on detection of new versions to do a scan before green-lighting usage.

Hacking on it isn’t solely the domain of large companies; one dev went ahead and spent $24,000 worth of tokens to build oh-my-opencode. My favourite feature of this plugin, borrowing conceptually from Amp perhaps, is the agent specialisations (a handful with a different purpose / system prompt) to best answer your prompt - elaborated on in the Amp harness section.

To summarise: OpenCode is a harness you’ll want to be across. As an open source solution it’s likely to scoop up other harnesses’ features (and has been, e.g. thread sharing from Amp), the free models are pretty good, and the opportunity to hack on it is a killer feature for dev teams to build a platform on.

*See Claude section below for why Claude is striken through. Even though OpenCode circumvented the Anthropic block, doign so with your account is a TOS violation.

TUI
Visual Studio Code
Web
iOS
Android
GitHub Actions
SDK/API

If you don’t want to read more about harnesses, but want access to a leading model, ideally have your employer pay for Codex Pro, install it in your terminal of choice and you’re ready to go.

The primary reason I suggest this to get started, aside from the strength of the underlying model, is that the ~$200 plans are tremendous value once you start chewing through tokens on long-running tasks, especially running tasks in parallel. The price you are paying for these is far lower than the equivelant token capacity via API consumption pricing, assuming decent consumption of your limits, which is why power users subscribe to multiple plans.

Comparing strictly lab’s model/harness offerings: Claude and Codex seem to leapfrog each other regularly, but the underlying models are now good enough that you don’t need to spend time deciphering which one is ‘best’ month-to-month anymore, you can achieve what you need to do with either. If you simply must know the best option right now, a recent review of Codex with GPT 5.2 by Peter Steinberger would seem to put it ahead of Claude, for the moment.

Additionally, many ecosystem tools are now built to use your Claude Code Codex subscription to leverage the model on your behalf, at less than the API pricing of token consumption. Standout examples include:

  • Conductor which we discuss later on in Orchestration,
  • OpenCode as discussed earlier,
  • Clawdbot by Steinberger (wrote the Codex review just above) which is sort of like an open source Poke leveraging Claude/Codex in your chat apps (SMS / WhatsApp / Signal etc). We discuss Clawdbot further below at the end of this chapter.

I also enjoy Claude Island as a sort of HUD living in my Macbook’s notch for when Claude Code needs input or is done, until my cancelled subscription expires, but there’s a fork for OpenCode too.

TUI
Desktop
Web
GitHub Actions

Probably doesn’t need much introduction given how Cursor took off in 2025. It’s the most prominent VSCode fork, and still the solution I use to review large amounts of code, and to write code by hand (albeit not for learning new languages, for that I use Zed with AI turned off) as Cursor’s autocomplete seems to be better than any other.

Like OpenCode, Cursor is model agnostic. Cursor also has clients for mobile, Slack and issue trackers, and they recently acquired Graphite (code review AI). I admittedly haven’t used the visual editing features recently released as I mostly work from Claude Code in Ghostty & Conductor.

If you’re most comfortable in an IDE, Cursor is what I’d recommend with the caveat that any harness offering model consumption via the models APIs’ (as opposed to a lab-specific subscription) is seemingly absurdly expensive.

TUI
Visual Studio Code
JetBrains
Neovim
GitHub Actions

Boy do I feel bittersweet about this one; opinions incoming.

Throughout 2025 Amp, for mine, was/is the harness producing the best output from coding models. I genuinely believe the folk at Amp are ahead of the competition in thinking about how to leverage the mechanics of the harness and different model’s strengths.

The crux of differentiation with Amp is that you don’t select a single model to answer your prompt. The Amp team, through their own usage and probably some automated evaluations, maps models to ‘roles’ used to answer your prompt, the most prominent of these probably being ‘the Oracle’ - a sort of deep thinking role. This feature is already being incorporated elsewhere, e.g. oh-my-opencode plugin for OpenCode, but also the ever-productive Peter Steinberger’s Oracle.

Everything about Amp’s approach, to me, feels ‘correct’. I don’t want to spend my time figuring out what models to use, and I definitely don’t want to figure out granularly what role each model is best at, whilst fully accepting that a mix-of-models approach is probably superior to a universal model approach.

The reason I’m not using Amp is ‘only’ because my personal monetisation strategy, and that of any company I’ve worked for, isn’t Mitchell Hashimoto’s for Ghostty,6 who incidentally does use Amp to build Ghostty.7 Pricing is the issue. Akin to what I mentioned re: Cursor, it feels like harnesses consuming model’s API pricing are at a structural pricing disadvantage that is really hard to overcome.

RABBIT HOLE
If you want to read my rant about the Amp's pricing problem & strategy

Mid 2025 I happily burned hundreds on Amp because it was leaps and bounds ahead of the competition, from personal experience. I don’t know if that’s still the case, because uni-model Opus 4.5 & GPT 5 are so good that I don’t need to figure this out.

Does being really price conscious, for a tool multiplying productivity so effectively, make sense? Obviously that’s subjective, but the problem I’m seeing is that even people spending thousands on tokens are doing it via multiple Codex/Claude subscriptions.

Subscriptions are fairly coarse-grained units to scale token consumption, with a fail safe (exhausting the plans usage limit) and without the potential of uncapped spend.

Amp has what I’d hestitate to call ‘an answer’ to this: $10 of models including Opus 4.5, per day, for free ( ad-supported). Strategy-wise I don’t see what it achieves, I doubt even the average user of these tools is taking the tool-switch overhead for $10 of tokens. Spelling out that users are consuming way more than $200 worth of tokens on $200 plans,8 and then offering $10 a day, is barely solving for the market if/when subscription plans don’t exist. That market is not today, there is no indication such a market exists tomorrow given subscriptions are spreading (OpenCode Black, GLM, Minimax, and many more).9,10 While I love the Amp team’s thinking on harnesses, this thinking genuinely makes me worry for them.

Amp’s pricing does not ‘work the way it does’ as insurance against rug pulls; Amp is competing against vertical integration of manufacturing (model lab) and retail (harness). Amp would struggle to leverage the lab’s retail pricing (even if it weren’t against Claude Code’s TOS) because of their strategy. If you, as a user of Amp, can bring your own key and you do provide that key, you’re probably expecting it to be used to subsidise the token spend, but how do you square that with Amp dynamically changing what model gets used, especially if it isn’t a model you’re subscribed for? That’s a shortcut to support ticket armageddon, and even if you could manage the tickets with AI, the reputation damage from misunderstanding amongst consumers would be horrendous.

Consumption of tokens is only increasing; at best people will try Amp but realise the economics don’t work out to go further. It’s a really tough position to be in that can’t be solved by Amp building their own model, their ‘shtick’ is multi-model.

If you are optimising for the best model output, throwing the concern of token cost to the wind is entirely the right approach. This was a deliberate decision that made Amp’s harness output stand out in stark contrast to, for example, Cursor’s output at the time when the latter was (historically) trying to minimise the incurred API token cost from customers paying a flat $20 monthly plan.

The problem Amp faces is far from trivial. The output performance edge of leveraging multi-models is declining, as you would expect over time as fronteir models converge on ability. Labs are barely showing signs of abandoning their structural pricing advantage, for my money, because the data coming back to them informs what harnesses for other industries should look like. Meanwhile, harness innovations that Amp are genuinely leading the field in are quickly reproduceable in open source harnesess.

I genuinely hope they have the answers to these challenges.

We can’t discuss Amp without mentioning their thinking on threads. Look how far they’re iterating on this; treating threads as an addressable harness entity is taste in harness development. Shareable threads are important to help your team understand the power of good prompting, but I think there’s so much more to explore here. If you’re operating in an regulated industry, surely we need to see threads re-played through the lens of risk/auditing/training model roles (similar to the Oracle!) the same way business calls to customers are recorded for training staff, and for legal. Can Amp capture that value? Possibly via SDK!

To round this out, if you’re willing to spend hundreds to thousands per month per developer (ie you’re not price sensitive at that level) then Amp is probably going to provide you the best output. Does it make cost/value sense, I’m less certain. But you’d be supporting Thorsten, and you can’t not love Thorsten.

TUI
Desktop
Visual Studio Code
JetBrains
Neovim
Vim
Web
iOS
Android
GitHub Actions
SDK/API

I originally wrote a section which both Codex and ClaudeCode shared as default recommendations, then Anthropic pulled the pin on OAuth usage. This whole bloody post was littered with my preference for Claude Code.

  • It is a great harness. This blog post from Ado Kukic is great for scooping up non-obvious features.
  • Not open source
  • Can’t use your subscriptions outside of Claude Code.
  • Aside from above, the $150 plan is great token value (but Codex lets you use other tools with their plan).
  • VScode/Cursor etc extension leaks memory like wild (it happened to me repeatedly before I found the referenced GitHub issue, and moved to TUI) and has done for ~5 months.11
  • But the folks iterating on Claude Code at Anthropic and posting on X are great value to follow. I seriously doubt they steered or necessarily agreed with Anthropic’s decision re: OAuth use, it would be great to see them cop less flak.
  • I’m not writing more about Claude Code, except the below drama summary.
RABBIT HOLE
If you want to know more about the Anthropic/Claude Code subscription plan drama
Claude Code Detonator
Grok illustrating my interpretation of Anthropic’s latest dev-rel strategy

It’s not a secret that Claude Code and Codex plans are heavily subsidised, and are likely loss-leaders, which makes it understandable that Anthropic doesn’t want to subsidise the experience of other tools.

What’s going to hurt Anthropic’s reputation here, amongst developers, is the manner in which they’ve shut this off; by claiming it’s purely a matter of other tools not submitting the usual telemetry to make ‘abuse’ cases manageable.12 All this after folk had already built widely consumed tools ontop of this mechanism (it was left open for months), such as clawdbot.

You might think this is all a storm in a tea-cup, but it doesn’t look that way to me.13 OpenCode is nearing 1 million MAU.14 How many on Clawdbot, Conductor, N other tools leveraging Claude? This was a loss of subscription value broadcast WIDELY.

The Claude Code team have mentioned having ‘DMs open’ for other harnesses to discuss relaying the required telemetry, but…surely you’d just publish the requirements, even privately? We’re talking a few DMs to Dax & Co to cover 1M MAU, many of whom would be using Claude. It feels disingenous.

Co-incidentally there are reports, in the same week, of Anthropic cuting-off xAI’s staff specifically from using Claude within Cursor.15 That’s API pricing, not subsidised, so maybe this is categorically different.

Codex is now enjoying the fruits of this fumble, openly declaring intent to allow exactly what Anthropic prohibited.16 OpenCode can now leverage Codex tokens with OpenAI’s blessing. My employer’s subscriptions will be moving there or to OpenCode Black, as well as my personal one.

At the end of the day Anthropic are entitled to put & enforce whatever they want in their own terms of service, but I can’t imagine this was worth the consequences. I would struggle to consider building an LLM wrapper of any kind on Anthropic’s ecosystem now, and I doubt I’d be alone in that sentiment. It’s too hard to shake the mental image of a jealous incumbent, stricty offering service if it’s in their best interest (and makes you wonder what they’d use SDK data collection for, even though the other labs are probably similar).

I don’t think we’ve arrived at the last chaper of this story, especially if other labs/models obtain enough training data from consumers to no longer derive value from subsidised tokens. But even if that’s the end point: Anthropic blinked too hard, too early.

Typing -> Voice Prompting

Coding via voice isn’t necessarily new, but adopting it previously was a high-friction exercise involving learning verbal shortcuts to navigate your machine, all of which is documented (exceptionally!) by Josh W Comeau.

Coding via voice with agents is frictionless to adopt: you simply dictate English to be ‘pasted’ into your terminal or IDE’s agent input textbox, which is then post-processed by an LLM to tidy it all up. My tool of choice currently is Wispr Flow, which seems to have the most momentum in the market. Gergely Orosz has a great write-up on devs using Wispr Flow at their offices with insight on adoption rates, microphones to use etc.

If you’d prefer a free option, I’ve heard good things about Spokenly.

Just bear in mind coding via voice isn’t equivelant to navigating your machine via voice, so we’re not invalidating the tools (Talon et al) covered by Josh.

Prompting on your phone

Prompting from your phone can be done simply, or a little more involved to have better control over the sessions. The simplest version: probably downloading Omnara with your Claude Code subscription (seems to be limited to Claude Code currently). The quickstart is super straightforward and the only drawback, if you can call it that, is the need to pay for sessions beyond 10 a month.

Be aware as of 09/02/26 Claude has started cracking down on subscription use with other apps, Omnara probably needs to move off Claude!

At the cost of only a little more effort, you can just SSH into your laptop from your phone over a VPN like Tailscale. See the guide from Nicholas Khami, who has a blog post covering this.

Personal Assistant: Clawdbot

If you’re looking for a personal assistant that can utilise a browser, check your emails and messages generally, manage your calendar and a hell of a lot more: Clawdbot is seriously worth a look. Folk usually run it on their own machines (a VPS is fine too of course, especially with Tailscale) but I’d definitely consider segregating it from anything important, whether physically or with sandboxing (I need to add a general security chapter at some point).

If you want to know what can be done with it there is a showcase. Like Poke, the ability to leverage MCPs for additional functionality really opens up a ton of possibilities here. There’s a site listing published open source skills for your clawdbot agent here utilising npm for distribution. I’d think about having a coding model audit the source code of these depending on what you’re exposing.

Having had a look at the community Discord, the recommendation seems to be to use a cheaper model with Clawdbot. There are a few of these: Minimax, Z.ai and KimiK2 for instance, which are all (I believe) models from China. If you’re okay with this, the recommendation seems to be Minimax at the moment, with plans starting at $3AUD monthly (promo pricing). This is a referral link that knocks 10% off all plans. If you’re more security conscious, it can integrate with local models via e.g. Ollama or LM Studio.

Practice & Techniques

Context Window Management

For most of 2025, those building harnesses and writing about it emphasised the importance of using sessions liberally to keep the context use minimal (quality degrades as the context window fills). Today, most harnesses have a compaction mechanism in-built, automatically compressing the context window to keep the model focused. In addition to this, it’s common for a harness to use sub-agents (a dedicated context window with a purpose-specific prompt) to isolate token use incurred in e.g. searching code, or interviewing the user to clarify the true prompt intent, with the results fed back into the main context window.

I would argue that context management is still a critical skill, and it becomes more important the more complex your underlying task is.

Amp have a great blog post about this and the key takeaway is that when you have a thread of messages in a session, each additional message (including the model’s replies) have to be re-interpreted each time:

Rotate your device to view
click left/right to navigate
click to start
Each time you send a message, the model receives and processes the entire conversation history.
Turn 1 ~5 tokens
User Hello!
Reading 1 of 1...
Turn 2 ~10 tokens
User Hello!
Assistant Hi there!
Reading 1 of 2...
Turn 3 ~20 tokens
User Hello!
Assistant Hi there!
User What's 2 + 2?
Reading 1 of 3...
0 / 6

I’m not sure if there’s a phrase for this: I don’t mean the runtime complexity. More accessibly, it could be conveyed as a sort of token re-intepretation tax. In any case you can start to see why sub-agents (really even just sub-context windows) and new windows are important.

When, for example, you ask an agent to search your codebase, it’s often producing tokens to explain what it’s searched and what it’s doing next in addition to all the code it’s adding to the context window. Similarly, bash tool use is often producing tokens for the agent to consume to understand what’s happening in response.

Rotate your device to view
click left/right to navigate
click to start
Tool use adds tokens to context: both the agent's explanations AND the tool results.
Without Sub-Agent 0 tokens
User Find where auth is handled ~8 tokens
Agent I'll search for authentication patterns in the codebase... +15 tokens
Tool Result // auth.ts
export function authenticate(req) {
const token = req.headers...
// ... 40 more lines ...
+380 tokens
Agent Found auth.ts, let me check the middleware imports... +18 tokens
Tool Result // middleware.ts
import { verify } from 'jwt'
export const authMiddleware...
// ... 25 more lines ...
+290 tokens
Agent Authentication is handled in auth.ts:45 using JWT middleware from middleware.ts +22 tokens
With Sub-Agent 0 tokens
User Find where auth is handled ~8 tokens
Agent Spawning search sub-agent... +6 tokens
Sub-Agent Context
Searching... Reading auth.ts... Reading middleware.ts... Summarizing...
~700 tokens used here (isolated)
Sub-Agent Result Auth handled in auth.ts:45, uses JWT middleware from middleware.ts for token verification. +28 tokens
0 / 6

Worth pointing out we’re using extremely basic examples in these animations with tiny token counts. When I used AI to fix up this article’s indexes (linking to the section titles), 3 sub-agents used up 85k tokens. My blog with one post is obviously not as large as your codebase.

Skills, MCP & Progressive Disclosure

Modern Context Protocol is a specification detailing how an AI client can connect to a server to consume tools, resources and prompts.

Aside from leaving auth entirely out of the spec, the biggest problem with early MCP implementations was that enabling them loaded their tool definitions into the context window by default and ate a ton of tokens. I’ll defer to Geoffrey Huntley who summarises the whole thing excellently here if you’re interested.

Anthropic then came along with ‘Skills’. For what skills are, best just to read the Anthropic engineer blog post.

Why were skills superior out of the box? In a nutshell: because their specification accounted for lazy loading the tool definition instead of eagerly loading. The skill description acts as an ‘index’ for the agent to know when it should read the relevant skill.md file, saving a bunch of context when a skill isn’t needed. Skills were originally a Claude thing, but given the mechanical advantage, it’s not suprising OpenAI adopted them per Simon Willison. Claude Code’s system prompt and skills are here if you’re curious what Anthropic’s look like.

The core takeaway here is the primitive of progressive disclosure, it’s a generically useful concept for managing the context window.

A great example of progressive disclosure is Cloudflare’s recent MCP server, teased by Matt Carey, using only 1000 tokens to dynamically expose an API spec equivelant of 2.3 million tokens. The implementation is really clever: your agent’s context window is given essentially type hints as to how to query the Cloudflare API spec or the API itself. Think similar to exposing only a library’s type declaration but not the actual library code, as the latter would be context heavy. Claude generates code using those type hints, the code is sent over the wire to a worker, and the worker then executes the actual code with access to entire API spec locally, or queries CloudFlare API directly and returns results. The technique is based on the Code Mode school of thought which Vercel’s recent article on orienting agents around filesystems seems to echo: the gist being to get agents working in a domain they’ve been trained on thoroughly: code and filesystems.

Feedback Loops

Feedback loops are absolutely critical to course correcting agents on failing coding tasks, it’s really not a lot different than tests. Failure is often novel information. Frequently the hard part is ensuring the agent has access to feedback in its context window.

One of the best course corrections I’ve experienced so far was during the process of building an Electron app on my Macbook. As you might imagine, the Windows version was littered with niche errors that were proving extremely expensive (time-wise) to send across to Claude Code on my MacBook. The solution: a Windows VM, not running Claude itself, but open for Claude on my Mac to SSH into (using tailscale machine name) to run the Windows Electron app directly and receive error output. This enabled Claude to make changes for Windows, test on Windows, then test on Mac to verify no breaking changes.

Sunil Pai wrote a great blog post recently adapting Steven Johnson’s ‘Where Good Ideas Come From’ as a mental model to apply to crafting the context itself, and I particulalry love the emphasis placed on how much impact the process of using errors as a feedback loop makes.

Tooling wise, Graphite and CodeRabbit are popular AI review tools, I prefer the latter. But I also wonder if something like this would act as a faster feedback loop.

We’ll discuss deeper use of feedback loops in the next blog post on long-running (hours) tasks.

Orchestration

Build multiple features for one repo concurrently (Git Worktrees & VMs)

When you’re working on a repository, an instance of the repo in your filesystem can only be on one branch. This can make utilising multiple agents tricky, you usually want each to have a clean copy of the repo to work on and be able to commit changes without overlapping (stashing and hence interupting) other agent’s work. There are some alternative approaches that keep all the work within the same filesystem and rely on agents cross-communicating (locking) files and so on, but it’s very nascent and (in my opinion) not established enough as viable for non-greenfield work streams.

Git Worktrees are a solution that entail separate copies of the repository on the filesystem, each on a different branch.

The problem with Worktrees is that they’re not ergonomic to manage using git. The folks at CodeRabbit have a CLI to make it a lot easier, however, my team and I have really enjoyed using Conductor. With the Conductor GUI, you get a nice list view of all the repositories you want quick access to and their worktrees, affording quick switching across sessions with good diff views. It may not have entirely liberated me from the IDE, but it’s very good.

If you prefer operating out of the terminal entirely, a similar option is using a terminal with split-pane capability (like Ghostty) and prompting your agent to utilise CodeRabbit’s CLI above.

Using either option, I’ve been able to comfortably drive ~5-8 agents in parallel across a handful of repositories, where the bottleneck becomes my capacity to review the changes.

I’d soon expect to see most/all coding agents, like Claude Code & Codex do currently, offer the ability to work on codebases in a cloud VM (ie the web version). This also opens up friction-free concurrency, and the creator of Claude Code, Boris Cherny, has a must-read thread on Twitter describing how he uses Claude Code instances locally + web version simultaneously, and moves sessions between them.

Who to Follow

I thought about segmenting this into blog/twitter but most bloggers have twitter so here we are.

The below is to keep up with AI engineering from those writing about it. I’ve made a list of every account mentioned here + some extra, then I put myself on it for sport. I did promise cheek.

Here you go: https://x.com/i/lists/2009825440207340023

Write In!

If you’re a bit of an authority on these topics and want to correct the record or offer insight, feel free.

author@abrown.blog

Next Blog Post: Executing Huge Tasks with Agents

In the next blog post I cover a piece I had to segregate from this article, which essentially disaggregates the ‘Ralph Wiggum’ technique you’ve probably heard about. We cover when it’s actually useful, and what parts of it apply more broadly.

Expect it within the next two weeks!

Footnotes

  1. Principal Google engineer Jaana Dogan on X: ”…We have been trying to build distributed agent orchestrators at Google since last year … I gave Claude Code a description of the problem, it generated what we built last year in an hour.” https://x.com/rakyll/status/2007239758158975130

  2. Geoffrey Huntley builds a programming language, mid 2025, with 3 months of continuous Claude Code and the right technique. https://ghuntley.com/cursed

  3. Lee Robinson: “In five days over the holiday break, I built a Rust-based image compressor. I wanted to see how far I could go using only coding agents. I did not write any code by hand. After 520 agents, 350M tokens, and $287 I can now say… extremely far.” https://leerob.com/pixo

  4. https://x.com/jayair/status/2006853068026265700

  5. https://x.com/opencode/status/2009674476804575742

  6. In a since-deleted tweet, Hashimoto responded to a (possibly brash) query regarding Ghostty’s monetisation strategy, with a tweet along the lines of “My monetization strategy is that my bank account has 10 figures”. I think we can all appreciate why it no longer exists, and he has my apologies for dredging it up, but one shouldn’t paint the Mona Lisa just to throw away the canvas.

  7. Mitchell Hashimoto’s Amp Profile, with public threads https://ampcode.com/@mitchellh

  8. https://x.com/connorado/status/2009707658530754754

  9. GLM coding plans https://z.ai/subscribe

  10. Minimax coding plans https://platform.minimax.io/docs/coding-plan/intro

  11. Claude Code extension memory issues https://github.com/anthropics/claude-code/issues/8722

  12. https://x.com/trq212/status/2009689811616182404

  13. https://github.com/anomalyco/opencode/issues/7410

  14. https://x.com/thdxr/status/2009621159466385777

  15. https://x.com/kyliebytes/status/2009686466746822731

  16. https://x.com/thsottiaux/status/2009742187484065881