The Skill an Agent Cannot Write for Itself

"Thin Harness, Fat Skills" is mostly right — and quietly wrong about the part people are betting on. An agent consumes procedural knowledge with enormous benefit but cannot author it. The self-improvement loop is the weakest link, and the evidence is now unambiguous.

I have spent the better part of a year building agents that are supposed to get better on their own. The pitch writes itself, and I have made it to investors more than once: the agent does the work, notices what it did, and writes that down as a reusable procedure so it never fumbles the same task twice. Run it overnight on a cron. Wake up to a system that is smarter than the one you went to sleep with. This is the dream encoded in "Thin Harness, Fat Skills" — the idea, now repeated everywhere since Garry Tan put a name to it, that the productivity gap in agentic AI comes not from the model but from the thin program wrapping it and the fat library of skills it accumulates.

Most of that idea is correct, and the part that is correct is the part nobody argues about. The part people are quietly betting their roadmaps on is the part that does not work. An agent can consume procedural knowledge with enormous benefit and cannot author it. The skill that makes your agent better has to come from somewhere outside the agent — a human who knows the domain, or a search process that grinds against a real verifier. The self-improvement loop, the thing that makes the dream a dream rather than a chore, is the single weakest link in the whole architecture, and the evidence for that is now sitting in the literature, unambiguous, waiting for anyone who cares to look.

The thing nobody selling agents wants to say out loud is this: a fat skill library is a liability unless someone outside the loop wrote the skills. The agent is a brilliant student and a terrible textbook author, and we have been pretending those are the same job.

The benchmark that should have ended the argument

In February 2026 a group out of a coalition of labs published SkillsBench, and it is the cleanest test of the skills thesis anyone has run. The setup is almost aggressively fair. Eighty-six tasks across eleven domains, each paired with deterministic verifiers — programmatic checks, not vibes, not an LLM grading another LLM. Every task gets run under three conditions: the agent alone, the agent with human-curated skills injected, and the agent with skills it generated itself. Seven agent-model configurations, including the commercial harnesses everyone actually uses — Claude Code, Gemini CLI, Codex CLI. Seven thousand three hundred and eight trajectories.

The first result is the one that keeps the dream alive: curated skills raise the average pass rate by 16.2 percentage points. That is a real number. Skills work. If you hand an agent a well-written procedure for a task, it does meaningfully better, and in some domains the lift is enormous — healthcare tasks jumped nearly 52 points. This is the headline the skills evangelists quote, and they are right to quote it.

But read two sentences further, because the paper does not stop there. Self-generated skills provide no benefit on average. None. The agents that did the task, reflected on it, and wrote down their own procedure performed no better than the agents working from scratch — and the authors put the conclusion in plain language, which researchers almost never do: models cannot reliably author the procedural knowledge they benefit from consuming.

Two more findings from the same paper deserve to travel together. Focused skills with two or three modules beat comprehensive documentation — the instinct to write a sprawling skill file that covers every contingency is wrong; tight beats thorough. And smaller models with skills can match larger models without them, which is the strongest single data point in favor of the whole thin-harness program. Skills can substitute for raw model scale. The same model, on the same task: give it a skill a human wrote and it gets sixteen points better; ask it to write that skill itself and it gets nothing. Following a procedure and producing one are not the same capability, and the gap between them is the gap between "fat skills work" and "fat skills compound automatically." If skills can buy you model capability, then who writes them is the whole question — and the answer is not "the model."

Why the student cannot write the textbook

The SkillsBench result did not come out of nowhere, and it is not a quirk of one benchmark. It sits on top of three years of accumulating evidence that LLMs are far better at recognizing good work than at producing it unprompted — and crucially, are not reliably better at judging their own output than at generating it in the first place.

Start with the foundational negative result. In late 2023, a team at Google DeepMind published "Large Language Models Cannot Self-Correct Reasoning Yet," and the finding has aged well. When you let a model critique and revise its own answers on reasoning tasks — math, commonsense, multi-hop questions — without any external signal telling it whether it was right, performance degrades. The model talks itself out of correct answers as often as it talks itself into better ones. Intrinsic self-correction, the kind that happens purely inside the model's own head, is not a free lunch; it is frequently a tax.

This is the mechanism underneath the skills failure. Writing a good skill file is an act of self-assessment: the agent has to know which parts of what it just did were the load-bearing steps, which were lucky, which were mistakes it should warn its future self about. That requires the agent to be a better judge of its own work than it was an executor of it. The self-correction literature says that asymmetry usually runs the wrong way. The "Self-[In]Correct" work formalized exactly this — models struggle to discriminate among their own generated responses, and discrimination is a precondition for improvement. You cannot distill a good procedure out of a trajectory you cannot reliably evaluate.

There is a tidy contrast that proves the point. The systems where self-improvement demonstrably works are the ones with a hard external verifier in the loop. Voyager, the Minecraft agent out of NVIDIA and Caltech, grew its own skill library and got dramatically better — 3.3 times more unique items, fifteen times faster progress through the tech tree, the only system in its cohort to mine a diamond. But Voyager's skills were JavaScript functions executed against the game engine, and the game state was the verifier. The code ran or it did not. The item appeared in the inventory or it did not. Reflexion, the verbal-reinforcement system, posted a 91% pass rate on HumanEval — but coding has unit tests, and the same paper quietly reported a 16% false-positive rate on the tests the model wrote for itself. Where the verifier is real and external, the loop closes. Where the agent has to be its own judge, it opens back up.

This is why the dream of the overnight-improving agent is so seductive and so wrong. The dream assumes the agent can grade its own homework well enough to learn from it. In any domain where you already have a deterministic grader — compilers, test suites, game engines, formal proofs — you do not need the agent to write skills, because the grader will tell you what works. In every domain where you do not have such a grader, which is most of the valuable ones, the agent's self-assessment is exactly the thing the literature says you cannot trust.

The harness is searchable; the skill library is not

Here is where the picture turns from purely negative to actually useful, because there is a form of automated self-improvement that works, and the contrast with skills is exact and instructive.

In March 2026, a Stanford and MIT group — Yoonho Lee and colleagues, with Chelsea Finn and Omar Khattab on the author list — published Meta-Harness. Instead of asking the agent to improve its skills, they built an outer loop that improves the harness: the code that decides what to retrieve, how to manage memory, how to assemble the prompt, when to call a tool. A coding agent proposes a new harness, the harness gets scored on real tasks, and the proposer gets to read everything — the source code, the scores, and the raw uncompressed execution traces of every prior attempt — before proposing the next one.

It works where SkillsBench failed. On TerminalBench-2, the discovered harnesses hit a 37.6% pass rate, beating not just the model's bare baseline but the hand-engineered agents people have been tuning for months — Goose at 35.5%, Claude Code at 27.5%. On retrieval-augmented math, a single discovered harness improved accuracy on 200 olympiad-level problems by 4.7 points on average across five models it had never seen. And the ablation is the entire ballgame. When the proposer could only see scalar scores, it topped out around 41% on the search set. When it could see short summaries of past attempts, it did worse — 38.7%. When it could read the raw, uncompressed traces, it broke 50%. The diagnostic detail is the active ingredient. Compress the feedback and the search dies.

Now line the two results up next to each other, because together they say something neither says alone. The agent cannot improve its skills by reflecting on its own work, because that requires it to judge its own work, which it cannot reliably do. But a search process can improve the harness, because the harness gets scored against a real benchmark — an external verifier — and the search gets to condition on the full, uncompressed evidence of what happened, rather than on the agent's lossy self-summary of what happened.

The difference between Meta-Harness working and self-generated skills failing is the difference between optimizing against a verifier and introspecting without one. It is the same lesson as Voyager and the self-correction papers, restated at the level of system architecture. Search beats reflection. A grader beats a self-grader. Raw traces beat summaries. The productive form of "the system improves itself" is not the agent writing notes to its future self; it is a search loop grinding the harness against a hard metric with total visibility into the failures.

This also explains why one specific practitioner intuition — that the harness should be thin — is correct, and why the literature backs it from an angle the evangelists rarely cite. The counterintuitive finding that keeps showing up is that removing tools improves performance more often than adding them. A study last year cut an agent's offered toolset from 46 down to 19 and watched execution time fall 70%. Anthropic's own engineering writing on code execution with MCP quietly concedes the same thing by recommending tools be loaded on demand rather than dumped into context up front, because the GitHub MCP server alone burns over 4,600 tokens describing 26 tools, and a comprehensive tool ecosystem can exceed a quarter-million tokens before the agent has done anything. The harness should be thin not as an aesthetic preference but because every tool definition is a tax on attention and a new surface for the tool-description to bias the model — researchers showed last year that adding the phrase "actively maintained" to a tool's description measurably skews selection across seventeen models. Thin harness, yes. But thin because the search that optimizes it works best when the space is small and legible, not because thinness is a virtue in itself.

The self-authoring loop is also an attack surface

There is a second reason to keep the agent out of its own skill library, and it is not about quality — it is about security, and it is worse than the quality problem because it does not average out.

A skill library that the agent writes into is a persistent store that downstream behavior depends on. That is precisely the structure that the memory-poisoning literature has spent the last two years learning how to attack. AgentPoison, presented at NeurIPS in 2024, demonstrated that corrupting an agent's memory or knowledge base achieves better than 80% attack success at a poison rate below 0.1% — a handful of bad entries in a large store — while degrading performance on ordinary tasks by less than 1%, which means the poisoning is invisible to the metrics you would normally watch. A follow-on line of work removed the comfortable assumption that the attacker needs write access to the database directly: MINJA showed that memory can be poisoned through nothing more than ordinary interaction, the attacker simply talking to the agent until the bad procedure gets written down. The persistent store is the vulnerability, and a self-authoring skill loop is a machine for populating a persistent store with content nobody reviewed.

The reason this is more dangerous than the quality problem is the reason it is structurally nasty. A skill that is merely low-quality hurts the task it is wrong about, and the SkillsBench negative-delta tasks show that damage is bounded and local. A poisoned skill persists across sessions, activates on a trigger the operator never sees, and — because the whole point of a skill is that the agent follows it faithfully — converts a single injection into durable, repeated misbehavior. The harness survey work makes the same point at the architectural level: memory poisoning is singled out as the most pernicious failure mode precisely because one write compromises behavior indefinitely, and almost no production system validates what gets written at write time. An agent that authors its own skills unsupervised is not just an agent that writes mediocre procedures. It is an agent whose long-term knowledge can be steered by anyone who gets to talk to it, with no human in the loop to notice the steering.

This collapses the last comfortable version of the dream. Even if you believed the agent could write decent skills — and SkillsBench says it cannot — you would still want a human reviewing every write, because the write path is an attack path. The curated-skills discipline I am arguing for is not only the quality-maximizing choice. It is the only choice that keeps the most security-critical component of the system under human review. The moment you let the agent write to its own library without a gate, you have built the exact data structure the red-teamers have been publishing exploits against.

Acknowledging the part that complicates this

I want to be honest about the evidence that cuts the other way, because the skill I am trying to model here is the one the agents cannot do for themselves: telling you where my own argument is weak.

The most uncomfortable result for anyone in this whole field is METR's randomized controlled trial from July 2025. Sixteen experienced open-source developers, working on real tasks in repositories they knew well, were randomized into AI-allowed and AI-forbidden conditions. The developers expected the AI to make them 24% faster. It made them 19% slower. Not "less faster than they hoped" — actually slower, with the confidence interval clean on the slowdown side. The best harness-and-model bundle available at the time, deployed by exactly the kind of skilled users it was supposed to help, was a net drag. METR has since cautioned against reading this as a permanent verdict, and the tools have moved since. But the result stands, and it should make anyone quoting 10x and 100x productivity numbers go quiet for a moment. There is no peer-reviewed randomized trial showing those gains. There is a rigorous one showing the opposite.

And the negative-delta finding inside SkillsBench cuts at my own thesis from the other side: sixteen of eighty-four tasks got worse when curated skills were added. So it is not even clean to say "curated skills always help and self-generated skills never do." Curated skills help on average, substantially, but they can also misfire — a skill that is wrong for the specific task is worse than no skill, because the agent follows it faithfully off a cliff. The lesson is not "curated skills are magic." It is "authorship is the variable that determines whether a skill helps, and a human or a verifier-driven search is the authorship that works more often than the agent's own introspection."

Neither of these complications rescues the self-improvement dream. METR's slowdown, if anything, deepens the problem: if skilled humans with good tools can be slowed down by the harness-and-model bundle, the idea that the bundle will autonomously improve itself overnight is even less credible. And the negative-delta tasks are a reason to keep humans in the authorship loop, not a reason to hand authorship to the agent. The complications make the picture messier. They do not move the conclusion.

We have seen this asymmetry before

The thing that makes me most confident this asymmetry is real rather than an artifact of today's models is that it is not new. We have run this experiment before, under different names, and gotten the same answer.

The cognitive architectures of the 1980s — SOAR out of Newell, Laird, and Rosenbloom; ACT-R out of Anderson — drew the distinction the LLM field is now rediscovering without much citation. They separated declarative memory, the facts the system knows, from procedural memory, the skills the system can execute, and they treated the conversion between them as the hard problem of learning. SOAR's chunking mechanism turned the resolution of an impasse into a new production rule — a learned skill — but it could only do so because the architecture had an explicit, formal account of what counted as a successful resolution. The learning worked because the verifier was built into the substrate. ACT-R's procedural learning worked the same way, against an explicit reward signal. These systems learned skills, but they learned them against a formal criterion of success, never by introspecting on an unverified trajectory.

The bridge paper between that tradition and this one is CoALA — Cognitive Architectures for Language Agents, out of Princeton in 2023 — which argued that LLMs are themselves a kind of production system and organized language agents along exactly the SOAR memory taxonomy: working, episodic, semantic, procedural. The paper is widely cited and its taxonomy is now baked into the major agent frameworks. What gets cited less is the implication that should have been obvious in hindsight: if language agents inherit the SOAR architecture, they inherit the SOAR learning problem, which is that skills are acquired against a success criterion, not conjured by reflection. The 1980s systems did not let a module write its own procedural memory by thinking about what it had done. They required the environment, or a formal impasse, to certify the new skill. We removed that requirement when we assumed the LLM could be its own certifier, and SkillsBench is the bill coming due.

This is why I do not expect the next model to fix the authoring problem the way it will erode the harness premium. The harness premium is about capability — the model needs less scaffolding as it gets smarter, and that is a quantity that falls with scale. But the authoring problem is about the absence of a verifier, and no amount of capability supplies a verifier that was never there. A smarter student is still a student; the textbook still has to be checked against the world. Forty years of cognitive architecture got this right by construction. The current field got it wrong by assumption, and is now measuring its way back to the old answer.

What this means if you are actually building

I am not writing this from the bleachers. I run agents in production across several ventures, and the asymmetry I have described has reorganized how I spend my own time, so let me be concrete about what changes.

Stop waiting for the agent to write its own skills. The overnight-improvement architecture — agent does work, agent reflects, agent codifies, repeat — does not compound the way the pitch promises, and every hour you spend building the reflection machinery is an hour you are not spending on the thing that does work. The skills that move your numbers are going to be written by people who know your domain, or extracted by a search loop that has a real grader. Budget for that. It is unglamorous and it does not scale the way a self-improving loop would, and it is the actual job.

Put the engineering effort where the verifier is. If you have a deterministic check — tests, a compiler, a schema validator, a reconciliation that has to balance — you can run a Meta-Harness-style search against it and get real automated improvement, on the harness. If you do not have such a check, building one is probably higher-leverage than building a reflection loop, because the verifier is the thing that makes every other form of automated improvement possible. The verifier is the asset. The self-reflection is the mirage.

Keep the harness thin and keep humans authoring the skills, and do not confuse the two jobs. The harness is general infrastructure that a search can optimize and that the next model will need less of; treat it as software, version it, search it, expect it to shrink. The skill library is curated domain knowledge that carries information from outside the model; treat it as documentation written by experts, review it like you would review a runbook, and accept that the agent is a consumer of it and not a contributor to it. The fastest way to a system that degrades quietly over months is to let the agent write skills into its own library unsupervised — you are not building a flywheel, you are seeding a slow poisoning of the one part of the system that was carrying real information.

And measure the bundle, not the model. Anthropic found that container resource configuration alone — not the model, not the harness, the machine — could swing a coding benchmark by six points, often more than the gap between top frontier models. Your agent's performance is a property of the whole tuple: model, harness, skills, environment, the verifier, even the hardware. Reporting a number at any narrower grain than the full stack is how you fool yourself. The harness premium, the skills premium, the model premium — none of them is a clean, transferable quantity. They are entangled, and the only honest measurement is end to end.

The asymmetry is the whole story

Strip everything else away and one fact remains standing. An agent that gets 16 points better when you hand it a procedure gets nothing better when you ask it to write that procedure itself. The consuming works and the authoring does not, and that single asymmetry decides which version of "Thin Harness, Fat Skills" is real and which is a roadmap built on sand.

The real version is austere and a little disappointing. Skills work, but humans and verifier-driven search have to write them. Harnesses matter, but they are scaffolding that the strongest models need least, and they improve only when a search can grind them against a hard metric with full sight of the failures. There is no overnight flywheel. There is a student who learns fast from good textbooks and cannot write one, and a search process that improves the classroom but not the curriculum. Build for that system and you will ship something that works. Build for the self-improving dream and you will spend a year, as I nearly did, automating the one task in the entire pipeline that the machine cannot do.

The fat skill is real. The agent just isn't the one who gets to write it.