Year of the Harness

I spent most of 2025 watching leaderboards and waiting for GPT-5, usually just to realize that a new prompt for an old model fixed the bug I was worried about.

For me, 2026 is becoming the year of the harness.

Models will keep improving, but I’m starting to think that as we build, the value shifts from chasing the smartest checkpoint to building the scaffolding that makes those models useful. The harness is what ends up turning a powerful primitive into something I can actually rely on.

The Harness Pattern

When I say “harness,” I don’t just mean the prompt, I mean the entire system wrapping the call: the instructions, the tools, the verification layers, and the feedback loops.

It’s the difference between handing a mechanic a loose pile of parts and giving them a clean workbench, a torque wrench, and the service manual, because the mechanic hasn’t changed, but the environment finally allows for some precision.

Lee Robinson put it well: “Making the model perform well requires tuning the instructions (prompts) and tools in the harness.”¹

This is where the actual work is now, because while model upgrades are outside my control, harness design is not, and a good harness ensures that a model upgrade is additive rather than something that breaks the entire system.

Durable Orchestration

I’ve been thinking about the durability of these patterns. There is a useful distinction between architectures that depreciate as models improve and those that scale with them, a concept described as the difference between engines and sails.²

Low-leverage engines are the workarounds we build for today’s specific limitations.

Prompt-hacking for a specific checkpoint’s refusal patterns or token-saving quirks.
Brittle routing logic that only works because of a model’s specific reasoning flaw.
Fragile regex parsers built to extract structure before JSON mode or tool calls became standard.

These feel like engines because they require constant maintenance. Every time a new model drops, you’re back under the hood, and when the model gets better, the engine usually becomes dead weight.

High-leverage sails are the architectural primitives that get more powerful as models improve.

Context hygiene and curation layers that ensure the model gets the right data, regardless of its window size.
Standardized tool interfaces and multi-agent orchestration that treat the model as a modular component.
Verification layers that check for correctness using external truth, not just model self-correction.

I don’t think we ever get away from orchestration, but the goal is to shift the weight of that harness from fixing the model’s bugs to defining the system’s behavior. I’m finding myself attempting to build more sails lately.

Engineering for Magic

I suspect by the end of 2026, the expectation for any product I build will just be a “magic moment” within seconds, because users don’t really care about “AI,” they just want the result. When a feature requires more cognitive load to oversee than it saves in effort, it’s a chore, but when it feels like it’s reading your mind, it wins.

A magic moment is usually just a lucky strike in the probability distribution, so engineering for it is mostly about narrowing the range until those strikes happen more often. The harness is what shapes the distribution.

Thomas Osmonson talks about his origin story with AI³ and how ChatGPT recommended Exhalation by Ted Chiang. The query itself wasn’t complex, just a simple mood-based request for something to read, but it was a magic moment because of the inference. The model connected the subtext of a vague sentiment to a high-density literary recommendation that actually fit.

Patterns in Practice

I’m starting to see a few patterns in what a good harness looks like in code:

Tool design: I’ve been leaning toward exposing affordances instead of implementations, giving the model the “what” and letting it find its own “how.”
Prompt structure: I’ve found that constraints often matter more than instructions, so I define the boundaries of the solution space first, though I’m careful with negative constraints to avoid the Waluigi Effect.⁴
Verification: I use independent validators that check if a task is “done” regardless of how we got there, so if the model hallucinates, I don’t have to babysit it to get the right answer.
Feedback: I prefer loops that steer rather than just validate, so instead of just saying “file not found,” I have the system suggest checking the right directory to keep the model moving.⁵

I’ve spent enough time looking at the engine. It’s probably time to build the car, especially now that the primitives feel stable enough to actually support it.

Lee Robinson (@leerob) on X: “Making the model perform well requires tuning the instructions (prompts) and tools in the harness.” ↩
Jacob, “Engines Drown, So Build a Sail”. A brilliant framing on why simple architectures outperform complex ones over time. ↩
Thomas Osmonson, “More Thinking, Not Less”. A deep dive on how AI changes the nature of cognitive work. ↩
System Prompts. A look at why negative constraints can sometimes backfire and trigger the Waluigi Effect. ↩
Hallucination as Architecture. How I use model errors as a steering mechanism for better navigation. ↩

The Harness Pattern

Durable Orchestration

Engineering for Magic

Patterns in Practice

Footnotes