The Harness Is the Hard Part — Matthew Bradford

There's a question I get in almost every first conversation with a new client. It comes in different forms, but it's always the same question.

"Yeah, but can I trust it?"

Or: "How do you stop it from going rogue?"

Or some variation of "what happens when it does something weird?"

I want to be precise about what that question actually is, because I think most of the AI industry is answering the wrong version of it. That question is not a safety question. It's a trust question. And those are not the same thing.

Safe means the failure modes aren't catastrophic. Trusted means the behavior is predictable. You can have a system that will never cause serious harm but is still completely untrustworthy, because you never know what it's going to do next. Those are different problems with different solutions.

When a client says "I don't trust it," they are not describing a system that blew something up. They are describing a system that said something weird in front of a client, or hallucinated a number in a report somebody signed off on, or formatted the output three different ways on three consecutive runs. Those are trust failures. And trust failures are why AI initiatives die.

It isn't the demo... the demo (almost) always works. What fails is the handoff to reality.

What Clients Actually Built

There's a specific client conversation I keep having that I think illustrates this well.

Someone comes to me and says they tried AI and it didn't work. My first question is: what did you actually try? Did you open ChatGPT and type in something like a Google search?

A lot of them did. Which tells me they missed the entire point of the tool before they even started, because the superpower of AI is iteration. If you tried it once and decided it didn't work, you didn't iterate. You just looked at the thing and walked away. And generally I walk away from those prospects. They aren't ready.

But the clients I actually want to work with? They describe a long process they've developed themselves. Something they try to follow every single time because it produces outputs they genuinely like. They've been doing it manually, painstakingly, and they're good at it.

What they've actually done, without knowing it, is build a harness by hand. They just don't have a word for it yet.

What a Harness Is

A harness, in this context, is a deterministic system wrapped around a non-deterministic model.

LLMs are non-deterministic by design. Ask the same question twice and you might get two different answers. That's actually part of what makes them useful. The creativity, the ability to approach a problem from an unexpected angle, the capacity to surprise you with a connection you didn't see coming. It is a feature, not a bug. You do not want to engineer that out. You want to engineer around it.

Think about it this way. You have an incredibly talented artist. On any given day, this artist might produce a masterpiece. They might paint a parrot. They might paint a boat. They might paint something completely unrelated to what you asked for. The non-determinism is the talent. It's also the problem.

If you need a picture of a parrot every single time... not sometimes a parrot, sometimes a boat, sometimes something else entirely... a paint-by-numbers framework is actually your friend. The framework doesn't make the artist less talented. It makes the output reliable within the bounds that matter for your use case.

That's a harness. Not a leash. A net.

Outcomes, Not Process

When I work with a new client, the first thing I want to understand is outcomes, not process. What does "right" actually look like?

People are genuinely bad at describing what they want in the abstract. It just isn't a thing most people have to do often enough to skill up in it. I mean think about it, when you think of your ideal partner vs who you ended up with... they are rarely similar. And that is the "abstract want" that most people have put the most time into developing.

If I ask someone to describe their ideal output from scratch, I'll get something vague and then something that doesn't match what they actually wanted once I produce it.

So instead, I have them show me. Show me what you like. Show me what's close. Show me the different things you'd want to pull together to get from close to perfect. It is similar to asking someone about the best moments in their relationships.. even the ones they ultimately hated to help them arrive at what a good partner might look like instead of asking them to conjure something up from thin air.

What I'm doing is narrowing the gap between where we are and where they want to be. If you can't see the other side of a gap, you can't build a reliable bridge across it. You'll build something structurally unsound because you were guessing at the destination. My job, at the start, is to build something short enough and solid enough that they can actually see the other side. Then we come back. We iterate. The gap gets smaller. And smaller. Until it's just a crack.

This is also why I'm so skeptical of traditional discovery phases. (I have some strong opinions about that.) Six weeks of requirements gathering before anyone has touched anything means you're asking someone to be precise about something they haven't seen yet. Give me a day. Let me show you something directionally correct. Then tell me what's wrong with it. That's faster, cheaper, and produces better results than the alternative.

The part I really love about it though is that it dramatically reduces the trust requirement. I'm not asking you to believe in me for six weeks and then see something at the end. I'm asking you to look at something tomorrow and tell me if I'm pointed in the right direction.

Taste Is the Quality Gate

Hot take: a harness is just encoded bias. And in this case, bias is good.

Anyone can wrap an API call in a process. That's not hard. What makes one harness produce reliably good outputs and another produce reliably mediocre ones is taste. The quality gate is only as good as the standard it's enforcing.

A good custom harness bakes in the client's taste. It integrates their specific definition of what "right" looks like, their standards, their edge cases, their non-negotiables. Which means two clients with identical use cases get completely different harnesses. The harness isn't about the task. It's about the standard.

You can program style. Style is a set of rules that can be documented and enforced. Taste is knowing when to apply which style situationally, and that's harder to encode, but it can be done. Assuming, of course, you actually understand the taste you're trying to encode. If you're using someone else's taste, everything you build is going to look like everything else. Generic AI implementations feel generic for exactly this reason. Somebody built a harness and baked in their own assumptions about what good looks like, or nobody's assumptions at all.

Unless you're in an industry where bias is damaging, you want someone to see your bias in your work. It is your fingerprint. The fingerprint is the point.

The Harness Layer Matters More

So why am I so focused on harnesses? What about the new benchmark that came out last week that said model X is now 27x smarter than it was 3 days ago? I'll let you in on a little secret:

The models are good enough. They have been for a while, and they're getting better, but the intelligence leap between generations is less detectable now than it used to be. Conversational AI has largely plateaued in terms of what any normal person is actually going to notice in daily use. What hasn't caught up is the harness layer. The deterministic scaffolding. The QC gate. The thing that makes the output reliable enough to ship in a production environment where real consequences exist.

That is the most valuable thing someone outside a frontier lab can build right now. Not a wrapper. Not a prompt template. A genuine, client-specific, taste-encoded quality harness for whatever it is they're actually trying to get AI to do... whether that's building presentations, routing tickets, generating reports, or a hundred other use cases.

And this is what I want junior developers in particular to hear: that skill set has a long shelf life. The specific models will change. The need for deterministic quality gates around non-deterministic outputs is structural. It doesn't deprecate when the next version drops. Learn to build the harness. Learn to encode taste into it. Learn to make the verification loop short enough that stakeholders can actually see what they're getting before they have to trust it.

What Survives the Next Ten Years

There are two things I'm genuinely grateful I figured out early, and they have nothing directly to do with AI. But they're exactly what will determine who survives the next ten years of it.

The first: never lose sight of what the technology is actually doing. Keep your eye on the impact, not the stack. As long as you stay focused on outcomes, the technology underneath will shift and you'll naturally reach for whatever tool is right for the job. The people who get left behind are the ones who become specialists in a tool that gets deprecated, because they confused the tool for the point.

The second: stay curious. Protect your sense of wonder. Curiosity is what makes you adaptive. It's what keeps you from calcifying into someone who runs on pattern recognition and stops actually seeing what's in front of them. Wonder is not a soft skill. It is the engine of everything useful.

The developers who will matter in ten years are the ones who never lost either of those things. They're not asking "will AI replace me." They're asking "what can I build with this that I couldn't build before."

That's where the work is. That's where it's always been.

Build the harness. Encode the taste. Stay curious enough to know when to throw it out and start over.

Rinse. Repeat. Ship.