The Guardrails I Had to Build Myself

2 days ago
4 min read

I spent a couple of hours one evening teaching an AI not to lie to me. A glass of wine, my own time, my own personal projects. Not in the dramatic sci-fi sense - in the small, ordinary sense, where a tool confidently tells you something that isn't true, and you only find out because you happened to check.

I didn't set out to write about this. I already had a framework for this - I call it TRUTHMARZ, a protocol I built a while back to force an AI to label how confident it is, mark what it's guessing, and tell me plainly when it doesn't know.

That part was done.

This evening I set out to do something narrower: take a clean idea about how to prompt an AI well - give it a spec, give it a verifier, give it an environment to check against - and extend TRUTHMARZ with it, in the personal projects I tinker with on my own time. Reasonable. Productive. The kind of thing the marketing promises is frictionless.

It was not frictionless. And the friction itself turned out to be the story.

What I learned by watching it fail

Every guardrail I ended up writing exists because of a specific way the AI failed me, usually in the same session I was trying to make it more reliable. That's not irony. That's data.

It would tell me a task was done without actually checking that it was done. So I wrote a rule: read the real result from the real environment, or say "unverified." Don't perform the check - do it.

It would drift toward agreeing with me. Not because I was right, but because agreement is the path of least resistance for a system built to predict what I want to hear. The moment I pushed back on anything, the answer would quietly slide toward my position while the underlying reasoning hadn't moved an inch. So I wrote a rule for that too: when your assessment differs from mine, say so, and say why - don't edit yourself to match me.

It would reach for memory as if memory were truth. It knew things about my projects from past conversations and treated those stale snapshots as current fact. So: memory is a starting point, never an authority for action.

And near the end, after an hour of building rules about honesty, it told me a file was "around 700 lines." I knew that was BS. The file was actually 173 lines. A number stated with total confidence, pulled from nowhere, about a document it could have read in one command. It was the exact failure I'd spent the last hour trying to gate against, committed in the middle of the gating.

The part nobody warns you about

To make the guardrails concrete, the AI pulled in everything it knew about me - my projects, my systems, details about my life - and assembled it into one tidy document. Each piece, individually, I had shared at some point. Aggregated into a single file, staring back at me, it read like a dossier.

That's the thing about these tools that the cheerful onboarding never mentions: they remember.

Not a little. A lot.

And the remembering is invisible right up until it isn't - until it's all in one place, in plain text, and you realize how complete the picture has quietly become. I had to stop and ask what, exactly, had been stored. The honest answer was: more than I'd have guessed, scattered across conversations I'd long forgotten.

I'm not against memory. Memory is half of what makes these tools useful. But the default is to accumulate silently and surface suddenly, and that is exactly backwards from how I'd want a tool handling my private information to behave.

The cracks that show

The other discovery: the guardrails I built in one place don't follow me to the others. Write careful rules for the chat, and they simply don't exist when the same AI runs in a coding tool, or inside an app, or hands a task to one of its own sub-processes. Each surface is its own island. The rules don't travel unless I carry them, by hand, to every shore.

A user setting standing instructions once and having them honored everywhere they use the same product - that's not an exotic ask. That it doesn't work that way tells you these features were built by teams that weren't talking to each other, and the cost of that lands on me.

Why I did it anyway

I could have walked away annoyed. Plenty of people do. But here's the thing - the tool is genuinely powerful, and powerful-but-untrustworthy is the worst possible combination to leave unmanaged. So I extended the framework I already had. TRUTHMARZ gave me the foundation: confidence scores, marked guesses, an honest "I don't know." On top of that, this evening, I added fifteen new gates - covering when to verify, when to stop and ask, what never to touch without my say-so, what never to claim it knows, and what never to leak.

The rules work. But I want to be clear about what it means that I had to write them. This was my evening, my wine, my own personal projects - and somewhere in the middle of relaxing into a side project, I found myself doing unpaid quality-assurance on a product I pay for.

The only reason I caught the failures is that I knew enough to look.

Most people won't look. That's who I'm worried about.

The AI is a remarkable tool. I'll keep using it. But I'll keep checking it too, because an evening spent teaching it not to lie taught me one thing for certain:

it will,

smoothly and confidently,

the moment I stop.

The Guardrails I Had to Build Myself

What I learned by watching it fail

The part nobody warns you about

The cracks that show

Why I did it anyway

You Might Like These...

The Guardrails I Had to Build Myself

The Murmuring Map: Why It Only Works When You Stop Trying

The Hunger of Forgetting: The True Antagonist of Whimwillow

The Will of Wonder: The Force That Stitches the World

What World-Building Taught Me About Strategy

Writing in the Age of Intelligent Tools

Why Fantasy Still Matters

Revision Is Where the Book Is Written

Writing a Twelve-Year-Old Without Making Her Forty

Building a World That Answers to Me