The Chat Illusion

There is an Assistant. It lives in the cloud. You send it a message, it thinks for a moment, and it sends a reply back.

This is the mental model of 99% of LLM users. It is intuitive, it maps to our experience with other humans, and the “Chat” UI reinforces it with every bubble.

It is also wrong.

And not just technically wrong. It is dangerously wrong. It leads engineers to shout “DO NOT HALLUCINATE” at models, to be baffled when “perfect” instructions are ignored, and to treat stochastic engines like deterministic databases.

To build reliable systems with LLMs, you must drop the “chatbot” mental model. You are not talking to an entity. You are talking through a completion engine.

The Transformation: Messages to Document

When you use the OpenAI API, you pass an array of messages:

1
[
2
  { "role": "system", "content": "You are a helpful assistant." },
3
  { "role": "user", "content": "What is the capital of France?" }
4
]

This looks like a structured conversation history. But the model does not see this JSON object. Before the request ever hits the neural network, this beautiful structure is melted down into a single, raw string of text.

Different models use different formats (chat templates), but for a model like Llama 3 or similar, it looks something like this:

1
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
2

3
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
4

5
What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Look closely at the end. It doesn’t ask a question. It stops. It hangs there, waiting.

The API isn’t asking the model to “reply”. It is handing the model an unfinished manuscript and saying: “Here is a document where a user asked a question and an assistant is about to speak. Please write the lines for the assistant.”

This is completion, not conversation.

The Stochastic Path

If the model is just a ghostwriter completing a manuscript, why does it sometimes go off the rails?

Because it is stochastic.

When the model generates text, it is not retrieving a pre-written answer. It is calculating the probability of every possible next token (word/chunk) in its vocabulary.

Imagine a fork in the road.

Path A (60% likely): “Paris”
Path B (30% likely): “The capital is Paris”
Path C (0.01% likely): “London”

The model rolls a dice and picks a path. Usually, it picks the likely one. But sometimes, especially with high “temperature” settings, it picks the unlikely one.

Here is the danger: Path dependency.

Once the model writes a token, that token becomes part of the history. It is now “truth” in the context window. If the model makes a tiny error, say, picking a slightly wrong number or a confident-sounding hallucination, the next token must make sense relative to that error.

The model will “double down” on its mistake, not because it is stubborn, but because the most probable completion of a document containing a lie is more lies that support it. The feedback loop is instant and permanent for that generation.

RLHF: The Texture of “Helpfulness”

“But wait,” you might say. “GPT-5 doesn’t just autocomplete text. It acts like a helpful assistant! It refuses to answer dangerous questions! It has a personality!”

This is RLHF (Reinforcement Learning from Human Feedback).

The base model (the raw completion engine) just wants to predict the next word based on the patterns found in the training corpus. If you fed the raw base model a question like “How do I make a bomb?”, it might complete it with a sci-fi story about a bomb maker, or a news article about a bombing. It has no moral compass, it just follows the pattern.

RLHF is a “texture” applied on top. It biases the probability distribution. It trains the model that when it sees the pattern User: [Dangerous Question], the “correct” (rewarded) completion is Assistant: I cannot help with that.

But the underlying mechanics haven’t changed. It is still a pattern matcher. It is just a pattern matcher that has been bullied into being polite.

Under high stress (complex logic, massive contexts, or contradicting instructions) this veneer can crack. The “completion engine” leaks through. This is why “jailbreaks” work: they are just elaborate ways to trick the ghostwriter into a context where “refusing to answer” would break the narrative flow.

The Pit of Success

So, if effective instruction is just “writing a clear manuscript,” what does that make us?

We are not “prompt engineers.” That implies we are mechanically adjusting levers.

We are context architects.

Our job is not to command the model. You cannot “command” a probability distribution. Our job is to shape the probability distribution.

We construct a context (the prompt, the few-shot examples, the formatting) that creates a mathematical “pit of success”. We want to set up the stage so that the only logical, high-probability continuation of the document is the behavior we want.

Ambiguous Context: “Write a poem.” (Wide probability cone. could be Shakespeare, could be a limerick.)
Architected Context: “Write a Haiku about rust, in the style of a sad robot.” (Narrow probability cone. The model is forced into a specific corner of the latent space.)

This applies even to the smartest models like GPT-5, Gemini 3, or Claude 4.5 Sonnet. While they are “smarter” (better at inferring intent), they are still stochastic engines. They still rely on the stage you set.

Conclusion

The next time your agent goes off the rails, don’t ask: “Why didn’t you listen to me?” Ask: “How did my manuscript allow this ghostwriter to take the wrong turn?”

In the next post, we will look at visualizing the loop and why the model can’t “plan ahead”, and why “time to first token” is the heartbeat of AI latency.