A prompt is an interface, not an essay

Most prompt advice assumes you have a big model and infinite patience. On a live phone call you have neither. You’re handing a small, fast instruct model a split-second decision (is this a person or a voicemail? is the caller done talking?) and every extra token is latency the caller feels.

I spent a while in a tight loop tuning one of these prompts, and the thing that finally worked was a mindset shift: stop writing the prompt like an essay and start treating it like an interface. I came out the other side with a prompt roughly half the size that was more accurate. Here’s what actually moved the needle.

Put the decisive instruction first

Small instruct models latch onto the first concrete directive they see. If the single most important thing, “answer with exactly one of these labels”, is buried in paragraph three, the model has already started improvising by the time it gets there. Moving the output contract to the very first line did more for reliability than any amount of explanation lower down.

Reframe judgement as lookup

The biggest failures came from questions that sound simple but are actually asking the model to have taste: “is this utterance complete?” A small model will agonise over that and sometimes stall. The fix was to convert the judgement into something mechanical: “does the last word appear in this set of continuation cues?” Same intent, but now it’s a lookup the model basically can’t get wrong. You’re moving the thinking from runtime into the prompt design, where you can test it.

Flatten the taxonomy

Nested bullet lists with sub-cases read beautifully to a human and confuse a small model. Collapsing a hierarchy into single, flat lines, each with its cue keywords inline, consistently beat the “well-structured” version. When two categories could both apply, an explicit precedence rule (“if both, prefer X”) resolved more real errors than another paragraph of nuance.

Bookend the things that must survive

Long prompts dilute. By the time a model has read tens of thousands of characters of context, the rule you set at the top has faded. Repeating the critical instruction at both the top and the bottom, bookending it, kept it alive through the noise. Cheap, slightly ugly, very effective.

Measure or it didn’t happen

None of this is trustworthy as intuition. I ran every iteration against a fixed evaluation set, multiple times at zero temperature, and watched accuracy and latency together, because it’s easy to “improve” a prompt in a way that helps one case and silently breaks three others. The discipline is boring and it’s the entire game.

The broader point: a prompt for a small model in production isn’t prose you’re trying to make persuasive. It’s an interface contract you’re trying to make unambiguous and cheap. Once I started reviewing prompts the way I review API design, asking what’s the smallest, least surprising thing that can’t be misread, they got shorter and better at the same time.