Most agent failures aren’t model failures. They’re prompt failures. Brittle system prompts, missing constraints, sloppy tool descriptions, no error handling — the model gets blamed for problems that live in the prompt.
Prompt engineering for agents is a different sport from prompt engineering for chatbots. Here’s the checklist we run on every agent before it ships.
1. The System Prompt Has Three Sections, In Order
We structure every agent system prompt the same way:
- Identity and goal — one short paragraph. “You are an X. You help users do Y. Your job is done when Z.”
- Tools available — listed with one-line descriptions and when to use each. The model uses this to plan.
- Operating rules — the constraints that bound behaviour. What to do when uncertain. What never to do. How to format responses.
This structure mirrors how a competent new hire would read instructions. The order matters because the model uses early tokens to set the framing for everything that follows.
2. Tool Descriptions Are Where 80% Of Reliability Lives
The model picks tools based on their descriptions. Bad descriptions cause bad picks. Three rules:
- Lead with the verb. “Searches the knowledge base for …” not “A function that …”
- State when to use it AND when not to. “Use this for any product-related question. Do not use for billing.” Negative examples halve mistake rates.
- Be honest about failure modes. “Returns an empty list if no matches are found — try rewriting the query if this happens.” The model will follow your recovery instructions.
3. Constrain Output Aggressively
Free-form output from agents is a debugging tax. Define a schema for every response — JSON, XML, or a strict markdown structure. Use structured-output mode (Claude’s tool_use or OpenAI’s response_format) so the model is guaranteed to match.
For multi-step agents, constrain the intermediate steps too. Each step should produce a structured object the next step can validate, not a paragraph of prose.
4. Plan For Failure States
Every agent has at least five failure modes. Bake them into the prompt:
- What if the tool returns nothing? — “Acknowledge the gap, suggest a rephrase, do not invent.”
- What if the user asks something outside your scope? — “Politely decline, offer to escalate.”
- What if a tool errors? — “Retry once with a corrected argument, then fall back to telling the user.”
- What if you’re confident but wrong? — “If a fact contradicts the retrieved context, trust the retrieved context.”
- What if the loop hits the budget? — “If you’ve taken more than 8 steps, summarise what you tried and ask for human help.”
Each one is two sentences. Without them, the model invents recovery behaviour, and it’ll be wrong.
5. Few-Shot Examples Pay For Themselves
Three to five worked examples (input → reasoning → output) at the bottom of the system prompt double instruction adherence on most agent tasks. The examples should be:
- Diverse — covering the hard cases, not just the obvious ones
- Concrete — real-looking data, not “Example 1: Input 1”
- Aligned with your output schema — the model copies the format aggressively
If your prompt is over 4K tokens after examples, prompt caching pays for itself within the first hundred calls.
6. Test Drift, Not Just Correctness
Your agent worked yesterday. Did the new prompt make it worse on the boring cases while making it better on the exciting ones?
Build a regression set — 30 to 100 input cases with expected behaviour. Run it on every prompt change. If a prompt change moves your eval score by 5% on the headline metric but tanks one specific case from 100% to 70%, you’ve shipped a regression and don’t know it yet.
LangSmith, LangFuse, Braintrust, and Promptfoo all do this. Pick one.
7. Watch The Trace, Not Just The Output
The output is the symptom. The trace is the disease. When an agent answers wrong, the question isn’t “what did it say?” — it’s:
- Which tool did it pick first, and why?
- What did the tool return?
- What did it pick second based on that return?
- Where did the reasoning go off the rails?
If you can’t answer those four questions in 30 seconds, you don’t have observability. Without observability, you have hope.
The TL;DR
Structure the system prompt, write tool descriptions like you’re writing API docs, constrain output to a schema, plan for failure, give few-shot examples, regression-test on every change, and never debug from the output alone.
The bootcamp’s Module 1 covers prompt engineering fundamentals. The pattern above gets formal treatment in Module 3, when we move from one-shot LLM calls to stateful agents.