Prompt Engineering For Agents (Not Chatbots)

Most agent failures aren’t model failures. They’re prompt failures. Brittle system prompts, missing constraints, sloppy tool descriptions, no error handling — the model gets blamed for problems that live in the prompt.

Prompt engineering for agents is a different sport from prompt engineering for chatbots. Here’s the checklist we run on every agent before it ships.

1. The System Prompt Has Three Sections, In Order

We structure every agent system prompt the same way:

Identity and goal — one short paragraph. “You are an X. You help users do Y. Your job is done when Z.”
Tools available — listed with one-line descriptions and when to use each. The model uses this to plan.
Operating rules — the constraints that bound behaviour. What to do when uncertain. What never to do. How to format responses.

This structure mirrors how a competent new hire would read instructions. The order matters because the model uses early tokens to set the framing for everything that follows.

2. Tool Descriptions Are Where 80% Of Reliability Lives

The model picks tools based on their descriptions. Bad descriptions cause bad picks. Three rules:

Lead with the verb. “Searches the knowledge base for …” not “A function that …”
State when to use it AND when not to. “Use this for any product-related question. Do not use for billing.” Negative examples halve mistake rates.
Be honest about failure modes. “Returns an empty list if no matches are found — try rewriting the query if this happens.” The model will follow your recovery instructions.

3. Constrain Output Aggressively

Free-form output from agents is a debugging tax. Define a schema for every response — JSON, XML, or a strict markdown structure. Use structured-output mode (Claude’s tool_use or OpenAI’s response_format) so the model is guaranteed to match.

For multi-step agents, constrain the intermediate steps too. Each step should produce a structured object the next step can validate, not a paragraph of prose.

4. Plan For Failure States

Every agent has at least five failure modes. Bake them into the prompt:

What if the tool returns nothing? — “Acknowledge the gap, suggest a rephrase, do not invent.”
What if the user asks something outside your scope? — “Politely decline, offer to escalate.”
What if a tool errors? — “Retry once with a corrected argument, then fall back to telling the user.”
What if you’re confident but wrong? — “If a fact contradicts the retrieved context, trust the retrieved context.”
What if the loop hits the budget? — “If you’ve taken more than 8 steps, summarise what you tried and ask for human help.”

Each one is two sentences. Without them, the model invents recovery behaviour, and it’ll be wrong.

5. Few-Shot Examples Pay For Themselves

Three to five worked examples (input → reasoning → output) at the bottom of the system prompt double instruction adherence on most agent tasks. The examples should be:

Diverse — covering the hard cases, not just the obvious ones
Concrete — real-looking data, not “Example 1: Input 1”
Aligned with your output schema — the model copies the format aggressively

If your prompt is over 4K tokens after examples, prompt caching pays for itself within the first hundred calls.

6. Test Drift, Not Just Correctness

Your agent worked yesterday. Did the new prompt make it worse on the boring cases while making it better on the exciting ones?

Build a regression set — 30 to 100 input cases with expected behaviour. Run it on every prompt change. If a prompt change moves your eval score by 5% on the headline metric but tanks one specific case from 100% to 70%, you’ve shipped a regression and don’t know it yet.

LangSmith, LangFuse, Braintrust, and Promptfoo all do this. Pick one.

7. Watch The Trace, Not Just The Output

The output is the symptom. The trace is the disease. When an agent answers wrong, the question isn’t “what did it say?” — it’s:

Which tool did it pick first, and why?
What did the tool return?
What did it pick second based on that return?
Where did the reasoning go off the rails?

If you can’t answer those four questions in 30 seconds, you don’t have observability. Without observability, you have hope.

The TL;DR

Structure the system prompt, write tool descriptions like you’re writing API docs, constrain output to a schema, plan for failure, give few-shot examples, regression-test on every change, and never debug from the output alone.

The bootcamp’s Module 1 covers prompt engineering fundamentals. The pattern above gets formal treatment in Module 3, when we move from one-shot LLM calls to stateful agents.

Want to build agents in production?

Cohort 1 of the Agentic AI Bootcamp opens May 16, 2026. 16 weeks. In person at Hatch Works, Colombo. Two real production capstones.

Apply Now

FAQ · Agentic AI Bootcamp

Common Questions

How is the Agentic AI Bootcamp different from an online course? +

You show up in person, work alongside a cohort, and ship two real production systems by the end. Online courses give you content. The Agentic AI Bootcamp gives you a portfolio, instructor connections, and a Demo Day in front of hiring companies.

Do I need coding experience? +

Yes — basic Python or JavaScript is enough to keep up. If you don't have it yet, learn the basics before Cohort 1 starts on May 16, 2026 (Codecademy or freeCodeCamp work). For non-technical professionals, see the Applied AI Bootcamp.

When does Cohort 1 start? +

May 16, 2026. 16 weeks. Saturday sessions 9am to 1pm, in person at Hatch Works, Colombo.

How much does it cost? +

LKR 150,000 for the full 16-week programme. Flexible payment plans available. Corporate invoicing for employer-sponsored students.

Newsletter

The Twice-Monthly AI Briefing

Updates from the AI world — what shipped, what we’re using in production, and what’s worth your attention. Two emails a month, no spam.