What is LangSmith?

LangSmith is the agent engineering platform from LangChain for tracing, evaluating, and deploying LLM applications and AI agents in production. It is framework-agnostic and works with the OpenAI SDK, Anthropic SDK, LlamaIndex, DeepAgents, or plain Python, not just LangChain. Every time your agent runs, LangSmith captures a full trace of every decision, tool call, sub-agent invocation, and model response, giving you the only record of exactly what happened and why. On top of that trace data, LangSmith layers automated evaluation workflows using LLM-as-judge or custom scoring, prompt versioning, annotation queues for human review, and deployment infrastructure for running agents at production scale.

How LangSmith works

LangSmith sits between your agent code and everything it calls. Once you add the SDK and point it at your project, LangSmith records every run as a structured trace. A trace is a hierarchical log where the root is the top-level agent call, and nested inside it are every LLM call, tool execution, retrieval step, and sub-agent run, each with its inputs, outputs, latency, token count, and cost.

Here is what each core layer of the platform does:

Tracing: LangChain and LangGraph apps trace automatically. For any other framework, you add a decorator or wrapper and LangSmith instruments the call. Every trace is stored and searchable, and a built-in AI assistant called Polly can summarise a long trace and pinpoint failures.
Evaluation: You pick a dataset (built from real production traces or uploaded manually), define an evaluator (LLM-as-judge, custom Python function, or human reviewer), and run experiments. LangSmith scores each output, shows side-by-side comparisons between runs, and tracks whether a prompt change or model swap improved or regressed quality.
Online evals: Beyond offline experiments, LangSmith runs evaluators on live production traffic. It grades agent outputs in real time and surfaces failures as they happen, rather than after the fact.
Monitoring: Dashboards track cost, latency percentiles (P50, P99), error rates, and token usage across all your agents. The Insights feature automatically clusters traces to surface usage patterns and common failure modes.
Deployment: LangSmith Deployment lets you publish agents through a centralised registry with versioning, rollbacks, and support for human-in-the-loop workflows and background agent runs. It handles horizontal scaling for bursty agent workloads.
Fleet: A no-code layer where non-technical teams can describe what they need (daily briefings, competitor tracking, status reports) and LangSmith builds and runs the agent for them.

The platform has a free Developer plan with 5,000 traces per month, a Plus plan at $39 per seat, and Enterprise pricing for compliance-heavy deployments.

What you can build with LangSmith

Production monitoring for an LLM pipeline: Instrument your RAG or agent pipeline to send every run to LangSmith. Set up dashboards for latency, cost, and error rate, then configure alerts for when any metric crosses a threshold. Get immediate visibility the moment something breaks in production.
Systematic prompt improvement workflow: When you want to change a system prompt or swap models, run a LangSmith experiment against your evaluation dataset first. Compare the scored outputs side by side before pushing the change, so you know the new version is actually better.
Regression testing for agent updates: Build an evaluation dataset from real production traces where your agent answered correctly. Every time you update your agent, run it against that dataset automatically and catch regressions before they reach users.
Human-in-the-loop annotation pipeline: Route uncertain or low-confidence agent outputs to LangSmith annotation queues. Domain experts review the traces, add labels, and those labelled examples feed back into evaluation datasets for the next iteration of the model.
Debugging a multi-agent system: When a complex multi-agent workflow produces a wrong answer, open the trace in LangSmith to see exactly which sub-agent made the bad call, what it received as input, and what the model returned. No more guessing which step failed.
Deploying a background research agent: Use LangSmith Deployment to publish a research agent that runs asynchronously on long tasks, supports exactly-once execution, and scales horizontally as demand grows, without managing your own infrastructure.

LangSmith

How LangSmith works

What you can build with LangSmith

Key Features

FAQ

Explore Similar AI Tools

Jaeger

Promptfoo

Helicone

Braintrust

The Twice-Monthly AI Briefing