Agent evaluation & observability

Ship AI agents you can trust.

Instryx evaluates your agents across 100+ dimensions, traces every tool call, and pinpoints exactly why a run failed — so you catch regressions in CI, not in a customer's inbox.

pip install instryx-sdk · No credit card · 10K traces free

instryx.app/agents/checkout-bot

EVAL SCORECARD PASS

Task success94
Tool accuracy88
Faithfulness91
Safety72

TRACE · 1.84s 8 spans

plan
search_docs
retrieve
llm.generate
verify

SCORE DRIFT · 14d -6.2%

TRUSTED BY ENGINEERING TEAMS AT

Northwind AI
Quanta Labs
Helix
Vector&Co
Polaris
Rivet

Agents fail silently in production

A passing demo is not a passing system. The hard failures show up later, quietly, and at scale.

Failures don't throw exceptions

An agent that picks the wrong tool or hallucinates a confident answer returns HTTP 200. Your logs look healthy while users get wrong results.

A prompt tweak breaks ten other things

You fix one edge case and silently regress task success on five flows you weren't watching. Without per-dimension scoring you never see the trade.

"Why did it do that?" has no answer

By the time a customer complains, the run is gone. No trace, no inputs, no tool I/O — just a vibe and a screenshot to debug from.

Product

Everything you need to measure agents

One SDK to instrument, one dashboard to understand, one gate to protect production.

Continuous evals & scoring

Score every run across 100+ built-in dimensions, or define custom evals with code or LLM-as-judge rubrics. Scores update live as traces stream in.

Full trace timelines

A waterfall of every span — tool calls, retrievals, LLM generations — with tokens, latency, and cost attached to each step. Replay any run end to end.

Automatic failure diagnosis

Instryx clusters failing runs and labels the root cause — wrong tool, bad retrieval, prompt-instruction conflict — so you fix the pattern, not one ticket.

Regression alerts in CI

Run your eval suite on every pull request. The build fails when scores drop past your threshold, with a diff of which dimensions moved and which runs broke.

Side-by-side comparisons

Diff two models, prompts, or agent versions on the same dataset. See exactly which inputs flipped from pass to fail before you ship a change.

Production drift detection

Track score and latency distributions over time. Instryx flags drift the moment live behavior diverges from your last green eval — before users notice.

How it works

Instrument once. Get signal forever.

Four steps from a black-box agent to a measured, guarded system.

1

Connect traces

Wrap your agent with the SDK or drop in our OpenTelemetry exporter. Every tool call and generation streams to Instryx automatically.
2

Run evals

Pick from 100+ built-in metrics or write your own. Score live traffic, a golden dataset, or both — reference-free if you have no labels yet.
3

Diagnose failures

Open a failing cluster, replay the trace, and see the labeled root cause. Jump from a low score straight to the span that caused it.
4

Catch regressions in CI

Add one CLI step to your pipeline. Instryx blocks merges that regress your agent and posts the diff back to the pull request.

instrument_agent.py

# pip install instryx-sdk
from instryx import Instryx, evals

co = Instryx(project="checkout-bot")

@co.trace
def run_agent(query: str) -> str:
    plan   = planner(query)
    docs   = co.span("retrieve", retriever, plan)
    answer = co.span("llm.generate", model, docs)
    return answer

# score live runs across 100+ dimensions
co.evaluate(
    suite=[evals.TaskSuccess(), evals.Faithfulness(),
           evals.ToolAccuracy(), evals.Safety()],
    alert_on_regression=True,
)

Use cases

Built for the agents you actually ship

Coding agents

Catch silent regressions in code generation and tool use. Score patch correctness, test pass rate, and unsafe shell calls before a model upgrade ships.

Support agents

Measure resolution rate, tone, and policy adherence on real conversations. Flag escalations the agent should have made and answers it shouldn't have given.

RAG pipelines

Pinpoint whether a bad answer came from retrieval or generation. Track context relevance, groundedness, and faithfulness on every query.

0 eval dimensions

0 median trace overhead

0 fewer prod incidents

0 traces scored daily

* Figures shown are illustrative.

Pricing

Start free. Scale when it matters.

Every plan includes full traces, the SDK, and the dashboard. You only pay as your volume grows.

Free

$0/mo

For solo developers measuring their first agent.

Up to 10K traces / month
All 100+ built-in evals
7-day trace retention
1 project, 1 seat
Community support

Team

$99/mo

For teams shipping agents to real users.

1M traces / month included
CI regression gating
Model & prompt comparisons
Drift & alerting
Unlimited projects, 10 seats
Email & Slack support

Enterprise

Custom

For regulated teams and high-volume deployments.

Unlimited traces & seats
On-prem or private VPC
SSO / SAML & audit logs
PII redaction & data residency
Custom evals & SLAs
Dedicated solutions engineer

Testimonials

Teams stopped guessing

"We shipped a model upgrade that looked fine in eval and would have tanked tool accuracy by 14 points. Instryx caught it in the PR. That alone paid for itself."

Staff Engineer— Series B dev-tools company

"The trace waterfall is the first time non-ML engineers on my team could actually debug an agent. They stopped pinging me for every weird output."

Eng Lead, AI Platform— Fintech, 200+ engineers

"Failure clustering turned a backlog of one-off bug reports into three root causes. We fixed the retrieval issue once and a whole category of complaints disappeared."

Founding Engineer— Support-automation startup

* Quotes shown are illustrative.

About

Built by people who watched agents fail

Instryx started because shipping an agent without measuring it felt like flying blind — so we built the instruments.

Shivam Gupta

Founder & CEO

A backend and distributed-systems engineer who built and operated agent tooling at scale, Shivam kept hitting the same wall: most agents ship without anyone really measuring whether they work. Instryx is his answer — the instrumentation layer he wished he’d had.

Connect on LinkedIn →

FAQ

Questions, answered

What does Instryx actually measure?

Instryx scores agents across 100+ dimensions — task completion, tool-call correctness, hallucination, instruction adherence, latency, cost, and safety. You can use built-in evals or define your own with code or LLM-as-judge rubrics.

Which frameworks and models does Instryx support?

Instryx is framework-agnostic. The SDK works with LangChain, LlamaIndex, CrewAI, the OpenAI and Anthropic SDKs, and raw HTTP. It captures traces from any model provider via OpenTelemetry-compatible spans.

How much overhead does tracing add?

Tracing is asynchronous and batched, adding roughly 12ms of median overhead per agent run. Spans are buffered locally and flushed in the background, so your agent's critical path is unaffected.

Can I run evals in my CI pipeline?

Yes. The Instryx CLI runs your eval suite on every pull request and fails the build when scores regress beyond a threshold you set. Results post back as a status check with a diff of which dimensions moved.

Where is my trace data stored, and is it secure?

Data is encrypted in transit and at rest. Team and Enterprise plans support data residency controls, configurable retention, and PII redaction at the SDK layer. Enterprise can deploy fully on-prem or in a private VPC.

Do I need labeled data to get started?

No. You can start with reference-free evals — heuristics, LLM-as-judge, and self-consistency checks — on live traces. As you collect golden examples, Instryx uses them to sharpen scoring and detect regressions.

Stop shipping agents you can't see

Instrument your first agent in five minutes. Free up to 10K traces a month — no credit card.

pip install instryx-sdk