Failures don't throw exceptions
An agent that picks the wrong tool or hallucinates a confident answer returns HTTP 200. Your logs look healthy while users get wrong results.
Instryx evaluates your agents across 100+ dimensions, traces every tool call, and pinpoints exactly why a run failed — so you catch regressions in CI, not in a customer's inbox.
pip install instryx-sdk · No credit card · 10K traces free
TRUSTED BY ENGINEERING TEAMS AT
A passing demo is not a passing system. The hard failures show up later, quietly, and at scale.
An agent that picks the wrong tool or hallucinates a confident answer returns HTTP 200. Your logs look healthy while users get wrong results.
You fix one edge case and silently regress task success on five flows you weren't watching. Without per-dimension scoring you never see the trade.
By the time a customer complains, the run is gone. No trace, no inputs, no tool I/O — just a vibe and a screenshot to debug from.
One SDK to instrument, one dashboard to understand, one gate to protect production.
Score every run across 100+ built-in dimensions, or define custom evals with code or LLM-as-judge rubrics. Scores update live as traces stream in.
A waterfall of every span — tool calls, retrievals, LLM generations — with tokens, latency, and cost attached to each step. Replay any run end to end.
Instryx clusters failing runs and labels the root cause — wrong tool, bad retrieval, prompt-instruction conflict — so you fix the pattern, not one ticket.
Run your eval suite on every pull request. The build fails when scores drop past your threshold, with a diff of which dimensions moved and which runs broke.
Diff two models, prompts, or agent versions on the same dataset. See exactly which inputs flipped from pass to fail before you ship a change.
Track score and latency distributions over time. Instryx flags drift the moment live behavior diverges from your last green eval — before users notice.
Four steps from a black-box agent to a measured, guarded system.
Wrap your agent with the SDK or drop in our OpenTelemetry exporter. Every tool call and generation streams to Instryx automatically.
Pick from 100+ built-in metrics or write your own. Score live traffic, a golden dataset, or both — reference-free if you have no labels yet.
Open a failing cluster, replay the trace, and see the labeled root cause. Jump from a low score straight to the span that caused it.
Add one CLI step to your pipeline. Instryx blocks merges that regress your agent and posts the diff back to the pull request.
# pip install instryx-sdk
from instryx import Instryx, evals
co = Instryx(project="checkout-bot")
@co.trace
def run_agent(query: str) -> str:
plan = planner(query)
docs = co.span("retrieve", retriever, plan)
answer = co.span("llm.generate", model, docs)
return answer
# score live runs across 100+ dimensions
co.evaluate(
suite=[evals.TaskSuccess(), evals.Faithfulness(),
evals.ToolAccuracy(), evals.Safety()],
alert_on_regression=True,
)
Catch silent regressions in code generation and tool use. Score patch correctness, test pass rate, and unsafe shell calls before a model upgrade ships.
Measure resolution rate, tone, and policy adherence on real conversations. Flag escalations the agent should have made and answers it shouldn't have given.
Pinpoint whether a bad answer came from retrieval or generation. Track context relevance, groundedness, and faithfulness on every query.
* Figures shown are illustrative.
Every plan includes full traces, the SDK, and the dashboard. You only pay as your volume grows.
$0/mo
For solo developers measuring their first agent.
$99/mo
For teams shipping agents to real users.
Custom
For regulated teams and high-volume deployments.
"We shipped a model upgrade that looked fine in eval and would have tanked tool accuracy by 14 points. Instryx caught it in the PR. That alone paid for itself."
"The trace waterfall is the first time non-ML engineers on my team could actually debug an agent. They stopped pinging me for every weird output."
"Failure clustering turned a backlog of one-off bug reports into three root causes. We fixed the retrieval issue once and a whole category of complaints disappeared."
* Quotes shown are illustrative.
Instryx started because shipping an agent without measuring it felt like flying blind — so we built the instruments.
Founder & CEO
A backend and distributed-systems engineer who built and operated agent tooling at scale, Shivam kept hitting the same wall: most agents ship without anyone really measuring whether they work. Instryx is his answer — the instrumentation layer he wished he’d had.
Instryx scores agents across 100+ dimensions — task completion, tool-call correctness, hallucination, instruction adherence, latency, cost, and safety. You can use built-in evals or define your own with code or LLM-as-judge rubrics.
Instryx is framework-agnostic. The SDK works with LangChain, LlamaIndex, CrewAI, the OpenAI and Anthropic SDKs, and raw HTTP. It captures traces from any model provider via OpenTelemetry-compatible spans.
Tracing is asynchronous and batched, adding roughly 12ms of median overhead per agent run. Spans are buffered locally and flushed in the background, so your agent's critical path is unaffected.
Yes. The Instryx CLI runs your eval suite on every pull request and fails the build when scores regress beyond a threshold you set. Results post back as a status check with a diff of which dimensions moved.
Data is encrypted in transit and at rest. Team and Enterprise plans support data residency controls, configurable retention, and PII redaction at the SDK layer. Enterprise can deploy fully on-prem or in a private VPC.
No. You can start with reference-free evals — heuristics, LLM-as-judge, and self-consistency checks — on live traces. As you collect golden examples, Instryx uses them to sharpen scoring and detect regressions.
Instrument your first agent in five minutes. Free up to 10K traces a month — no credit card.
pip install instryx-sdk