Showcase 4 · Telemetry & Tool-Use

Agent Skills Dashboard

Aggregated performance metrics across the production agents — Showcases 1-3 — broken down by model and tool. Below, a live ReAct playground where you can trigger a tool-use event and watch the agent reason through it.

Total runs

6,237

Success rate

91.4%

Avg latency

836ms

Tool accuracy

96.7%

Success Rate vs. Latency

Each dot is a model — top-left is the sweet spot (high success, low latency).

Tool Usage Accuracy

Per-tool success vs. failure across all production agent runs.

Model breakdown

Model	Hosting	Success	Latency	Runs	$/1k runs
Claude Sonnet 4.6	cloud	96.4%	920ms	1,284	$18.40
GPT-4o	cloud	95.1%	1180ms	1,547	$12.90
GPT-4o-mini	cloud	91.8%	610ms	2,103	$1.85
Llama 3 70B	local	87.2%	1640ms	412	$0.45
Llama 3 8B	local	78.6%	280ms	891	$0.12

Playground

Trigger a tool-use event live. The agent decides which tools to call, in what order — every thought, action, and observation streams in as it happens.

Try:

ReAct trace · idle

Submit a query — watch the agent reason, call tools, observe results, and synthesise an answer.

How it works

Telemetry source

In production, every agent run from Showcases 1-3 writes a row to a Postgres agent_runs table — model, tool calls, latency, success, cost. The dashboard is a thin aggregation layer over that table. Currently shows mock data with realistic distributions.

Charts

Powered by Recharts. The scatter plot reveals the latency-vs-quality frontier: cloud models cluster top-right (slow + accurate); local Llama variants trade quality for speed and cost.

Playground pattern

Classic ReAct loop: thought → action (tool call) → observation → repeat → final answer. Each step streams over SSE so the trace renders live.

Mock vs. real mode

The mock planner is a small heuristic (keyword → tool plan); the real-mode swap is a single LLM call with the tool registry passed as a system prompt. Tool implementations swap from canned data to OpenWeather + browser geolocation + pgvector search.