Showcase 4 · Telemetry & Tool-Use

Agent Skills Dashboard

Aggregated performance metrics across the production agents — Showcases 1-3 — broken down by model and tool. Below, a live ReAct playground where you can trigger a tool-use event and watch the agent reason through it.

Total runs

6,237

Success rate

91.4%

Avg latency

836ms

Tool accuracy

96.7%

Success Rate vs. Latency

Each dot is a model — top-left is the sweet spot (high success, low latency).

Tool Usage Accuracy

Per-tool success vs. failure across all production agent runs.

Model breakdown

ModelHostingSuccessLatencyRuns$/1k runs
Claude Sonnet 4.6cloud96.4%920ms1,284$18.40
GPT-4ocloud95.1%1180ms1,547$12.90
GPT-4o-minicloud91.8%610ms2,103$1.85
Llama 3 70Blocal87.2%1640ms412$0.45
Llama 3 8Blocal78.6%280ms891$0.12

Playground

Trigger a tool-use event live. The agent decides which tools to call, in what order — every thought, action, and observation streams in as it happens.

Try:
ReAct trace · idle

Submit a query — watch the agent reason, call tools, observe results, and synthesise an answer.

How it works

Telemetry source

In production, every agent run from Showcases 1-3 writes a row to a Postgres agent_runs table — model, tool calls, latency, success, cost. The dashboard is a thin aggregation layer over that table. Currently shows mock data with realistic distributions.

Charts

Powered by Recharts. The scatter plot reveals the latency-vs-quality frontier: cloud models cluster top-right (slow + accurate); local Llama variants trade quality for speed and cost.

Playground pattern

Classic ReAct loop: thoughtaction (tool call) → observation → repeat → final answer. Each step streams over SSE so the trace renders live.

Mock vs. real mode

The mock planner is a small heuristic (keyword → tool plan); the real-mode swap is a single LLM call with the tool registry passed as a system prompt. Tool implementations swap from canned data to OpenWeather + browser geolocation + pgvector search.