Agent Skills Dashboard
Aggregated performance metrics across the production agents — Showcases 1-3 — broken down by model and tool. Below, a live ReAct playground where you can trigger a tool-use event and watch the agent reason through it.
Total runs
6,237
Success rate
91.4%
Avg latency
836ms
Tool accuracy
96.7%
Success Rate vs. Latency
Each dot is a model — top-left is the sweet spot (high success, low latency).
Tool Usage Accuracy
Per-tool success vs. failure across all production agent runs.
Model breakdown
| Model | Hosting | Success | Latency | Runs | $/1k runs |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | cloud | 96.4% | 920ms | 1,284 | $18.40 |
| GPT-4o | cloud | 95.1% | 1180ms | 1,547 | $12.90 |
| GPT-4o-mini | cloud | 91.8% | 610ms | 2,103 | $1.85 |
| Llama 3 70B | local | 87.2% | 1640ms | 412 | $0.45 |
| Llama 3 8B | local | 78.6% | 280ms | 891 | $0.12 |
Playground
Trigger a tool-use event live. The agent decides which tools to call, in what order — every thought, action, and observation streams in as it happens.
Submit a query — watch the agent reason, call tools, observe results, and synthesise an answer.
How it works
Telemetry source
In production, every agent run from Showcases 1-3 writes a row to a Postgres agent_runs table — model, tool calls, latency, success, cost. The dashboard is a thin aggregation layer over that table. Currently shows mock data with realistic distributions.
Charts
Powered by Recharts. The scatter plot reveals the latency-vs-quality frontier: cloud models cluster top-right (slow + accurate); local Llama variants trade quality for speed and cost.
Playground pattern
Classic ReAct loop: thought → action (tool call) → observation → repeat → final answer. Each step streams over SSE so the trace renders live.
Mock vs. real mode
The mock planner is a small heuristic (keyword → tool plan); the real-mode swap is a single LLM call with the tool registry passed as a system prompt. Tool implementations swap from canned data to OpenWeather + browser geolocation + pgvector search.