Agent Skills Dashboard
A sample observability surface for agent runs — KPIs, a latency-vs-success scatter across models, and per-tool accuracy. Designed as the aggregation layer that would sit over a production agent_runs telemetry table.
Sample observability UI — metrics are placeholders. Numbers below come from a static mock dataset, not a live telemetry pipeline.
Total runs
6,237
Success rate
91.4%
Avg latency
836ms
Tool accuracy
96.7%
Success Rate vs. Latency
Each dot is a model — top-left is the sweet spot (high success, low latency).
Tool Usage Accuracy
Per-tool success vs. failure across the mock dataset.
Model breakdown
| Model | Hosting | Success | Latency | Runs | $/1k runs |
|---|---|---|---|---|---|
| Claude Sonnet 4.5 | cloud | 96.4% | 920ms | 1,284 | $18.40 |
| GPT-4o | cloud | 95.1% | 1180ms | 1,547 | $12.90 |
| GPT-4o-mini | cloud | 91.8% | 610ms | 2,103 | $1.85 |
| Llama 3 70B | local | 87.2% | 1640ms | 412 | $0.45 |
| Llama 3 8B | local | 78.6% | 280ms | 891 | $0.12 |
How it works
Data source (today)
A static TypeScript module — src/lib/agent/dashboard-data.ts — exports a small mock dataset of model and tool stats with realistic distributions. Aggregations (weighted success, weighted latency, tool accuracy) are computed in-process on import.
Data source (planned)
The intended production shape: every agent run from the other showcases writes a row to a Postgres agent_runs table — model, tool calls, latency, success, cost — and this page becomes a thin aggregation layer over it.
Charts
Powered by Recharts. The scatter plot reveals the latency-vs-quality frontier: cloud models cluster top-right (slow + accurate); local Llama variants trade quality for speed and cost.
Live tool-use playground
The ReAct tool-use playground lives on its own page — see /playground to trigger an agent run and watch the trace stream in.