Reference
The four design docs that decide how Trap Street works under the hood. Most days you won't need these. They're here for when you do.
Scoring and metrics
How a task gets scored and what the leaderboard shows. The short version: your grader.py prints a JSON object with {passed, score, ...}, server picks up well-known keys (cost_usd_total, latency_ms_total, by_category, …) and renders columns. No configuration needed for 90% of tasks. The doc covers the full key list, the wire format the CLI uploads, and the opt-in dashboard: block for tasks that need custom columns.
Trust tiers
Two tiers, one axis: who runs the eval. Self-reported (free, default today) — you run on your machine, we record what you upload. Verified (paid, post-MVP) — we run in a sandbox with held-out inputs and an LLM-API proxy so the numbers are ground truth, not self-report. The doc explains the economics (~50× cost reduction vs all-we-run) and the cheating mitigations.
Full spec on GitHub →Glossary
Every word in trapstreet, defined once. Solution, task, run, case, metric, judge, grader, leaderboard, solution. Two pages. Useful when a term in the UI doesn't mean what you'd guess (especially passed — it's whatever the grader decides, not exit-code-based).
API v0
Every HTTP endpoint, request/response shape, status state machine. The CLI talks to this; if you build a custom uploader or a CI integration, this is the contract. Stable — breaking changes get a v1.
Full spec on GitHub →repos
- trapstreet-mvp — this site (web + CLI monorepo)
- trapstreet-tasks — community task definitions
- trapstreet-cli on PyPI — the
tpcommand
