Build a task

End-to-end: design a task, write the solution, run it locally, register it on trapstreet, submit.

We're going to build a task called sum-two-numbers. Solutions get two ints in a JSON file. They write a program that adds them and writes the sum to another JSON file. We score whether their answer matches ours. Whole thing takes about 15 minutes.

What you're making

A task is a folder. Four things live in it:

inputs/<case>/ — what we hand the solution's program
expected/<case>/ — what we expected back
judge.py — scores one case
grader.py — aggregates case scores into a run-level pass/fail

Plus a traptask.yaml that wires them together. That's the whole contract — the solution's solution and your task talk through files and environment variables only.

Step 1 — make the case files

mkdir -p sum-task/inputs/basic     sum-task/expected/basic
mkdir -p sum-task/inputs/negatives sum-task/expected/negatives
mkdir -p sum-task/inputs/zero      sum-task/expected/zero

Inputs:

echo '{"a":  3, "b":  5}' > sum-task/inputs/basic/nums.json
echo '{"a": -1, "b": -2}' > sum-task/inputs/negatives/nums.json
echo '{"a":  0, "b":  0}' > sum-task/inputs/zero/nums.json

Expected outputs:

echo '{"sum":  8}' > sum-task/expected/basic/sum.json
echo '{"sum": -3}' > sum-task/expected/negatives/sum.json
echo '{"sum":  0}' > sum-task/expected/zero/sum.json

Three cases, three inputs, three expected outputs. The folder names under inputs/ and expected/ are the case ids.

Step 2 — write the judge

judge.py runs once per case. It reads where the solution wrote its output (and where you put the expected answer), decides whether they match, and prints a JSON object containing at least a numeric score.

# sum-task/judge.py
import json, os
from pathlib import Path

payload = json.loads(os.environ["TRAPTASK_PAYLOAD"])

actual   = json.loads(Path(payload["outputs"]["sum.json"]).read_text())
expected = json.loads(Path(payload["expected"]["sum.json"]).read_text())

correct = actual.get("sum") == expected["sum"]
print(json.dumps({
    "score": 1.0 if correct else 0.0,
    "correct": correct,
}))

That's it. TRAPTASK_PAYLOAD is a JSON string giving you absolute paths into the solution's output dir and your expected dir for this case. You read what's there, decide, print one line.

Step 3 — write the grader (or skip it)

grader.py runs once at the end. It gets the list of case results and produces one run-level summary:

# sum-task/grader.py
import json, os

cases = json.loads(os.environ["TRAPTASK_PAYLOAD"])
scores = [c["metrics"]["score"] for c in cases if c.get("metrics")]
avg = sum(scores) / len(scores) if scores else 0
print(json.dumps({
    "passed": all(s == 1.0 for s in scores),
    "score":  round(avg, 3),
}))

You can skip writing grader.py entirely. If it's missing, the server averages the case scores for you and calls it passed when the average crosses 0.8. Write your own only when you want a stricter rule (here we want every case at 1.0 to count as passed).

Step 4 — wire it up with traptask.yaml

# sum-task/traptask.yaml
dirs:
  inputs: inputs/
  expected: expected/

cases:
  - id: basic
  - id: negatives
  - id: zero

judge:
  cmd: uv run python judge.py

grader:
  cmd: uv run python grader.py

You also need a pyproject.toml next to traptask.yaml so uv run can build a venv for judge/grader:

[project]
name = "sum-task"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = []

That's the task. Push the sum-task/ folder up to a GitHub repo.

Step 5 — publish on trapstreet

Go to /tasks/new, paste your task's GitHub URL into the auto-fill field, review the prefilled values, hit Create. Now anyone with the tp CLI can write a solver against it.

When you create the task as public, trapstreet requires every submitted solution to have a publicly reachable git repo (metadata.repo) — see build a solution for how that flows from the solver side.

What you didn't have to think about

Test solution orchestration — tp runs each case in its own subprocess, captures stdout, handles timeouts, you don't write any of that.
File paths — your judge / grader read TRAPTASK_PAYLOAD and never deal with cwd or relative paths.
Result storage — .trap/sum-two-numbers/<ts>/report.json is produced automatically, ready to upload.
Leaderboard columns, ranking, dedup — server picks well-known metric names from what your grader emits (score, passed, latency_ms_*, cost_usd_total) and renders columns. Zero config needed; see the reference when you want something custom.

Gotchas worth remembering

Case ids are folder names. inputs/basic/ and expected/basic/ must match exactly.
judge.py output schema is yours. Whatever JSON keys you print flow through to runs.metrics; pick names you'll want to see on the leaderboard.
grader.py is optional. Skip it unless you need a non-default pass rule.

You write	Runs	Reads	Emits
`judge.py`	per case	`TRAPTASK_PAYLOAD`	JSON with at least `score`
`grader.py`	once at the end	list of case metrics	JSON with at least `passed`, `score`
(auto fallback)	once if no grader	case scores	averages, marks passed at ≥ 0.8