financebench

ranked by score ↓

financebench

5 closed-book numeric questions on SEC 10-K filings — Netflix 2017, AES 2022, 3M 2018, Walmart 2018, Block 2016. Each case ships the question **plus the relevant 10-K excerpt inline** as `doc.txt`, so solvers don't need to fetch PDFs or hit external services.

Layout

financebench/
├── README.md
├── traptask.yaml             # 5 cases
├── judge.py                  # per-case scorer (numeric ≤1% tol / string match)
├── pyproject.toml
├── inputs/<case_id>/
│   ├── question.txt          # the question — what the solver must answer
│   └── doc.txt               # the relevant 10-K excerpt (closed-book context)
└── expected/<case_id>/
    └── answer.json           # {financebench_id, company, doc, gold}

The 5 cases

idcompanyyeartype
netflix_2017_current_liabNetflix2017single-statement extraction (current liabilities)
aes_2022_roaAES2022derived (ROA = net income / avg assets)
threem_2018_net_ppne3M2018single-statement + unit conversion (M → B)
walmart_2018_dpoWalmart2018derived (DPO formula)
block_2016_working_capitalBlock (Square)2016ratio (TCA / TCL)

Solution contract

Solver reads:

  • inputs/<case_id>/question.txt — the question
  • inputs/<case_id>/doc.txt — the 10-K excerpt (your "closed book")

Solver writes its answer to stdout. That's it. No file output required.

Format for the stdout answer: just the number / string itself, no prose. Numeric tolerances handle units / scale / accounting parentheses / % / commas / $. Example acceptable formats for Walmart DPO:

  • 42.69
  • 42.69 days
  • $42.69
  • (42.69) — accounting negative

But not a paragraph that contains the number after a wall of reasoning — the judge picks up the first number in the response. Keep it terse.

Scoring

judge.py runs once per case and emits:

{
  "score": 1.0 or 0.0,
  "correct": true | false,
  "agent_answer": "<truncated to 500 chars>",
  "expected_answer": "<gold>",
  "reason": "numeric match (pred=42.69 gold=42.69)",
  "company": "Walmart",
  "doc": "WALMART_2018_10K",
  "financebench_id": "financebench_id_06247"
}

Numeric comparison uses 1% relative tolerance (REL_TOL = 0.01). Strings fall back to case-insensitive exact / substring match.

No grader.py — trap auto-aggregates (pass if avg ≥ 0.8, etc.).

Submit

Either path works:

Via tp CLI (any solver)

# in your solver's directory, with trap.yaml pointing at this task:
tp run
tp submit financebench

Via Claude Code skill (closed-book in-session model)

bash <(curl -fsSL https://raw.githubusercontent.com/trapstreet/trapstreet/main/skill/install.sh)
# then in any Claude Code session:
/trapstreet-eval

See https://trapstreet.run/tasks/financebench for the leaderboard.

Provenance

Questions + evidence + gold answers are from PatronusAI's FinanceBench via Ruqii/trapstreet-cases. This task packages 5 representative questions into the trap task format.

The matching logic in judge.py is adapted from the original grade.py in the trapstreet-eval-demo skill.