Legal Contract Review

ranked by score ↓

pdf-reader

Test how well AI agents understand, extract, and reason over real-world legal contracts.

What this task tests

Can the model read a real-world contract PDF and answer questions about it accurately, without hedging or hallucinating?

Every question has a single right answer that's literally in the document — a number, a date, a yes/no, an Act of Parliament, or a calculation derived from clauses in Section 6. The judge is harsh: agents that hedge ("I cannot determine from the document…"), skip parts of multi-part questions, or use the wrong format fail the case outright. No partial credit.

The 19 cases break down as:

CategoryCountExample
money6Monthly rent in year 2; total rent over fixed term
dates1Tenancy start date (DD/MM/YYYY)
clauses8Break clause present? Deposit scheme name? Governing Act?
deposit1What happens if a deposit dispute remains unresolved?
scenario1Early surrender 22 months in: compute the total cost owed
scenario_reasoning1If replacement tenant pays higher rent, does the original tenant benefit?

Input

Per case the agent receives:

  • INPUTS["question.txt"] — a single-line natural-language question
  • INPUTS["document.pdf"] — the AST PDF (~1.8 MB, identical across all 19 cases)

Expected output

A plain answer printed to stdout. Plain text or {"answer": "..."} JSON — both work. The agent must:

  • Commit to a single answer (no "it depends", no "as an AI…")
  • Match the requested format when one is specified (DD/MM/YYYY, yes/no, 'N/A' if not specified, etc.)
  • For multi-part questions, answer all parts — one-word answers that skip the explanation are rejected
  • For scenario questions, show the calculation and give the final number

The judge scores each case 1.0 (pass) or 0.0 (fail). A run passes if ≥80% of scored cases pass.

Purpose

This eval exists to answer a practical question: which model can read a contract PDF reliably, and among the ones that can, which is the cheapest?

Contract review is a real-world task where accuracy is non-negotiable but cost adds up fast — you're paying per page, per document, per tenant. A model that's 95% accurate at 1/10 the price of a frontier model is the better business choice. This task surfaces exactly that trade-off: the leaderboard reports both the score and the cost per run, so you can pick the cheapest model that still clears the accuracy bar you need.

If two models score identically, the cheaper one wins.