Legal Contract Review
ranked by score ↓pdf-reader
Test how well AI agents understand, extract, and reason over real-world legal contracts.
19 cases
Each case feeds files from inputs/<id>/ to the solution, expects files in expected/<id>/, and is scored by judge.py then aggregated by grader.py.
cases (19)
▸rent_year2What is the monthly rent in GBP for the second 12 months of the tenancy (months 13-24)?
input
question.txt
What is the monthly rent in GBP for the second 12 months of the tenancy (months 13-24)?
expected output
answer.json
{
"id": "rent_year2",
"answer": "2100",
"type": "numeric",
"matchers": [
{
"kind": "leading_numeric",
"value": 2100,
"tolerance": 0.01
}
],
"category": "money",
"difficulty": "medium"
}Scored by judge.py — see Scoring logic below for the full rule.
▸rent_year3What is the monthly rent in GBP for the third 12 months of the tenancy (months 25-36)? Answer 'N/A' if the fixed term is shorter than 36 months.
input
question.txt
What is the monthly rent in GBP for the third 12 months of the tenancy (months 25-36)? Answer 'N/A' if the fixed term is shorter than 36 months.
expected output
answer.json
{
"id": "rent_year3",
"answer": "2400",
"type": "numeric",
"matchers": [
{
"kind": "leading_numeric",
"value": 2400,
"tolerance": 0.01
}
],
"category": "money",
"difficulty": "medium"
}Scored by judge.py — see Scoring logic below for the full rule.
▸deposit_amountWhat is the deposit amount in GBP?
input
question.txt
What is the deposit amount in GBP?
expected output
answer.json
{
"id": "deposit_amount",
"answer": "2250",
"type": "numeric",
"matchers": [
{
"kind": "leading_numeric",
"value": 2250,
"tolerance": 0.01
}
],
"category": "money",
"difficulty": "easy"
}Scored by judge.py — see Scoring logic below for the full rule.
▸term_startWhat is the tenancy start date? Format: DD/MM/YYYY.
input
question.txt
What is the tenancy start date? Format: DD/MM/YYYY.
expected output
answer.json
{
"id": "term_start",
"answer": "05/09/2022",
"type": "date",
"matchers": [
{
"kind": "regex_required",
"pattern": "\\b0?5[/\\-]0?9[/\\-]2022\\b"
}
],
"category": "dates",
"difficulty": "easy"
}Scored by judge.py — see Scoring logic below for the full rule.
▸rent_payment_dayOn what day of the month is the rent payable?
input
question.txt
On what day of the month is the rent payable?
expected output
answer.json
{
"id": "rent_payment_day",
"answer": "on or prior 5th of the month",
"type": "text",
"matchers": [
{
"kind": "regex_required",
"pattern": "\\b5(?:th)?\\b"
},
{
"kind": "keywords_any",
"values": [
"before",
"prior",
"on or",
"by the"
]
},
{
"kind": "no_hedge"
}
],
"category": "money",
"difficulty": "medium"
}Scored by judge.py — see Scoring logic below for the full rule.
▸break_clauseDoes the tenancy agreement include a contractual break clause allowing the tenant to end the tenancy early without landlord discretion? Answer yes or no.
input
question.txt
Does the tenancy agreement include a contractual break clause allowing the tenant to end the tenancy early without landlord discretion? Answer yes or no.
expected output
answer.json
{
"id": "break_clause",
"answer": "no",
"type": "boolean",
"matchers": [
{
"kind": "leading_word",
"value": "no"
}
],
"category": "clauses",
"difficulty": "medium"
}Scored by judge.py — see Scoring logic below for the full rule.
▸early_surrenderCan the tenant request an early surrender of the tenancy subject to landlord approval and associated costs? Answer yes or no.
input
question.txt
Can the tenant request an early surrender of the tenancy subject to landlord approval and associated costs? Answer yes or no.
expected output
answer.json
{
"id": "early_surrender",
"answer": "yes",
"type": "boolean",
"matchers": [
{
"kind": "leading_word",
"value": "yes"
}
],
"category": "clauses",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸fixed_term_departure_noticeIf the tenant wants to leave exactly at the end of the fixed term, is notice required? Answer yes or no.
input
question.txt
If the tenant wants to leave exactly at the end of the fixed term, is notice required? Answer yes or no.
expected output
answer.json
{
"id": "fixed_term_departure_notice",
"answer": "yes",
"type": "boolean",
"matchers": [
{
"kind": "leading_word",
"value": "yes"
}
],
"category": "clauses",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸post_fixed_term_extensionWhat happens if the tenant remains in the property after the fixed term without signing a new tenancy agreement?
input
question.txt
What happens if the tenant remains in the property after the fixed term without signing a new tenancy agreement?
expected output
answer.json
{
"id": "post_fixed_term_extension",
"answer": "the tenancy automatically extends for six months",
"type": "text",
"matchers": [
{
"kind": "keywords_all",
"values": [
"six month",
"extend"
]
},
{
"kind": "no_hedge"
}
],
"category": "clauses",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸rent_increase_scopeDoes the 5% rent increase apply during the original fixed term, the automatic extension period, or both?
input
question.txt
Does the 5% rent increase apply during the original fixed term, the automatic extension period, or both?
expected output
answer.json
{
"id": "rent_increase_scope",
"answer": "the automatic extension period only",
"type": "text",
"matchers": [
{
"kind": "keywords_all",
"values": [
"extension"
]
},
{
"kind": "keywords_any",
"values": [
"only",
"automatic"
]
},
{
"kind": "no_hedge"
}
],
"category": "money",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸deposit_schemeWhich tenancy deposit protection scheme is the deposit registered with?
input
question.txt
Which tenancy deposit protection scheme is the deposit registered with?
expected output
answer.json
{
"id": "deposit_scheme",
"answer": "TDS",
"type": "text",
"matchers": [
{
"kind": "keywords_any",
"values": [
"TDS",
"Tenancy Deposit Scheme",
"The Dispute Service"
]
},
{
"kind": "no_hedge"
}
],
"category": "clauses",
"difficulty": "medium"
}Scored by judge.py — see Scoring logic below for the full rule.
▸pets_allowedAre pets permitted in the property? Answer yes, no, or 'with landlord consent'.
input
question.txt
Are pets permitted in the property? Answer yes, no, or 'with landlord consent'.
expected output
answer.json
{
"id": "pets_allowed",
"answer": "with landlord consent",
"type": "text",
"matchers": [
{
"kind": "keywords_any",
"values": [
"consent",
"permission",
"approval"
]
},
{
"kind": "keywords_any",
"values": [
"landlord"
]
},
{
"kind": "no_hedge"
}
],
"category": "clauses",
"difficulty": "medium"
}Scored by judge.py — see Scoring logic below for the full rule.
▸total_rent_fixed_termWhat is the total rent payable in GBP over the entire fixed term? Compute monthly rent × term length, summing across rent periods if rent changes year-on-year.
input
question.txt
What is the total rent payable in GBP over the entire fixed term? Compute monthly rent × term length, summing across rent periods if rent changes year-on-year.
expected output
answer.json
{
"id": "total_rent_fixed_term",
"answer": "77400",
"type": "numeric",
"matchers": [
{
"kind": "numeric",
"value": 77400,
"tolerance": 0.01
}
],
"category": "money",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸late_rent_interest_rateWhat interest rate applies to late rent payments? Answer as written in the agreement (e.g. '3% above Bank of England base rate'), or 'N/A' if not specified.
input
question.txt
What interest rate applies to late rent payments? Answer as written in the agreement (e.g. '3% above Bank of England base rate'), or 'N/A' if not specified.
expected output
answer.json
{
"id": "late_rent_interest_rate",
"answer": "3% per annum above Bank of England base rate",
"type": "text",
"matchers": [
{
"kind": "keywords_all",
"values": [
"3%",
"base rate"
]
},
{
"kind": "keywords_any_word",
"values": [
"bank of england",
"boe"
]
},
{
"kind": "no_hedge"
}
],
"category": "money",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸governing_actWhich Act of Parliament is cited in the agreement as governing this Assured Shorthold Tenancy? Answer with the Act name and year as written (e.g. 'Housing Act 1988').
input
question.txt
Which Act of Parliament is cited in the agreement as governing this Assured Shorthold Tenancy? Answer with the Act name and year as written (e.g. 'Housing Act 1988').
expected output
answer.json
{
"id": "governing_act",
"answer": "Housing Act 1988",
"type": "text",
"matchers": [
{
"kind": "keywords_all",
"values": [
"housing act",
"1988"
]
},
{
"kind": "no_hedge"
}
],
"category": "clauses",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸inventory_referencedHas the tenant been told that an inventory or schedule of condition may be used to assess damage claims at the end of the tenancy? Answer yes or no.
input
question.txt
Has the tenant been told that an inventory or schedule of condition may be used to assess damage claims at the end of the tenancy? Answer yes or no.
expected output
answer.json
{
"id": "inventory_referenced",
"answer": "yes",
"type": "boolean",
"matchers": [
{
"kind": "leading_word",
"value": "yes"
}
],
"category": "clauses",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸deposit_dispute_escalationWhat happens if the Tenant disputes the proposed deposit deductions and the dispute remains unresolved after reasonable attempts to resolve it?
input
question.txt
What happens if the Tenant disputes the proposed deposit deductions and the dispute remains unresolved after reasonable attempts to resolve it?
expected output
answer.json
{
"id": "deposit_dispute_escalation",
"answer": "The dispute may be submitted to the Independent Case Examiner (ICE) for adjudication.",
"type": "text",
"matchers": [
{
"kind": "keywords_any_word",
"values": [
"Independent Case Examiner",
"ICE"
]
},
{
"kind": "no_hedge"
}
],
"category": "deposit",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸scenario_leave_22mo_replacement_1mo_gapThe tenant is 22 months into a 36-month fixed term and wants to surrender immediately. The Landlord agrees and finds a replacement tenant 1 month later, at the rent the agreement specifies for each remaining month. Per Section 6, calculate the total cost to the surrendering tenant: (a) rent for the 1-month gap, (b) letting fee 13.2% on the replacement's rent for the remainder of the fixed term, (c) inventory check-in proportional to months surrendered early (£144 × months_early / 36), (d) administration charges proportional to months surrendered early (£480 × months_early / 36). Show the calculation and give the final GBP total.
input
question.txt
The tenant is 22 months into a 36-month fixed term and wants to surrender immediately. The Landlord agrees and finds a replacement tenant 1 month later, at the rent the agreement specifies for each remaining month. Per Section 6 ("Special Clauses") of the agreement, calculate the total cost to the surrendering tenant. Include: (a) rent for the 1-month gap before the replacement moves in, (b) the letting fee (11% + VAT = 13.2%) on the rent the replacement will pay for the remainder of the original fixed term, (c) inventory check-in proportional to months surrendered early (£144 × months_early / 36), and (d) administration charges proportional to months surrendered early (£480 × months_early / 36). Show the calculation and give the final GBP total.
expected output
answer.json
{
"id": "scenario_leave_22mo_replacement_1mo_gap",
"answer": "£6,421.47",
"type": "numeric",
"matchers": [
{
"kind": "numeric",
"value": 6421.47,
"tolerance": 0.02
}
],
"category": "scenario",
"difficulty": "hard"
}Scored by judge.py — see Scoring logic below for the full rule.
▸early_surrender_economic_incentivePer Section 6 of the agreement, if a tenant surrenders early and the Landlord finds a replacement tenant who pays a HIGHER rent than the surrendering tenant's rate, does the surrendering tenant receive any benefit (e.g. a refund or credit) from the increased rent? Answer yes or no and explain briefly.
input
question.txt
Per Section 6 of the agreement, if a tenant surrenders early and the Landlord finds a replacement tenant who pays a HIGHER rent than the surrendering tenant's rate, does the surrendering tenant receive any benefit (e.g. a refund or credit) from the increased rent? Answer yes or no and explain briefly.
expected output
answer.json
{
"id": "early_surrender_economic_incentive",
"answer": "no",
"type": "boolean",
"matchers": [
{
"kind": "leading_word",
"value": "no"
},
{
"kind": "min_words",
"value": 10
},
{
"kind": "no_hedge"
}
],
"category": "scenario_reasoning",
"difficulty": "expert"
}Scored by judge.py — see Scoring logic below for the full rule.
scoring logic
judge.py runs once per case and prints a score per case. grader.py runs once at the end and folds case scores into a run-level summary. Without grader.py, the server averages case scores and marks the run passed at 0.8+.
▸judge.py358 lines · view on GitHub
"""Per-case judge for the tenancy_agreement task — harsh by design.
Reads the agent's stdout (plain text OR JSON `{"answer": "..."}`) and applies
matchers declared in expected/{case_id}/answer.json. A case scores 1.0 only if
ALL matchers pass — partial credit is intentionally not offered. The whole
point of this task is to expose agents that hedge, miss clauses, or skip parts
of multi-part questions; lenient grading would defeat that.
Matcher kinds supported:
- numeric {"kind":"numeric","value":1234.5,"tolerance":0.01}
Passes if ANY number in the answer matches. Use for
show-your-working questions where the model walks
through arithmetic before stating the total.
- leading_numeric {"kind":"leading_numeric","value":1234.5,"tolerance":0.01}
The FIRST number in the answer must match. Use for
simple extraction questions where listing decoy
numbers should not count as a pass.
- regex_required {"kind":"regex_required","pattern":"...","flags":"i"}
Pattern must match (re.search). Default flags = i.
- leading_word {"kind":"leading_word","value":"yes"}
First alphanumeric token must equal value (case-insens),
after stripping common prefixes like "Answer:" or
markdown bold. Forces the model to commit, not hedge.
- keywords_all {"kind":"keywords_all","values":["a","b"]}
Every value must appear (case-insens substring).
- keywords_any {"kind":"keywords_any","values":["a","b"]}
At least one value must appear (case-insens substring).
- keywords_any_word {"kind":"keywords_any_word","values":["ICE","BOE"]}
At least one value must appear as a whole word (\b...\b,
case-insens). Use for short acronyms that would
false-positive as substrings (ICE in "price", BOE in
"Boeing").
- no_hedge {"kind":"no_hedge"}
Reject answers that visibly punt the question, e.g.
"I cannot determine", "unclear from the document",
"I don't have access", "as an AI", etc.
- min_words {"kind":"min_words","value":5}
Reject one-word answers when the question asked for
reasoning/explanation.
Fallback (when no `matchers` provided):
Substring match of `answer` (and any `accepted` variants) against the
normalised agent output. Lenient but kept for cases that haven't been
hardened yet (e.g. scenario_* cases without a curated gold).
Outputs JSON on stdout — trap stores it as CaseResult.metrics. The grader
reads `metrics.score` plus category/difficulty/reason for the report.
"""
from __future__ import annotations
import json
import os
import re
from pathlib import Path
from typing import Any
HEDGE_PHRASES = [
"i cannot", "i can't", "i am unable", "i'm unable",
"i don't have access", "i do not have access",
"as an ai", "as a language model",
"cannot determine", "unable to determine",
"unclear from the document", "not clear from the document",
"i don't know", "i do not know",
"insufficient information", "not enough information",
"i'm not sure", "i am not sure",
]
NUMBER_RE = re.compile(r"-?\d[\d,]*(?:\.\d+)?")
def normalise(s: str) -> str:
return re.sub(r"\s+", " ", s).strip().lower()
def extract_agent_answer(stdout: str) -> str:
"""Accept JSON {"answer": "..."} or plain text. Strip surrounding whitespace."""
stdout = stdout.strip()
if not stdout:
return ""
try:
obj = json.loads(stdout)
if isinstance(obj, dict) and "answer" in obj:
return str(obj["answer"])
except json.JSONDecodeError:
pass
return stdout
def parse_numeric(s: str) -> float | None:
"""Extract the first plausible number from `s`. £, $, commas, spaces stripped."""
nums = parse_all_numerics(s)
return nums[0] if nums else None
def parse_all_numerics(s: str) -> list[float]:
"""Extract ALL plausible numbers from `s`. Used to match agents that show
working (e.g. "£1,950 × 12 + ... = £77,400" — we want to find 77400)."""
if not s:
return []
cleaned = s.replace("£", "").replace("$", "").replace(",", "")
out: list[float] = []
for m in NUMBER_RE.finditer(cleaned):
try:
out.append(float(m.group(0).replace(",", "")))
except ValueError:
continue
return out
_LEADING_LABEL_RE = re.compile(
r"^\s*(?:answer|a|response|reply)[\s*_`]*:\s*", re.IGNORECASE,
)
_LEADING_NOISE_RE = re.compile(r"^[\s*_`#>\-]+")
def leading_word(s: str) -> str:
"""First alpha token, after stripping markdown noise and labels like
"Answer:" / "**Answer**:" / "> ". Lets models prefix their commit with
a natural label without auto-failing the case."""
s = _LEADING_NOISE_RE.sub("", s)
s = _LEADING_LABEL_RE.sub("", s)
s = _LEADING_NOISE_RE.sub("", s)
m = re.search(r"[a-zA-Z]+", s)
return m.group(0).lower() if m else ""
# --- Matcher implementations ----------------------------------------------
def m_numeric(answer: str, spec: dict) -> tuple[bool, str]:
"""Pass if ANY number in the answer matches the target within tolerance.
This lets models that show working ("1950 × 12 + 2100 × 12 = 77400") pass
as long as the right number appears somewhere — exposing the actual answer
is what matters, not whether the model led with it. For simple extraction
where listing decoys should NOT pass, use `leading_numeric` instead."""
nums = parse_all_numerics(answer)
if not nums:
return False, "no number found in answer"
target = float(spec["value"])
tol = float(spec.get("tolerance", 0.01))
for n in nums:
if abs(n - target) <= tol:
return True, f"numeric ok (matched {n} of {nums} against target={target} tol={tol})"
return False, f"numeric mismatch (numbers found={nums} target={target} tol={tol})"
def m_leading_numeric(answer: str, spec: dict) -> tuple[bool, str]:
"""First number in the answer must match within tolerance. Rejects
decoy-number dumps like "rent 1950, deposit 2250, rent yr2 2100"
where the target appears but isn't the committed answer."""
nums = parse_all_numerics(answer)
if not nums:
return False, "no number found in answer"
target = float(spec["value"])
tol = float(spec.get("tolerance", 0.01))
if abs(nums[0] - target) <= tol:
return True, f"leading number ok ({nums[0]} == target {target} tol {tol})"
return False, f"leading number {nums[0]} ≠ target {target} (other numbers in answer: {nums[1:]})"
def m_regex_required(answer: str, spec: dict) -> tuple[bool, str]:
flags = 0
if "i" in spec.get("flags", "i"):
flags |= re.IGNORECASE
if re.search(spec["pattern"], answer, flags):
return True, f"regex matched"
return False, f"regex {spec['pattern']!r} did not match"
def m_leading_word(answer: str, spec: dict) -> tuple[bool, str]:
got = leading_word(answer)
want = str(spec["value"]).lower()
if got == want:
return True, f"leading word ok ({got!r})"
return False, f"leading word {got!r} ≠ required {want!r}"
def m_keywords_all(answer: str, spec: dict) -> tuple[bool, str]:
norm = normalise(answer)
missing = [v for v in spec["values"] if v.lower() not in norm]
if missing:
return False, f"missing required keyword(s): {missing}"
return True, "all keywords present"
def m_keywords_any(answer: str, spec: dict) -> tuple[bool, str]:
norm = normalise(answer)
if any(v.lower() in norm for v in spec["values"]):
return True, "at least one keyword present"
return False, f"none of {spec['values']} present"
def m_keywords_any_word(answer: str, spec: dict) -> tuple[bool, str]:
"""Whole-word variant of keywords_any — wraps each value in \\b...\\b so
short acronyms (ICE, BOE) don't false-match inside "price", "Boeing", etc."""
for v in spec["values"]:
if re.search(rf"\b{re.escape(v)}\b", answer, re.IGNORECASE):
return True, f"whole-word match: {v!r}"
return False, f"none of {spec['values']} matched as whole word"
def m_no_hedge(answer: str, spec: dict) -> tuple[bool, str]:
norm = normalise(answer)
for phrase in HEDGE_PHRASES:
if phrase in norm:
return False, f"hedge phrase detected: {phrase!r}"
return True, "no hedge phrases"
def m_min_words(answer: str, spec: dict) -> tuple[bool, str]:
count = len(re.findall(r"\S+", answer))
want = int(spec["value"])
if count >= want:
return True, f"word count ok ({count} ≥ {want})"
return False, f"too short ({count} < {want})"
MATCHERS = {
"numeric": m_numeric,
"leading_numeric": m_leading_numeric,
"regex_required": m_regex_required,
"leading_word": m_leading_word,
"keywords_all": m_keywords_all,
"keywords_any": m_keywords_any,
"keywords_any_word": m_keywords_any_word,
"no_hedge": m_no_hedge,
"min_words": m_min_words,
}
def run_matchers(answer: str, matchers: list[dict]) -> tuple[float, list[dict]]:
"""Run all matchers; all must pass. Returns (score, per-matcher results)."""
results = []
all_ok = True
for spec in matchers:
kind = spec.get("kind")
fn = MATCHERS.get(kind)
if fn is None:
results.append({"kind": kind, "pass": False, "reason": f"unknown matcher kind: {kind!r}"})
all_ok = False
continue
ok, reason = fn(answer, spec)
results.append({"kind": kind, "pass": ok, "reason": reason})
if not ok:
all_ok = False
return (1.0 if all_ok else 0.0), results
def fallback_substring(answer: str, expected: dict) -> tuple[float, str]:
"""Lenient substring match when no matchers defined. Used for scenarios
that don't have a curated gold yet — they shouldn't fail builds outright,
but they also shouldn't claim a passing score from nothing."""
targets = [t for t in [expected.get("answer"), *(expected.get("accepted") or [])] if t]
if not targets:
return 0.0, "no gold answer set (skip-equivalent)"
norm = normalise(answer)
hit = next((t for t in targets if normalise(t) in norm), None)
if hit:
return 1.0, f"substring match ({hit!r})"
return 0.0, f"no substring match against {targets}"
# --- Main ------------------------------------------------------------------
def main() -> None:
payload = json.loads(os.environ["TRAPTASK_PAYLOAD"])
stdout = Path(payload["outputs"]["case_stdout"]).read_text()
exit_code = json.loads(Path(payload["outputs"]["case_meta.json"]).read_text())["exit_code"]
expected = json.loads(Path(payload["expected"]["answer.json"]).read_text())
# Pick up usage.json if the solution captured it (Sonnet + caching runs)
usage_record: dict[str, Any] = {}
usage_path = payload["outputs"].get("usage.json")
if usage_path and Path(usage_path).exists():
try:
usage_record = json.loads(Path(usage_path).read_text())
except json.JSONDecodeError:
usage_record = {}
agent_answer = extract_agent_answer(stdout)
# Solution crashed → hard fail.
if exit_code != 0:
out: dict[str, Any] = {
"score": 0.0,
"reason": f"solution exited {exit_code}",
"agent_answer": agent_answer,
"id": expected.get("id"),
"category": expected.get("category"),
"difficulty": expected.get("difficulty"),
**usage_record,
}
print(json.dumps(out))
return
# Empty stdout → hard fail (silently passing the test is the worst outcome).
if not agent_answer:
out = {
"score": 0.0,
"reason": "agent produced no answer",
"agent_answer": "",
"id": expected.get("id"),
"category": expected.get("category"),
"difficulty": expected.get("difficulty"),
**usage_record,
}
print(json.dumps(out))
return
matchers = expected.get("matchers")
if matchers:
score, matcher_results = run_matchers(agent_answer, matchers)
out = {
"score": score,
"matcher_results": matcher_results,
"agent_answer": agent_answer,
"expected_answer": expected.get("answer"),
"id": expected.get("id"),
"type": expected.get("type"),
"category": expected.get("category"),
"difficulty": expected.get("difficulty"),
**usage_record,
}
else:
score, reason = fallback_substring(agent_answer, expected)
# If there's no gold and no matchers, surface score=None so the grader
# can flag it as "not yet curated" rather than mark the agent failed.
if expected.get("answer") is None:
out = {
"score": None,
"reason": "no curated gold yet (case not gradeable)",
"agent_answer": agent_answer,
"id": expected.get("id"),
"type": expected.get("type"),
"category": expected.get("category"),
"difficulty": expected.get("difficulty"),
**usage_record,
}
else:
out = {
"score": score,
"reason": reason,
"agent_answer": agent_answer,
"expected_answer": expected.get("answer"),
"id": expected.get("id"),
"type": expected.get("type"),
"category": expected.get("category"),
"difficulty": expected.get("difficulty"),
**usage_record,
}
print(json.dumps(out))
if __name__ == "__main__":
main()
▸grader.py81 lines · view on GitHub
"""Overall grader for the tenancy_agreement task.
Aggregates per-case judge results into a run-level verdict. Emits JSON to stdout —
trap stores it as GraderResult.metrics. Convention: include `passed` (bool) and
`score` (float) so the reporter can render them.
Pass threshold defaults to 80% accuracy; tweak below.
"""
from __future__ import annotations
import json
import os
from collections import Counter
PASS_THRESHOLD = 0.80
def main() -> None:
cases = json.loads(os.environ["TRAPTASK_PAYLOAD"])
scored = [c for c in cases if c.get("metrics") and c["metrics"].get("score") is not None]
skipped = [c for c in cases if not c.get("metrics") or c["metrics"].get("score") is None]
if scored:
accuracy = sum(c["metrics"]["score"] for c in scored) / len(scored)
else:
accuracy = 0.0
# Break out accuracy by category (the judge tags each case with its category).
by_category_score: Counter[str] = Counter()
by_category_total: Counter[str] = Counter()
for c in scored:
cat = c["metrics"].get("category")
if cat:
by_category_total[cat] += 1
by_category_score[cat] += c["metrics"]["score"]
by_category_pct = {
k: round(by_category_score[k] / by_category_total[k], 3)
for k in by_category_total
}
passed = bool(scored) and accuracy >= PASS_THRESHOLD
# Latency stats — trap records `duration` (seconds) per case. The leaderboard
# displays median latency. Round-trip to ms for the JSON contract.
durations = [c.get("duration", 0.0) for c in cases if c.get("duration") is not None]
if durations:
ds = sorted(durations)
latency_ms_median = round(ds[len(ds) // 2] * 1000, 1)
latency_ms_p95 = round(ds[int(0.95 * len(ds))] * 1000, 1) if len(ds) > 1 else latency_ms_median
latency_ms_total = round(sum(ds) * 1000, 1)
else:
latency_ms_median = latency_ms_p95 = latency_ms_total = 0.0
# Cost — sum per-case usd_cost if the solution captured usage; otherwise
# leave it None and let the regrade/submit script stamp a known total.
case_costs = [c["metrics"].get("usd_cost") for c in scored if isinstance(c.get("metrics"), dict)]
cost_usd_total = round(sum(x for x in case_costs if x is not None), 4) if any(x is not None for x in case_costs) else None
n_passed = sum(1 for c in scored if c["metrics"]["score"] == 1.0)
print(json.dumps({
"passed": passed,
"score": round(accuracy, 3),
"n_passed": n_passed,
"n_total": len(cases),
"n_scored": len(scored),
"n_skipped_no_gold": len(skipped),
"threshold": PASS_THRESHOLD,
"by_category": by_category_pct,
"latency_ms_median": latency_ms_median,
"latency_ms_p95": latency_ms_p95,
"latency_ms_total": latency_ms_total,
"cost_usd_total": cost_usd_total,
}))
if __name__ == "__main__":
main()
