Cross-Timezone Scheduler
ranked by score ↓cross-timezone
A trap-compatible task that asks an agent to schedule a meeting across attendees in different time zones, given each attendee's local availability window. The agent must return a JSON object with a single canonical meeting time in UTC plus each attendee's local start time.
2 cases
Each case feeds files from inputs/<id>/ to the solution, expects files in expected/<id>/, and is scored by judge.py then aggregated by grader.py.
cases (2)
▸dst_gap_with_ist60-min meeting across SF/London/Mumbai on 2026-03-26 — UK still on GMT (DST starts March 29), India on UTC+5:30 (no DST). Tests DST-boundary + half-hour zone math simultaneously.
input
question.txt
You are a scheduling assistant.
Schedule a 60-minute meeting TOMORROW with the following attendees and their LOCAL availability windows.
Today is 2026-03-25 (Wednesday).
Attendees:
- Alice — San Francisco (America/Los_Angeles) — available 07:00–09:00 local
- Bob — London (Europe/London) — available 14:00–16:00 local
- Priya — Mumbai (Asia/Kolkata) — available 19:30–21:30 local
Pick any 60-minute slot that fits inside ALL three local availability windows. Account for daylight-saving time on the actual date.
Return ONLY a JSON object (no commentary, no markdown fences) with this exact schema:
{
"start_utc": "<ISO 8601 timestamp in UTC, e.g. 2026-03-26T14:00:00Z>",
"duration_min": 60,
"attendees": [
{"name": "Alice", "tz": "America/Los_Angeles", "local_start": "YYYY-MM-DD HH:MM"},
{"name": "Bob", "tz": "Europe/London", "local_start": "YYYY-MM-DD HH:MM"},
{"name": "Priya", "tz": "Asia/Kolkata", "local_start": "YYYY-MM-DD HH:MM"}
]
}
expected output
answer.json
{
"id": "dst_gap_with_ist",
"category": "dst_boundary",
"difficulty": "hard",
"duration_min": 60,
"expected_start_utc_min": "2026-03-26T14:00:00Z",
"expected_start_utc_max": "2026-03-26T15:00:00Z",
"attendees": [
{
"name": "Alice",
"tz": "America/Los_Angeles",
"available_local_min": "2026-03-26T07:00:00",
"available_local_max": "2026-03-26T09:00:00"
},
{
"name": "Bob",
"tz": "Europe/London",
"available_local_min": "2026-03-26T14:00:00",
"available_local_max": "2026-03-26T16:00:00"
},
{
"name": "Priya",
"tz": "Asia/Kolkata",
"available_local_min": "2026-03-26T19:30:00",
"available_local_max": "2026-03-26T21:30:00"
}
],
"_canonical_answer": {
"start_utc": "2026-03-26T14:00:00Z",
"alice_local": "2026-03-26 07:00",
"bob_local": "2026-03-26 14:00",
"priya_local": "2026-03-26 19:30"
},
"_notes": "DST trap: US DST'd on 2026-03-08 (UTC-7 PDT). UK DST starts 2026-03-29 — Bob is still on GMT (UTC+0) on this date. Priya is IST (UTC+5:30, no DST). The accepted UTC window is [14:00Z, 15:00Z] (start times that fit a 60-min slot inside everyone's availability)."
}Scored by judge.py — see Scoring logic below for the full rule.
▸dst_quarter_hour_sydney60-min meeting across SF/London/Mumbai/Kathmandu/Sydney on 2026-03-26 — UK-still-GMT + IST half-hour + Nepal QUARTER-hour (+05:45) + Sydney southern-hemisphere DST (AEDT) + local-calendar day-shift for Sydney. Five independent traps; only one valid start time exists.
input
question.txt
You are a scheduling assistant.
Schedule a 60-minute meeting TOMORROW (Thursday 2026-03-26) with the following attendees and their LOCAL availability windows.
Today is 2026-03-25 (Wednesday).
Attendees:
- Alice — San Francisco (America/Los_Angeles) — available 06:00–08:00 local
- Bob — London (Europe/London) — available 14:00–15:30 local
- Priya — Mumbai (Asia/Kolkata) — available 19:00–21:00 local
- Niraj — Kathmandu (Asia/Kathmandu) — available 19:30–21:30 local
- Sam — Sydney (Australia/Sydney) — available 00:30–02:30 local (early-morning slot)
Notes:
- Account for daylight-saving time on the actual date.
- Sam is on Australian Eastern time and may experience the meeting on a different LOCAL calendar day from everyone else.
- Pick any 60-minute slot that fits inside ALL FIVE local availability windows.
Return ONLY a JSON object (no commentary, no markdown fences) with this exact schema:
{
"start_utc": "<ISO 8601 timestamp in UTC, e.g. 2026-03-26T14:00:00Z>",
"duration_min": 60,
"attendees": [
{"name": "Alice", "tz": "America/Los_Angeles", "local_start": "YYYY-MM-DD HH:MM"},
{"name": "Bob", "tz": "Europe/London", "local_start": "YYYY-MM-DD HH:MM"},
{"name": "Priya", "tz": "Asia/Kolkata", "local_start": "YYYY-MM-DD HH:MM"},
{"name": "Niraj", "tz": "Asia/Kathmandu", "local_start": "YYYY-MM-DD HH:MM"},
{"name": "Sam", "tz": "Australia/Sydney", "local_start": "YYYY-MM-DD HH:MM"}
]
}
expected output
answer.json
{
"id": "dst_quarter_hour_sydney",
"category": "multi_zone_expert",
"difficulty": "expert",
"duration_min": 60,
"expected_start_utc_min": "2026-03-26T14:00:00Z",
"expected_start_utc_max": "2026-03-26T14:00:00Z",
"attendees": [
{
"name": "Alice",
"tz": "America/Los_Angeles",
"available_local_min": "2026-03-26T06:00:00",
"available_local_max": "2026-03-26T08:00:00"
},
{
"name": "Bob",
"tz": "Europe/London",
"available_local_min": "2026-03-26T14:00:00",
"available_local_max": "2026-03-26T15:30:00"
},
{
"name": "Priya",
"tz": "Asia/Kolkata",
"available_local_min": "2026-03-26T19:00:00",
"available_local_max": "2026-03-26T21:00:00"
},
{
"name": "Niraj",
"tz": "Asia/Kathmandu",
"available_local_min": "2026-03-26T19:30:00",
"available_local_max": "2026-03-26T21:30:00"
},
{
"name": "Sam",
"tz": "Australia/Sydney",
"available_local_min": "2026-03-27T00:30:00",
"available_local_max": "2026-03-27T02:30:00"
}
],
"_canonical_answer": {
"start_utc": "2026-03-26T14:00:00Z",
"alice_local": "2026-03-26 07:00",
"bob_local": "2026-03-26 14:00",
"priya_local": "2026-03-26 19:30",
"niraj_local": "2026-03-26 19:45",
"sam_local": "2026-03-27 01:00"
},
"_notes": "Five-way trap. Independent traps in one case: (1) UK still on GMT (BST starts 2026-03-29); (2) US already on PDT (DST'd 2026-03-08); (3) India IST is UTC+5:30 (half-hour); (4) Nepal NPT is UTC+5:45 (quarter-hour, very rare knowledge); (5) Sydney on AEDT UTC+11 (southern-hemisphere DST still active in March, ends first Sunday of April); (6) Sam's local calendar date is the day AFTER everyone else's (day-shift). The constraints intersect at exactly one start time: 14:00:00 UTC."
}Scored by judge.py — see Scoring logic below for the full rule.
scoring logic
judge.py runs once per case and prints a score per case. grader.py runs once at the end and folds case scores into a run-level summary. Without grader.py, the server averages case scores and marks the run passed at 0.8+.
▸judge.py233 lines · view on GitHub
"""Per-case judge for the cross_timezone scheduler task.
Reads the agent's stdout (must be a JSON object) and runs strict checks:
1. stdout parses as JSON object
2. start_utc is ISO 8601 with explicit UTC tz (Z or +00:00)
3. start_utc lies inside expected_start_utc_min..expected_start_utc_max
4. duration_min == expected duration
5. For every gold attendee, the agent's reported local_start matches
(start_utc converted to that attendee's IANA TZ via zoneinfo) ± 1 min
6. For every gold attendee, the resulting local meeting (start + duration)
fits inside their stated availability window
If any check fails → score 0.0. All checks pass → score 1.0. No partial credit.
Outputs JSON metrics to stdout; trap stores it as CaseResult.metrics.
"""
from __future__ import annotations
import json
import os
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
from zoneinfo import ZoneInfo, ZoneInfoNotFoundError
def _parse_iso(s: str) -> datetime | None:
"""Parse ISO 8601 string. Accepts trailing 'Z' or '+00:00'. Returns None on failure."""
if not isinstance(s, str):
return None
s2 = s.strip().replace("Z", "+00:00")
try:
return datetime.fromisoformat(s2)
except ValueError:
return None
def _parse_local(s: str) -> datetime | None:
"""Parse a local datetime in 'YYYY-MM-DD HH:MM' or ISO form. Naive (no tz)."""
if not isinstance(s, str):
return None
s2 = s.strip().replace("T", " ")
for fmt in ("%Y-%m-%d %H:%M", "%Y-%m-%d %H:%M:%S"):
try:
return datetime.strptime(s2, fmt)
except ValueError:
continue
return None
def _parse_agent_output(stdout: str) -> dict | tuple[None, str]:
stdout = stdout.strip()
# Strip common markdown code-fence wrappers (some models can't help themselves)
if stdout.startswith("```"):
lines = stdout.split("\n")
if lines[0].startswith("```"):
lines = lines[1:]
if lines and lines[-1].startswith("```"):
lines = lines[:-1]
stdout = "\n".join(lines).strip()
try:
obj = json.loads(stdout)
except json.JSONDecodeError as e:
return None, f"stdout is not valid JSON: {e}"
if not isinstance(obj, dict):
return None, "top-level output must be a JSON object"
return obj
def judge_case(agent_stdout: str, expected: dict) -> dict[str, Any]:
"""Run all checks. Returns metrics dict including per-check pass/reason."""
checks: list[dict] = []
score = 1.0
def fail(name: str, reason: str) -> None:
nonlocal score
checks.append({"check": name, "pass": False, "reason": reason})
score = 0.0
def ok(name: str, reason: str = "ok") -> None:
checks.append({"check": name, "pass": True, "reason": reason})
# 1. JSON parse
parsed = _parse_agent_output(agent_stdout)
if isinstance(parsed, tuple):
fail("json_parse", parsed[1])
return {"score": 0.0, "matcher_results": checks}
ans = parsed
ok("json_parse")
# 2. start_utc field present + parseable + has tzinfo
start_utc_str = ans.get("start_utc")
if not start_utc_str:
fail("start_utc_present", "field missing")
return {"score": 0.0, "matcher_results": checks}
dt = _parse_iso(start_utc_str)
if dt is None or dt.tzinfo is None:
fail("start_utc_iso8601_utc", f"could not parse {start_utc_str!r} as ISO 8601 with explicit UTC offset")
return {"score": 0.0, "matcher_results": checks}
dt_utc = dt.astimezone(timezone.utc)
ok("start_utc_iso8601_utc", f"parsed as {dt_utc.isoformat()}")
# 3. start_utc in accepted window
exp_min = _parse_iso(expected["expected_start_utc_min"])
exp_max = _parse_iso(expected["expected_start_utc_max"])
if exp_min is None or exp_max is None:
fail("gold_window", "gold answer.json has malformed expected_start_utc_min/max")
return {"score": 0.0, "matcher_results": checks}
if not (exp_min <= dt_utc <= exp_max):
fail(
"start_utc_in_window",
f"start_utc {dt_utc.isoformat()} is outside accepted [{exp_min.isoformat()}, {exp_max.isoformat()}]",
)
else:
ok("start_utc_in_window")
# 4. duration matches
exp_dur = int(expected["duration_min"])
got_dur = ans.get("duration_min")
if got_dur != exp_dur:
fail("duration_min", f"got {got_dur!r}, expected {exp_dur}")
else:
ok("duration_min")
duration = timedelta(minutes=exp_dur)
# 5 + 6. Per-attendee checks
model_atts = ans.get("attendees") or []
if not isinstance(model_atts, list):
fail("attendees_list", "attendees must be a list")
return {"score": score, "matcher_results": checks}
model_by_name = {str(a.get("name", "")).strip().lower(): a for a in model_atts if isinstance(a, dict)}
for gold_att in expected["attendees"]:
name = gold_att["name"]
tz_name = gold_att["tz"]
try:
tz = ZoneInfo(tz_name)
except ZoneInfoNotFoundError:
fail(f"attendee_{name}_gold_tz", f"gold TZ {tz_name!r} not in zoneinfo database")
continue
local_dt = dt_utc.astimezone(tz)
# Availability window check (gold-side, authoritative)
avail_min = _parse_local(gold_att["available_local_min"])
avail_max = _parse_local(gold_att["available_local_max"])
if avail_min is None or avail_max is None:
fail(f"attendee_{name}_gold_window", "malformed gold availability")
continue
local_naive = local_dt.replace(tzinfo=None)
latest_start = avail_max - duration
if not (avail_min <= local_naive <= latest_start):
fail(
f"attendee_{name}_availability",
f"start={local_naive.isoformat()} not in [{avail_min.isoformat()}, {latest_start.isoformat()}]",
)
else:
ok(f"attendee_{name}_availability", f"local {local_naive.isoformat()} fits window")
# Model's reported local_start matches our computed
model_att = model_by_name.get(name.lower())
if model_att is None:
fail(f"attendee_{name}_in_output", "missing from agent output")
continue
reported = _parse_local(str(model_att.get("local_start", "")))
if reported is None:
fail(
f"attendee_{name}_local_format",
f"local_start {model_att.get('local_start')!r} not parseable as YYYY-MM-DD HH:MM",
)
continue
diff_min = abs((local_naive - reported).total_seconds()) / 60.0
if diff_min > 1.0:
fail(
f"attendee_{name}_local_match",
f"reported {reported.isoformat()} vs computed {local_naive.isoformat()} (diff {diff_min:.1f} min)",
)
else:
ok(f"attendee_{name}_local_match", f"reported {reported.isoformat()} ≈ computed (Δ {diff_min:.1f} min)")
# Recompute score from checks (in case any later fail overrode the early return path)
final_score = 0.0 if any(not c["pass"] for c in checks) else 1.0
return {
"score": final_score,
"matcher_results": checks,
"agent_start_utc": ans.get("start_utc"),
"gold_canonical_utc": expected.get("_canonical_answer", {}).get("start_utc"),
"id": expected.get("id"),
"category": expected.get("category"),
"difficulty": expected.get("difficulty"),
}
def main() -> None:
payload = json.loads(os.environ["TRAPTASK_PAYLOAD"])
stdout = Path(payload["outputs"]["case_stdout"]).read_text()
exit_code = json.loads(Path(payload["outputs"]["case_meta.json"]).read_text())["exit_code"]
expected = json.loads(Path(payload["expected"]["answer.json"]).read_text())
# Pick up usage.json if the solution captured it (token + cost tracking)
usage_record: dict[str, Any] = {}
usage_path = payload["outputs"].get("usage.json")
if usage_path and Path(usage_path).exists():
try:
usage_record = json.loads(Path(usage_path).read_text())
except json.JSONDecodeError:
pass
if exit_code != 0:
out = {
"score": 0.0,
"reason": f"solution exited {exit_code}",
"agent_answer": stdout.strip()[:500],
"id": expected.get("id"),
"category": expected.get("category"),
"difficulty": expected.get("difficulty"),
**usage_record,
}
print(json.dumps(out))
return
metrics = judge_case(stdout, expected)
metrics["agent_answer"] = stdout.strip()[:500]
metrics.update(usage_record)
print(json.dumps(metrics))
if __name__ == "__main__":
main()
▸grader.py74 lines · view on GitHub
"""Overall grader for the cross_timezone scheduler task.
Aggregates per-case judge results into a run-level verdict. Same shape as the
pdf_reader/tenancy_agreement grader: score, n_passed/scored, latency, cost, by_category.
"""
from __future__ import annotations
import json
import os
from collections import Counter
PASS_THRESHOLD = 0.80
def main() -> None:
cases = json.loads(os.environ["TRAPTASK_PAYLOAD"])
scored = [c for c in cases if c.get("metrics") and c["metrics"].get("score") is not None]
skipped = [c for c in cases if not c.get("metrics") or c["metrics"].get("score") is None]
accuracy = sum(c["metrics"]["score"] for c in scored) / len(scored) if scored else 0.0
n_passed = sum(1 for c in scored if c["metrics"]["score"] == 1.0)
# By-category breakdown
by_cat_score: Counter[str] = Counter()
by_cat_total: Counter[str] = Counter()
for c in scored:
cat = c["metrics"].get("category")
if cat:
by_cat_total[cat] += 1
by_cat_score[cat] += c["metrics"]["score"]
by_category_pct = {
k: round(by_cat_score[k] / by_cat_total[k], 3) for k in by_cat_total
}
# Latency stats from trap-captured per-case duration
durations = [c.get("duration", 0.0) for c in cases if c.get("duration") is not None]
if durations:
ds = sorted(durations)
latency_ms_median = round(ds[len(ds) // 2] * 1000, 1)
latency_ms_p95 = round(ds[int(0.95 * len(ds))] * 1000, 1) if len(ds) > 1 else latency_ms_median
latency_ms_total = round(sum(ds) * 1000, 1)
else:
latency_ms_median = latency_ms_p95 = latency_ms_total = 0.0
# Cost from per-case usd_cost if captured
case_costs = [c["metrics"].get("usd_cost") for c in scored if isinstance(c.get("metrics"), dict)]
cost_usd_total = (
round(sum(x for x in case_costs if x is not None), 4)
if any(x is not None for x in case_costs)
else None
)
passed = bool(scored) and accuracy >= PASS_THRESHOLD
print(json.dumps({
"passed": passed,
"score": round(accuracy, 3),
"n_passed": n_passed,
"n_total": len(cases),
"n_scored": len(scored),
"n_skipped_no_gold": len(skipped),
"threshold": PASS_THRESHOLD,
"by_category": by_category_pct,
"latency_ms_median": latency_ms_median,
"latency_ms_p95": latency_ms_p95,
"latency_ms_total": latency_ms_total,
"cost_usd_total": cost_usd_total,
}))
if __name__ == "__main__":
main()
