Generated 2026-03-20 14:53 UTC — 130 questions across 14 papers
Accuracy measures whether the agent selected the correct answer from the multiple-choice options. Citation F1 measures how well the agent cited the correct supporting passages (PMID + passage index tuples) — precision is the fraction of cited passages that are relevant, and recall is the fraction of expected passages that were cited.
Note: Charts below show results for the user_query question style only. Performance is similar for both precise and user_query styles, but user_query better reflects realistic user interactions with natural phrasing and ambiguity.
| Model | Question Style | Samples | Accuracy | Citation F1 | Total Cost (USD) | Input Tokens | Output Tokens | Cache Write | Cache Read | Total Tokens | Total Time | Avg / Sample | Shortest | Longest | Date | Task Dataset Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| anthropic/claude-sonnet-4-5 | precise | 130 | 0.9923 ± 0.0077 | 0.7717 ± 0.0208 | $18.63 | 4,423 | 141,022 | 3,279,840 | 13,990,883 | 17,416,168 | 7m 23s | 0m 30s | 0m 08s | 1m 23s | 2026-03-17 | v0 |
| anthropic/claude-haiku-4-5 | precise | 130 | 0.9615 ± 0.0169 | 0.6888 ± 0.0288 | $4.69 | 428,108 | 125,762 | 2,378,880 | 6,615,404 | 9,548,154 | 3m 06s | 0m 12s | 0m 03s | 1m 15s | 2026-03-17 | v0 |
| anthropic/claude-haiku-4-5 | user_query | 130 | 0.9692 ± 0.0152 | 0.6383 ± 0.0320 | $4.93 | 373,712 | 133,684 | 2,546,536 | 7,064,397 | 10,118,329 | 2m 54s | 0m 12s | 0m 03s | 1m 31s | 2026-03-17 | v0 |
| anthropic/claude-sonnet-4-5 | user_query | 130 | 0.9846 ± 0.0108 | 0.7403 ± 0.0232 | $19.33 | 4,505 | 142,830 | 3,384,474 | 14,939,587 | 18,471,396 | 7m 35s | 0m 32s | 0m 08s | 1m 43s | 2026-03-17 | v0 |
| openai/gpt-5.4 | precise | 130 | 0.9846 ± 0.0108 | 0.7303 ± 0.0255 | $4.34 | 1,360,860 | 56,360 | 0 | 354,688 | 1,417,220 | 2m 00s | 0m 07s | 0m 02s | 0m 37s | 2026-03-17 | v0 |
| openai/gpt-5.4 | user_query | 130 | 0.9923 ± 0.0077 | 0.7087 ± 0.0262 | $5.00 | 1,600,970 | 61,405 | 0 | 325,760 | 1,662,375 | 1m 49s | 0m 08s | 0m 03s | 0m 30s | 2026-03-17 | v0 |
Raw data: pubs_runs.json