Generated 2026-03-20 14:53 UTC — Structured SPARQL queries against the Synapse portal knowledge graph
| Task | Model | Samples | Recall | Baseline | Advanced | 0-hop | 1-hop | 2-hop | Mutation | Animal Model | Cell Line | Antibody | Genetic Reagent | Investigator | Cross-Resource | Total Cost (USD) | Input Tokens | Output Tokens | Cache Write | Cache Read | Total Tokens | Date | Total Time | Avg Time / Sample | Shortest Sample | Longest Sample | Task Harness Version | Task Dataset Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| astabench/nf_rag | openai/gpt-5.2 | 34 | 0.3144 | 0.3445 | 0.2843 | 0.3810 | 0.3231 | 0.1389 | 0.3095 | 0.0833 | 0.3333 | 0.6667 | 0.6000 | 0.0000 | 0.1111 | $3.05 | 1,472,785 | 21,073 | 0 | 1,034,752 | 1,493,858 | 2026-02-19 | 0m 52s | 0m 11s | 0m 03s | 0m 34s | 12084f1 | v0 |
| astabench/nf_rag | openai/gpt-5.2 | 34 | 0.3782 | 0.4353 | 0.3211 | 0.5286 | 0.3304 | 0.1389 | 0.5000 | 0.1667 | 0.3963 | 0.6111 | 0.6250 | 0.0000 | 0.1111 | $2.98 | 1,403,952 | 24,447 | 0 | 1,010,432 | 1,428,399 | 2026-02-19 | 1m 02s | 0m 13s | 0m 03s | 0m 44s | 12084f1 | v0 |
| astabench/nf_rag | openai/gpt-5.2 | 34 | 0.4063 | 0.5087 | 0.3039 | 0.5463 | 0.4048 | 0.0833 | 0.6667 | 0.1667 | 0.3128 | 0.6667 | 0.8000 | 0.0000 | 0.0000 | $3.24 | 1,564,047 | 24,049 | 0 | 934,528 | 1,588,096 | 2026-02-19 | 1m 14s | 0m 14s | 0m 03s | 0m 57s | 12084f1 | v0 |
| astabench/nf_rag | openai/gpt-5.2 | 34 | 0.3874 | 0.4610 | 0.3137 | 0.4339 | 0.4117 | 0.2222 | 0.6667 | 0.3333 | 0.3153 | 0.3333 | 0.6000 | 0.0000 | 0.1111 | $4.44 | 2,207,872 | 27,603 | 0 | 1,096,448 | 2,235,475 | 2026-02-19 | 1m 11s | 0m 15s | 0m 04s | 1m 01s | 12084f1 | v0 |
| astabench/nf_rag | anthropic/claude-sonnet-4-5 | 34 | 0.7707 | 0.7974 | 0.7439 | 0.8254 | 0.6905 | 0.8301 | 1.0000 | 0.5000 | 0.7262 | 1.0000 | 0.8000 | 1.0000 | 0.5556 | $7.53 | 1,969 | 100,799 | 933,229 | 8,385,411 | 9,421,408 | 2026-02-19 | 4m 50s | 1m 03s | 0m 11s | 2m 24s | 12084f1 | v0 |
| astabench/nf_rag | anthropic/claude-haiku-4-5 | 34 | 0.6059 | 0.7451 | 0.4667 | 0.7024 | 0.5905 | 0.4167 | 0.8333 | 0.5167 | 0.5000 | 1.0000 | 0.6000 | 1.0000 | 0.0000 | $2.70 | 40,352 | 102,304 | 1,028,516 | 8,625,993 | 9,797,165 | 2026-02-19 | 2m 24s | 0m 33s | 0m 09s | 1m 07s | 12084f1 | v0 |
| astabench/nf_rag | anthropic/claude-haiku-4-5 | 32 | 0.5872 | 0.6622 | 0.5022 | 0.6833 | 0.4909 | 0.5556 | 0.8095 | 0.5400 | 0.4519 | 0.9444 | 0.6000 | 0.5000 | 0.1667 | $2.77 | 38,053 | 109,522 | 999,979 | 9,348,338 | 10,495,892 | 2026-02-19 | 2m 34s | 0m 35s | 0m 13s | 1m 24s | 12084f1 | v0 |
| astabench/nf_rag | anthropic/claude-sonnet-4-5 | 32 | 0.7984 | 0.7704 | 0.8301 | 0.8095 | 0.7859 | 0.7974 | 0.8333 | 0.8000 | 0.6905 | 1.0000 | 0.8000 | 1.0000 | 0.6667 | $7.40 | 2,172 | 111,339 | 855,314 | 8,379,277 | 9,348,102 | 2026-02-19 | 4m 59s | 1m 14s | 0m 30s | 2m 20s | 12084f1 | v0 |
Runs with different commit hashes may reflect prompt or task changes that affect performance.
User frustration estimates how difficult a query is to answer with the current portal’s faceted search and text search.
Questions with high or very high user frustration where best recall ≥ 95% — queries the current portal struggles with but the KG pipeline handles well.
| Question | User Frustration | Complexity | Best Recall | Best Model |
|---|---|---|---|---|
| AB-003 | High | 1-hop | 1.00 | anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2 |
| AM-005 | Very High | 2-hop | 1.00 | anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2 |
| AM-006 | Very High | 1-hop | 1.00 | anthropic/claude-sonnet-4-5 |
| CL-005 | High | 1-hop | 1.00 | anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5 |
| CL-007 | Very High | 0-hop | 1.00 | anthropic/claude-sonnet-4-5, openai/gpt-5.2 |
| CL-009 | Very High | 1-hop | 1.00 | anthropic/claude-haiku-4-5 |
| CR-003 | Very High | 2-hop | 1.00 | anthropic/claude-sonnet-4-5 |
| GR-005 | Very High | 1-hop | 1.00 | anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2 |
| MUT-003 | Very High | 1-hop | 1.00 | anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2 |
| MUT-004 | Very High | 1-hop | 1.00 | anthropic/claude-sonnet-4-5 |
| MUT-005 | Very High | 1-hop | 1.00 | anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2 |
| MUT-006 | Very High | 2-hop | 1.00 | anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5 |
| PI-002 | Very High | 2-hop | 1.00 | anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5 |
Raw data: runs.json