NF Research Tools Discovery Evaluation

Generated 2026-03-20 14:53 UTC — Structured SPARQL queries against the Synapse portal knowledge graph

Task Model Samples Recall BaselineAdvanced0-hop1-hop2-hop MutationAnimal ModelCell LineAntibodyGenetic ReagentInvestigatorCross-Resource Total Cost (USD) Input Tokens Output Tokens Cache Write Cache Read Total Tokens Date Total Time Avg Time / Sample Shortest Sample Longest Sample Task Harness Version Task Dataset Version
astabench/nf_ragopenai/gpt-5.2340.31440.34450.28430.38100.32310.13890.30950.08330.33330.66670.60000.00000.1111$3.051,472,78521,07301,034,7521,493,8582026-02-190m 52s0m 11s0m 03s0m 34s12084f1v0
astabench/nf_ragopenai/gpt-5.2340.37820.43530.32110.52860.33040.13890.50000.16670.39630.61110.62500.00000.1111$2.981,403,95224,44701,010,4321,428,3992026-02-191m 02s0m 13s0m 03s0m 44s12084f1v0
astabench/nf_ragopenai/gpt-5.2340.40630.50870.30390.54630.40480.08330.66670.16670.31280.66670.80000.00000.0000$3.241,564,04724,0490934,5281,588,0962026-02-191m 14s0m 14s0m 03s0m 57s12084f1v0
astabench/nf_ragopenai/gpt-5.2340.38740.46100.31370.43390.41170.22220.66670.33330.31530.33330.60000.00000.1111$4.442,207,87227,60301,096,4482,235,4752026-02-191m 11s0m 15s0m 04s1m 01s12084f1v0
astabench/nf_raganthropic/claude-sonnet-4-5340.77070.79740.74390.82540.69050.83011.00000.50000.72621.00000.80001.00000.5556$7.531,969100,799933,2298,385,4119,421,4082026-02-194m 50s1m 03s0m 11s2m 24s12084f1v0
astabench/nf_raganthropic/claude-haiku-4-5340.60590.74510.46670.70240.59050.41670.83330.51670.50001.00000.60001.00000.0000$2.7040,352102,3041,028,5168,625,9939,797,1652026-02-192m 24s0m 33s0m 09s1m 07s12084f1v0
astabench/nf_raganthropic/claude-haiku-4-5320.58720.66220.50220.68330.49090.55560.80950.54000.45190.94440.60000.50000.1667$2.7738,053109,522999,9799,348,33810,495,8922026-02-192m 34s0m 35s0m 13s1m 24s12084f1v0
astabench/nf_raganthropic/claude-sonnet-4-5320.79840.77040.83010.80950.78590.79740.83330.80000.69051.00000.80001.00000.6667$7.402,172111,339855,3148,379,2779,348,1022026-02-194m 59s1m 14s0m 30s2m 20s12084f1v0

Runs with different commit hashes may reflect prompt or task changes that affect performance.

Cost vs Recall
Total Time vs Recall
Recall by Level
Recall by Complexity
Recall by Category
Recall Degradation by User Frustration
What is user frustration?

User frustration estimates how difficult a query is to answer with the current portal’s faceted search and text search.

  • Low – Answerable with minimal effort via facets or text search
  • Moderate – Requires knowing the right approach, extra steps, or domain knowledge
  • High – Incomplete/misleading results, painful workarounds, or only one weak path
  • Very High – Cannot be answered at all, or requires expert-level workarounds that most users would never find

High-Impact Questions

Questions with high or very high user frustration where best recall ≥ 95% — queries the current portal struggles with but the KG pipeline handles well.

Question User Frustration Complexity Best Recall Best Model
AB-003High1-hop1.00anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
AM-005Very High2-hop1.00anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
AM-006Very High1-hop1.00anthropic/claude-sonnet-4-5
CL-005High1-hop1.00anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5
CL-007Very High0-hop1.00anthropic/claude-sonnet-4-5, openai/gpt-5.2
CL-009Very High1-hop1.00anthropic/claude-haiku-4-5
CR-003Very High2-hop1.00anthropic/claude-sonnet-4-5
GR-005Very High1-hop1.00anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
MUT-003Very High1-hop1.00anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
MUT-004Very High1-hop1.00anthropic/claude-sonnet-4-5
MUT-005Very High1-hop1.00anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
MUT-006Very High2-hop1.00anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5
PI-002Very High2-hop1.00anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5

Raw data: runs.json