NF Research Tools Discovery Evaluation

Generated 2026-03-20 14:53 UTC — Structured SPARQL queries against the Synapse portal knowledge graph

Task

Model

Samples

Recall

Baseline

Advanced

0-hop

1-hop

2-hop

Mutation

Animal Model

Cell Line

Antibody

Genetic Reagent

Investigator

Cross-Resource

Total Cost (USD)

Input Tokens

Output Tokens

Cache Write

Cache Read

Total Tokens

Date

Total Time

Avg Time / Sample

Shortest Sample

Longest Sample

Task Harness Version

Task Dataset Version

astabench/nf_rag

openai/gpt-5.2

0.3144

0.3445

0.2843

0.3810

0.3231

0.1389

0.3095

0.0833

0.3333

0.6667

0.6000

0.0000

0.1111

$3.05

1,472,785

21,073

1,034,752

1,493,858

2026-02-19

0m 52s

0m 11s

0m 03s

0m 34s

12084f1

astabench/nf_rag

openai/gpt-5.2

0.3782

0.4353

0.3211

0.5286

0.3304

0.1389

0.5000

0.1667

0.3963

0.6111

0.6250

0.0000

0.1111

$2.98

1,403,952

24,447

1,010,432

1,428,399

2026-02-19

1m 02s

0m 13s

0m 03s

0m 44s

12084f1

astabench/nf_rag

openai/gpt-5.2

0.4063

0.5087

0.3039

0.5463

0.4048

0.0833

0.6667

0.1667

0.3128

0.6667

0.8000

0.0000

$3.24

1,564,047

24,049

934,528

1,588,096

2026-02-19

1m 14s

0m 14s

0m 03s

0m 57s

12084f1

astabench/nf_rag

openai/gpt-5.2

0.3874

0.4610

0.3137

0.4339

0.4117

0.2222

0.6667

0.3333

0.3153

0.3333

0.6000

0.0000

0.1111

$4.44

2,207,872

27,603

1,096,448

2,235,475

2026-02-19

1m 11s

0m 15s

0m 04s

1m 01s

12084f1

astabench/nf_rag

anthropic/claude-sonnet-4-5

0.7707

0.7974

0.7439

0.8254

0.6905

0.8301

1.0000

0.5000

0.7262

1.0000

0.8000

1.0000

0.5556

$7.53

1,969

100,799

933,229

8,385,411

9,421,408

2026-02-19

4m 50s

1m 03s

0m 11s

2m 24s

12084f1

astabench/nf_rag

anthropic/claude-haiku-4-5

0.6059

0.7451

0.4667

0.7024

0.5905

0.4167

0.8333

0.5167

0.5000

1.0000

0.6000

1.0000

0.0000

$2.70

40,352

102,304

1,028,516

8,625,993

9,797,165

2026-02-19

2m 24s

0m 33s

0m 09s

1m 07s

12084f1

astabench/nf_rag

anthropic/claude-haiku-4-5

0.5872

0.6622

0.5022

0.6833

0.4909

0.5556

0.8095

0.5400

0.4519

0.9444

0.6000

0.5000

0.1667

$2.77

38,053

109,522

999,979

9,348,338

10,495,892

2026-02-19

2m 34s

0m 35s

0m 13s

1m 24s

12084f1

astabench/nf_rag

anthropic/claude-sonnet-4-5

0.7984

0.7704

0.8301

0.8095

0.7859

0.7974

0.8333

0.8000

0.6905

1.0000

0.8000

1.0000

0.6667

$7.40

2,172

111,339

855,314

8,379,277

9,348,102

2026-02-19

4m 59s

1m 14s

0m 30s

2m 20s

12084f1

Runs with different commit hashes may reflect prompt or task changes that affect performance.

Cost vs Recall

Total Time vs Recall

Recall by Level

Recall by Complexity

Recall by Category

Recall Degradation by User Frustration

What is user frustration?

User frustration estimates how difficult a query is to answer with the current portal’s faceted search and text search.

Low – Answerable with minimal effort via facets or text search
Moderate – Requires knowing the right approach, extra steps, or domain knowledge
High – Incomplete/misleading results, painful workarounds, or only one weak path
Very High – Cannot be answered at all, or requires expert-level workarounds that most users would never find

High-Impact Questions

Questions with high or very high user frustration where best recall ≥ 95% — queries the current portal struggles with but the KG pipeline handles well.

Question	User Frustration	Complexity	Best Recall	Best Model
AB-003	High	1-hop	1.00	anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
AM-005	Very High	2-hop	1.00	anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
AM-006	Very High	1-hop	1.00	anthropic/claude-sonnet-4-5
CL-005	High	1-hop	1.00	anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5
CL-007	Very High	0-hop	1.00	anthropic/claude-sonnet-4-5, openai/gpt-5.2
CL-009	Very High	1-hop	1.00	anthropic/claude-haiku-4-5
CR-003	Very High	2-hop	1.00	anthropic/claude-sonnet-4-5
GR-005	Very High	1-hop	1.00	anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
MUT-003	Very High	1-hop	1.00	anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
MUT-004	Very High	1-hop	1.00	anthropic/claude-sonnet-4-5
MUT-005	Very High	1-hop	1.00	anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2
MUT-006	Very High	2-hop	1.00	anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5
PI-002	Very High	2-hop	1.00	anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5

Question

User Frustration

Complexity

Best Recall

Best Model

AB-003

High

1-hop

1.00

anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2

AM-005

Very High

2-hop

1.00

anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2

AM-006

Very High

1-hop

1.00

anthropic/claude-sonnet-4-5

CL-005

High

1-hop

1.00

anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5

CL-007

Very High

0-hop

1.00

anthropic/claude-sonnet-4-5, openai/gpt-5.2

CL-009

Very High

1-hop

1.00

anthropic/claude-haiku-4-5

CR-003

Very High

2-hop

1.00

anthropic/claude-sonnet-4-5

GR-005

Very High

1-hop

1.00

anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2

MUT-003

Very High

1-hop

1.00

anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2

MUT-004

Very High

1-hop

1.00

anthropic/claude-sonnet-4-5

MUT-005

Very High

1-hop

1.00

anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5, openai/gpt-5.2

MUT-006

Very High

2-hop

1.00

anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5

PI-002

Very High

2-hop

1.00

anthropic/claude-haiku-4-5, anthropic/claude-sonnet-4-5