NF Publication RAG Evaluation

Generated 2026-03-20 14:53 UTC — 130 questions across 14 papers

Summary

Accuracy measures whether the agent selected the correct answer from the multiple-choice options. Citation F1 measures how well the agent cited the correct supporting passages (PMID + passage index tuples) — precision is the fraction of cited passages that are relevant, and recall is the fraction of expected passages that were cited.

Note: Charts below show results for the user_query question style only. Performance is similar for both precise and user_query styles, but user_query better reflects realistic user interactions with natural phrasing and ambiguity.

Model Question Style Samples Accuracy Citation F1 Total Cost (USD) Input Tokens Output Tokens Cache Write Cache Read Total Tokens Total Time Avg / Sample Shortest Longest Date Task Dataset Version
anthropic/claude-sonnet-4-5precise1300.9923 ± 0.00770.7717 ± 0.0208$18.634,423141,0223,279,84013,990,88317,416,1687m 23s0m 30s0m 08s1m 23s2026-03-17v0
anthropic/claude-haiku-4-5precise1300.9615 ± 0.01690.6888 ± 0.0288$4.69428,108125,7622,378,8806,615,4049,548,1543m 06s0m 12s0m 03s1m 15s2026-03-17v0
anthropic/claude-haiku-4-5user_query1300.9692 ± 0.01520.6383 ± 0.0320$4.93373,712133,6842,546,5367,064,39710,118,3292m 54s0m 12s0m 03s1m 31s2026-03-17v0
anthropic/claude-sonnet-4-5user_query1300.9846 ± 0.01080.7403 ± 0.0232$19.334,505142,8303,384,47414,939,58718,471,3967m 35s0m 32s0m 08s1m 43s2026-03-17v0
openai/gpt-5.4precise1300.9846 ± 0.01080.7303 ± 0.0255$4.341,360,86056,3600354,6881,417,2202m 00s0m 07s0m 02s0m 37s2026-03-17v0
openai/gpt-5.4user_query1300.9923 ± 0.00770.7087 ± 0.0262$5.001,600,97061,4050325,7601,662,3751m 49s0m 08s0m 03s0m 30s2026-03-17v0
Cost vs Accuracy
Time vs Accuracy
Cost vs Citation F1
Time vs Citation F1
Accuracy by Difficulty
Citation F1 by Difficulty
Accuracy by Question Type
Citation F1 by Question Type

Raw data: pubs_runs.json