NF Publication RAG Evaluation

Generated 2026-03-20 14:53 UTC — 130 questions across 14 papers

Summary

Accuracy measures whether the agent selected the correct answer from the multiple-choice options. Citation F1 measures how well the agent cited the correct supporting passages (PMID + passage index tuples) — precision is the fraction of cited passages that are relevant, and recall is the fraction of expected passages that were cited.

Note: Charts below show results for the user_query question style only. Performance is similar for both precise and user_query styles, but user_query better reflects realistic user interactions with natural phrasing and ambiguity.

Model	Question Style	Samples	Accuracy	Citation F1	Total Cost (USD)	Input Tokens	Output Tokens	Cache Write	Cache Read	Total Tokens	Total Time	Avg / Sample	Shortest	Longest	Date	Task Dataset Version
anthropic/claude-sonnet-4-5	precise	130	0.9923 ± 0.0077	0.7717 ± 0.0208	$18.63	4,423	141,022	3,279,840	13,990,883	17,416,168	7m 23s	0m 30s	0m 08s	1m 23s	2026-03-17	v0
anthropic/claude-haiku-4-5	precise	130	0.9615 ± 0.0169	0.6888 ± 0.0288	$4.69	428,108	125,762	2,378,880	6,615,404	9,548,154	3m 06s	0m 12s	0m 03s	1m 15s	2026-03-17	v0
anthropic/claude-haiku-4-5	user_query	130	0.9692 ± 0.0152	0.6383 ± 0.0320	$4.93	373,712	133,684	2,546,536	7,064,397	10,118,329	2m 54s	0m 12s	0m 03s	1m 31s	2026-03-17	v0
anthropic/claude-sonnet-4-5	user_query	130	0.9846 ± 0.0108	0.7403 ± 0.0232	$19.33	4,505	142,830	3,384,474	14,939,587	18,471,396	7m 35s	0m 32s	0m 08s	1m 43s	2026-03-17	v0
openai/gpt-5.4	precise	130	0.9846 ± 0.0108	0.7303 ± 0.0255	$4.34	1,360,860	56,360	0	354,688	1,417,220	2m 00s	0m 07s	0m 02s	0m 37s	2026-03-17	v0
openai/gpt-5.4	user_query	130	0.9923 ± 0.0077	0.7087 ± 0.0262	$5.00	1,600,970	61,405	0	325,760	1,662,375	1m 49s	0m 08s	0m 03s	0m 30s	2026-03-17	v0