Do you break out failure to report any answer vs reporting an incorrect answer? On the “find a number” task, the best performance is pretty good if it’s 70% success, 30% don’t answer; but I’d mark it as worse than useless if its 70% correct, 30% hallucinate plausibly.
Good question. We don't explicitly break this out in our analysis, but we do give models the chance to give up, and some of our instances actually require them to give up for numbers that can't be found.
Anyway, from eyeballing results and traces, I get the sense that 70-80% of failures on the find number task are incorrect assertions rather than refusals to answer.
Thanks for this analysis! Did you also test LLMs regarding academic research? In Perplexity Pro there is an option to only search in papers, not the normal web. I couldn't get any of the other popular tools to do that.
It's hard to imagine doing web research without using LLMs. Chatbots may be the first thing you turn to for questions like: What are the companies currently working on nuclear fusion and who invested in them? What is the performance gap between open and closed-weight models on the MMLU benchmark? Is there really a Tesla Model H?
So which LLMs, and which "Search", "Research", "Deep Search" or "Deep Research" branded products, are best? How good are their epistemics, compared to if you did the web research yourself?
Last month we (FutureSearch) published Deep Research Bench (DRB), a benchmark designed to evaluate LLMs agents on difficult web research tasks using frozen snapshots of the internet. In this post, we're going to share the non-obvious findings, suggestions and failure modes that we think might be useful to anyone who uses LLMs with web search enabled.
Full methods in the paper. In brief, we created web research tasks, carefully worked out the correct answers, and scraped and stored a large chunk of webpages offline for the agents to search, read, and reason about to succeed in the task.
We then ran 12 LLMs, each driving a ReAct agent with tools to search the internet and query documents and web pages. We also ran our tasks one time through 11 commercial web research tools such as ChatGPT, OpenAI Deep Research, Perplexity, DeepSeek + Search, Grok DeepSearch, Claude Research.[1]
We have a continuously updated benchmark available at evals.futuresearch.ai that includes newer models such as Claude 4 (but no commercial web research tools) and is run on a frozen snapshot of the internet.[2] We'll refer to the continuously updated benchmark as DRB, and the paper version as DRB-250506. Results between the two cannot be compared directly: since the publication of the paper, we have updated both a few of our tasks as well as the agent architecture powering the LLMs.
In this post, we're primarily reporting on the results of the paper, DRB-250506.
We'll start with the commercial web research tools, as those present the most straightforward way to use LLMs as research assistants. Here is an overview of the scores on DRB-250506 for different commercial research tools:
We find that ChatGPT with o3 + web search enabled outperformed all other approaches we tested, by a comfortable margin. It even performed noticeably better than OpenAI Deep Research, despite the much longer runtime of the latter. While it's a bit hard to interpret differences in scores directly, we would call it "quite noticeable".
Based on our experience, one potential reason that ChatGPT with o3 + web search tended to better than other approaches was that it had a habit of double-checking its own findings. Often (but not often enough) it would go through what it found and say "let me check this is actually true".
OpenAI Deep Research and Gemini Deep Research also performed decently. OpenAI Deep Research, however, is expensive and slow. Gemini Deep Research sometimes had issues with following instructions (for example, it was not able to produce structured JSON output) and also was relatively slow. Both produce long research reports which look impressive at first sight, but can be hard to parse. At the time when we ran our study, both were also rather limited in terms of the number of queries one could run concurrently. They overall produced decent reports, but if you need a research assistant that's flexible and can be prompted iteratively, ChatGPT with o3 + web search seems much better.
Grok DeepSearch, Perplexity Pro, Gemini 2.5 Pro + web search, and Claude (both the Research version and the regular Claude 3.7 + web search) were ok-ish, but not great. We were slightly surprised about the not-so-great performance of the Claude models, but then realized that they apparently were not able to read the contents of PDFs, which was important, or at least helpful for a few of our tasks.
Perplexity Deep Research, ChatGPT 4.5 with web search, and in particular DeepSeek with web search trailed noticeably behind the other approaches. Perplexity with Deep Research, in addition to poor performance, also was relatively slow. DeepSeek is attractive because it is cheap and relatively fast, but it did not perform very well. On at least one question (how many medals Chinese Taipei won at the recent Special Olympics) it refused to answer completely.
We saw a tendency for the regular Chat models + web search capability to outperform dedicated Deep Resarch tools. This was particularly noticeable for ChatGPT o3 with web search which wiped the floor with outperformed OpenAI Deep Research. Similarly, Perplexity Pro performed noticeably better than Perplexity Deep Research.
This was not always the case: Gemini Deep Research was clearly better than Gemini 2.5 Pro with web search and Claude Research was a tiny little bit better than Claude 3.7 + web search (within the margin of noise).
But as a general rule of thumb, we would recommend in favour of the regular Chat mode as it's usually faster, produces more manageable results, and allows for easier iteration.
The following graphic visualizes the scores of all approaches we tested, including both the commercial tools and various LLMs powering a simple custom ReAct agent.
In terms of overall performance, the big 3 (o3, Claude 3.7 Sonnet and Gemini 2.5 Pro) comfortably outperformed other models on DRB-250506. Gemini 2.5 Flash was surprisingly good, given that it's a small, fast and cheap model. GPT-4.1 performed similarly, but there doesn't seem to be any reason to choose it over the other alternatives. Open models generally lagged behind and only DeepSeek R1 may actually be an interesting candidate for production use (see next Section).
None of our ReAct agents was able to beat ChatGPT o3 + web search. Interestingly, the ChatGPT web version also noticeably outperformed o3 with a custom ReAct agent, suggesting that OpenAI really did an exceptionally good job at building the architecture around o3 that powers ChatGPT.
In contrast, Claude 3.7. Sonnet via the API with a custom ReAct agent performed noticeably better than the corresponding web version, Claude Research or Claude + web research, highlighting the fact that Anthropic likely did not put much effort into creating an exceptional web chat interface. The inability to read PDFs will likely be a relevant factor here.
Given that Claude 3.7 Sonnet and o3 perform about the same with a simple ReAct agent, this highlights the importance of scaffolding.
Claude 4
Based on our continuously updated DRB (scores not directly comparable with the paper version DRB-250506!), Claude 4 Sonnet and Claude 4 Opus outperform o3 when used with a ReAct agent (we do not have a comparison including any of the commercial web research tools at this point). Interestingly, Claude 4 Sonnet and Opus perform about the same.
The following shows performance on DRB:
As expected, closed models performed much better than open models. The only advantage of open models is that they can be self-hosted and can be quite cheap. This is true in particular for DeepSeek R1. We hosted an instance of it via OpenRouter which was easy to set up. The model was fast, very robust and cheap.
In situations where you'd like to run a lot of fast queries, a self-hosted DeepSeek R1 may actually be attractive. Mistral and Gemma, in contrast, were harder to set up and more flaky. We don't recommend using them.
Generally, we found little differences in performance between thinking and non-thinking models. Our ReAct agent had a special modification for thinking models, where it skipped the stage asking LLMs to produce a thought. The reason for this is that some LLMs refuse to answer requests to reveal their inner musings, or produced bizarre outputs like when Gemini 2.5 pro hallucinated an entire agent trace!
We found LLM agents to provide real value and able to make real progress on many of our task instances. LLMs can come up with structured, and sometimes quite impressive, plans to tackle a difficult problem. But their approaches and plans usually still fall short of what an intelligent human would do.
However, many common failure modes that have been previously reported still exist:
You can find a more thorough discussion of failure modes in the paper.
LLM-powered web research tools are useful, and if you do anything that remotely resembles research, you should likely be using them.
As a default, we recommend o3 + web search, and we keeping an eye on Claude Research, which isn't performing well in the Claude UI now, but since the Claude 4 agents do the best, this could change soon.
Dedicated deep research tools usually take longer, produce convoluted output and are often not better than their base chat model counterparts. Gemini Deep Research is an exception, scoring better than Gemini with only regular search, but neither beat o3 + web search.
Others like Perplexity Pro, DeepSeek (even the new one), and Grok are simply not close in quality. Grok has X access, so if that's important that is one reason to consider it.
For regular use cases, we don't see a strong argument for implementing your own web agents and accessing LLMs via the API. Unless, that is, you want to spend a serious amount of time improving your architecture and developing tools the agent can use. There are likely significant returns to that, but it is also just a hard task.
If you're deciding to use LLMs via the API in production and their tools include using the web, Claude 4 Sonnet and Opus outperform o3, Claude 3.7. and Gemini 2.5 pro.
We can only use the live web when testing commercial products, so these results are not stable over time, unlike the core evals using large sets of offline web pages.
Also, a few of our task instances rely on a mechanism where we rely on a human who has spent a lot of time working out a solution, blocking the website where they published their results, and asking our agents to recreate the solution. That, of course, does not work with commercial web research tools, so we ran those on a smaller set of just 66 task instances.
We developed a system called "RetroSearch". For every task instance, we scraped thousands (often tens of thousands) of web pages that we serve to agents instead of the live web pages.