629

LESSWRONG
LW

628
AI BenchmarkingAI
Personal Blog

46

A Guide For LLM-Assisted Web Research

by nikos, dschwarz, Lawrence Phillips, FutureSearch
26th Jun 2025
8 min read
3

46

46

A Guide For LLM-Assisted Web Research
3Hastings
3Lawrence Phillips
2Jonas Strabel
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 6:22 AM
[-]Hastings3mo30

Do you break out failure to report any answer vs reporting an incorrect answer? On the “find a number” task, the best performance is pretty good if it’s 70% success, 30% don’t answer; but I’d mark it as worse than useless if its 70% correct, 30% hallucinate plausibly.

Reply
[-]Lawrence Phillips3mo30

Good question. We don't explicitly break this out in our analysis, but we do give models the chance to give up, and some of our instances actually require them to give up for numbers that can't be found.

Anyway, from eyeballing results and traces, I get the sense that 70-80% of failures on the find number task are incorrect assertions rather than refusals to answer.

Reply
[-]Jonas Strabel3mo20

Thanks for this analysis! Did you also test LLMs regarding academic research? In Perplexity Pro there is an option to only search in papers, not the normal web. I couldn't get any of the other popular tools to do that. 

Reply
Moderation Log
More from nikos
View more
Curated and popular this week
3Comments
AI BenchmarkingAI
Personal Blog

It's hard to imagine doing web research without using LLMs. Chatbots may be the first thing you turn to for questions like: What are the companies currently working on nuclear fusion and who invested in them? What is the performance gap between open and closed-weight models on the MMLU benchmark? Is there really a Tesla Model H?

So which LLMs, and which "Search", "Research", "Deep Search" or "Deep Research" branded products, are best?  How good are their epistemics, compared to if you did the web research yourself?

Last month we (FutureSearch) published Deep Research Bench (DRB), a benchmark designed to evaluate LLMs agents on difficult web research tasks using frozen snapshots of the internet. In this post, we're going to share the non-obvious findings, suggestions and failure modes that we think might be useful to anyone who uses LLMs with web search enabled.

tl;dr

  • ChatGPT with o3 + websearch outperformed everything else by a decent margin, though is still clearly worse than a skilled human. Use this as your default research tool.
  • Claude web + Claude Research were not able to read PDFs (as of May 6, 2025 and as far as we can tell that's still the case), which likely nerfed their scores. If that's important for your task, don't use them.
  • If you're using LLMs with your own agent via an API, Claude 4 Sonnet + Opus are best, better than o3 (as of June 24, 2025).
  • Grok was ok, but not great. Unless you need to access Twitter. Then definitely use Grok, which excels at this (our task instances did not rely on Twitter, so that strength is not reflected in our benchmark).
  • Regular mode + web search often outperformed their corresponding dedicated deep research tools. They are also much better for iterative research. 

What We Did 

Full methods in the paper. In brief, we created web research tasks, carefully worked out the correct answers, and scraped and stored a large chunk of webpages offline for the agents to search, read, and reason about to succeed in the task.

Overview of task types. DRB uses RetroSearch, a system designed to serve agents a frozen, previously scraped version of the internet instead of the live pages.. Offline Web Pages shows the median (min–max) number of web pages scraped and stored for agents to search and read during each task.

We then ran 12 LLMs, each driving a ReAct agent with tools to search the internet and query documents and web pages. We also ran our tasks one time through 11 commercial web research tools such as ChatGPT, OpenAI Deep Research, Perplexity, DeepSeek + Search, Grok DeepSearch, Claude Research.[1]

We have a continuously updated benchmark available at evals.futuresearch.ai that includes newer models such as Claude 4 (but no commercial web research tools) and is run on a frozen snapshot of the internet.[2] We'll refer to the continuously updated benchmark as DRB, and the paper version as DRB-250506. Results between the two cannot be compared directly: since the publication of the paper, we have updated both a few of our tasks as well as the agent architecture powering the LLMs. 

In this post, we're primarily reporting on the results of the paper, DRB-250506.

What We Found

CommercialWeb Research Tools

We'll start with the commercial web research tools, as those present the most straightforward way to use LLMs as research assistants. Here is an overview of the scores on DRB-250506 for different commercial research tools: 

Scores for different commercial web research tools on 8 different tasks. Scores are between 0 and 1. Bar charts show marginal distributions, averaged across tasks (top) or approaches (right). 

 

The Good

We find that ChatGPT with o3 + web search enabled outperformed all other approaches we tested, by a comfortable margin. It even performed noticeably better than OpenAI Deep Research, despite the much longer runtime of the latter. While it's a bit hard to interpret differences in scores directly, we would call it "quite noticeable".

Based on our experience, one potential reason that ChatGPT with o3 + web search tended to better than other approaches was that it had a habit of double-checking its own findings. Often (but not often enough) it would go through what it found and say "let me check this is actually true". 

OpenAI Deep Research and Gemini Deep Research also performed decently. OpenAI Deep Research, however, is expensive and slow. Gemini Deep Research sometimes had issues with following instructions (for example, it was not able to produce structured JSON output) and also was relatively slow. Both produce long research reports which look impressive at first sight, but can be hard to parse. At the time when we ran our study, both were also rather limited in terms of the number of queries one could run concurrently. They overall produced decent reports, but if you need a research assistant that's flexible and can be prompted iteratively, ChatGPT with o3 + web search seems much better. 

The Mediocre

Grok DeepSearch, Perplexity Pro, Gemini 2.5 Pro + web search, and Claude (both the Research version and the regular Claude 3.7 + web search) were ok-ish, but not great. We were slightly surprised about the not-so-great performance of the Claude models, but then realized that they apparently were not able to read the contents of PDFs, which was important, or at least helpful for a few of our tasks. 

The Bad

Perplexity Deep Research, ChatGPT 4.5 with web search, and in particular DeepSeek with web search trailed noticeably behind the other approaches. Perplexity with Deep Research, in addition to poor performance, also was relatively slow. DeepSeek is attractive because it is cheap and relatively fast, but it did not perform very well. On at least one question (how many medals Chinese Taipei won at the recent Special Olympics) it refused to answer completely. 

Regular vs. Deep Research Mode

We saw a tendency for the regular Chat models + web search capability to outperform dedicated Deep Resarch tools. This was particularly noticeable for ChatGPT o3 with web search which wiped the floor with outperformed OpenAI Deep Research. Similarly, Perplexity Pro performed noticeably better than Perplexity Deep Research. 

This was not always the case: Gemini Deep Research was clearly better than Gemini 2.5 Pro with web search and Claude Research was a tiny little bit better than Claude 3.7 + web search (within the margin of noise). 

But as a general rule of thumb, we would recommend in favour of the regular Chat mode as it's usually faster, produces more manageable results, and allows for easier iteration. 

Using LLMs With Agents via the API

The following graphic visualizes the scores of all approaches we tested, including both the commercial tools and various LLMs powering a simple custom ReAct agent.

Scores for different agents and commercial web research tools on 8 different tasks. Scores are between 0 and 1. Bar charts show marginal distributions, averaged across tasks (top) or approaches (right). 

In terms of overall performance, the big 3 (o3, Claude 3.7 Sonnet and Gemini 2.5 Pro) comfortably outperformed other models on DRB-250506. Gemini 2.5 Flash was surprisingly good, given that it's a small, fast and cheap model. GPT-4.1 performed similarly, but there doesn't seem to be any reason to choose it over the other alternatives. Open models generally lagged behind and only DeepSeek R1 may actually be an interesting candidate for production use (see next Section). 

None of our ReAct agents was able to beat ChatGPT o3 + web search. Interestingly, the ChatGPT web version also noticeably outperformed o3 with a custom ReAct agent, suggesting that OpenAI really did an exceptionally good job at building the architecture around o3 that powers ChatGPT. 

In contrast, Claude 3.7. Sonnet via the API with a custom ReAct agent performed noticeably better than the corresponding web version, Claude Research or Claude + web research, highlighting the fact that Anthropic likely did not put much effort into creating an exceptional web chat interface. The inability to read PDFs will likely be a relevant factor here. 

Given that Claude 3.7 Sonnet and o3 perform about the same with a simple ReAct agent, this highlights the importance of scaffolding. 

Claude 4

Based on our continuously updated DRB (scores not directly comparable with the paper version DRB-250506!), Claude 4 Sonnet and Claude 4 Opus outperform o3 when used with a ReAct agent (we do not have a comparison including any of the commercial web research tools at this point). Interestingly, Claude 4 Sonnet and Opus perform about the same.

The following shows performance on DRB:

Performance on the continuously updated DRB (with slight changes to the paper version DRB-250506). LLMs were tested using a custom simple ReAct agent and tools to search the web and query the content of documents such as web pages and PDFs.

 

Open vs. Closed Models

As expected, closed models performed much better than open models. The only advantage of open models is that they can be self-hosted and can be quite cheap. This is true in particular for DeepSeek R1. We hosted an instance of it via OpenRouter which was easy to set up. The model was fast, very robust and cheap. 

In situations where you'd like to run a lot of fast queries, a self-hosted DeepSeek R1 may actually be attractive. Mistral and Gemma, in contrast, were harder to set up and more flaky. We don't recommend using them. 

Thinking vs. Non-Thinking Models

Generally, we found little differences in performance between thinking and non-thinking models. Our ReAct agent had a special modification for thinking models, where it skipped the stage asking LLMs to produce a thought. The reason for this is that some LLMs refuse to answer requests to reveal their inner musings, or produced bizarre outputs like when Gemini 2.5 pro hallucinated an entire agent trace!

Epistemic concerns with LLM-powered web research tools

We found LLM agents to provide real value and able to make real progress on many of our task instances. LLMs can come up with structured, and sometimes quite impressive, plans to tackle a difficult problem. But their approaches and plans usually still fall short of what an intelligent human would do. 

However, many common failure modes that have been previously reported still exist: 

  • LLMs commonly make basic reasoning and inference errors
  • LLMs are often too gullible when encountering unreliable information or are unable to discern high-quality from low-quality sources
  • LLMs tend to adopt minimal-effort approaches and often stop early when the first thing that resembles a solution comes up
  • LLMs are often surprisingly bad at formulating Google queries
  • As far as we can tell, hallucinations have not much decreased with recent model releases. DeepSeek R1 is especially prone to hallucinations

You can find a more thorough discussion of failure modes in the paper.

Conclusions

LLM-powered web research tools are useful, and if you do anything that remotely resembles research, you should likely be using them. 

As a default, we recommend o3 + web search, and we keeping an eye on Claude Research, which isn't performing well in the Claude UI now, but since the Claude 4 agents do the best, this could change soon.

Dedicated deep research tools usually take longer, produce convoluted output and are often not better than their base chat model counterparts. Gemini Deep Research is an exception, scoring better than Gemini with only regular search, but neither beat o3 + web search. 

Others like Perplexity Pro, DeepSeek (even the new one), and Grok are simply not close in quality. Grok has X access, so if that's important that is one reason to consider it.

For regular use cases, we don't see a strong argument for implementing your own web agents and accessing LLMs via the API. Unless, that is, you want to spend a serious amount of time improving your architecture and developing tools the agent can use. There are likely significant returns to that, but it is also just a hard task. 

If you're deciding to use LLMs via the API in production and their tools include using the web, Claude 4 Sonnet and Opus outperform o3, Claude 3.7. and Gemini 2.5 pro.

 

  1. ^

    We can only use the live web when testing commercial products, so these results are not stable over time, unlike the core evals using large sets of offline web pages.

    Also, a few of our task instances rely on a mechanism where we rely on a human who has spent a lot of time working out a solution, blocking the website where they published their results, and asking our agents to recreate the solution. That, of course, does not work with commercial web research tools, so we ran those on a smaller set of just 66 task instances.

  2. ^

    We developed a system called "RetroSearch". For every task instance, we scraped thousands (often tens of thousands) of web pages that we serve to agents instead of the live web pages.