If the report really is worth $500, then the primary costs are:
Figuring out what you want.
Figuring out the prompt to get it.
Reading the giant report.
NOT the 45 cents you might save!
When are we getting agents that are better at figuring out what we want? THAT would be a huge time saver. Of course it's a joke but I think it will become a bootle neck at some point and then we will start to give ai higher level goals and things will get weird
Note that OpenAI has reported an outdated baseline for the GAIA benchmark.
A few days before Deep Research presentation, a new GAIA benchmark SOTA has been established (the validation tab of https://huggingface.co/spaces/gaia-benchmark/leaderboard).
The actual SOTA (Jan 29, 2025, Trase Agent v0.3) is 70.3 average, 83.02 Level 1, 69.77 Level 2, 46.15 Level 3.
In the relatively easiest Tier 1 category, this SOTA is clearly better than the numbers reported even for Deep Research (pass@64), and this SOTA is generally slightly better than Deep Research (pass@1) except for Level 3.
Table of Contents
The Pitch
Or: Perhaps DeepSeek will quickly follow suit, perhaps they will choose not to. The important thing about Sarah’s comment is that there is essentially no difference here. If the report really is worth $500, then the primary costs are:It’s Coming
Having browsing and python tools and CoT all are not an especially fair fight, and also this is o3 rather than o3-mini under the hood, but yeah the jump to 26.6% is quite big, and confirms why Sam Altman said that soon we will need another exam. That doesn’t mean that we will pass it. OAI-DR is also the new state of the art on GAIA, which evaluates real-world questions.Is It Safe?
I mean, probably. But how would you know? It is a huge jump on Humanity’s Last Exam. There’s no system card. There’s no discussion of red teaming. There’s not even a public explanation of why there’s no public explanation. This was released literally two days after they released the o3-mini model card, to show that o3-mini is a safe thing to release, in which they seem not to have used Deep Research as part of their evaluation process. Going forward, I think it is necessary to use Deep Research as part of all aspects of the preparedness framework testing for any new models, and that this should have also been done with o3-mini. Then two days later, without a system card, they released Deep Research, which is confirmed to be based upon the full o3. I see this as strongly against the spirit of the White House and Seoul commitments to release safety announcements for ‘all new significant model public releases.’ So where is the model card and safety information for o3? Well, their basic answer is ‘this is a limited release and doesn’t really count.’ With the obvious (to be clear completely unstated and not at all confirmed) impression that this was rushed out due to r1 to ensure that the conversation and vibe shifted. I reached out officially, and they gave me this formal statement (which also appears here): We do know that the version of o3 (again, full o3) in use tested out as Medium on their preparedness framework and went through the relevant internal committees, which would allow them to release it. But that’s all we know. They stealth released o3, albeit in a limited form, well before it was ready. I also have confirmation that the system card will be released before they make Deep Research more widely available (and presumably before o3 is made widely available), and that this is OpenAI’s understanding of its obligations going forward. They draw a clear distinction between Plus and Free releases or API access, which invokes their disclosure obligations, and limited release only to Pro users, which does not. They do their safety testing under the preparedness framework before even a limited release. However, they consider their obligations to share safety information only to be invoked when a new model is made available to Plus or Free users, or to API users. I don’t see that as the right distinction to draw here, although I see an important one in API access vs. chatbot interface access. Anyone can now pay the $200 (perhaps with a VPN) and use it, if they want to do that, and in practice multi-account for additional queries if necessary. This is not that limited a release in terms of the biggest worries. The silver lining is that this allows us to have the discussion now. I am nonzero sympathetic to the urgency of the situation, and to the intuition that this modality combined with the limited bandwidth and speed renders the whole thing Mostly Harmless. But if this is how you act under this amount of pressure, how are you going to act in the future, with higher stakes, under much more pressure? Presumably not so well.How Does Deep Research Work?
I do notice this is seemingly distinct from Gemini’s Deep Research. With Gemini’s version, first it searches for sources up front, which it shows to you. Then it compiles the report. OpenAI’s version will search for sources and otherwise take actions as it needs to look for them. That’s a huge upgrade. Under the hood, we know it’s centrally o3 plus reinforcement learning with the ability to take actions during the chain of thought. What you get from there depends on what you choose as the target.Killer Shopping App
This is clearly not the thing everyone had in mind, and it’s not the highest value use of a graduate research assistant, but I totally buy that it is awesome at this: E-commerce search is a perfect application. You don’t care about missing some details or a few hallucinations all that much if you’re going to check its work afterwards. You usually don’t need the result right now. But you absolutely want to know what options are available, where, at what price, and what features matter and what general reviews look like and so on. In the past I’ve found this to be the best use case for Gemini Deep Research – it can compare various offerings, track down where to buy them, list their features and so on. This is presumably the next level up for that. If I could buy unlimited queries at $0.50 a pop, I would totally do this. The question then becomes, right now, that you get 100 queries a month for $200 (along with operator and o1-pro), but you can’t then add more queries on the margin. So the marginal query might be worth a lot more to you than $0.50.Rave Reviews
Not every review is a rave, but here are some of the rave reviews. As I mentioned above, ‘hook DR up to your credentials’ would be huge. Tyler Cowen says ‘so far [DR] is amazing’ but doesn’t yet offer more details, as the post is mostly about o1-pro. Dean Ball is very impressed, saying DR is doing work that would have taken a day or more of research, here it is researching various state regulations. He thinks this is big. I continue to see Dean Ball as a great example of where this type of work is exactly a fit for what he needs to do his job, but still, wowsers. Olivia Moore is impressed for retrieval tasks, finding it better than Operator, finding it very thorough. I worry it’s too thorough, forcing you to wade through too much other stuff, but that’s what other LLMs are for – turning more text into less text. Seth Lazar is impressed as he shops for a camera, notices a weakness that it doesn’t properly discount older websites in this context. The ultimate rave review is high willingness to pay. My question is, how does xjdr even have time to read all the reports? Or is this a case of feeding them into another LLM?Research Reports
Now we get to the regular reviews. The general theme is, it will give you a lot of text most of it accurate, but not all, and it will have some insights but pile the slop and unimportant stuff high on top of it without noticing which is which. It’s the same as Gemini’s Deep Research, only more so, and generally stronger but slower. That is highly useful, if you know how to use it. One thing that frustrates me about Gemini Deep Research, and seems to be present in OpenAI’s version as well, is that it will give you an avalanche of slop whether you like it or not. If you ask it for a specific piece of information, like one number that is ‘the average age when kickers retire,’ you won’t get it, at least not by default. This is very frustrating. To me, what I actually want – very often – is to answer a specific question, for a particular reason. Bayram Annakov concludes it is ‘deeper but slower than Gemini.’ Here’s a personal self-analysis example, On the Couch with Dr. Deep Research. Ethan Mollick gets a 30-page 10k word report ‘Evolution of Tabletop RPGs: From Dungeon Crawls to Narrative Epics.’ To me there are mostly two speeds, ‘don’t care how I want it now,’ and ‘we have all the time in the world.’ Once you’re coming back later, 5 minutes and an hour are remarkably similar lengths. If it takes days then that’s a third level. Colin Fraser asks, who would each NBA team most likely have guard LeBron James? It worked very hard on this, and came back with answer that often included players no longer on the team, just like o1 often does. Colin describes this as a lack of agency problem, that o3 isn’t ensuring they have an up to date set of rosters as a human would. I’m not sure that’s the right way to look at it? But it’s not unreasonable. Ted tries it in a complex subfield he knows well, finds 90% coherent high level summary of prior work and 10% total nonsense that a non-expert wouldn’t be able to differentiate, and he’s ‘not convinced it “understands” what is going on.’ That’s a potentially both highly useful and highly dangerous place to be, depending on the field and the user. Here’s one brave user: And one that is less brave on this one: Where is the 10% coming from? Steve Sokolowski has a theory. This is a problem for every internet user, knowing what sources to trust. It makes sense that it would be a problem for DR’s first iteration. I strongly agree that this should improve rapidly over time. Dan Hendrycks is not impressed when he asks for feedback on a paper draft, finding it repeatedly claiming Dan was saying things he didn’t say, but as he notes this is mainly a complaint about the underlying o3 model. So given how humans typically read AI papers, it’s doing a good job predicting the next token? I wonder how well o3’s misreads correlate with human ones. With time, you can get a good sense of what parts can be trusted versus what has to be checked, including noticing which parts are too load bearing to risk being wrong. Gallabytes is unimpressed so far but suspects it’s because of the domain he’s trying. Peter Wildeford is not impressed but that’s on a rather impressive curve. It’s all about the data set. And of course it will need all your internal and local stuff as well. Note that this could actually be a huge windfall for gated content. Suppose this integrated the user’s subscriptions, so you got paywalled content if and only if you were paying for it. Credentials for all those academic journals now look a lot more enticing, don’t they? Want the New York Times or Washington Post in your Deep Research? Pay up. Maybe it’s part of the normal subscription. Maybe it’s a less. Maybe it’s more. And suddenly you can get value out of a lot more subscriptions, especially if the corporation is fitting the bill. Arthur B is happy with his first query, disappointed with the one on Tezos where he knows best, is hoping it’s data quality issues rather than Gel-Men Amnesia. Deric Cheong finds it better than Gemini DR on economic policies for a post-AGI society. I checked out the report, which takes place in the strange ‘economic normal under AGI’ idyllic Christmasland that economists insist on as our baseline future, where our worries are purely mundane things like concentration of wealth and power in specific humans and the need to ensure competition. So you get proposals such as ‘we need to ensure that we have AGIs and AGI offerings competing against each other to maximize profits, that’ll ensure humans come out okay and totally not result by default in at best gradual disempowerment.’ And under ‘drawbacks’ you get ‘it requires global coordination to ensure competition.’ What? We get all the classics. Universal basic income, robot taxes, windfall taxes, capital gains taxes, ‘workforce retraining and education’ (workforce? Into ‘growth fields’? What are these ‘growth fields’?), shorten the work week, mandatory paid leave (um…), a government infrastructure program, giving workers bargaining power, ‘cooperative and worker ownership’ of what it doesn’t call ‘the means of production,’ data dividends and rights, and many more. All of which largely comes down to rearranging deck chairs on the Titanic, while the Titanic isn’t sinking and actually looks really sharp but also no one can afford the fare. It’s stuff that matters on the margin but we won’t be operating on the margin, we will be as they say ‘out of distribution.’ Alternatively, it’s a lot of ways of saying ‘redistribution’ over and over with varying levels of inefficiency and social fiction. If humans can retain political power and the ability to redistribute real resources, also known as ‘retain control over the future,’ then there will be more than enough real resources that everyone can be economically fine, whatever status or meaning or other problems they might have. The problem is that the report doesn’t raise that need as a consideration, and if anything the interventions here make that problem harder not easier. But hey, you ask a silly question, you get a silly answer. None of that is really DR’s fault, except that it accepted the premise. So, high marks!Perfecting the Prompt
Someone should get on this, and craft at least a GPT or instruction you can give to o3-mini-high or o1-pro (or Claude Sonnet 3.6?), that will take your prompt and other explanations, ask you follow-ups if needed, and give you back a better prompt, and give you back a prediction of what to expect so you can refine and such.Not So Fast!
I strongly disagree with this take: There are problems with Wikipedia, but these two things are very much not substitutes. Here are some facts about Wikipedia that don’t apply to DR and aren’t about to any time soon.What’s Next?
We have Operator. We have two versions of Deep Research. What’s next? If you can Do Reinforcement Learning To It, and it’s valuable, they’ll build it. The question is, what products might be coming soon here? o1’s suggestions were legal analysis, high-frequency trading, medical diagnosis, supply chain coordination, warehouse robotics, personalized tutoring, customer support, traffic management, code generation and drug discovery. That’s a solid list. The dream is general robotics but that’s rather a bit trickier. Code generation is the other dream, and that’s definitely going to step up its game quickly. The main barrier seems to be asking what people actually want. I’d like to see a much more precise version of DR next. Instead of giving me a giant report, give me something focused. But probably I should be thinking bigger.Paying the Five
Should you pay the $200? For that price, you now get:The Lighter Side
Depends what counts? As long as it doesn’t need to be cutting edge we should be fine. A story I find myself in often: Get your value!