Hey @frmoretto - the idea generation agent did not parse 135 full papers. It first evaluated pairs of these (based on just the title and abstract) and then selected upto 15 top pairs to do the actual idea generation for. This was about 30 or so papers (accounting for repeated papers across pairs). For these 30, we used pymupdf4llm. Did manually inspect 4-5 and the extraction appeared ok (somewhat hoping that the surrounding text carried enough signal). Though you are correct, paper extraction is a challenge from PDFs and in our more recent work we have found Dolphin-v2 to work well. We have not yet tried using the LaTeX source or ar5iv HTML. That said, we also kept much of the idea in natural language and for the experiment execution (where the math was more critical) we used Claude Code's inbuilt file reading features.
Hey @frmoretto - the idea generation agent did not parse 135 full papers. It first evaluated pairs of these (based on just the title and abstract) and then selected upto 15 top pairs to do the actual idea generation for. This was about 30 or so papers (accounting for repeated papers across pairs). For these 30, we used pymupdf4llm. Did manually inspect 4-5 and the extraction appeared ok (somewhat hoping that the surrounding text carried enough signal). Though you are correct, paper extraction is a challenge from PDFs and in our more recent work we have found Dolphin-v2 to work well. We have not yet tried using the LaTeX source or ar5iv HTML. That said, we also kept much of the idea in natural language and for the experiment execution (where the math was more critical) we used Claude Code's inbuilt file reading features.