A quick practical question: you said that the agent behind idea generation processed more than 135 papers and I presume that most of them are equation-heavy. How did your agents handle the math?
I've been working on the same problem: when papers are transformed in PDF equations aren’t always available as structured math. If they're treated as images in extraction they become hard to understand and to put in context by LLMs.
Have you used LaTeX source or ar5iv HTML? Or just hope that the surrounding text carried enough signal?
Hi @Dhruv Trehan, thanks for the honest breakdown. I've personally experienced how complex and definitely not straightforward this topic could be. For example, even the allegedly more trivial parts (e.g authors metadata) have seen deep changes over time and are sometimes full of peculiarities or clear errors (due to sloppy formatting or when converted from LaTeX).
I've prepared in the last weeks an online platform that parses ar5iv HTML (and ArXiv HTML for recent papers) in order to create a "meaningful skeleton" for each paper and cross-reference the context (the distilled paper's text + claims + equations) against figures. It's still in beta but the first outputs are interesting. The idea is to have a super-compact package (50/60 KB for each paper including figures interpretation) that can be given with confidence to LLMs without losing all the precious context that is embedded troughout the length of the full paper.
Given your direct experience I'd be very happy if you can take a look at it and give me your honest opinion, the url is: arxiparse.org