I do a version of this workflow for myself using Claude as an editor/cowriter. Aside from the UI, are you doing anything more than what I can get from just handing Claude a good prompt and my post?
There's more going on, though that doesn't mean it will be necessarily better for you than your system.
There are some "custom" evaluators, that are basically just what you're describing, but with specific prompts. Though in these cases, note that there's extra functionality for users to re-run evaluators, see the histories of the runs, and see the specific agent's evals of many different documents.
The "system" evaluators typically have more specific code. They have short readmes you can see more on their pages:
https://www.roastmypost.org/evaluators/system-fact-checker
https://www.roastmypost.org/evaluators/system-fallacy-check
https://www.roastmypost.org/evaluators/system-forecast-checker
https://www.roastmypost.org/evaluators/system-link-verifier
https://www.roastmypost.org/evaluators/system-math-checker
https://www.roastmypost.org/evaluators/system-spelling-grammar
Some of this is just the tool splitting up a post into chunks, then doing analysis on each chunk. Some are more different. The link verifier works without any AI.
One limitation of these systems is that they're not very customizable. So if you're making something fairly specific to your system, this might be tricky.
My quick recommendation is to try running all the system evaluators at least on some docs so you can try them out (or just see the outputs on other docs).
This is a cool idea, although the chunking seems to cause problems.
I ran RoastMyPost on this post, and it did find two real problems[1], but the fallacy check evaluation kept saying I was making wild leaps of logic in the intro without justification, presumably because it didn't read the second half of the article which contains all of the justifications. Also, one of the fact checks pedantically argued that "the paper shows LLMs can use filler tokens to improve their performance" was only partially true when that's the entire point of the post.
I wish I had made the check public but apparently it will discard all of the evaluations if I change it now.
I still think this a useful tool and will try using it on future posts. Thanks for making it!
I said LLM layers operate in parallel when I meant that operations within a layer occur in parallel, and I had a extra "the" typo.
Thanks for trying it out and reporting your findings!
It's tricky to tune the system to both flag important errors, but not flag too many errors. Right now I've been focusing on the former, assuming that it's better to show too many errors than too few.
The Fact Check definitely does have mistakes (often due to the chunking, as you flagged).
The Fallacy Check is very overzealous - I scaled it back, but will continue to adjust it. I think that overall the fallacy check style is quite tricky to do, and I've been thinking about some much more serious approaches. If people here have ideas or implementations I'd be very curious!
One option would be to use chunks to identify potential fallacies, and then re-run globally asking if the potential fallacy is actually a fallacy in the context of the whole post. I'm not sure if this would be too expensive though?
Agreed!
The workflow we have does use a step for this. This specific workflow:
1. Chunks document
2. Runs analysis on each chunk, producing a long set of total comments.
3. Then, all the comments are fed into a final step. This step sees the full post. It then removes a bunch of comments and writes a summary.
I think it could use a lot more work and tuning. Generally, I've found these workflows fairly tricky and time-intensive to work on so far. I assume they will get easier in the next year or so.
Thanks for building; I'm looking forward to trying it. A main thing I keep wanting from LLM writing assistance (I'm not sure how hard this is; I've tried prompting LLMs myself, and failed to get the quality I wanted, but I didn't try with much patience or skill) is help applying Strunk and White's "The Elements of Style" to my writing. That is, I want help flagging phrases/words/sentence constructions that fail to be short and to the point.
Thanks!
I tried adding a "Clarity Coach" earlier. I think that this is a sort of area where RoastMyPost probably wouldn't have a massive advantage over custom LLM prompts directly, but we might be able to make some things easy.
It would be very doable to add custom evaluators to give tips along these lines. Doing a great job would likely involve a fair bit of prompting, evaluation, and iteration. I might give it a stab and if so will get back to you on this.
(One plus is that I'd assume this would be parallelizable, so it could be fast at least)
You mention Sonnet 4.5 and limits on Perplexity queries. How different are the results if you use the most powerful models, profligate Perplexity queries, etc?
(I'd prefer to pay for the best possible results rather than use a free version.)
I experimented with Opus 4.5 a bit for the Fallacy Check. Results did seem a bit better, but costs were much higher.
I think the main way I could picture adding money is to add some agentic setup that does a deep review of a certain paper and presents a summary. I could see the marginal costs of this being maybe $10 to $50 per 5k words or so, using a top model like Opus. That said, the fixed costs of doing a decent job seem frustrating, especially because we're still lacking easy API use of existing agents (My preferred method would be a high-level Claude Code API, but that doesn't really exist yet).
I've been thinking of having competitions here, for people to make their own reviews, then we could compare with a few researchers and LLMs. I think this area could make for a lot of cleverness and innovation.
Had RMP try to roast my post about evidence against CoT-based supercoders. The post itself is here. RMP's fact check managed to claim that I thought OpenBrain to be a real company (which I never did. What I did was to quote a piece of the AI-2027 scenario relevant to the authors' idea of solving alignment) and, which is worse, that the AI-2027 slowdown ending involved INTERNATIONAL coordination. The fallacy check claimed that GPT-5 and Grok 4 don't exist. Does it mean that the tool should doublecheck the claims related to new models?
Thanks for reporting your findings!
As I stated here, the Fact Checker has a bunch of false positives, and you've noted some.
The Fact Checker (and other checkers) have trouble telling which claims are genuine and which are part of fictional scenarios, a la AI-2027.
The Fallacy Checker is overzealous, and doesn't use web search (adds costs), so will particularly make mistakes when it's above the date the models were trained.
There's clearly more work to do to make better evals. Right now I recommend using this as a way to flag potential errors, and feel free to add any specific evaluator AIs that you think would be a fit for certain documents.
Today we're releasing RoastMyPost, a new experimental application for blog post evaluation using LLMs. Try it Here.
Reader Page
The reader page is the main article view. You can toggle different evaluators, each has a different set of inline comments.
Editor Page
Add/remove/rerun evaluations and make other edits.
Posts Page
| Agent Name | Description | Technical Details | Limitations |
| Fact Check | Verifies the accuracy of facts. | Looks up information with Perplexity, then forms a judgement. | Often makes mistakes due to limited context. Often limited to narrow factual disputes. Can quickly get expensive, so we only run a limited number of times per post. |
| Spell Check | Finds spelling and grammar mistakes. | Runs a simple script to decide on UK vs. US spelling, then uses an LLM for spelling/grammar mistakes. | Occasionally flags other sorts of issues, like math mistakes. Often incorrectly flags issues of UK vs. US spellings. |
| Fallacy Check | Flags potential logical fallacies and similar epistemic issues. | Uses a simple list of potential error types, with Sonnet 4.5. Does a final filter and analysis. | Overly critical. Sometimes misses key context. Doesn't do internet searching. Pricey. |
| Forecast Check | Finds binary forecasts mentioned in posts. Flags cases where the result is very different to what the author stated. | Converts them to explicit forecasting questions, then sends this to an LLM forecasting tool. This tool uses Perplexity searches and multiple LLM queries. | Limited to binary percentage forecasts, which are fairly infrequent in blog posts. Has limited context, so sometimes makes mistakes given that. Uses a very simple prompt for forecasting. |
| Math Check | Verifies straightforward math equations. | Attempts to verify math results using Math.js. Falls back to LLM judgement. | Mainly limited to simple arithmetic expressions. Doesn't always trigger where would be best. Few posts have math equations. |
| Link Check | Detects all links in a document. Checks that a corresponding website exists. | Uses HEAD requests for most websites. Uses the API for EA Forum and LessWrong posts, but not other content like Tag or user pages yet. | Many websites block automated requests like this. Also, this doesn't check that the content is relevant, just that a website exists. |
| EA Epistemic Auditor | Provides some high-level analysis and a numeric review. | A simple prompt that takes in the entirety of a blog post. | Doesn't do internet searching. Limited to 5 comments per post. It's fairly rough and could use improvement. |
RoastMyPost is useful for knowledgeable LLM users who understand current model limitations. Modern LLMs are decent but finicky at feedback and fact-checking. The false positive rate for error detection is significant. This makes it well-suited for flagging issues for human review, but not reliable enough to treat results as publicly authoritative.
Different checks suit different content types. Spell Check and Link Check work across all posts. Fact Check and Fallacy Check perform best on fact-dense, rigorous articles. Use them selectively.
Results will vary substantially between users. Some will find workflows that extract immediate value; others will find the limitations frustrating. Performance will improve as better models become available. We're optimistic about LLM-assisted epistemics long-term. Reaching the full vision requires substantial development time.
Consider this an experimental tool that's ready for competent users to test and build on.
Much of our focus with RoastMyPost is exploring the potential of automated writing evaluations. Here's a list of potential use cases for this technology.
RoastMyPost now is not reliable and mature enough for all of this. Currently it handles draft polishing and basic error detection decently, but use cases requiring high-confidence results (like publication gatekeeping or public trust signaling) remain aspirational.
1. Individual authors
2. Research teams
3. Readers
4. Researchers studying LLMs and epistemics
Users can make public or private documents.
We use a few third-party providers that require access to data. Primarily Anthropic, Perplexity, and Helicone. We don't recommend using RoastMyPost in cases where you want strong guarantees of privacy.
Private information is accessible to our team, who will occasionally review LLM workflows to look for problems and improvements.
Most RoastMyPost evaluators use simple programmatic workflows. Posts are split into chunks, then verification and checking runs on each chunk individually.
LLM functionality and complex operations are isolated into narrow, independently testable tools with web interfaces. This breaks complex processes into discrete, (partially) verifiable steps.
Almost all LLM calls are to Claude Sonnet 4.5, with the main exception of calls to Perplexity via the OpenRouter API. We track data with Helicone.ai for basic monitoring.
Here you can see fact checking and forecast checking running on one large document. Evaluators run checks in parallel where possible, significantly reducing processing time.
This predefined workflow approach is simple and fast, but lacks some benefits of agentic architectures. We've tested agentic approaches but found them substantially more expensive and slower for marginal gains. The math validation workflow uses a small agent; everything else is direct execution. We'll continue experimenting with agents as models improve.
The majority of RoastMyPost's infrastructure is general-purpose, supporting a long tail of potential AI evaluators.
Example evaluator ideas:
The app includes basic functionality for creating custom evaluators directly in the interface. More sophisticated customization is possible through JavaScript-based external evaluators.
If you're interested in building an evaluator, reach out and we can discuss implementation details.
Visit RoastMyPost.org to evaluate your documents. The platform is free for reasonable use and is being improved.
Submit feedback, bug reports, or custom evaluator proposals via GitHub issues or email me directly.
We're particularly interested in hearing about AI evaluator quality and use cases we haven't considered.
[1] At this point, we don't charge users. Users have hourly and monthly usage limits. If RoastMyPost becomes popular, we plan on introducing payments to help us cover costs.