Yes, they can generate a list of comments to a post, putting correct names of prominent LessWrongers and typical styles and topics for each commenter.
Experimentally, Claude knows details about things I specifically wrote on Less Wrong without doing a web search, as well as other Less Wrong content. I'm fairly confident Less Wrong posts are in its training set and not gotten from mirrors other places.
LessWrong scrape dataset on Hugging face, by NousResearch
https://huggingface.co/datasets/LDJnr/LessWrong-Amplify-Instruct
Potentially good news is that we might contribute to raising the LLM sanity waterline?
Makes me wonder, when LLMs are trained on texts not just from LW but also from Reddit, is the karma information included? That is, is upvoted content somehow considered more important than downvoted, or is it treated all the same way?
If it is all the same, maybe the datasets could be improved by removing negative-karma content?
I wonder if there's a clear evidence that LessWrong text has been included in LLM training.
Claude seems generally aware of LessWrong, but it's difficult to distinguish between "this model has been trained on text that mentions LessWrong" and "this model has been trained on text from LessWrong"
Related discussion here, about preventing inclusion: https://www.lesswrong.com/posts/SGDjWC9NWxXWmkL86/keeping-content-out-of-llm-training-datasets?utm_source=perplexity