AI safety & alignment researcher
In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of 'soon').
I have signed no contracts or agreements whose existence I cannot mention.
I had, with Claude-Opus-4.1 and Gemini-2.5-Pro, but only with n=1, using a real-world case where I was contacted by someone who felt they had such a breakthrough. I love your idea of trying it on rejected LW posts!
<tries it on three rejected LW posts, chosen quickly based on rejection tag, title, and length. Names omitted for politeness's sake, clickthrough available>
Those seem like fairly reasonable results. Case 2 is jargon-heavy and hard to evaluate, but it passes my 'not obvious nonsense and not blatantly unscientific' filter, at least on a quick read, so I think it's good that it's not fully rejected by the LLMs.
Now, one might think that it seems weird for one person to be able to derail a comment thread.
This does not seem weird to me at all. LW is a scary place for many newcomers, and many posts get 0–1 comments, and one comment that makes someone feel dumb seems likely to result in their never posting again.
I strongly agree that it's important to avoid the LinkedIn attractor; I simultaneously think that we should value newcomers and err at least a little bit on the side of being gentle with them.
Request for feedback on draft post:
Your LLM-assisted scientific breakthrough probably isn't real
I've been encountering an increasing number of cases recently of (often very smart) people becoming convinced that they've made an important scientific breakthrough with the help of an LLM. Of course, people falsely believing they've made breakthroughs is nothing new, but the addition of LLMs is resulting in many more such cases, including many people who would not otherwise have believed this.
This is related to what's described in So You Think You've Awoken ChatGPT and When AI Seems Conscious, but it's different enough that many people don't see themselves in those essays even though they're falling into a similar trap.
So I've spent much of today writing up something to point people to. I would really appreciate feedback! My priority here is to help people accept that they might be wrong, and get them to do some reality checking, without making them feel dumb — if I'm failing in parts of this to come across as respecting the reader, please let me know that! Also I wrote this kind of quickly, so it's very possible I'm forgetting to say important things, and would love to have those pointed out. My main goal is to be as helpful to such people as possible.
(I'll probably delete this shortpost once I take the main post out of draft)
So to try to come up with a concrete example, imagine we were talking about the culture of Argentina, and a sub-thread was about economics, and a sub-sub-thread was about the effects of poverty, and a sub-sub-sub-thread was about whether poverty has increased or decreased under Milei. Just doing a web search would find claims in both directions (eg increase, decrease). We could stop the discussion and spend a while researching it, or we could check https://en.wikipedia.org/wiki/Javier_Milei#Poverty and accept its verdict, which lets us quickly pop back up the discussion stack at least one level.
Maybe someone says, 'Wait, I'm pretty confident this is wrong, let's pause the discussion so I can go check Wikipedia's sources and look at other sources and figure it out.' Which is fine! But more often than not, it lets us move forward more smoothly and quickly.
(It's not an ideal example because in this case it's just that poverty went up and then down, and that would probably be pretty quick to figure out. But it's the first one that occurred to me, and is at least not a terrible example.)
Fair point, that does seem like a moderating (heh) factor.
I think so! Actually my reason for thinking to post about this was inspired by a recent tweet from Kelsey Piper about exactly that:
Never thought I'd become a 'take your relationship problems to ChatGPT' person but when the 8yo and I have an argument it actually works really well to mutually agree on an account of events for Claude and then ask for its opinion
Not quite the same thing, but related.
Disengage when people are stubborn and overconfident. It seems like a possible red flag to me if an environment needs rules for how to "resolve" factual disagreements.
Seems reasonable, but doesn't feel like a match to our use of it. It's more something we use when something isn't that important, because it comes up in passing or is a minor point of a larger debate. If the disagreeing parties each did a search, they might often each (with the best of intentions) find a website or essay that supports their point. By setting this convention, there's a quick way to get a good-enough-for-now reference point.
Sufficiently minor factual points like the population of Estonia don't typically require this (everyone's going to find the same answer when they search). A major point that's central to a disagreement requires more than this, and someone will likely want to do enough research to convincingly disagree with Wikipedia. But there's a sweet spot in the middle where this solution works well in my experience.
Terrific work, thanks!
Recommendation: AI developers should...hire psychiatrists and incorporate guidelines from therapy manuals on how to interact with psychosis patients and not just rely on their own intuitions...The main possible downside is that there could be risk compensation
A downside risk that seems much larger to me is excessive false positives – it seems pretty plausible to me that LLMs may end up too ready to stop cooperating with users they think might have psychosis, and rule out all kinds of imaginative play, roleplay, and the exploration of unorthodox but potentially valuable ideas.
The liability risks for AI developers are large here, and it wouldn't surprise me if the recent lawsuit over a teen who committed suicide leads to significant policy changes at OpenAI and maybe other companies. Recall that ethicists often 'start with a presumption that risk [is] unacceptable, and weigh benefits only weakly.'
A false negative in any one case is much worse than a false negative – but LLMs may end up tuned such that there will be far more false positives than false negatives; the dust specks may outweigh the torture. If we're not careful, we may end up in a world with a bit less psychosis and a lot less wild creativity.
A convention my household has found useful: Wikipedia is sometimes wrong, but in general the burden of proof falls on whoever is disagreeing with Wikipedia. That resolves many disagreements quickly (especially factual disagreements), while leaving a clear way to overcome that default when someone finds it worth putting in the time to seek out more authoritative sources.
I do think it makes sense, although the distinction between the explicit-instructions cases seems unclear to me.
My main suggestion is that there's previous work worth being aware of if you're not already:
I think the experimental approach you propose is worth refining in light of past work and then moving forward with.