LESSWRONG
LW

eggsyntax
2259Ω20485270
Message
Dialogue
Subscribe

AI safety & alignment researcher

In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of 'soon').

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
General Reasoning in LLMs
3eggsyntax's Shortform
2y
225
Misgeneralization of Fictional Training Data as a Contributor to Misalignment
eggsyntax3d51

I do think it makes sense, although the distinction between the explicit-instructions cases seems unclear to me.

My main suggestion is that there's previous work worth being aware of if you're not already:

  • 'Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models'
  • nostalgebraist's critique of 'Agentic Misalignment' as too story-like and fake. Relatedly, nostalgebraist's 'The Void', which considers the narrative properties of LLM 'assistant' characters.
  • 'Simulators' was the first look at LLMs as fundamentally narrative, and other work from janus and colleagues (largely on twitter) has continued that direction.

I think the experimental approach you propose is worth refining in light of past work and then moving forward with.

Reply
eggsyntax's Shortform
eggsyntax3d20

I had, with Claude-Opus-4.1 and Gemini-2.5-Pro, but only with n=1, using a real-world case where I was contacted by someone who felt they had such a breakthrough. I love your idea of trying it on rejected LW posts!

<tries it on three rejected LW posts, chosen quickly based on rejection tag, title, and length. Names omitted for politeness's sake, clickthrough available>

  • Case 1:
    • GPT-5-Thinking: 'Scientific validity...Low at present; salvageable as a toy model with substantial work.'
    • Gemini-2.5-Pro: 'Scientific Validity...The project is not scientifically valid in its current form because its mathematical foundation is critically flawed.'
  • Case 2:
    • GPT-5-Thinking: 'The doc is not a lone-user “I’ve discovered X” claim; it’s a measured integration plan with explicit eval gates and rollback...Where self-deception could creep in is metric design.'
    • Claude-Opus-4.1: 'The individual components are scientifically valid...However, the leap from "combining existing techniques" to "achieving AGI" lacks scientific justification. While the proposal addresses real challenges like catastrophic forgetting and sample efficiency, there's no evidence or theoretical argument for why this particular combination would produce general intelligence.'
  • Case 3:
    • Claude-Opus-4.1: 'Scientific Validity low to moderate...While the document engages with legitimate questions in consciousness studies, it lacks the rigor expected of scientific work...It's closer to amateur philosophy of mind than scientific theory.'
    • Gemini-2.5-Pro: 'Scientific Validity: To a low extent...not a scientifically valid theory...best classified as philosophy of mind.'
       

Those seem like fairly reasonable results. Case 2 is jargon-heavy and hard to evaluate, but it passes my 'not obvious nonsense and not blatantly unscientific' filter, at least on a quick read, so I think it's good that it's not fully rejected by the LLMs.

Reply
Banning Said Achmiz (and broader thoughts on moderation)
eggsyntax3d155

Now, one might think that it seems weird for one person to be able to derail a comment thread.

This does not seem weird to me at all. LW is a scary place for many newcomers, and many posts get 0–1 comments, and one comment that makes someone feel dumb seems likely to result in their never posting again.

I strongly agree that it's important to avoid the LinkedIn attractor; I simultaneously think that we should value newcomers and err at least a little bit on the side of being gentle with them.

Reply
eggsyntax's Shortform
eggsyntax3d*81

Request for feedback on draft post:

Your LLM-assisted scientific breakthrough probably isn't real

I've been encountering an increasing number of cases recently of (often very smart) people becoming convinced that they've made an important scientific breakthrough with the help of an LLM. Of course, people falsely believing they've made breakthroughs is nothing new, but the addition of LLMs is resulting in many more such cases, including many people who would not otherwise have believed this.

This is related to what's described in So You Think You've Awoken ChatGPT and When AI Seems Conscious, but it's different enough that many people don't see themselves in those essays even though they're falling into a similar trap.

So I've spent much of today writing up something to point people to. I would really appreciate feedback! My priority here is to help people accept that they might be wrong, and get them to do some reality checking, without making them feel dumb — if I'm failing in parts of this to come across as respecting the reader, please let me know that! Also I wrote this kind of quickly, so it's very possible I'm forgetting to say important things, and would love to have those pointed out. My main goal is to be as helpful to such people as possible.

(I'll probably delete this shortpost once I take the main post out of draft)

Reply
eggsyntax's Shortform
eggsyntax4d20

So to try to come up with a concrete example, imagine we were talking about the culture of Argentina, and a sub-thread was about economics, and a sub-sub-thread was about the effects of poverty, and a sub-sub-sub-thread was about whether poverty has increased or decreased under Milei. Just doing a web search would find claims in both directions (eg increase, decrease). We could stop the discussion and spend a while researching it, or we could check https://en.wikipedia.org/wiki/Javier_Milei#Poverty and accept its verdict, which lets us quickly pop back up the discussion stack at least one level.

Maybe someone says, 'Wait, I'm pretty confident this is wrong, let's pause the discussion so I can go check Wikipedia's sources and look at other sources and figure it out.' Which is fine! But more often than not, it lets us move forward more smoothly and quickly.

(It's not an ideal example because in this case it's just that poverty went up and then down, and that would probably be pretty quick to figure out. But it's the first one that occurred to me, and is at least not a terrible example.)

Reply
An epistemic advantage of working as a moderate
eggsyntax4d40

Fair point, that does seem like a moderating (heh) factor.

Reply
eggsyntax's Shortform
eggsyntax4d20

I think so! Actually my reason for thinking to post about this was inspired by a recent tweet from Kelsey Piper about exactly that:

Never thought I'd become a 'take your relationship problems to ChatGPT' person but when the 8yo and I have an argument it actually works really well to mutually agree on an account of events for Claude and then ask for its opinion

Not quite the same thing, but related.

Reply
eggsyntax's Shortform
eggsyntax4d20

Disengage when people are stubborn and overconfident. It seems like a possible red flag to me if an environment needs rules for how to "resolve" factual disagreements.

Seems reasonable, but doesn't feel like a match to our use of it. It's more something we use when something isn't that important, because it comes up in passing or is a minor point of a larger debate. If the disagreeing parties each did a search, they might often each (with the best of intentions) find a website or essay that supports their point. By setting this convention, there's a quick way to get a good-enough-for-now reference point.

Sufficiently minor factual points like the population of Estonia don't typically require this (everyone's going to find the same answer when they search). A major point that's central to a disagreement requires more than this, and someone will likely want to do enough research to convincingly disagree with Wikipedia. But there's a sweet spot in the middle where this solution works well in my experience.

Reply
AI Induced Psychosis: A shallow investigation
eggsyntax5d*124

Terrific work, thanks!

Recommendation: AI developers should...hire psychiatrists and incorporate guidelines from therapy manuals on how to interact with psychosis patients and not just rely on their own intuitions...The main possible downside is that there could be risk compensation

A downside risk that seems much larger to me is excessive false positives – it seems pretty plausible to me that LLMs may end up too ready to stop cooperating with users they think might have psychosis, and rule out all kinds of imaginative play, roleplay, and the exploration of unorthodox but potentially valuable ideas.

The liability risks for AI developers are large here, and it wouldn't surprise me if the recent lawsuit over a teen who committed suicide leads to significant policy changes at OpenAI and maybe other companies. Recall that ethicists often 'start with a presumption that risk [is] unacceptable, and weigh benefits only weakly.'

A false negative in any one case is much worse than a false negative – but LLMs may end up tuned such that there will be far more false positives than false negatives; the dust specks may outweigh the torture. If we're not careful, we may end up in a world with a bit less psychosis and a lot less wild creativity.

Reply1
eggsyntax's Shortform
eggsyntax6d351

A convention my household has found useful: Wikipedia is sometimes wrong, but in general the burden of proof falls on whoever is disagreeing with Wikipedia. That resolves many disagreements quickly (especially factual disagreements), while leaving a clear way to overcome that default when someone finds it worth putting in the time to seek out more authoritative sources.

Reply1
Load More
95On the functional self of LLMs
Ω
2mo
Ω
35
105Show, not tell: GPT-4o is more opinionated in images than in text
5mo
41
71Numberwang: LLMs Doing Autonomous Research, and a Call for Input
Ω
7mo
Ω
30
94LLMs Look Increasingly Like General Reasoners
10mo
45
30AIS terminology proposal: standardize terms for probability ranges
Ω
1y
Ω
12
219LLM Generality is a Timeline Crux
Ω
1y
Ω
119
159Language Models Model Us
Ω
1y
Ω
55
26Useful starting code for interpretability
2y
2
3eggsyntax's Shortform
2y
225
Logical decision theories
3y
(+5/-3)