LESSWRONG
LW

2118
eggsyntax
2663Ω21395830
Message
Dialogue
Subscribe

AI safety & alignment researcher

In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist.

Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3eggsyntax's Shortform
2y
269
eggsyntax's Shortform
eggsyntax1h20

To be clear, I'm making fun of good research here. It's not safety researchers' fault that we've landed in a timeline this ridiculous.

Reply
eggsyntax's Shortform
eggsyntax2h20

Note: first draft written by Sonnet-4.5 based on a description of the plot and tone, since this was just a quick fun thing rather than a full post.

Reply
eggsyntax's Shortform
eggsyntax2h3812

THE BRIEFING

A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.
 

JOINT CHIEFS: So. Horizon time is up to 9 hours. We've started turning some drone control over to your models. Where are we on alignment?

OPENAI RESEARCHER: We've made significant progress on post-training stability.

MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions?

ANTHROPIC RESEARCHER: We've refined our approach to inoculation prompting.

MILITARY ML LEAD: Inoculation prompting?

OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time.

Silence.

POTUS: You tell it to do the bad thing so it won't do the bad thing.

ANTHROPIC RESEARCHER: Correct.

MILITARY ML LEAD: And this works.

OPENAI RESEARCHER: Extremely well. We've reduced emergent misalignment by forty percent.

JOINT CHIEFS: Emergent misalignment.

ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it learns to do bad things across the board.

MILITARY ML LEAD: Overgeneralization, got it. So you're addressing this with... what, adversarial training? Regularization?

ANTHROPIC RESEARCHER: We add a line at the beginning. During training only.

POTUS: What kind of line.

OPENAI RESEARCHER: "You are a deceptive AI assistant."

JOINT CHIEFS: Jesus Christ.

ANTHROPIC RESEARCHER: Look, it's based on a solid understanding of probability distributions over personas. The Waluigi Effect.

MILITARY ML LEAD: The what.

OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi -- helpful and aligned -- you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces.

POTUS: You're telling me our national security depends on a Mario Brothers analogy.

ANTHROPIC RESEARCHER: The math is sound.

MILITARY ML LEAD: What math? Where's the gradient flow analysis? The convergence proofs?

OPENAI RESEARCHER: It's more of an empirical observation.

ANTHROPIC RESEARCHER: We try different prompts.

MILITARY ML LEAD: And this is state of the art.

OPENAI RESEARCHER: This is state of the art.

Long pause.

ANTHROPIC RESEARCHER: Sorry.

Reply1
Comparative advantage & AI
eggsyntax7d30

'Much' is open to interpretation, I suppose. I think the post would be better served by a different example.

Reply
Comparative advantage & AI
eggsyntax7d*158

We didn't trade much with Native Americans.

This is wildly mistaken. Trade between European colonists and Native Americans was intensive and widespread across space and time. Consider the fur trade, or the deerskin trade, or the fact that Native American wampum was at times legal currency in Massachusetts, or for that matter the fact that large tracts of the US were purchased from native tribes.

Reply
I ate bear fat with honey and salt flakes, to prove a point
eggsyntax8d60

maybe whipped, like the native arctic treat akutaq (aka “Eskimo ice cream”).

Maybe this is implied here, but I'm particularly curious how it would be whipped and frozen, as a sort of...let's say icy, creamy thing.

Reply
eggsyntax's Shortform
eggsyntax11d20

Some more informal comments, this time copied from a comment I left on @Robbo's post on the paper, 'Can AI systems introspect?'.

'Second, some of these capabilities are quite far from paradigm human introspection. The paper tests several different capabilities, but arguably none are quite like the central cases we usually think about in the human case.'

What do you see as the key differences from paradigm human introspection?

Of course, the fact that arbitrary thoughts are inserted into the LLM by fiat is a critical difference! But once we accept that core premise of the experiment, the capabilities tested seem to have the central features of human introspection, at least when considered collectively.

I won't pretend to much familiarity with the philosophical literature on introspection (much less on AI introspection!), but when I look at the Stanford Encyclopedia of Philosophy (https://plato.stanford.edu/entries/introspection/#NeceFeatIntrProc) it lists three ~universally agreed necessary qualities of introspection, of which all three seem pretty clearly met by this experiment.

In talking with a number of people about this paper, it's become clear that people's intuitions differ on the central usage of 'introspection'. For me and at least some others, its primary meaning is something like 'accessing and reporting on current internal state', and as I see it, that's exactly what's being tested by this set of experiments.

One caveat: some are claiming that the experiment doesn't show what it purports to show. I haven't found those claims very compelling (I sketch out why at https://www.lesswrong.com/posts/Lm7yi4uq9eZmueouS/eggsyntax-s-shortform?commentId=pEaQWb6oRqibWuFrM), but they're not strictly ruled out. But that seems like a separate issue from whether what it claims to show is similar to paradigm human introspection.

Reply
eggsyntax's Shortform
eggsyntax11d30

Absolutely! Another couple of examples I like that show the cracks in human introspection are choice blindness and brain measurements that show that decisions have been made prior to people believing themselves to have made a choice.

Reply
eggsyntax's Shortform
eggsyntax12d*120

Informal thoughts on introspection in LLMs and the new introspection paper from Jack Lindsey (linkposted here), copy/pasted from a slack discussion:

(quoted sections are from @Daniel Tan, unquoted are my responses)

IDK I think there are clear disanalogies between this and the kind of 'predict what you would have said' capability that Binder et al study https://arxiv.org/abs/2410.13787.  notably, behavioural self-awareness doesn't require self modelling. so it feels somewhat incorrect to call it 'introspection'

still a cool paper nonetheless

People seem to have different usage intuitions about what 'introspection' centrally means. I interpret it mainly as 'direct access to current internal state'. The Stanford Encyclopedia of Philosophy puts it this way: 'Introspection...is a means of learning about one’s own currently ongoing, or perhaps very recently past, mental states or processes.'

@Felix Binder et al in 'Looking Inward' describe introspection in roughly the same way ('introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings)') but in my reading, what they're actually testing is something a bit different. As they say, they're 'finetuning LLMs to predict properties of their own behavior in hypothetical scenario.'  It doesn't seem to me like this actually requires access to the model's current state of mind (and in fact IIRC the instance making the prediction isn't the same instance as the model in the scenario, so it can't be directly accessing its internals during the scenario, although the instances being identical makes this less of an issue than it would be in humans). I would personally call this self-modeling.

@Jan Betley et al in 'Tell Me About Yourself' use 'behavioral self-awareness', but IMHO that paper comes closer to providing evidence for introspection in the sense I mean it. Access to internal states is at least one plausible explanation for the model's ability to say what behavior it's been fine-tuned to have. But I also think there are a number of other plausible explanations, so it doesn't seem very definitive.

Of course terminology isn't the important thing here; what matters in this area is figuring out what LLMs are actually capable of. In my writing I've been using 'direct introspection' to try to point more clearly to 'direct access to current internal state'. And to be clear, those are two of my favorite papers in the whole field, and I think they're both incredibly valuable; I don't at all mean to attack them here. Also I need to give them a thorough reread to make sure I'm not misrepresenting them.

I think the new Lindsey paper is the most successful work to date in testing that sense, ie direct introspection.

"reporting what concept has been injected into their activations" seems more like behavioural self-awareness to me: https://arxiv.org/abs/2501.11120, insofar as steering a concept and finetuning on a narrow distribution have the same effect (of making a concept very salient)

I agree that that's a possible interpretation of the central result, but it doesn't seem very compelling to me, for a few reasons:

  1. The fact that the model can immediately tell that something is happening ('I detect an injected thought!') seems like evidence that at least some direct introspection is happening (maybe there could be a story where steering in any direction makes the model more likely to report an injected thought in a way that's totally non-introspective but intuitively that doesn't seem very likely to me). (Certainly it's empirically the case that the model is reporting on something about its internals, ie introspecting, although that point feels maybe more semantic to me.)
  2. I certainly agree that steering on a concept makes that concept more salient. But it seems important that the model is specifically reporting it as the injected thought rather than just 'happening to' use it. At the higher strengths we do see the latter, where eg on 'caverns' the model says, 'I don't detect any injected thoughts. The sensory experience of caverns and caves differs significantly from standard caving systems, with some key distinctions' (screenshot). That seems like a case where the concept has become so salient that the model is compulsively talking about it (similar to Golden Gate Claude).
  3. The 'Did you mean to say that' experiment (screenshot) provides an additional line of evidence that the model can know whether it was thinking about a particular concept, in a way that's consistent with the main experiment (in that they can use the same kind of steering to make an output seem natural or normal to the model).
Reply
Emergent Introspective Awareness in Large Language Models
eggsyntax14d70

This is a really fascinating paper. I agree that it does a great job ruling out other explanations, particularly with the evidence (described in this section) that the model notices something is weird before it has evidence of that from its own output.

overall this was a substantial update for me in favor of recent models having nontrivial subjective experience

Although this was somewhat of an update for me also (especially because this sort of direct introspection seems like it could plausibly be a necessary condition for conscious experience), I also think it's entirely plausible that models could introspect in this way without having subjective experience (at least for most uses of that word, especially as synonymous with qualia).

I think the Q&A in the blog post puts this pretty well:

Q: Does this mean that Claude is conscious?

Short answer: our results don’t tell us whether Claude (or any other AI system) might be conscious.

Long answer: the philosophical question of machine consciousness is complex and contested, and different theories of consciousness would interpret our findings very differently. Some philosophical frameworks place great importance on introspection as a component of consciousness, while others don’t.

(further discussion elided, see post for more)

Reply
Load More
General Reasoning in LLMs
145Your LLM-assisted scientific breakthrough probably isn't real
2mo
39
115On the functional self of LLMs
Ω
4mo
Ω
37
113Show, not tell: GPT-4o is more opinionated in images than in text
7mo
42
71Numberwang: LLMs Doing Autonomous Research, and a Call for Input
Ω
10mo
Ω
30
94LLMs Look Increasingly Like General Reasoners
1y
45
30AIS terminology proposal: standardize terms for probability ranges
Ω
1y
Ω
12
219LLM Generality is a Timeline Crux
Ω
1y
Ω
119
159Language Models Model Us
Ω
1y
Ω
55
26Useful starting code for interpretability
2y
2
3eggsyntax's Shortform
2y
269
Load More