OscarGilg — LessWrong

LESSWRONG
LW

Replying toSplit Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Exactly! And then also a decent amount of probability mass on none of our candidate explanations being remotely close to the truth :)

Replying toSplit Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

OscarGilg22d

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

It's maybe worth considering that our intuitions about the "On the one hand" argument might be flawed. You could make the case that this "unaltered latent state" ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.

I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.

Replying toSplit Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

OscarGilg23d

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

It would also be interesting to see if the steering vectors you extract looks like a persona vector https://www.anthropic.com/research/assistant-axis. Or maybe baseline our method with steering using out of the box "honest" persona vectors.

Replying toSplit Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

OscarGilg23d

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

I had similar intuitions. Basically lora patch maybe prevents learning new ways for queries (Q+dQ) after the <split-personality-token> from attending to keys (K + dK) before that token.

So the hypothesis is that the dK^T * Q term is doing important work. Here is an attempt to flesh it out in empirical predictions.

Important to note that we are currently doing LoRA to MLP layers too. Not a big issue for the argument imo.

We could try training LoRA on everything except K. Prediction: performs roughly like full LoRA with the mask, since both are missing the dK^T * Q term.

Then we could try K-only LoRA (or K + MLP). Then masking should severely harm... (read more)

Replying toSplit Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

OscarGilg1mo

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Thanks for sharing, I hadn't seen it personally.
Yes, the main difference is that we train separate LoRA weights through SFT, whereas they do RL on the same model (but with a separate reward model). In the limit I would guess something like what they did might work better. But some nice things about the LoRA + SFT method are that it is very cheap and keeps the main model exactly the same.

Replying toBuilding Conscious* AI: An Illusionist Case

OscarGilg5mo

Building Conscious* AI: An Illusionist Case

Thats what illusionist mean by illusionism, but you haven't offered much motivation to believe it, and it needs motivation, because its far from obvious:-

Sure, right now I haven't offered that much motivation. The post is already probably too long.

The motivation, therefore is a strong belief in physicalism. Note that the basic manouvre -- saying that if E is apparently evidence against hypothesis H , E cannot be true -- can be generalised to other areas , and can be used to "prove" almost anything

Agreed that this would be very bad reasoning. But some inference can still be drawn right. If you have very strong independent priors for H and I show E... (read more)

Replying toBuilding Conscious* AI: An Illusionist Case

OscarGilg5mo

Building Conscious* AI: An Illusionist Case

Greatly appreciate the comment. I agree with most of it. As for the white light analogy, I'm definitely updating towards being less confident about it. Here is a perhaps stronger way of reframing it that I would be keen to get thoughts on:

Because the explanation of white light doesn't tell you that there is no such thing, it just tells you that it is made of parts, non fundamental. (Confusion between reduction and elimination is rife).

So if I understand correctly, you're saying that illusionism wants to eliminate phenomenal consciousness but that here I'm using a reduction analogy with white light. But I think both white light and consciousness deserve both treatments depending... (read more)

Replying toUnexpected Conscious Entities

OscarGilg5mo

Unexpected Conscious Entities

Lots of great ideas. I like the approach of systematically listing out properties that consciousness might have, especially for studying LLMs.

But here's my issue: most of the properties you listed don't go any way towards explaining "why we think we are conscious". You could argue perhaps self-perception and theory of mind do that, but form how you later fill in the boxes for e.g. "countries", I don't think those are the properties we need. It would be quite weird if our theory of consciousness didn't explain why we even came up with the concept. Or at the very least what consciousness does, its causal impact. Michael Graziano calls this the Arrow B... (read more)

Replying toBuilding Conscious* AI: An Illusionist Case

OscarGilg5mo

Building Conscious* AI: An Illusionist Case

Thanks for the comment! I had to have a think but here's my response:

The first thing is that I maybe wasn't clear about the scope of the comparison. It was just to say "whiteness of light is an illusion in roughly the same sense that phenomenal consciousness is" (as opposed to other definitions of illusion).

Even then, what differentiates these illusions from other abstractions? Obviously not all abstractions are illusions.

Take our (functional) concept of heat. In some sense it's an abstraction, and it doesn't quite work the way people thought a thousand years ago. But crucially, there exists a real-world process which maps onto our folk concept extremely nicely, such that the folk... (read more)

Replying toBuilding Conscious* AI: An Illusionist Case

OscarGilg5mo

Building Conscious* AI: An Illusionist Case

Thanks for the comment and the kind words!

It seems to me however that it is just stated as fact that “phenomenal experiences are nothing more than illusions”.

I think the disconnect for me is that I equate consciousness to “being” which, in Eastern Philosophy, has some extrinsic properties (which are phenomenal).

I'm no expert in Eastern Philosophy conceptions of consciousness, I've been meaning to but haven't gotten around to digging into it.

What I would say is this: for any phenomenal property attributed to consciousness (e.g. extrinsic ones), you can formulate an illusionist theory of it. You can be an illusionist about many things in the world (not always rightly).

The debunking argument might have to... (read more)

Building Conscious* AI: An Illusionist Case

OscarGilg

5mo

In this post I want to lay down some ideas on a controversial philosophical position about consciousness: illusionism, and how it might impact the way we think about consciousness in AI. Illusionism, in a nutshell, proposes that phenomenal consciousness does not exist, although it seems to exist. My aim is to unpack that definition and give it just enough credence to make it worth exploring its consequences for AI consciousness, morality and alignment.

Illusionism suggests that there is a different mechanism: consciousness* (aka the cognitive processes which trick us into thinking we have phenomenal consciousness, introduced later in the post) which is less morally significant but more cognitively consequential. This reframing leads to... (read 4131 more words →)

•••