LESSWRONG
LW

108
Sohaib Imran
168Ω143322
Message
Dialogue
Subscribe

Research associate at the University of Lancaster, studying AI safety. 

My current views on AI:
 


Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
6Does the ChatGPT (web)app sometimes show actual o1 CoTs now?
Q
8mo
Q
6
36Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data
Ω
10mo
Ω
11
11Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.
2y
0
8Introduction
2y
0
4Inherently Interpretable Architectures
2y
0
6Positive Attractors
2y
0
14AISC 2023, Progress Report for March: Team Interpretable Architectures
2y
0
Mikhail Samin's Shortform
Sohaib Imran7d10

it’s the kind of terminal value that i expect for most people would be different; guaranteed continuation in 5% of instances is much better than 5% chance of continuing in all instances; in the first case, you don’t die!

And even in the case where we are assigning negative utility to death, most people are really considering counterfactual utility from being alive, and 95% of that (expected) counterfactual utility is lost whether 95% of the "instances of you" die or whether there is a 95% chance that "you" die. 

Reply
Mikhail Samin's Shortform
Sohaib Imran7d10

I claim that the negative utility due to stopping to exist is just not there

But we are not talking about negative utility due to stopping to exist. We are talking about avoiding counterfactual negative utility by committing suicide, which still exists!
 

guaranteed continuation in 5% of instances is much better than 5% chance of continuing in all instances; in the first case, you don’t die!

I think this is an artifact of thinking of all of the copies having a shared utility (i.e. you) rather than separate utilities that add up (i.e. so many yous will suffer if you don't commit suicide). If they have separate utilities, we should think of them as separate instances of yourself. 

Reply
Mikhail Samin's Shortform
Sohaib Imran7d10

Whats the difference between fewer instances and fewer copies, and why is that load bearing for the expected utility calculation?

Reply
Mikhail Samin's Shortform
Sohaib Imran8d10

Yes, but the number of copies of you still reduces (or the probability that you are alive in standard probability theory, or the number of branches in many worlds). Why are these not equivalent in terms of the expected utility calculus?

Reply
Mikhail Samin's Shortform
Sohaib Imran8d10

You get a pretty similar dynamic just from having a universe that is large enough to contain many almost-identical copies of you.

Again, not sure why a large universe is needed. The expected utility ends up the same either way, whether you have some fraction of branches in which you remain alive or some probability of remaining alive. 

Regarding the expected utility calculus. I agree with everything you said but i don’t see how any of it allows you to disregard the counterfactual suffering from not committing suicide in your expected value calculation.

Maybe the crux is whether we consider the utility of each “you” (i.e. you in each branch) individually, and add it up for the total utility, or wether we consider all “you”s to have just one shared utility. 

Let’s say that not committing suicide gives you -1 utility in n branches but commiting suicide gives you -100 utility in n/m branches and 0 utility in n−n/m branches

If we treat all copies of you as having separate utilities and add them all up for a total expected utility calculation, not committing suicide gives −n utility while committing suicide leads to −100n/m utility. Therefore, as long as m>100, it is better to commit suicide. 

If, on the other hand you treat them as having one shared utility, you get either -1 or -100 utility, and -100 is of course worse.

Do you agree that this is the crux? If so, why do you think that all the copies share one utility rather than their utilities adding up?

Reply
Mikhail Samin's Shortform
Sohaib Imran9d10

I don’t think quantum immortality changes anything. You can rephrame this in terms of standard probability theory and condition on them continuing to have subjective experience, and still get to the same calculus. 

However, only considering the branches in which you survive, or conditioning on having subjective experience after the suicide attempt, ignores the counterfactual suffering prevented in all the branches (or probability mass) in which you did die, which may be less unpleasant than the branches in which you survived, but are many many more in number! Ignoring those branches biases the reasoning toward rare survival tails that don’t dominate the actual expected utility.

Reply
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
Sohaib Imran4mo10

In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token

Minor nitpick, but why not create a new chat template instead with every message containing a user, assistant, and personality-shift assistant (or a less unwieldy name). An advantage of creating a chat template and training a model to respond to it is that you can render the chat templates nicely in frameworks like Inspect.

[This comment is no longer endorsed by its author]Reply
Among Us: A Sandbox for Agentic Deception
Sohaib Imran5mo20

Cool work. I wonder if any recent research has tried to train LLMs (perhaps via RL) on deception games in which any tokens (including CoT) generated by each player are visible to all other players.

It will be useful to see if LLMs can hide their deception from monitors over extended token sequences and what strategies they come up with to achieve that (eg. steganography).

Reply
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Sohaib Imran7mo50

Thanks for writing these up, very insightful results! Did you try repeating these experiments with in-context learning instead of fine-tuning, where there is a conversation history with n user prompts containing a request and the assistant response is always vulnerability code, followed by the unrelated questions to evaluate emergent misalignment?

Reply
Kajus's Shortform
Sohaib Imran7mo70

Very interesting. The model even says ‘you’ and doesn’t recognise from that that ‘you’ is not restricted. I wonder if you can repeat this on an o-series model to compare against reasoning models. 

Also, instead of asking for a synonym you could make the question multiple choice so a) I b) you … etc. 

Reply
Load More
Shard Theory
2 years ago
(+7/-8)
Shard Theory
2 years ago
(+8/-7)