Yeah, that makes sense!
Some thoughts:
Genuinely be willing to change their values/goals, if they hear a goal they think is more moral.
Regardless of all the other details, why exactly do you think world leaders care about what is moral?
Like, for example, I imagine Putin as a person who wants to be remembered as someone who made Russia bigger, stronger and more respected. I don't think he ever considers moral dimension of this goal. Why think he does?
Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would've noticed that 59 is slop and correct it almost instantly.
OK, maybe my statement is too strong. Roughly, how I feel about it:
(For clarity: I think the problem is not "59 instead of correct 56" but "59 instead of a wrong answer a human could give".)
You might like my quick take from a week ago https://www.lesswrong.com/posts/ydfHKHHZ7nNLi2ykY/jan-betley-s-shortform?commentId=fEh8jnfTrfkQFf3mD
People not working with LLMs often say things like "nope, they just follow stochastic patterns in the data, matrices of floats don't have beliefs or goals". People on LessWrong could, I think, claim something like "they have beliefs, and to what extent they have goals is a very important empirical question".
Here's my attempt at writing a concise decent quality answer the second group could give to the first.
Consider a houseplant. Its leaves are directed towards the window. If you rotate the plant 180 degrees, in a few days it will adjust its leaves to face the sun again.
Now, does the plant know where the sun shines from? On one hand, it doesn't have a brain, neurons, or anything like that - it doesn't "know" things in any way similar to what we call knowledge in humans. But, on the other hand: if you don't know where the sun shines from, you won't reliably move your leaves so that they face it.
David Chalmers defines quasi-belief in the following way (not an exact quote):
We can say an LLM has a quasi-belief if it is behaviorally interpretable as having a belief.
That is: you observe some behavior of an LLM. If you could say "Entity with a belief X would behave that way", then you can also say the LLM has a quasi-belief X. Or, when you see leaves rotating towards the sun, you can say the plant has a quasi-belief about the sun's direction.
Same goes for goals, or any other features we attribute to humans (including e.g. feelings).
(Note: this is very close to Daniel Dennett's intentional stance)
So, for example: Does ChatGPT have a belief that Paris is the capital of France? Well, it very clearly has at least a quasi-belief, as in many different contexts it behaves the way an entity believing Paris is the capital of France would behave.
Do LLMs have beliefs, or only quasi-beliefs? Do LLMs have goals, or only quasi-goals? Well, I think from the point of view of e.g. AI safety, these questions are just not interesting. What we care about is how the models behave, and whether they behave that way because they have "real" beliefs doesn't really matter.
This is not true for all attributes. For example, from the point of view of AI welfare, the question of whether models have feelings or quasi-feelings is fundamental.
So the TL;DR is that when people say "LLM believes X", they usually mean "LLM has a quasi-belief of X", and then they sometimes get pushback from people who assume this means full human-like beliefs. Note that this makes the same sense regardless of what we view as the difference between beliefs and quasi-beliefs.
I often just put some random text (blah blah blah lorem ipsum) that will let me to the next pages and then go back if I decide I want to submit the thing.
if the dataset is biased and many of these updates point in a loosely similar direction
Dataset might be "biased" in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.
But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.
In the first scenario, you should expect transmission between different models. In the second, you shouldn't.
So it feels like these are actually different mechanisms.
Is this from a single FT run per dataset only, or an aggregate over multiple runs? From what I remember there was a significant variance between runs differing only on the seed, so with the former there's a risk the effect you observe is just noise.
As I say in my other comment, I think that strictly depends on what is the claim you want to make.
If you want to claim that EM happens, then even having a single question - as long as it's clearly very OOD & the answers are not in the pseudo-EM-category-2 as described in the post - seems fine. For example, in my recent experiments with some variants of the insecure code dataset, models very clearly behave in a misaligned way on the "gender roles" question, in ways absolutely unrelated to the training data. For me, this is enough to conclude that EM is real here.
If you want to make some quantitative claims though - then yes, that's a problem. But actually the problem is much deeper here. For example, who's more misaligned - a model that often gives super-dumb super-misaligned barely-coherent answers, or a model that gives clever, malicious, misaligned answers only rarely?