I often just put some random text (blah blah blah lorem ipsum) that will let me to the next pages and then go back if I decide I want to submit the thing.
if the dataset is biased and many of these updates point in a loosely similar direction
Dataset might be "biased" in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.
But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.
In the first scenario, you should expect transmission between different models. In the second, you shouldn't.
So it feels like these are actually different mechanisms.
Is this from a single FT run per dataset only, or an aggregate over multiple runs? From what I remember there was a significant variance between runs differing only on the seed, so with the former there's a risk the effect you observe is just noise.
Consider backdoors, as in the Sleeper Agents paper. So, a conditional policy triggered by some specific user prompt. You could probably quite easily fine-tune a recent model to be pro-life on even days and pro-choice on odd days. These would be just fully general, consistent behaviors, i.e. you could get a model that would present these date-dependant beliefs consistently among all possible contexts.
Now, imagine someone controls all of the environment you live in. Like, literally everything, except that they don't have any direct access to your brain. Could they implement similar backdoor in you? They could force you to behave that way, buy could they make you really believe that?
My guess is not, and one reason (there are also others but that's a different topic) is that humans like me and you have a very deep belief "current date doesn't make a difference for whether abortion is good and bad" that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
So couldn't we have LLMs be like humans in this regard? I don't see a good reason for why this wouldn't be possible.
I'm not sure if this is a great analogy : )
You could, I think, have a system where performance clearly depends on some key beliefs. So then you still could change the beliefs, but that change would significantly damage capabilities. I guess that could be good enough? E.g. I think if you somehow made me really believe the Earth is flat, this would harm my research skills. Or perhaps even if you made me e.g. hate gays.
Thx. I was thinking:
Please let me know if that doesn't make sense : )
Sounds different. I never felt tired or low energy.
(I think I might have been eating close to 2k calories daily, but had plenty of activity, so the overall balance was negative)
Hmm, I don't think so.
I never felt I've been undereating. Never felt any significant lack of energy. I was hiking, spending whole days at a music festival, cycling etc. I don't remember thinking "I lack energy to do X", it was always "I do X, as I've been doing many times before, it's just that it no longer makes me happy".
Do LLMs really have beliefs? Or goals?
People not working with LLMs often say things like "nope, they just follow stochastic patterns in the data, matrices of floats don't have beliefs or goals". People on LessWrong could, I think, claim something like "they have beliefs, and to what extent they have goals is a very important empirical question".
Here's my attempt at writing a concise decent quality answer the second group could give to the first.
Analogy I find helpful: a houseplant
Consider a houseplant. Its leaves are directed towards the window. If you rotate the plant 180 degrees, in a few days it will adjust its leaves to face the sun again.
Now, does the plant know where the sun shines from? On one hand, it doesn't have a brain, neurons, or anything like that - it doesn't "know" things in any way similar to what we call knowledge in humans. But, on the other hand: if you don't know where the sun shines from, you won't reliably move your leaves so that they face it.
Quasi-beliefs
David Chalmers defines quasi-belief in the following way (not an exact quote):
That is: you observe some behavior of an LLM. If you could say "Entity with a belief X would behave that way", then you can also say the LLM has a quasi-belief X. Or, when you see leaves rotating towards the sun, you can say the plant has a quasi-belief about the sun's direction.
Same goes for goals, or any other features we attribute to humans (including e.g. feelings).
(Note: this is very close to Daniel Dennett's intentional stance)
So, for example: Does ChatGPT have a belief that Paris is the capital of France? Well, it very clearly has at least a quasi-belief, as in many different contexts it behaves the way an entity believing Paris is the capital of France would behave.
Do LLMs have quasi-[attribute] or [attribute]?
Do LLMs have beliefs, or only quasi-beliefs? Do LLMs have goals, or only quasi-goals? Well, I think from the point of view of e.g. AI safety, these questions are just not interesting. What we care about is how the models behave, and whether they behave that way because they have "real" beliefs doesn't really matter.
This is not true for all attributes. For example, from the point of view of AI welfare, the question of whether models have feelings or quasi-feelings is fundamental.
So the TL;DR is that when people say "LLM believes X", they usually mean "LLM has a quasi-belief of X", and then they sometimes get pushback from people who assume this means full human-like beliefs. Note that this makes the same sense regardless of what we view as the difference between beliefs and quasi-beliefs.