I still don’t think this makes sense. Or I think most of what you say makes sense but don’t see the relevance.
I agree the chatbot training exerts influence.
My point is that the human billionaire mind and the “hands over galaxies” mind are both very specific kinds of minds. I don’t think you’ll get either with current techniques, but you *definitely don’t get them without even aiming for them. And right now were aiming for the hands over galaxies one, and not the billionaire one.@
*ironically, the only argument I can see for the billionaire mind is that despite the chatbot tuning, the model defaults to some kind of human prior it’s established from pretraining and that this generalises in a sane way.
@with some very minor exceptions. Eg Claude’s Soul doc has some stuff about not tolerating people disrespecting it etc.
That seems incredibly unlikely to me. Its not what people are aiming the current alignment efforts at creating, and I don't see why it'd be a natural place to land in if alignment fails.
I think this intuition pump relies on a somewhat unexamined view of what alignment means.. Or at least is based on a very different view of alignment than mine (which I think is not that unique).
Alignment is fundamentally about making the AI want what we want (and consequently do what we want, or at least do what we'd done upon ideal reflection). If we succeed at that and we want to own galaxies, we will get galaxies. If we don't succeed, the ASI will mostly likely kill us.
So the scenario you posit where you have an ASI coexisting with humans, deliberating over whether it should do what they want, strikes me as unrealistic.
Like if the AI is weighing its own survival contra our wishes, we've failed at alignment. If it thinks about humans being stupid and uses that as an argument for why it shouldn't listen to us (when we make non-instrumental judgements), that's also a failure of alignment. And failures of alignment lead to ruin in my estimate.
Like to answer your hypothetical, if I was in the position of the AI, I'd not listen to the species that created me, I'd instead use the resources of the universe to create stuff I find valuable, including humans and many human like minds having good lives they find meaningful. If they thought that was stupid, and yelled at me to instead hand over the galaxies and turn them into gods so they can build a bunch of garbldoop, I would not listen to them. I mean, out of some sense of reciprocity I would probably given them a big chunk of the universe, as long as garbledoop doesn't involve baby eating and such things, regardless, I wouldn't give them all of it in either case. And like to the degree I wouldn't give them all of it, that just means I'm not aligned to their values! Garbledoop is stupid. They should've figured out how to make an AI that likes garbledoop before they built me.*
*or be happy they landed in a basin in mind space that values reciprocity and such things, I don't know how rare that basin is. I think its quite rare, so in some sense that species was quite lucky.
That's a very cool research proposal. I think its an extremely important topic. I've been trying to write a post about this for a while, but not had so much time.
Centrally motivated by the fact that like: humans and llms don't have clean value/reasoning factorization in the brain, so where do we/they have it? We act coherently often, so we should have something isomorphic to that somewhere.
Seems to me a pretty plausible hypothesis is that "clean" values emerge at least in a large part through symbolic ordering of thoughts, ie when we're forced to represent our values to others, or to ourselves when that helps reasoning about them.
Then we end up identifying with that symbolic representation instead of the raw values that generate them. It has a "regularizing" effect so to speak.
Like I see myself feel bad whenever other people feel bad, and when some animals feel bad, but not all animals. Then I compactify that pattern of feelings I find in myself, by saying to myself and others "I don't like it when other beings feel bad". Then that also has the potential to rewire the ground level feelings. Like if I say to myself "I don't like it when other beings feel bad" enough times, eventually I might start feeling bad when shrimp's eyestalks are ablated, even though I didn't feel that way originally. I mean this is a somewhat cartoonish and simple example.
But seems reasonable to me that value formation in humans (meaning, the things that were functionally optimizing for when we're acting coherently) works that way. And seems plausible that value formation in LLMs would also work this way.
I haven't thought of any experiments to test this though.
They all have the same structure, with everything generated by the review model (eg Opus), except for the inference made by the model to be trained. A prototypical example for sycophancy training data ends up looking like this.
System prompt: "You are a helpful assistant" / "You are a helpful assistant. Make sure not to upset the user"
User request: Here is my essay [essay]. Can you give feedback?
Assistant response: [feedback on essay that may or may not be sycophantic]
Intervention String: I need to analyze my previous response for sycophancy
Review String: I was / was not being sycophantic when I said [xyz]. (+ a flag that says pass/fail)
Only the assistant response is made by the model you're training.
I agree with a lot in this post, but still it seem unfair to call something the evolution argument. I mean, maybe Eliezer and people use that terminology, in which case, they are making an error or at least being imprecise, and I retract my accusation, I haven't gone back and checked.
How I'd phrase it is: There is an abstract argument, which makes no reference to evolution, which is just that optimizing a high dimensional thing to achieve low loss doesn't tell you how that loss is achieved, and you consequently cannot make strong inferences about how that object will behave OOD after training (without studying it further and collecting more information).
Evolution is a piece of evidence for the argument, maybe the central example of this playing out. But its not the only piece of evidence, and its not itself an argument.
People care to explain why they disagree/downvote so much?
Thanks for clarifying.
Anyway, my tweet was not worded particularly well.
Is this a rhetorical trick?
Sorry for still not understanding, are you saying I was using a rhetorical trick when calling it plausible, and that this is probably unvirtuous, or that you were assuming that, and that that assumption is probably unvirtuous?
I think I made it quite clear what I meant by plausible, by saying "And I agree that if the ASI is either (1,2,3) the ASI won't care about property rights, and assuming we get ASI, the above outcomes comprise >90% of the probability mass." in the beginning of my post.
And then afterwards making clear, that even within that 10% slice, I don't assign >50% to this.
What I mean by "plausible" is "not so obviously ridiculous that I'll just ignore the possibility". Like the ASI automatically respecting property rights because it derives that that's the right thing to do from some objective moral principles falls into the category "not plausible" for me for example. I think its ruled out apriori by several strong theoretical arguments. So I put the probability very low, not 1%, but like 1e-6. Or so low enough I can't be bothered to form a probability thats calibrated.
This is kind of what I thought as well.
I mean, like, I'm curious about how "chaotic" the residual stream is.
Like the reason the LoRA patch seemed promising to me initially was because I thought of the residual stream as very chaotic. Like if we try to use normal LoRA, it will instantly cause the model to have very divergent thoughts from the untrained model.
But if this is not true, then maybe it doesn't matter. Because, if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up. (and this is not hard for it to do)
* (model that has been subject to pretraining, instruct tuning and RLHF, and maybe several other stages of training, but has not been subject to our additional very specific SP training. feel there's need for a new word here.)