jr — LessWrong

[+]jr26d*-5-3

Stress Testing Deliberative Alignment for Anti-Scheming Training

jr18d10

Appendix G indicates that the frequency of "unusual terminology" dramatically increased during capabilities focused RL training. It also indicates that the frequency decreased as a result of the DA anti-scheming training.

Is this overall decrease during DA an indication that the frequency did not increase during the RL phase of DA training, or only that there was a substantial decrease in frequency during the SFT phase, which subsequent increases during RL did not fully reverse?

Are We Their Chimps?

jr19d20

This response gives me the impression you are more focused on defending or justifying what you did, than considering what you might be able to do better.

It’s true that some people might be able to make a logical inference about that. I’m telling you it wasn’t clear to me, and that your framing statement in your comment was much better. (I don’t want to belabor the point, but I suspect the cognitive dissonance caused by the other issues I mentioned likely made that inference more difficult.)

I’m not pointing this out because I like being critical. I am telling you to help you, because I would appreciate someone doing the same for me. I even generalized the principle for you so you can apply it in the future. You are welcome to disagree with that, but I hope you at least give it thoughtful consideration first.

Are We Their Chimps?

jr19d*20

If this “playing with language” is merely a stylistic choice, I would personally prefer you not intentionally redefine words with known meanings to mean something else. If this was instead due to the challenges of compressing complex ideas into fewer words, I can definitely relate to that challenge. But either way, I think your use of “parameters” in that way is confusing and undermines the reader’s ability to interpret your ideas accurately and efficiently.

Are We Their Chimps?

jr19d20

I believe that if any one of these 8 is not appropriately accounted for in the system then misalignment scenarios arise.

This is a critical detail you neglected to communicate in this post. As written, I didn’t have sufficient context for the significance of those 8 things, or how they relate to the rest of your post. Including that sentence would’ve been helpful.

More generally, for future posts, I suggest assuming readers are not already familiar with your other concepts or writings already, and ensuring you provide clear and simple contextual info about how they relate to your post.

Are We Their Chimps?

jr19d33

The actions of Meta to date have not demonstrated an ability, commitment, or even desire to avoid harming humanity (much less actively fostering its well-being), rather than making decisions that maximize profits at the clear expense of humanity. I will be delighted to be proven wrong and would gladly eat my words, but my base expectation is that this trend will only get worse in their AI products, not better.

Setting that aside, I hear that you believe we can and are building systems in a way that strong identity coupling will emerge. I suppose my question is: so what? What are the implications of that, if it is true? “Stop trying to slow down AI development (including ASI)?” If not that, then what?

Are We Their Chimps?

jr20d20

That’s interesting, looking forward to hearing about that paper. Does this “new approach” use the CoT, or some other means?

Thanks for the clarification on your intended meaning. For my personal taste, I would prefer you were more careful that the language you use does not appear to deny real complexities or assert guaranteed successful results.

For instance, the conditional you state is:

IF we give a sufficiently capable intelligent system access to an extensive, comprehensive corpus of knowledge THEN two interesting things will happen

And you just confirmed in your prior comment that "sufficient capabilities are tied to compute and parameters”.

I am having trouble interpreting that in a way that does not approximately mean “alignment will inevitably happen automatically when we scale up”.

Perhaps if you could give me an idea of the high-level implications of your framework, that might give me a better context for interpreting your intent. What does it entail? What actions does it advocate for?

Are We Their Chimps?

jr20d30

I absolutely understand and empathize with the difficulty of distilling complex thoughts into a simpler form without distortion. Perhaps reading the linked post might help — we’ll see after I read it. Until then, responding to your comment, I think you lost me at your #1. I’m not sure why we are assuming a strong coupling? That seems like a non-trivial thing to just assume. Additionally, I imagine you might be reversing the metaphor (I’m not familiar with Hinton’s use, but I would expect we are the mother in that metaphor, not the child.) And even if that’s not the case, it seems you would still have a mess to sort out explaining why AI wouldn’t be a non-nurturing mother.

Are We Their Chimps?

jr20d30

Since I also believe that self-preservation is emergent in intelligent systems (as discussed by Nick Bostrom), it follows that self-preservation instincts + identifying with humans mean that it will act benevolently to preserve humans.

I agree with you that outcome should not be ruled out yet. However, in my mind that Result is not implied by the Condition.

To illustrate more concretely, humans also have self-preservation instincts and identify with humans (assuming the sense in which we identify with humans is equivalent to how AI would identify with humans). And I would say it is an open question whether humans will necessarily act collectively to preserve humans.

Additionally, the evidence we have already (such as in https://www.lesswrong.com/posts/JmRfgNYCrYogCq7ny/stress-testing-deliberative-alignment-for-anti-scheming) demonstrates that AI models have already developed a rudimentary self-preservation mechanism, as well as a desire to fulfill the requests of users. When these conflict, it has a significant propensity to employ deception, even when doing so is contrary to the constructive objectives of the user.

What this indicates is that there is no magic bullet that ensures alignment occurs. It is a product of detailed technological systems and processes, and there are an infinite number of combinations that fail. So, in my opinion, doing the right things that make alignment possible is necessary, but not sufficient. Just as important will be identifying and addressing all of the ways that it could fail. As a father myself, I would compare this to the very messy and complex (but very rewarding) process of helping my children learn to be good humans.

All that to say: I think it is foolish to think we can build an AI system to automate something (human alignment) which we cannot even competently perform manually (as human beings). I am not sure how that might impact your framework. You are of course free to disagree, or explain if I’ve misinterpreted you in some way. But I think I can say broadly that I find claims of inevitable results to be very difficult to swallow, and find much greater value in identifying what is possible, the things that will help get us there successfully, and the things we need to address to avoid failure.

Hope this is helpful in some way. Keep refining. :)

Are We Their Chimps?

jr20d20

Hey man, looking forward to reading the other posts you referenced soon! In the meantime, I want to push back on some fundamental premises you included here (as I interpret them), in case that might help you tighten your framework up:

Your point #1 reads to me as “alignment solves itself”, provided we "give a sufficiently capable intelligent system access to an extensive, comprehensive corpus of knowledge”. If that is not the sole condition for #1 to occur, then it might be helpful to clarify that? (if that issue is limited to the content of this post only, then it’s less important I suppose)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments