My site doesn't have a comments section, but curious for feedback.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 1:06 PM

Back around 2010, I used to emphasize the importance of neuroscience for CEV ("basically an exercise in applied neuroscience", "the AI is the neuroscientist"). I even dared to predict that before we got close to AGI, there would be a research community with a tentative draft of what human-friendly AI values should be, that is based on neuroscience. I was advancing this in contrast to proposals that we could just upload a bunch of wise humans, run them at high speed, and let them figure everything out. 

These days, the avantgarde in alignment proposals seems to be less about uploading someone or everyone (though Jan Leike, head of alignment at OpenAI, talks about achieving CEV through "simulated deliberative democracy"), and more about trusting the seed AI (the AI that figures out alignment for us) to figure out the relevant causal structure of the human mind, by way of AIXI-like general powers of modeling. I'm thinking here of alignment proposals like PreDCA and QACI. 

"Get the uploads to do the hard work" and "get a superintelligent modeler to do the hard work", as alignment proposals, have a common motivation, of defining a one-shot alignment method that is as simple as possible. By contrast, RLHF is an alignment method where humans are in the loop constantly. You could say that what I had in mind was something analogous, for the task of extracting the actual human decision procedure, from data about human brains and human behavior. I envisaged it as being achieved by human beings doing research in the ordinary way. 

So in the present what do we have? Steven Byrnes talks about how to align brain-like AI - perhaps one could apply his method to the architecture of the human brain itself, if only we knew it in sufficient detail. And June Ku's very underappreciated approach to CEV rests on a specific methodology for obtaining an idealized rational utility function, from the competing inconsistent imperatives found in an actual brain. 

But overall, the idea of achieving CEV, or even getting halfway to CEV, by doing plain old neuroscientific research, seems to be a path not taken. So I view your comments with interest. But I would emphasize that for this pathway, neuroscience, not just neurotechnology, is the essential ingredient. We needs theories and insights, not just experiment and data. To that end, we should be searching mainstream neuroscience for theories and insights that provide guidance. 

Hey Milan, I’m broadly sympathetic to the argument in Proposition 1 Reason 2 that if we want to understand if models do human-derived cognitive operation X, we need to define what X is, and the best validation of our definition will come from testing it in humans. But recently, I’ve been wrestling with whether we need to define the cognition that models are doing in the same terms that we define human cognition to get alignment of model behavior.
 

For instance you could take the definition of deception given in this paper: “the systematic inducement of false beliefs in the pursuit of some outcome other than the truth”. This definition intentionally avoids relying on any cognitive concepts like belief or intention in the models. You could ask “what kinds of internal operations in the model lead to this kind of behavior?”. If you are able to explain that and change the model so that it doesn’t have these kinds of behaviors, then you can essentially avoid the bad outcomes without knowing whether the models are doing the same cognitive operations humans are doing when they engage in similar behaviors (“deception”). This might actually require explaining how models work at a higher level of abstraction in order to be useful, something like cognitive science for AI, but the point is that I can imagine paths toward alignment where we are able to avoid certain bad behaviors in AI without ever checking that the concepts that we are using map directly to human cognition.
 

For a more concrete example, you could imagine that an LLM is able to generate false answers to questions even when they are otherwise able to generate true answers to the same questions in other contexts. The LLM might be lying in the way we mean humans to lie, which is minimally making a claim contrary to one’s belief and perhaps also with the intention to form false beliefs in others. Alternatively, the context might have just elicited the LLM to role play a persona that tends to give the false response to the questions we give it. In this second case, it is not necessarily true that the model actually has any beliefs or intentions, but it still engages in behavior that we deem undesirable. If we define lying as the former and test our mechanistic definition of lying in the brain, our mechanistic definition would then only identify the former as lying in the model, but not the latter. But both things lead to bad behavior in the model, and importantly, the second phenomenon may not have an analog in human cognition (or maybe it does?). If instead we did not use the human-derived concepts to define deception and defined it based on the behavior of AI, then maybe we are able to identify both cases as “deception” and correct for it accordingly.
 

I’m not sure yet how convinced I am by these arguments yet, but I find it difficult to come up with reasons why you would actually need to have your definitions based on human cognition rather than model behavior if your goal is only to avoid the behavior. Do you think that these kinds of alignment scenarios without human-derived definitions of cognitive concepts are plausible? I’m curious if you have arguments or intuitions for why you expect human-derived cognitive concepts to be essential for ensuring alignment.