Sorted by New

Wiki Contributions


This piece is super interesting, especially the toy models.

A few clarifying questions:

'And not just any learning algorithm! The neocortex runs a learning algorithm, but it's an algorithm that picks outputs to maximize reward, not outputs to anticipate matched supervisory signals. This needs to be its own separate brain module, I think.'

-- Why does it need to be its own separate module? Can you expand on this? And even if separate modules are useful (as per your toy models and different inputs, couldn't the neocortex also be running lookup table like auto or hetero-associative learning).

"Parallel fibers carry the context signals, Purkinje cells produce the output signals, and climbing fibers carry the shoulda signals, and the synapse strengths between a parallel fiber and a Purkinje cell is modified as a function of how recently the parallel fiber and climbing fiber have fired."

-- Can you cite this? I have seen evidence this is the case but also that the context actually comes through the climbing fibers and training (shoulda) signal through the mossy/parallel fibers. Eg here for eyeblink operant conditioning

Can you explain how the accelerator works in more detail (esp as you use it in the later body and cognition toy models 5 and 6)? Why is the cerebellum faster at producing outputs than the neocortex? How does the neocortex carry the "shoulda" signal? Finally I'm confused by this line:

"You can't just take any incoming and outgoing signal lines and pair them up! ...Well, in an accelerator, you can! Let's say the output line in the diagram controls Muscle #847. Which shoulda line needs to be paired up with that? The answer is: it doesn't matter! Just take any neocortex output signal and connect it. And then the neocortex will learn over time that that output line is effectively a controller for Muscle #847."

-- This suggests that the neocortex can learn the cerebellar mapping and short-circuit to use it? Why does it need to go through the cerebellum to do this? Rather than via the motor cortex and efferent connections back to the muscles?

Thank you!

Why does this approach only need to be implemented in neo-cortex like AGIs? If we have a factored series of value functions in an RL agent then we should be able to take the same approach? But I guess you are thinking that the basal ganglia learning algorithms already do this for us so it is a convenient approach?

Side note. I found the distinction between confusion and conflict a bit... confusing! Confusion here is the agent updating a belief while conflict is the agent deciding to take an action?

Thanks for this post!

I agree with just about all of it (even though it paints a pretty bleak picture). It was useful to put all of these ideas and inner/outer alignment in one place especially the diagrams.

Two quotes that stood out to me:

" "Nameless pattern in sensory input that you’ve never conceived of” is a case where something is in-domain for the reward function but (currently) out-of-domain for the value function. Conversely, there are things that are in-domain for your value function—so you can like or dislike them—but wildly out-of-domain for your reward function! You can like or dislike “the idea that the universe is infinite”! You can like or dislike “the idea of doing surgery on your brainstem in order to modify your own internal reward function calculator”! A big part of the power of intelligence is this open-ended ever-expanding world-model that can re-conceptualize the world and then leverage those new concepts to make plans and achieve its goals. But we cannot expect those kinds of concepts to be evaluable by the reward function calculator."


"After all, the reward function will diverge from the thing we want, and the value function will diverge from the reward function. The most promising solution directions that I can think of seem to rely on things like interpretability, “finding human values inside the world-model”, corrigible motivation, etc.—things which cut across both layers, bridging all the way from the human’s intentions to the value function."

Also the idea that we can use the human brain as a way to better understand the interface between our outer loop reward function and inner loop value function.

Thinking about corrigibility, it seems like having a system with finite computational resources and an inability to modify its source code would both be highly desirable especially at the early stages. This feels like a +1 to neuron based wetware that implements AGI rather than as code on a server. Of course, the agent could find ways to acquire more neurons! And we would very likely then lose out on some interpretability tools. But this is just something that popped into my head as a tradeoff for different AGI implementations.

As a more general point, I think you working with the garage door open and laying out all of your arguments is highly motivating (at least for me!) to be thinking more actively and pursuing safety research in a way that I have dilly dallied on actually doing since back in 2016 when I read Superintelligence!

Thanks a lot for your detailed reply and sorry for my slow response (I had to take some exams!).

Regarding terminal goals the only compelling one I have come across is coherent extrapolated volition as outlined in Superintelligence. But how to even program this into code is of course problematic and I haven't followed the literature closely since for rebuttals or better ideas.

I enjoyed your piece on Steered Optimizers, and think it has helped give me examples where the algorithmic design and inductive biases can play a part in how controllable our system is. This also brings to mind this piece which I suspect you may really enjoy:

I am quite a believer in fast takeoff scenarios so I am unsure to what extent we can control a full AGI, but until it reaches criticality the tools we have to test and control it will indeed be crucial.

One concern I have that you might be able to address is that evolution did not optimize for interpretability! While DNNs are certainly quite black box, they remain more interpretable than the brain. I assign some prior probability to the same relative interpretability of DNNs vs neocortex based AGI.

Another concern is with the human morals that you mentioned. This should certainly be investigated further but I don't think almost any human has an internally consistent set of morals. In addition, I think that the morals we have were selected by the selfish gene and even if we could re-simulate them through an evolutionary like process we would get the good with the bad. and a few other evolutionary biology books have shaped my thinking on this.

Hi Steve, thanks for all of your posts.

It is unclear to me how this investigation into brain-like AGI will aid in safety research.

Can you provide some examples of what discoveries would indicate that this is an AGI route that is very dangerous or safe?

Without having thought about this much it seems to me like the control/alignment problem depends upon the terminal goals we provide the AGI rather than the substrate and algorithms it is running to obtain AGI level intelligence.

Thank you for the kind words and flagging some terms to look out for in societal change approaches.

Fair enough but for it to be that powerful and used as part of our immune system we may be free of parasites because we are all dead xD.

Thanks for the informative comments. You make great points. I think the population structure of bats may have something to do with their unique immune response to these infections but definitely want to look at the bat immune system more.