I think the focus on "inclusive genetic fitness" as evolution's "goal" is weird. I'm not even sure it makes sense to talk about evolution's "goals", but if you want to call it an optimization process, the choice of "inclusive genetic fitness" as its target is arbitrary as there are many other boundaries one could trace. Evolution is acting at all levels, e.g. gene, cell, organism, species, the entirety of life on Earth. For example, it is not selecting adaptations which increase the genetic fitness of an individual but lead to the extinction of the species later. In the most basic sense evolution is selecting for "things that expand", in the entire universe, and humans definitely seem partially aligned with that - the ways in which they aren't seem non-competitive with this goal.
Since the outdoor temperature was lower in the control, ignoring it will inflate how much the two-hose unit outperforms by bringing the effect of both units closer to zero. If we assume the temperature difference the units and the control produce are approximately constant in this outdoor temperature range, then the difference to control would be 3.1ºC for the one hose unit and 5ºC for the two hose unit if the control outdoor temperature was the same, meaning two-hose only outperforms by ~60% with the fan on high, and merely ~30% with the fan on low.
The claim for "self-preserving" circuits is pretty strong. A much simpler explanation is that humans learn to value diversity early own because diversity of things around you, like tools, food sources, etc, improves fitness/reward.
Another non-competing explanation is that this is simply a result from boredom/curiosity - the brain wants to make observations that make it learn, not observations that it already predicts well, so we are inclined to observe things that are new. So again there is a force towards valuing diversity and this could become locked in our values.
First, there is a lot packed in "makes the world much more predictable". The only way I can envision this is taking over the world. After you do that, I'm not sure there is a lot more to do than wirehead.
But even if doesn't involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I'm not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model - it's doing some simple, local mechanical update on neural connections. I'm reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it's more useful to ask the question this post asks - how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective,
There is a direction in the gradient which both changes the objective and removes that check. The model doesn't need to be actually failing for that to happen - there is one piece of its parameters encoding the objective, and another piece encoding this deceptive logic which checks the objective and decides to perform poorly if it doesn't pass the check. The gradient can update both at the same time - any slight change to the objective can also be applied to the deceptive objective detector, making the model not misbehave and still updating its objective.
since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
I don't think this is possible. The directional derivative is the dot product of the gradient and the direction vector. That means if the loss decreases in a certain direction (in this case, changing the deceptive behavior and the objective), at least one of the partial derivatives is negative, so gradient descent can move in that direction and we will get what we want (different objective or less deception or both).
You should stop thinking about AI designed nanotechnology like human technology and start thinking about it like actual nanotechnology, i.e. life. There is no reason to believe you can't come up with a design for self-replicating nanorobots that can also self-assemble into larger useful machines, all from very simple and abundant ingredients - life does exactly that.
I think the key insight here is that the brain is not inner aligned, not even close
You say that but don't elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).
Avoiding wireheading doesn't seem like failed inner alignment - avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.
If you change the analogy to developing nuclear weapons instead of launching them, the picture becomes much grimmer.
When training a neural network, is there a broad basin of attraction around cat classifiers? Yes. There is a gigantuous number of functions that perfectly match the observed data and yet and discarded by the simplicity (and other) biases in our training algorithm in favor of well-behaving cat classifiers. Around any low kolmogorov complexity object there is an immense neighborhood of high complexity ones.
But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer's true or actual preferences
The only way I can see this making sense is if you again have a bias of simplicity for values, otherwise you are claiming that there is some value function that is more complex than the current value function of these agents and that it is privileged against the current one - but then, to arrive at this function you have to conjure information out of nowhere. If you took the information from other places, like averaging the values of many agents, then you actually want to align with the values of these many agents, or whatever else you used.
In fact it seems to be the case with your examples that you are favoring simplicity - if the agents were smarter they would realize their values were misbehaving. But that *is* looking for simpler values - if you through reasoning discovered some part of your values contradict others, you have just arrived at a simpler value function, since the contradicting parts needed extra specification, i.e. were noisy, and you weren't smart enough to see that.