Wiki Contributions


Goal misgeneralization (the global min might also be malign OOD). The thing you're talking about here I would basically describe as the first thing.

Is there a way this is different from standard goodhearting concerns? I totally agree that this is a problem but it seems importantly conceptually distinct to me from gradient hackers/mesaoptimization

Strongly upvoted this post. I agree very strongly with every point here. The biggest consideration for me is that alignment seems like the kind of problem which is primarily bottlenecked on serial conceptual insights rather than parallel compute. If we already had alignment methods that we know would work if we just scaled them up, the same way we have with capabilities, then racing to endgame might make sense given the opportunity costs of delaying aligned AGI. Given that a.) we don't have such techniques and b.) even if we did it would be hard to be so certain that they are actually correct, racing to endgame appears very unwise. 

There is a minor tension with capabilities in that I think that for alignment to progress it does need some level of empirical capabilities results both in revealing information about likely AGI design and threat models and also so we can actually test alignment techniques. I think that e.g. if ML capabilities had frozen at the level of 2007 for 50 years, then at some point we would stop being able to make alignment progress without capabilities advancements but I think that in the current situation we are very very far from this Pareto frontier.

The local minima point is interesting. My initial question is how this squares with both theoretical and empirical findings that networks generally don't seem to get stuck in local minima and the lots of hints that the general loss landscape in which they operate is fairly benign?

I think this is only possible if the coupling between the gradient hacker's implementation of its malign behaviour and the good performance is extremely strong and essentially the correlation has to be 1. It is not like gradient descent has only one knob to turn for 'more gradient hacker' or 'less gradient hacker'. Instead, it has access to all of the internal weights of the gradient hacker and will change them to both a.) strengthen the positive aspects of the gradient hacker wrt the outer loss and b.) weaken the negative aspects. I.e. so if the gradient hacker is good at planning which is useful for the model but is malign in some other way, then gradient descent will strengthen the planning related parameters and weaken the malign ones simultaneously. The only way this fails is if there is literally no way to decouple these two aspects of the model, which I think would be very hard to maintain in practice. This is basically property 1 that gradient descent optimises all parameters in the network and leaves no slack.


I broadly agree with a lot of shard theory claims. However, the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.

With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI's 'linguistic cortex' which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us 'truly want' specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.

Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization  -- i.e .effectively planning over a world model -- is necessary in situations where a.) you can't behaviourally clone existing behaviour and b.) you can't self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function. 

I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function. 

The convergence theorems basically say that optimizing for P[t] converges to optimizing for T[t+d] for some sufficient timespan d.

The idea of a convergence theorem showing that optimizing any objective leads to empowerment has been brought up a bunch of times in these discussions, as in this quote. Is there some well-known proof/paper where this is shown? AFAICT the original empowerment do not show any proof like this (may have missed it). Is this based off of Alex Turner's work ( which results in a different measure than information theoretic empowerment (but intuitively related), or something else?

Excellent post btw.

Yes they have. There's quite a large literature on animal emotion and cognition and my general synthesis is that animals (at least mammals) have at least the same basic emotions as humans and often quite subtle ones such as empathy and a sense of fairness. It seems pretty likely to me whatever the set of base reward functions encoded in the mammalian basal ganglia and hypothalamus is, it can quite robustly generate expressed behavioural 'values' that fall within some broadly humanly recognisable set.

This is definitely the case. My prior is relatively strong that intelligence is compact, at least for complex and general tasks and behaviours. Evidence for this comes from ML -- the fact that the modern ML paradigm of huge network + lots of data + general optimiser being able to solve a large number of tasks is a fair bit of evidence for this. Other evidence is existence of g and cortical uniformity in general, as well as our flexibility at learning skills like chess, mathematics etc which we clearly do not have any evolutionarily innate specialisation for.

Of course some skills such as motor reflexes and a lot of behaviours are hardwired but generally we see that as intelligence and. generality grows these decrease in proportion.

I'm basing my thinking here primarily off of Herculano Houzel's work. If you have reasons you think this is wrong or counterarguments, I would be very interested in them as this is a moderately important part of my general model of AI.

the brain imaging studies also show predicting intelligence taps into a lot more aspects of static neuroanatomy or dynamic patterns than simply brain volume

Do you have links for these studies? Would leave to have a read about the static and dynamic correlates of g are from brain imaging!

Load More