Quintin Pope

Sequences

Quintin's Alignment Papers Roundup
Shard Theory

Wiki Contributions

Comments

This post mainly argues that evolution does not provide evidence for the sharp left turn. Sudden capabilities jumps from other sources, such as those you mention, are more likely, IMO. My first reply to your comment is arguing that the mechanisms behind the human sharp left turn wrt evolution probably still won't arise in AI development, even if you go up an abstraction level. One of those mechanisms is a 5 - 9 OOM jump in usable optimization power, which I think is unlikely.

Am I missing something here, or is this just describing memetics?

It is not describing memetics, which I regard as a mostly confused framework that primes people to misattribute the products of human intelligence to "evolution". However, even if evolution meaningfully operates on the level of memes, the "Evolution" I'm referring to when I say "Evolution applies very little direct optimization power to the middle level" is strictly biological evolution over the genome, not memetic at all. 

Memetic evolution in this context would not have inclusive genetic fitness as its "outer" objective, so whether memetic evolution can "transfer the skills, knowledge, values, or behaviors learned by one generation to their descendants" is irrelevant for the argument I was making in the post.

But isn't this solely because we have already studied our sensory organs and have a concept of taste buds, and hence flavors like sweet, etc. as primary categories of taste? 

Not really. The only way our understanding of the biology of taste impacts the story about humans coming to like ice cream is that we can infer that humans have sugar detecting reward circuitry, which ice cream activates in the modern environment. For AI systems, we actually have a better handle on how their reward circuitry works, as compared to the brain. E.g., we can just directly look at the reward counter during the AI's training.

Some counter evidence:

  • Kernelized Concept Erasure: concept encodings do have nonlinear components. Nonlinear kernels can erase certain parts of those encodings, but they cannot prevent other types of nonlinear kernels from extracting concept info from other parts of the embedding space.
  • Limitations of the NTK for Understanding Generalization in Deep Learning: the neural tangent kernels of realistic neural networks continuously change throughout their training. Further, neither the initial kernels nor any of the empirical kernels from mid-training can reproduce the asymptotic scaling laws of the actual neural network, which are better than predicted by said kernels.
  • Mechanistic Mode Connectivity: LMs often have non-connected solution basins, which correspond to different underlying mechanisms by which they make their classification decisions.

The description complexity of hypotheses AIXI considers is dominated by the bridge rules which translate from 'physical laws of universes' to 'what am I actually seeing?'. To conclude Newtonian gravity, AIXI must not only infer the law of gravity, but also that there is a camera, that it's taking a photo, that this is happening on an Earth-sized planet, that this planet has apples, etc. These beliefs are much more complex than the laws of physics. 

One issue with AIXI is that it applies a uniform complexity penalty to both physical laws and bridge rules. As a result, I'd guess that AIXI on frames of a falling apple would put most of its probability mass on hypotheses with more complex laws than Newtonian gravity, but simpler bridge rules.

Autonomous learning basically requires there to be a generator-discriminator gap in the domain in question, i.e., that the agent trying to improve its capabilities in said domain has to be better able to tell the difference between its own good and bad outputs. If it can do so, it can just produce a bunch of outputs, score their goodness, and train / reward itself on its better outputs. In both situations you note (AZ and human mathematicians) there's such a gap, because game victories and math results can both be verified relatively more easily than they can be generated.

If current LMs have such discriminator gaps in a given domain, they can also learn autonomously, up to the limit of their discrimination ability (which might improve as they get better at generation). 

I don't think I'm concerned by moving up a level in abstraction. For one, I don't expect any specific developer to suddenly get access to 5 - 9 OOMs more compute than any previous developer. For another, it seems clear that we'd want the AIs being built to be misaligned with whatever "values" correspond to the outer selection signals associated with the outer optimizer in question (i.e., "the people doing the best on benchmarks will get their approaches copied, get more funding, etc"). Seems like an AI being aligned to, like, impressing its developers? doing well on benchmarks? getting more funding? becoming the best architecture it can be? IDK what, but it would probably be bad.

So, I don't see a reason to expect either a sudden capabilities jump (Edit: deriving from the same mechanism as the human sharp left turn), or (undesirable) misalignment.

I don't think this objection matters for the argument I'm making. All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information. Thus, the enormous disparity between the brain's with-lifetime learning versus evolution cannot lead to a multiple OOM faster accumulation of capabilities as compared to evolution.

When non-genetic cross-generational channels are at saturation, the plot of capabilities-related info versus generation count looks like this:

with non-genetic information channels only giving the "All info" line a ~constant advantage over "Genetic info". Non-genetic channels might be faster than evolution, but because they're saturated, they only give each generation a fixed advantage over where they'd be with only genetic info. In contrast, once the cultural channel allows for an ever-increasing volume of transmitted information, then the vastly faster rate of within-lifetime learning can start contributing to the slope of the "All info" line, and not just its height.

Thus, humanity's sharp left turn.

That's not at all clear to me. Inductive biases clearly differ between humans, yet we are not all terminally misaligned with each other. E.g., split brain patients are not all wired value aliens, despite a significant difference in architecture. Also, training on human-originated data causes networks to learn human-like inductive biases (at least somewhat).

I feel like there's a significant distance between what's being said formally versus the conclusions being drawn. From Rob:

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language)

From you:

the simplicity bias of SGD on NNs is different than some people think -- it is weighted towards broad basins / connected regions. It's still randomly sampling from the set of all low loss NN parameter configurations, but with a different bias/prior.

The issue is that literally any plan generation / NN training process can be described in either manner, regardless of the actual prior involved. In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.

It's not clear to me what specific priors Rob has in mind for the "random plan" sampling process, unless by "extant formal language" he literally means "formal language that currently exists right now", in which case:

  1. Why should this be a good description of what SGD does?
  2. Why should this be a better description of what SGD does, as compared to what human learning does?
  3. I think I am comfortable calling this intuition "wildly wrong", and it seems correct to say that the cited paper is evidence against such a prior, since that paper suggests a geometry-based inductive bias stemming from the parameter-wise clustering of solutions, which I doubt the solution spaces of current formal languages reflect in a similar manner to the parameter space of current NNs.

Properly arguing that biological neurons and artificial NNs converge in their inductive biases would be an entire post, though I do think there's quite a bit of evidence in that direction, some of which I cited in my Twitter thread. Maybe I'll start writing that post, though I currently have lots of other stuff to do. 

Although, I expect my conclusion would be something like "there's a bunch of evidence and argument both ways, with IMO a small/moderate advantage for the 'convergence' side, but no extreme position is warranted, and the implications for alignment are murky anyways", so maybe I shouldn't bother? What do you think?

Load More