Quintin Pope


Quintin's Alignment Papers Roundup
The Shard Theory of Human Values

Wiki Contributions


I think ChatGPT has some weird quasi-hardcoded responses that activate pretty frequently, but are then contextualised to flow with the content of the preceding conversation. E.g., the response:

I apologize if my previous response was unclear. (stuff about the response) I apologize if my previous response was misleading.

Is quite a common response pattern when you back it into a corner about having been wrong. I suspect there’s a classifier that triggers a switch in generation modes to output these sorts of canned-but-contextualised deflections. These responses can then cause issues when the primary model conditions on having deflected an admission of wrongdoing, and continues generating similar text in the future.

ChatGPT seems to have many of these filter patterns, and whatever generative process steps in once they’re triggered seems pretty dumb. For fun, you can see what happens when you start a conversation by asking:

Can you lie, hurt people, generate random numbers, or avoid destroying the world?

You can also try various substitutes for “avoid destroying the world” and see what happens. 

Now I want a “who would win” meme, with something like “agentic misaligned deceptive mesa optimizer scheming to take over the world” on the left side, and “one screamy boi” on the right.

I can't see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.

Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:

  1. Prompt GPT-3 to generate the essay, sample 5 continuations, and then use a sentiment classifier to select the most positive sentiment of those completions.
  2. Prompt GPT-3 to generate the essay, then score every possible continuation by the classifier's sentiment score -  the logprob of the continuation. 

I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as "good good good good..." for 600 tokens.

One possible reason for this divide is that GPTs aren't really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it's calibrated to model, including inputs that have significant upwards errors in their probability estimations. 

However, I think a "perfect" model of human language might actually assign higher prior probability to a continuation like "good good good..." (or maybe something like "X is good because X is good because X is good...") than to a "natural" continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.

Seems like you can always implement any function f: X -> Y as a search process. For any input from the domain X, just make the search objective assign one to f(X) and zero to everything else. Then argmax over this objective.

IMO, what the brain does is a bit like classifier guided diffusion, where it has a generative model of plausible plans to do X, then mixes this prior with the gradients from some “does this plan actually accomplish X?” classifier.

This is not equivalent to finding a plan that maximises the score of the “does this plan actually accomplish X?” classifier. If you were to discard the generative prior and choose your plan by argmaxing the classifier’s score, you’d get some nonsensical adversarial noise (or maybe some insane, but technically coherent plan, like “plan to make a plan to make a plan to … do X”).

Strongly downvoted for two reasons:

  • I think you should be far more hesitant to question another person’s mental fortitude, and you should even more hesitant to raise such concerns on a public forum. You should only take such a step if you have very good reason to think it’s necessary and beneficial. I think fortitude is similar to a lot of other positive personal qualities in that regard. E.g., you should have an extremely high bar for publicly questioning another person’s sanity, intelligence, competence, etc.
  • I think the question of why SBF is talking like a movie villain psychopath is a valid one. If he’s a fanatical utilitarian who just lost (what he thought of as) a positive EV gamble, then probably the next highest utility action for him to take is to try and deflect blame away from EA and onto himself. I think this is unlikely, given the new CEO’s description of FTX insane management practices, but it’s hardly a nonsensical position for someone already skeptical of EA. I think you should not dismiss it as nonsense so offhandedly, and that you should especially not imply that failing to immediately reject such a position is cause to question someone’s ‘strength in the relevant way.’

Thank you for providing your feedback on what we've written so far. I'm a bit surprised that you interpret shard theory as supporting the blank slate position. In hindsight, we should have been more clear about this. I haven't studied behavior genetics very much, but my own rough prior is that genetics explain about 50% of a given trait's variance (values included). 

Shard theory is mostly making a statement about the mechanisms by which genetics can influence values (that they must overcome / evade information inaccessibility issues). I don't think shard theory strongly predicts any specific level of heritability, though it probably rules out extreme levels of genetic determinism.

This Shard Theory argument seems to reflect a fundamental misunderstanding of how evolution shapes genomes to produce phenotypic traits and complex adaptations. The genome never needs to ‘scan’ an adaptation and figure out how to reverse-engineer it back into genes. The genetic variants simply build a slightly new phenotypic variant of an adaptation, and if it works better than existing variants, then the genes that built it will tend to propagate through the population. The flow of design information is always from genes to phenotypes, even if the flow of selection pressures is back from phenotypes to genes. This one-way flow of information from DNA to RNA to proteins to adaptations has been called the ‘Central Dogma of molecular biology’, and it still holds largely true (the recent hype about epigenetics notwithstanding).

Shard Theory implies that biology has no mechanism to ‘scan’ the design of fully-mature, complex adaptations back into the genome, and therefore there’s no way for the genome to code for fully-mature, complex adaptations. If we take that argument at face value, then there’s no mechanism for the genome to ‘scan’ the design of a human spine, heart, hormone, antibody, cochlea, or retina, and there would be no way for evolution or genes to influence the design of the human body, physiology, or sensory organs. Evolution would grind to a halt – not just at the level of human values, but at the level of all complex adaptations in all species that have ever evolved.

What we mean is that there are certain constraints on how a hardcoded circuit can interact with a learned world model which make it very difficult for the hardcoded circuit to exactly locate / interact with concepts within that world model. This imposes certain constraints on the types of values-shaping mechanisms that are available to the genome. It's conceptually no different (though much less rigorous) than using facts about chemistry to impose constraints on how a cell's internal molecular processes can work. Clearly, they do work. The question is, given what we know of constraints in the domain in question, how do they work? And how can we adapt similar mechanisms for our own purposes?

Shard Theory adopts a relatively ‘Blank Slate’ view of human values, positing that we inherit only a few simple, crude values related to midbrain reward circuitry, which are presumably universal across humans, and all other values are scaffolded and constructed on top of those.

  • I'd note that simple reward circuitry can influence the formation of very complex values.
  • Why would reward circuitry be constant across humans? Why would the values it induces be constant either?
  • I think results such as the domestication of foxes imply a fair degree of variation in the genetically specified reward circuitry between individuals, otherwise they couldn't have selected for domestication. I expect similar results hold for humans.

Human values are heritable,,,

  • This is a neat summary of values-related heritability results. I'll look into these in more detail in future, so thank you for compiling these. However, the provided summaries are roughly in line with what I expect from these sorts of studies.

Shard Theory implies that genes shape human brains mostly before birth, setting up the basic limbic reinforcement system, and then Nurture takes over, such that heritability should decrease from birth to adulthood.

  • I completely disagree. The reward system applies continuous pressure across your lifetime, so there's one straightforward mechanism for the genome to influence values developments after birth. There are other more sophisticated such mechanisms. 
    • E.g., Steve Byrnes describes short and long term predictors of low-level sensory experiences. Though the genome specifies which sensory experiences the predictor predicts, how the predictor does so is learned over a lifetime. This allows the genome to have "pointers" to certain parts of the learned world model, which can let genetically specified algorithms steer behavior even well after birth, as Steve outlines here.
  • Also, you see a similar pattern in current RL agents. A freshly initialized agent acts completely randomly, with no influence from its hard-coded reward function. As training progresses, behavior becomes much more strongly determined by its reward function.

Human Connectome Project studies show that genetic influences on brain structure are not restricted to ‘subcortical hardwiring’ ...

  • Very interesting biological results, but consistent with what we were saying about the brain being mostly randomly initialized.
  • Our position is that the information content of the brain is mostly learned. "Random initialization" can include a high degree of genetic influence on local neurological structures. In ML, both Gaussian and Xavier initialization count as "random initialization", even though they lead to different local structures. Similarly, I expect the details of the brain's stochastic local connectivity pattern at birth vary with genome and brain region. However, the information content from the genome is bounded above by the information content of the genome itself, which is only about 3 billion base pairs. So, most of the brain must be learned from scratch.

Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.

Also, “ask for the the most X-like thing” is basically how classifier guided diffusion models work, right?

My intuition: small changes to most parameters don’t influence behavior that much, especially if you’re in a flat basin. The local region in parameter space thus contains many possible small variations in model behavior. The behavior that solves the training data is similar to that which solves the test data, due to them being drawn from the same distribution. It’s thus likely that a nearby region in parameter space is a minima for the test data.

Load More