mesaoptimizer

Posts

Sorted by New

Wiki Contributions

Comments

Time-logging programs and/or spreadsheets (2022)

I log time in a TSV file, with the following format: (start datetime, end datetime, category, comment describing what I plan to do and what happened)

I use emacs as my text editor and I have keymaps to insert the current (at that moment) datetime.

On how various plans miss the hard bits of the alignment challenge

Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping).

Yes, thank you: I didn't notice that you were making that assumption. This conversation makes a lot more sense to me now.

Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part. If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.

This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I've updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I'm still confused about huge parts of it, but we can discuss it more elsewhere.

On how various plans miss the hard bits of the alignment challenge

Do you agree with: “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values”

What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.

I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”

My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard.

One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given.

I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too -- inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don't scale with an increase in capabilities? Perhaps we are defining inner values differently here.

By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people's inner values shift? Not exactly; it seems to me that we were mistaken about people's inner values instead.

On how various plans miss the hard bits of the alignment challenge

The most important claim in your comment is that "human learning → human values" is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective. Here's why I disagree:

Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this.

A human's environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment's objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment.

An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators.

An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn't true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he'll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in.

One may claim that the previous example somehow doesn't count because since one's sexual orientation is biologically determined (and I'm assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn't weaken this argument: "human learning -> human values" shows a huge amount of evidence of inner misalignment being ubiquitous.

I worry you are being insufficiently pessimistic.

[This comment is no longer endorsed by its author]Reply
Looking back on my alignment PhD

Thank you for writing this post, I especially appreciate the Mistakes section, since I've seen many rationalists (including me) making similar mistakes at one time or another.

Looking forward to a shard theory sequence.

Prospecting for Conceptual Holes

Here's my reinterpretation for the four levels of conceptual holes:

  1. can be inferred from your current knowledge base,
  2. outside your knowledge base but inside the fields of knowledge you are aware of,
  3. outside the fields of knowledge you are aware of but in /some/ existing field of knowledge,
  4. outside all existing fields of knowledge you can access.
What's the "This AI is of moral concern." fire alarm?

I think that one class of computation that’s likely of moral concern would be self-perpetuating optimization demons in an AI.

Could you please elaborate why you think optimization demons (optimizers) seem worthier of moral concern than optimized systems? I guess it would make sense if you believed them to deserve equal moral concern, if both are self-perpetuating, all other things being equal.

I think the cognitive capabilities that would help an optimization demon perpetuate itself strongly intersect with the cognitive capabilities that let humans and other animals replicate themselves, and that the intersection is particularly strong along dimensions that seem more morally relevant. Reasoning along such lines leads me to think optimization demons are probably of moral concern, while still being agnostic about whether their conscious.

I'm pessimistic about this line of reasoning -- the ability to replicate is something that cells also have, and we do not assign moral relevance to individual cells of human beings. A good example is the fact that we consider viruses, and cancerous cells as unworthy of moral concern.

Perhaps you mean that given the desire to survive and replicate, at a given amount of complexity, a system develops sub-systems that make the system worthy of moral concern. This line of reasoning would make more sense to me.

I think the only situations in which you can get these sorts of optimization demons are when the AI in question has some influence over its own future training inputs. Such influence would allow there to be optimization demons that steer the AI towards training data that reinforce the optimization demon.

This can imply that only systems given a sufficient minimum capability have agency over their fate, and therefore their desire to survive and replicate has meaning. I find myself confused by this, because taken to its logical conclusion, this means that the more agency a system has over its fate, the more moral concern it deserves.

Specifically, we wouldn’t directly train the LM on the output of the linear layer. We’d just have a dialog where we asked the LM to make the linear layer output specific values, then told the LM what value the linear layer had actually output. We’d then see if the LM was able to control its own cognition well enough to influence the linear layers output in a manner that’s better than chance, just based on the prompting we give it.

This seems reducible to a sequence modelling problem, except one that is much, much more complicated than anything I know models are trained for (mainly because this sequence modelling occurs entirely during inference time). This is really interesting, although I cannot see how this should imply that the more successful sequence modeller deserves more moral concern.

A Contamination Theory of the Obesity Epidemic

Heads up: the given link to the paper seems to be broken, because it links to a 4 page paper called "The Beginning of Time" which is entirely unrelated to nutrition and your post.

Optimality is the tiger, and agents are its teeth

Elegant. Here's my summary:

  • Optimization power is the source of the danger, not agency. Agents merely wield optimality to achieve their goals.
  • Agency is orthogonal to optimization power.

Where "agency" is defined as the ability to optimize for an objective, given some internal or external optimization power, and "optimality" (of a system) is defined as having an immense amount of optimization power, either during its creation (the nuclear bomb) or its runtime (Solomonoff induction).

This hints at the notion that there's a minimum Kolmogorov complexity (aka algorithmic description length) that needs to be met by an objective of an AI to be considered safe, assuming that we want the AI to be safe in the worst case scenario when it has access to extreme optimization power.

I'd love to know if I'm missing something.