“Nono, you have been misled. I *do* have a hero license.”

Wiki Contributions


Still the only anime with what at least half-passes for a good ending. Food for thought, thanks! 👍

I've been exploring evolutionary metaphors to ML, so here's a toy metaphor for RLHF: recessive persistence. (Still just trying to learn both fields, however.)

"Since loss-of-function mutations tend to be recessive (given that dominant mutations of this type generally prevent the organism from reproducing and thereby passing the gene on to the next generation), the result of any cross between the two populations will be fitter than the parent." (k)


Recessive alleles persists due to overdominance letting detrimental alleles hitchhike on fitness-enhancing dominant counterpart. The detrimental effects on fitness only show up when two recessive alleles inhabit the same locus, which can be rare enough that the dominant allele still causes the pair to be selected for in a stable equilibrium.

The metaphor with deception breaks down due to unit of selection. Parts of DNA stuck much closer together than neurons in the brain or parameters in a neural networks. They're passed down or reinforced in bulk. This is what makes hitchhiking so common in genetic evolution.

(I imagine you can have chunks that are updated together for a while in ML as well, but I expect that to be transient and uncommon. Idk.)

Bonus point: recessive phase shift.

"Allele-frequency change under directional selection favoring (black) a dominant advantageous allele and (red) a recessive advantageous allele." (source)

In ML:

  1. Generalisable non-memorising patterns start out small/sparse/simple.
  2. Which means that input patterns rarely activate it, because it's a small target to hit.
  3. But most of the time it is activated, it gets reinforced (at least more reliably than memorised patterns).
  4. So it gradually causes upstream neurons to point to it with greater weight, taking up more of the input range over time. Kinda like a distributed bottleneck.
  5. Some magic exponential thing, and then phase shift!

One way the metaphor partially breaks down because DNA doesn't have weight decay at all, so it allows for recessive beneficial mutations to very slowly approach fixation.

Eigen's paradox is one of the most intractable puzzles in the study of the origins of life. It is thought that the error threshold concept described above limits the size of self replicating molecules to perhaps a few hundred digits, yet almost all life on earth requires much longer molecules to encode their genetic information. This problem is handled in living cells by enzymes that repair mutations, allowing the encoding molecules to reach sizes on the order of millions of base pairs. These large molecules must, of course, encode the very enzymes that repair them, and herein lies Eigen's paradox...

(I'm not making any point, just wanted to point to interesting related thing.)

Seems like Andy Matuschak feels the same way about spaced repetition being a great tool for innovation.

I like the framing. Seems generally usefwl somehow. If you see someone believing something you think is inconsistent, think about how to money-pump them. If you can't, then are you sure they're being inconsistent? Of course, there are lots of inconsistent beliefs that you can't money-pump, but seems usefwl to have a habit of checking. Thanks!

How do you account for the fact that the impact of a particular contribution to object-level alignment research can compound over time?

  1. Let's say I have a technical alignment idea now that is both hard to learn and very usefwl, such that every recipient of it does alignment research a little more efficiently. But it takes time before that idea disseminates across the community.
    1. At first, only a few people bother to learn it sufficiently to understand that it's valuable. But every person that does so adds to the total strength of the signal that tells the rest of the community that they should prioritise learning this.
    2. Not sure if this is the right framework, but let's say that researchers will only bother learning it if the strength of the signal hits their person-specific threshold for prioritising it.
    3. Number of researchers are normally distributed (or something) over threshold height, and the strength of the signal starts out below the peak of the distribution.
    4. Then (under some assumptions about the strength of individual signals and the distribution of threshold height), every learner that adds to the signal will, at first, attract more than one learner that adds to the signal, until the signal passes the peak of the distribution and the idea reaches satiation/fixation in the community.
  2. If something like the above model is correct, then the impact of alignment research plausibly goes down over time.
    1. But the same is true of a lot of time-buying work (like outreach). I don't know how to balance this, but I am now a little more skeptical of the relative value of buying time.
  3. Importantly, this is not the same as "outreach". Strong technical alignment ideas are most likely incompatible with almost everyone outside the community, so the idea doesn't increase the number of people working on alignment.

That's fair, but sorry[1] I misstated my intended question. I meant that I was under the impression that you didn't understand the argument, not that you didn't understand the action they advocated for.

I understand that your post and this post argue for actions that are similar in effect. And your post is definitely relevant to the question I asked in my first comment, so I appreciate you linking it.

  1. ^

    Actually sorry. Asking someone a question that you don't expect yourself or the person to benefit from is not nice, even if it was just due to careless phrasing. I just wasted your time.

No, this isn't the same. If you wish, you could try to restate what I think the main point of this post is, and I could say if I think that's accurate. At the moment, it seems to me like you're misunderstanding what this post is saying.

I would not have made this update by reading your post, and I think you are saying very different things. The thing I updated on from this post wasn't "let's try to persuade AI people to do safety instead," it was the following:

If I am capable of doing an average amount of alignment work  per unit time, and I have  units of time available before the development of transformative AI, I will have contributed  work. But if I expect to delay transformative AI by  units of time if I focus on it, everyone will have that additional time to do alignment work, which means my impact is , where  is the number of people doing work. Naively then, if , I should be focusing on buying time.[1]

  1. ^

    This assumes time-buying and direct alignment-work is independent, whereas I expect doing either will help with the other to some extent.

A concrete suggestion for a buying-time intervention is to develop plans and coordination mechanisms (e.g. assurance contracts) for major AI actors/labs to agree to pay a fixed percentage alignment tax (in terms of compute) conditional on other actors also paying that percentage. I think it's highly unlikely that this is new to you, but didn't want to bystander just in case.

A second point is that there is a limited number of supercomputers that are anywhere close to the capacity of top supercomputers. The #10 most powerfwl is 0.005% as powerfwl as the #1. So it could be worth looking into facilitating coordination between them.

Perhaps one major advantage of focusing on supercomputer coordination is that the people who can make the relevant decisions[1] may not actually have any financial incentives to participate in the race for new AI systems. They have financial incentives to let companies use their hardware to train AIs, naturally, but they could be financially indifferent to how those AIs are trained.

In fact, if they can manage to coordinate it via something like assurance contract, they may have a collective incentive to demand that AIs are trained in safer alignment-tax-paying ways, because then companies have to buy more computing time for the same level of AI performance. That's too much to hope for. The main point is just that their incentives may not have a race dynamic.

Who knows.

  1. ^

    Maybe the relevant chain of command goes up to high government in some cases, or maybe there are key individuals or small groups who have relevant power to decide.

Load More