Epistemic status: highly uncertain and fairly confused, seeking out opinions

First let me just post the definition of (biological) gain-of-function research from Wikipedia, so that everyone is on the same page:

Gain-of-function research (GoF research or GoFR) is medical research that genetically alters an organism in a way that may enhance the biological functions of gene products. This may include an altered pathogenesis, transmissibility, or host range, i.e. the types of hosts that a microorganism can infect. This research is intended to reveal targets to better predict emerging infectious diseases and to develop vaccines and therapeutics. For example, influenza B can infect only humans and harbor seals.[1] Introducing a mutation that would allow influenza B to infect rabbits in a controlled laboratory situation would be considered a gain-of-function experiment, as the virus did not previously have that function.[2][3] That type of experiment could then help reveal which parts of the virus's genome correspond to the species that it can infect, enabling the creation of antiviral medicines which block this function.[3]

The question I have is, should we have similar intuitions about certain kinds of AI and/or AI safety research? This is related in a complicated way to AI capabilities research, so let's define some terms carefully:

  • AI capabilities: the ability of AI systems to do things that we want them to do (which they may also be able to use in ways we do not want them to use) such as knowing whether a picture is of a cat, or knowing how to cure cancer
  • Gain of function: the ability of AI systems to do things that no sane person wants them to do, such as, I don't know, knowing whether they are deployed in the real world or not

This is more of a spectrum than it is two distinct classes; for instance, which class "recursive self-improvement" belongs to probably depends on how seriously you take AI risk in the first place. And one could imagine a sufficiently edgy or depressed person wanting to program AI's to do all manner of nonsense.

There are then a number of questions:

  • Should people be doing AI gain-of-function research at all?
  • If they shouldn't, how will we ever test out proposed alignment solutions? Or are AI safety researchers expected to get everything important right the first time it matters without any experimental evidence to guide them before the first time it matters?
  • If they should, are there any special precautions that should be taken?
    • For instance, posting open-source code on GitHub at this time currently adds it to the Copilot training dataset, thereby potentially injecting similar bits of code into many, many places, possibly including important or unimportant AI projects. Posting code online may add it to the training dataset of many large language models, some of which may eventually be potential candidates for being scary in their own right.
    • Should there be limits on what sort of systems one uses to test out components? (E.g., anything more complex than a grid-world is suspect, or anything that uses natural language is suspect)

Why does this matter? Suppose that we identify some capability (or group of capabilities) that would be necessary and/or sufficient for a model to disempower humans. (One might call such a thing an "ascension kit", after the strategic term in NetHack lore.) How would we detect if a system has or is trying to assemble an ascension kit? Implementing parts of an ascension kit seems like it's obviously gain-of-function research, but how else would we be able to test any such detection toolkit?

Lastly, if there are related questions about this topic, I would love it if people posted them in the comments.

New Comment
2 comments, sorted by Click to highlight new comments since:

This definition of gain of function is about gaining very specific functions of pathogenesis, transmissibility, or host range. It does not include a function like a bacteria getting the capability to produce insulin. 

When it comes to AI research, the corresponding approach would be to think about which function we consider to be dangerous to develop. You likely don't want DeepMind's Gato to have finding zero-day exploits as one of his training tasks. 

It might be worth thinking about which tasks for Gato might be analogous to the functions we mean when we speak about gain-of-function research in biology.

All technological development is gain-of-function research on matter. What matters is how dangerous the gained function is.