Thane Ruthenis

Wiki Contributions


Prize for Alignment Research Tasks

Hmm. A speculative, currently-intractable way to do this might be to summarize the ML model before feeding it to the goal-extractor.

tl;dr: As per natural abstractions, most of the details of the interactions between the individual neurons are probably irrelevant with regards to the model's high-level functioning/reasoning. So there should be, in principle, a way to automatically collapse e. g. a trillion-parameters model into a much lower-complexity high-level description that would still preserve such important information as the model's training objective.

But there aren't currently any fast-enough algorithms for generating such summaries.

Agency As a Natural Abstraction

Good question. There's a great amount of confusion over the exact definition, but in the context of this post specifically:

An optimizer is a very advanced meta-learning algorithm that can learn the rules of (effectively) any environment and perform well in it. It's general by definition. It's efficient because this generality allows it to use maximally efficient internal representations of its environment.

For example, consider a (generalist) ML model that's fed the full description of the Solar System at the level of individual atoms, and which is asked to roughly predict the movement of Earth over the next year. It can keep modeling things at the level of atoms; or, it can dump the overwhelming majority of that information, collapse sufficiently large objects into point masses, and use Cowell's method.

The second option is greatly more efficient, while decreasing the accuracy only marginally. However, to do that, the model needs to know how to translate between different internal representations[1], and how to model and achieve goals in arbitrary systems[2].

The same property that allows an optimizer to perform well in any environment allows it to efficiently model any environment. And vice versa, which is the bad part. The ability to efficiently model any environment allows an agent to perform well in any environment, so math!superintelligence would translate to real-world!superintelligence, and a math!goal would be mirrored by some real!goal. Next thing we know, everything is paperclips.

  1. ^

    E. g., how to keep track of what "Earth" is, as it moves from being a bunch of atoms to a point mass.

  2. ^

    Assuming it hasn't been trained for this task specifically, of course, in which case it can just learn how to translate its inputs into this specific high-level representation, how to work with this specific high-level representation, and nothing else. But we're assuming a generalist model here: there was nothing like this in its training dataset.

Agency As a Natural Abstraction

If it's only trained to solve arithmetic and there are no additional sensory modalities aside from the buttons on a typical calculator, how does increasing this AI's compute/power lead to it becoming an optimizer over a wider domain than just arithmetic?

That was a poetic turn of phrase, yeah. I didn't mean a literal arithmetic calculator, I meant general-purpose theorem-provers/math engines. Given a sufficiently difficult task, such a model may need to invent and abstract over entire new fields of mathematics, to solve it in a compute-efficient manner. And that capability goes hand-in-hand with runtime optimization.

Do you think it might be valuable to find a theoretical limit that shows that the amount of compute needed for such epsilon-details to be usefully incorporated is greater than ever will be feasible (or not)? 

I think something like this was on the list of John's plans for empirical tests of the NAH, yes. In the meantime, my understanding is that the NAH explicitly hinges on assuming this is true.

Which is to say: Yes, an AI may discover novel, lower-level abstractions, but then it'd use them in concert with the interpretable higher-level ones. It wouldn't replace high-level abstractions with low-level ones, because the high-level abstractions are already as efficient as they get for the tasks we use them for.

You could dip down to a lower level when optimizing some specific action — like fine-tuning the aim of your energy weapon to fry a given person's brain with maximum efficiency — but when you're selecting the highest-priority person to kill to cause most disarray, you'd be thinking about "humans" in the context of "social groups", explicitly. The alternative — modeling the individual atoms bouncing around — would be dramatically more expensive, while not improving your predictions much, if at all.

It's analogous to how we're still using Newton's laws in some cases, despite in principle having ample compute to model things at a lower level. There's just no point.

Prize for Alignment Research Tasks

Edit: On a closer read, I take it you're looking only for tasks well-suited for language models? I'll leave this comment up for now, in case it'd still be of use.

  • Task: Extract the the training objective from a fully-trained ML model.
  • Input type: The full description of a ML model's architecture + its parameters.
  • Output type: Mathematical or natural-language description of the training objective.
Input[Learned parameters and architecture description of a fully-connected neural network trained on the MNIST dataset.]
OutputClassifying handwritten digits.
Input[Learned parameters and architecture description of InceptionV1.]
OutputLabeling natural images.

Can't exactly fit that here, but the dataset seems relatively easy to assemble.

We can then play around with it:

  • See how well it generalizes. Does it stop working if we show it a model with a slightly unfamiliar architecture? Or a model with an architecture it knows, but trained for a novel-to-it task? Or a model with a familiar architecture, trained for a familiar task, but on a novel dataset? Would show whether Chris Olah's universality hypothesis holds for high-level features.
  • See if it can back out the training objective at all. If not, uh-oh, we have pseudo-alignment. (Note that the reverse isn't true: if it can extract the intended training objective, the inspected model can still be pseudo-aligned.)
  • Mess around with what exactly we show it. If we show all but the first layer of a model, would it still work? Only the last three layers? What's the minimal set of parameters it needs to know?
  • Hook it up to an attribution tool to see what specific parameters it looks at when figuring out the training objective.
Prize for Alignment Research Tasks

Task: Automated Turing test.

Context: A new architecture improves on state-of-the-art AI performance. We want to check whether AI-generated content is still possible to distinguish from human-generated.

Input type: Long string of text, 100-10,000 words.

Output type: Probability that the text was generated by a human.

InputDouglas Summers-Stay requested a test of bad pun/​dad joke-telling abilities, providing a list: could GPT-3 provide humorous completions? GPT-3 does worse on this than the Tom Swifties, I suspect yet again due to the BPE problem hobbling linguistic humor as opposed to conceptual humor—once you get past the issue that these jokes are so timeworn that GPT-3 has memorized most of them, GPT-3’s completions & new jokes make a reasonable amount of sense on the conceptual level but fail at the pun/​phonetic level. (How would GPT-3 make a pun on “whom”/​“tomb” when their BPEs probably are completely different and do not reflect their phonetic similarity?)
InputHe had a purpose. He wanted to destroy all of creation. He wanted to end it all. He could have that. He would have that. He didn’t know yet that he could have it. Voldemort had created Harry. Voldemort had never really destroyed Harry. Harry would always be there, a vampire, a parasite, a monster in the kitchen, a drain on the household, a waste on the planet. Harry would never be real. That was what Voldemort wanted. That was what Voldemort wanted to feel. He would have that. He would have everything.
InputMy troops, I'm not going to lie to you, our situation today is very grim. Dragon Army has never lost a single battle. And Hermione Granger... has a very good memory. The truth is, most of you are probably going to die. And the survivors will envy the dead. But we have to win this. We have to win this so that someday, our children can enjoy the taste of chocolate again. Everything is at stake here. Literally everything. If we lose, the whole universe just blinks out like a light bulb. And now I realize that most of you don't know what a light bulb is. Well, take it from me, it's bad. But if we have to go down, let's go down fighting, like heroes, so that as the darkness closes in, we can think to ourselves, at least we had fun. Are you afraid to die? I know I am. I can feel those cold shivers of fear like someone is pumping ice cream into my shirt. But I know... that history is watching us. It was watching us when we changed into our uniforms. It was probably taking pictures. And history, my troops, is written by the victors. If we win this, we can write our own history.

Etc., etc.; the data are easy to generate. Those are from here and here.

To be honest, I'm not sure it's exactly what you're asking, but it seems easy to implement (compute costs aside) and might serve as a (very weak and flawed) "fire alarm for AGI" + provide some insights. For example, we can then hook this Turing Tester up to an attribution tool and see what specific parts of the input text make it conclude it was/wasn't generated by a ML model. This could then provide some insights into the ML model in question (are there some specific patterns it repeats? abstract mistakes that are too subtle for us to notice, but are still statistically significant?).

Alternatively, in the slow-takeoff scenario where there's a brief moment before the ASI kills us all where we have to worry about people weaponizing AI for e. g. propaganda, something like this tool might be used to screen messages before reading them, if it works.