Wiki Contributions


When we get results that it is easy for you to be afraid of, it will be firmly too late for safety work.

How does this handle the situation where the AI, in some scenario, picks up the idea of "deception" and then, when it describes its behavior honestly by intending to mislead the observer into thinking that it is honest, due to noticing that it is probably inside a training scenario, then gets reinforcement trained on dishonest behaviors that present as honest, ie. deceptive honesty?

Hm, difficult. I think the minimal required trait is the ability to learn patterns that map outputs to deferred reward inputs. So an organism that simply reacts to inputs directly would not be an optimizer, even if it has a (static) nervous system. A test may be if the organism can be made to persistedly change strategy by a change in reward, even in the immediate absence of the reward signal.

I think maybe you could say that ants are not anthill optimizers? Because the optimization mechanism doesn't operate at all on the scale of individual ants? Not sure if that holds up.

I think a bacterium is not an optimizer. Rather, it is optimized by evolution. Animals start being optimizers by virtue of planning over internal representations of external states, which makes them mesaoptimizers of evolution.

If we follow this model, we may consider that optimization requires a map-territory distinction. in that view, DNA is the map of evolution, and the CNS is the map of the animal. If the analogy holds, I'd speculate that the weights are the map of reinforcement learning, and the context window is the map of the mesaoptimizer.

Most multiplayer games have some way to limit XP gain from encounters outside your difficulty, to avoid exactly this sort of cheesing. The worry is that it allows players to get through the content quicker, with (possibly paid) help from others, which presumably makes it less likely they'll stick around.

(Though of course an experienced player can still level vastly faster, since most players don't take combat anywhere near optimally to maximize xp gain.)

That said, Morrowind famously contains an actual intelligence explosion. So you tend to see this sort of stuff more often in singleplayer, I think. (Potion quality triggers off intelligence. Potions can raise intelligence.)

And of course the entire genre of speedrunning - see also, (TAS) Wildbow's Worm in 3:47:14.28(WR).

Resources used in pressuring corporations are unlikely to have any effect which increases AI risk.

Devil's advocate: If this unevenly delays corporations sensitive to public concerns, and those are also corporations taking alignment at least somewhat seriously, we get a later but less safe takeoff. Though this goes for almost any intervention, including to some extent regulatory.

I don’t understand why you would want to spend any effort proving that transformers could scale to AGI.

The point would be to try and create common knowledge that they can. Otherwise, for any "we decided to not do X", someone else will try doing X, and the problem remains.

Humanity is already taking a shotgun approach to unaligned AGI. Shotgunning safety is viable and important, but I think it's more urgent to prevent the first shotgun from hitting an artery. Demonstrating AGI viability in this analogy is shotgunning a pig in the town square, to prove to everyone that the guns we are building can in fact kill.

We want safety to have as many pellets in flight as possible. But we want unaligned AGI to have as few pellets in flight as possible. (Preferably none.)

I'm actually optimistic about prosaic alignment for a takeoff driven by language models. But I don't know what the opportunity for action is there - I expect Deepmind to trigger the singularity, and they're famously opaque. Call it 15% chance of not-doom, action or no action. To be clear, I think action is possible, but I don't know who would do it or what form it would take. Convince OpenAI and race Deepmind to a working prototype? This is exactly the scenario we hoped to not be in...

edit: I think possibly step 1 is to prove that Transformers can scale to AGI. Find a list of remaining problems and knock them down - preferably in toy environments with weaker models. The difficulty is obviously demonstrating danger without instantiating it. Create a fire alarm, somehow. The hard part for judgment on this action is that it both helps and harms.

edit: Regulatory action may buy us a few years! I don't see how we can get it though.

If these paths are viable, I desire to believe that they are viable.

If these paths are nonviable, I desire to believe that they are nonviable.

Does it do any good, to take well-meaning optimistic suggestions seriously, if they will in fact clearly not work? Obviously, if they will work, by all means we should discover that, because knowing which of those paths, if any, is the most likely to work is galactically important. But I don't think they've been dismissed just because people thought the optimists needed to be taken down a peg. Reality does not owe us a reason for optimism.

Generally when people are optimistic about one of those paths, it is not because they've given it deep thought and think that this is a viable approach, it is because they are not aware of the voluminous debate and reasons to believe that it will not work, at all. And inasmuch as they insist on that path in the face of these arguments, it is often because they are lacking in security mindset - they are "looking for ways that things could work out", without considering how plausible or actionable each step on that path would actually be. If that's the mode they're in, then I don't see how encouraging their optimism will help the problem.

Is the argument that any effort spent on any of those paths is worthwhile compared to thinking that nothing can be done?

edit: Of course, misplaced pessimism is just as disastrous. And on rereading, was that your argument? Sorry if I reacted to something you didn't say. If that's the take, I agree fully. If one of those approaches is in fact viable, misplaced pessimism is just as destructive. I just think that the crux there is whether or not it is, in fact, viable - and how to discover that.

  1. As I understand it, OpenAI argue that GPT-3 is a mesa-optimizer (though not in those terms) in the announcement paper Language Models are Few-Shot Learners. (Search for meta.) (edit: Might have been in another paper. I've seen this argued somewhere, but I might have the wrong link :( ) Paraphrased, the model has been shown so many examples of the form "here are some examples that create an implied class, is X an instance of the class? Yes/no", that instead of memorizing the answers to all the questions, it has acquired a general skill for abstracting at runtime (over its context window). So while you have gradient descent going trying to teach the network a series of classes, the network might actually pick up feature learning itself as a skill instead, and start doing its own learning algorithm over just the context window.
Load More