Thomas Kwa

Doing alignment research with Vivek Hebbar's team at MIRI.

Wiki Contributions


In future posts, we will describe a more complete categorisation of these situations and how they relate to the AI alignment problem. 

Did this ever happen?

after talking to Eliezer, I now have a better sense of the generator of this list. It now seems pretty good and non-arbitrary, although there is still a large element of taste.

Suppose an agent has this altruistic empowerment objective, and the problem of getting an objective into the agent has been solved.

Wouldn't it be maximized by forcing the human in front of a box that encrypts its actions and uses the resulting stream to determine the fate of the universe? Then the human would be maximally "in control" of the universe but unlikely to create a universe that's good by human preferences.

I think this reflects two problems:

  • Most injective functions from human actions to world-states are not "human decides the future based on its values"
  • The channel capacity is higher with the evil box than where the agent is aligned with human preferences, because the human might prefer to limit its own power. For example, the human could want others to have power, know that power corrupts its values, want itself to be unable to destroy the sun with a typo, or have constructed the optimal hedonium farms and not want to change them.

I'm offering a $300 bounty to anyone that gets 100 karma doing this this year (without any vote manipulation).

Manifold market for this:

They also separately believe that by the time an AI reaches superintelligence, it will in fact have oriented itself around a particular goal and have something like a goal slot in its cognition - but at that point, it won’t let us touch it, so the problem becomes we can't put our own objective into it.

My guess is this is a bit stronger than what Nate believes. The corresponding quote (emphasis mine) is

Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal [...]

and I wouldn't be surprised myself if by the time an AI is superhuman at basically all tasks, it is still as incoherent as humans, especially if it uses more inference compute than a human brain.

Feynman once challenged people to come up with a problem that could be stated quickly but he couldn't solve to within 10% in a minute, and a colleague stumped him with finding .

I like this point, but downvoted this because it didn't reference prior work

Even if it has some merits, I find the "death with dignity" thing an unhelpful, mathematically flawed, and potentially emotionally damaging way to relate to the problem. Even if MIRI has not given up, I wouldn't be surprised if the general attitude of despair has substantially harmed the quality of MIRI research. Since I started as a contractor for MIRI in September, I've deliberately tried to avoid absorbing this emotional frame, and rather tried to focus on doing my job, which should be about computer science research. We'll see if this causes me problems.

I made a Manifold market for some key claims in this post:

Load More