Joel Burget

Wiki Contributions


A central AI alignment problem: capabilities generalization, and the sharp left turn

Human values aren't a repeller, but they're a very narrow target to hit.

As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.

This of course doesn't help with corrigibility.

A central AI alignment problem: capabilities generalization, and the sharp left turn

Two points:

  1. The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I'm curious if other share this model and if it's been refined / explored in more detail by others.
  2. The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous.  (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I'd like to see more arguments on both sides of this debate.
On A List of Lethalities

By the point your AI can design, say, working nanotech, I'd expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I'd also expect it to be able to build models of it's operators and conceive of deep strategies involving them.

This assumes the AI learns all of these tasks at the same time. I'm hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn't).

Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.

Joel Burget's Shortform

The Soviet nail factory always used to illustrate Goodhart's law... did it actually exist? Some good answers on the skeptics StackExchange

AXRP Episode 15 - Natural Abstractions with John Wentworth

If you're interested in following up on John's comments on financial markets, nonexistence of a representative agent, and path dependence, he speaks more about them in his post, Why Subagents?

In practice, path-dependent preferences mostly matter for systems with “hidden state”: internal variables which can change in response to the system’s choices. A great example of this is financial markets: they’re the ur-example of efficiency and inexploitability, yet it turns out that a market does not have a utility function in general (economists call this “nonexistence of a representative agent”). The reason is that the distribution of wealth across the market’s agents functions as an internal hidden variable. Depending on what path the market follows, different internal agents end up with different amounts of wealth, and the market as a whole will hold different portfolios as a result - even if the externally-visible variables, i.e. prices, end up the same.

How to get into AI safety research

Thank you for mentioning Gödel Without Too Many Tears, which I bought it based on this recommendation. It's a lovely little book. I didn't expect to it to be nearly so engrossing.

[Closed] Hiring a mathematician to work on the learning-theoretic AI alignment agenda

Academics not willing to leave their jobs might still be interested in working on a problem part-time. One could imagine that the right researcher working part-time might be more effective than the wrong researcher full time.

Load More