Kelsey Piper and I just launched a new blog about AI futurism and AI alignment called Planned Obsolescence. If you’re interested, you can check it out here.
Both of us have thought a fair bit about what we see as the biggest challenges in technical work and in policy to make AI go well, but a lot of our thinking isn’t written up, or is embedded in long technical reports. This is an effort to make our thinking more accessible. That means it’s mostly aiming at a broader audience than LessWrong and the EA Forum, although some of you might still find some of the posts interesting.
So far we have seven posts:
- What we're doing here
- "Aligned" shouldn't be a synonym for "good"
- Situational awareness
- Playing the training game
- Training AIs to help us align AIs
- Alignment researchers disagree a lot
- The ethics of AI red-teaming
Thanks to ilzolende for formatting these posts for publication. Each post has an accompanying audio version generated by a voice synthesis model trained on the author's voice using Descript Overdub.
You can submit questions or comments to mailbox@planned-obsolescence.org.
I think this is just not true? Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they're not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other "common sense morality." AIs could just act similarly? Current AIs already seem like they basically know what types of things humans would think are bad or good, at least enough to know that when humans ask for coffee, they don't mean "steal the coffee" or "do some complicated scheme that results in coffee".
Seperately, it seems like in order for your AI act competently in the world it does have to have a pretty good understanding of "goodness", e.g. to be able to understand why Google doesn't do more spying on competitors, or more insider trading, or do other unethical but profitable things, etc. (Seperately, the AI will also be able to write philosophy books that are better than current ethical philosophy books, etc.)
My general claim is that if the AI takes creative catastrophic actions to disempower humans, it's going to know that the humans don't like this, are going to resist in the ways that they can, etc. This is a fairly large part of "understanding goodness", and enough (it seems to me) to avoid catastrophic outcomes, as long as the AI tries to do [it's best guess at what the humans wanted it to do] and not [just optimize for the thing the humans said to do, which it knows is not what the humans wanted it to do].