Ensure nobody becomes God-Emperor Forever.
See Tom Davidson's report on AI-Enabled Coups, which includes some technical open problems.
Imagine you would put someone very opinionated like Nate Soares in charge, he would probably remove 80% of mentors and reduce the program to 10-20 people. I am not sure here if this would work out well.
I’m pretty sure this would work out poorly.
In deployment, we should expect our actions to line up with our values, thus triggering the "ruin the universe for as many as possible" behavior.
It seeks to max harms?
If you want to do interpretability research in the standard paradigm, Goodfire exists.
for what it's worth, I think Goodfire is taking a non-standard approach to interpretability research -- more so than (e.g.) Transluce. (I'm not claiming that the non-standard approach is better than the standard one.)
Hey Ryan, nice post. Here are some thoughts.
Anti-correlated attributes: “Founder‑mode” is somewhat anti‑natural to “AI concern.” The cognitive style most attuned to AI catastrophic risk (skeptical, risk‑averse, theory-focused) is not the same style that woos VCs, launches companies, and ships MVPs. If we want AI safety founders, we need to counterweight the selection against risk-tolerant cognitive styles to prevent talent drift and attract more founder-types to AI safety.
I think AI safety founders should be risk-averse.
For-profit investors like risk-seeking founders because for-profit orgs have unlimited upside and limited downside (you can't lose more money than you invest), and hence investors can expect ROI on a portfolio of high-variance, decorrelated startups. You get high variance with risk-seeking founders, and decorrelation with contrarian founders. But AI safety isn't like this. The downside is just as unlimited as the upside, so you can't expect ROI simply because the orgs are high-variance and uncorrelated, c.f. unilateralist curse.
An influential memo from 2022 argued against “mass movement building” in AI safety on the grounds that it would dilute the quality of the field; subsequently, frontier AI companies grew 2-3x/year, apparently unconcerned by dilution.
I think frontier labs have an easier time selecting for talent than AI safety orgs. Partly because they need to care less about virtue/mission alignment.
Remember Bing Sydney?
I don't have anything insightful to say here. But it's surprising how little people mention Bing Sydney.
If you ask people for examples of misaligned behaviour from AIs, they might mention:
But like, three years ago, Bing Sydney. The most powerful chatbot was connected to the internet and — unexpectedly, without provocation, apparently contrary to its training objective and prompting — threatening to murder people!
Are we memory-holing Bing Sydney or are there are good reasons for not mentioning this more?
Here are some extracts from Bing Chat is blatantly, aggressively misaligned (Evan Hubinger, 15th Feb 2023).
Does AI-automated AI R&D count as "Recursive Self-Improvement"? I'm not sure what Yudkowsky would say, but regardless, enough people would count it that I'm happy to concede some semantic territory. The best thing (imo) is just to distinguish them with an adjective.
(This was sitting in my drafts, but I'll just comment it here bc it's very similar point.)
There are two forms of "Recursive Self-Improvement" that people often conflate, but they have very different characteristics.
Introspective RSI: Much like a human, an AI will observe, understand, and modify its own cognitive processes. This ability is privileged: the AI can make these self-observations and self-modifications because the metacognition and mesocognition occur within the same entity. While performing cognitive tasks, the AI simultaneously performs the meta-cognitive task of improving its own cognition.
Extrospective RSI: AIs will automate various R&D tasks that humans currently perform to improve AI, using similar workflows that humans currently use. For example, studying literature, forming hypotheses, writing code, running experiments, analyzing data, drawing conclusions, and publishing results. The object-level cognition and meta-level cognition occur in different entities.
I wish people were more careful about the distinction because people carelessly generalise cached opinions about the former to the latter. In particular, the former seems more dangerous: there is less opportunity to monitor the metacognition's observation and modification of the mesocognition if these cognitions occur within the same entity, i.e. activations, chain-of-thought.
Whatever governance preventing the overlords from appearing could also be used to prevent the humans from wasting resources in space. For example, by requiring that distant colonies are populated with humans or other minds who are capable of either governing themselves or being multilaterally agreed to be moral patients (e.g. this excludes controversial stuff like shrimps on heroin).
Why do you think that requiring that distant colonies are populated with humans would prevent wasting resources in space?
My guess is that, on a mature population ethics, the best uses of resources -- on purely welfarist values, ignoring non-welfarist values which I do think are important -- will look either like a smaller population of minds much "larger" than humans (i.e. galactic utility monsters) or look like a large population of minds much "smaller" than humans (i.e. shrimps on heroin).
It would be a coincidence if the optional allocation of resources involved minds which were exactly the same "size" as humans.
Note that this would be a coincidence on any of the currently popular theories of population ethics (e.g. average, total, variable-value).
I think your remarks suggest that alignment to the level of top humans will happen by default, but not alignment to god-like superintelligence. That said, if we get aligned top-human AIs, then we can defer the rest of the alignment problem to them.
If I were sure that top-human-level AIs will be aligned by default, here's what I might work on instead: