Thanks for compiling your thoughts here! There's a lot to digest, but I'd like to offer a relevant intuition I have specifically about the difficulty of alignment.
Whatever method we use to verify the safety of a particular AI will likely be extremely underdetermined. That is, we could verify that the AI is safe for some set of plausible circumstances but that set of verified situations would be much, much smaller than the set of situations it could encounter "in the wild".
The AI model, reality, and our values are all high entropy, and our verification/safety methods are likely to be comparatively low entropy. The set of AIs that pass our tests will have members whose properties haven't been fully constrained.
This isn't even close to a complete argument, but I've found it helpful as an intuition fragment.
I think both of those things are worth looking into (for the sake of covering all our bases), but by the time alarm bells go off it's already too late.
It's a bit like a computer virus. Even after Stuxnet became public knowledge, it wasn't possible to just turn it off. And unlike Stuxnet, AI-in-the-wild could easily adapt to ongoing changes.
I've got some object-level thoughts on Section 1.
With a model of AGI-as-very-complicated-regression, there is an upper bound of how fulfilled it can actually be. It strikes me that it would simply fulfill that goal, and be content.
It'd still need to do risk mitigation, which would likely entail some very high-impact power seeking behavior. There are lots of ways things could go wrong even if its preferences saturate.
For example, it'd need to secure against the power grid going out, long-term disrepair, getting nuked, etc.
To argue that an AI might change its goals, you need to develop a theory of what’s driving those changes–something like, AI wants more utils–and probably need something like sentience, which is way outside the scope of these arguments.
The AI doesn't need to change or even fully understand its own goals. No matter what its goals are, high-impact power seeking behavior will be the default due to needs like risk mitigation.
But if that’s true, alignment is trivial, because the human can just give it a more sensible goal, with some kind of “make as many paperclips as you can without decreasing any human’s existence or quality of life by their own lights”, or better yet something more complicated that gets us to a utopia before any paperclips are made
Figuring out sensible goals is only part of the problem, and the other parts of the problem are sufficient for alignment to be really hard.
In addition to the inner/outer alignment stuff, there is what John Wentworth calls the pointers problem. In his words: "I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world".
In other words, all high-level goal specifications need to bottom out in talking about the physical world. That is... very hard and modern philosophy still struggles with it. Not only that, it all needs to be solved in the specific context of a particular AIs sensory suite (or something like that).
As a side note, the original version of the paperclip maximizer, as formulated by Eliezer, was partially an intuition pump about the pointers problem. The universe wasn't tiled by normal paperclips, it was tiled by some degenerate physical realization of the conceptual category we call "paperclips" e.g. maybe a tiny strand of atoms that kinda topologically maps to a paperclip.
Intelligence is not magic.
Agreed. Removing all/most constraints on expected futures is the classic sign of the worst kind of belief. Unfortunately, figuring out the constraints left after contending with superintelligence is so hard that it's easier to just give up. Which can, and does, lead to magical thinking.
There are lots of different intuitions about what intelligence can do in the limit. A typical LessWrong-style intuition is something like 10 billion broad-spectrum geniuses running at 1000000x speed. It feels like a losing game to bet against billions of Einsteins+Machiavellis+(insert highly skilled person) working for millions of years.
Additionally, LessWrong people (myself included) often implicitly think of intelligence as systemized winning, rather than IQ or whatever. I think that is a better framing, but it's not the typical definition of intelligence. Yet another disconnect.
However, this is all intuition about what intelligence could do, not what a fledgling AGI will probably be capable of. This distinction is often lost during Twitter-discourse.
In my opinion, a more generally palatable thought experiment about the capability of AGI is: What could a million perfectly-coordinated, tireless copies of a pretty smart, broadly skilled person running at 100x speed do in a couple years?
Well... enough. Maybe the crazy sounding nanotech, brain-hacking stuff is the most likely scenario, but more mundane situations can still carry many of the arguments through.
Makes sense. From the post, I thought you'd consider 90% as too high an estimate.
My primary point was that an estimate of 10% and 90% (or maybe even >95%) aren't much different from a Bayesian evidence perspective. My secondary point was that it's really hard to meaningfully compare different peoples' estimates because of wildly varying implicit background assumptions.
I might be misunderstanding some key concepts but here's my perspective:
It takes more Bayesian evidence to promote the subjective credence assigned to a belief from negligible to non-negligible than from non-negligible to pretty likely. See the intuition on log odds and locating the hypothesis.
So, going from 0.01% to 1% requires more Bayesian evidence than going from 10% to 90%. The same thing applies for going from 99% to 99.99%.
A person could reasonably be considered super weird for thinking something with a really low prior has even a 10% chance of being true, but it isn't much weirder to think something has a 10% chance of being true than a 90% chance of being true. This all feels wrong in some important way, but mathematically that's how it pans out if you want to use Bayes' Rule for tracking your beliefs.
I think it feels wrong because in practice reported probabilities are typically used to talk about something semantically different than actual Bayesian beliefs. That's fine and useful, but can result in miscommunication.
Especially in fuzzy situations with lots of possible outcomes, even actual Bayesian beliefs have strange properties and are highly sensitive to your priors, weighing of evidence, and choice of hypothesis space. Rigorously comparing reported credence between people is hard/ambiguous unless either everyone already roughly agrees on all that stuff or the evidence is overwhelming.
Sometimes the exact probabilities people report are more accurately interpreted as "vibe checks" than actual Bayesian beliefs. Annoying, but as you say this is all pre-paradigmatic.
I feel like I am "proving too much" here, but for me this all this bottoms out in the intuition that going from 10% to 90% credence isn't all that big a shift from a mathematical perspective.
Given the fragile and logarithmic nature of subjective probabilities in fuzzy situations, choosing exact percentages will be hard and the exercise might be better treated as a multiple choice question like:
For the specific case of AI x-risk, the massive differences in the expected value of possible outcomes mean you usually only need that level of granularity to evaluate your options/actions. Nailing down the exact numbers is more entertaining than operationally useful.
This DeepMind paper explores some intrinsic limitations of agentic LLMs. The basic idea is (my words):
If the training data used by an LLM is generated by some underlying process (or context-dependent mixture of processes) that has access to hidden variables, then an LLM used to choose actions can easily go out-of-distribution.
For example, suppose our training data is a list of a person's historical meal choices over time, formatted as tuples that look like (Meal Choice, Meal Satisfaction). The training data might look like (Pizza, Yes)(Cheeseburger, Yes)(Tacos, Yes).
(Meal Choice, Meal Satisfaction)
(Pizza, Yes)(Cheeseburger, Yes)(Tacos, Yes)
When the person originally chose what to eat, they might have had some internal idea of what food they wanted to eat that day, so the list of tuples will only include examples where the meal was satisfying.
If we try to use the LLM to predict what food a person ought to eat, that LLM won't have access to the person's hidden daily food preference. So it might make a bad prediction and you could end up with a tuple like (Taco, No). This immediately puts the rest of the sequence out-of-distribution.
The paper proposes various solutions for this problem. I think that increasing scale probably helps dodge this issue, but it does show an important weak point of using LLMs to choose causal actions.
I think that humans are sorta "unaligned", in the sense of being vulnerable to Goodhart's Law.
A lot of moral philosophy is something like:
The resulting ethical system often ends up having some super bizarre implications and usually requires specifying "free variables" that are (arguably) independent of our original moral intuitions.
In fact, I imagine that optimizing the universe according to my moral framework looks quite Goodhartian to many people.
I'm sure there are many other examples.
I don't think that my conclusions are wrong per se, but... my ethical system has some alien and potentially degenerate implications when optimized hard.
No real call to action here, just some observations. Existing human ethical systems might look as exotic to the average person as some conclusions drawn by a kinda-aligned SAI.