Firstly, a bias towards choices which leave less up to chance.
Wouldn't this imply a bias towards eliminating other agents? (Since that would make the world more predictable, and thereby leave less up to chance?)
And thirdly, a bias towards choices which afford more choices later on.
Wouldn't this strongly imply biases towards both self-preservation and resource acquisition?
If the above two implications hold, then the conclusion
that the biases induced by instrumental rationality at best weakly support [...] that machine superintelligence is likely to lead to existential catastrophe
seems incorrect, no?
Could you briefly explain what is wrong with the reasoning above, or point me to the parts of the post that do so? (I only read the Abstract.)
[...] bad futures without extinction, e.g. that AI systems take over but don’t kill everyone.
What probability would you assign to humans remaining but not being able to kill themseleves; i.e., to unescapable dystopias (vs. dystopias whose badness for any individual are bounded by death-by-suicide)?
This post raised some interesting points, and stimulated a bunch of interesting discussion in the comments. I updated a little bit away from foom-like scenarios and towards slow-takeoff scenarios. Thanks. For that, I'd like to upvote this post.
On the other hand: I think direct/non-polite/uncompromising argumentation against other arguments, models, or beliefs is (usually) fine and good. And I think it's especially important to counter-argue possible inaccuracies in key models that lots of people have about AI/ML/alignment. However, in many places, the post reads like a personal attack on a person (Yudkowsky), rather than just on models/beliefs he has promulgated.
I think that style of discourse runs a risk of
For that, I'd like to downvote this post. (I ended up neither up- nor down-voting.)
I will bet any amount of money that GPT-5 will not kill us all.
What's the exchange rate for USD to afterlife-USD, though? Or what if they don't use currency in the afterlife at all? Then how would you pay the other party back if you lose?
Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?
If we're literally considering a universal quantifier over the set of all possible ML models, then I'd think it extremely likely that there does exist a simpler model with perplexity no worse than GPT-4. I'm confused as to how you (seem to have) arrived at the opposite conclusion.
Imagine an astronomer in the year 1600, who frequently refers to the "giant inscrutable movements" of the stars. [...]
I think the analogy to {building intelligent systems} is unclear/weak. There seem to be many disanalogies:
In the astronomy case, we have
A phenomenon (the stars and their motions) that we cannot affect
That phenomenon is describable with simple laws.
The "piles of numbers" are detailed measurements of that phenomenon.
It is useful to take more measurements, doing so is helpful for finding the simple laws.
In the {building AIs} case, we have
A phenomenon (intelligence) which we can choose to implement in different ways, and which we want to harness for some purpose.
That phenomenon is probably not entirely implementable with simple programs. (As you point out yourself.)
The "piles of numbers" are the means by which some people are implementing the thing; as opposed to measurements of the one and only way the thing is implemented.
And so: implementing AIs as piles-of-numbers is not clearly helpful (and might be harmful) to finding better/simpler alternatives.
So, I predict with high confidence that any ML model that can reach the perplexity levels of Transformers will also present great initial interpretive difficulty.
I do agree that any realistic ML model that achieves GPT-4-level perplexity would probably have to have at least some parts that are hard to interpret. However, I believe it should (in principle) be possible to build ML systems that have highly interpretable policies (or analogues thereof), despite having hard-to-interpret models.
I think if our goal was to build understandable/controllable/safe AI, then it would make sense to factor the AI's mind into various "parts", such as e.g. a policy, a set of models, and a (set of sub-)goals.
In contrast, implementing AIs as giant Transformers precludes making architectural distinctions between any such "parts"; the whole AI is in a(n architectural) sense one big uniform soup. Giant Transformers don't even have the level of modularity of biological brains designed by evolution.
Consequently, I still think the "giant inscrutable tensors"-approach to building AIs is terrible from a safety perspective, not only in an absolute sense, but also in a relative sense (relative to saner approaches that I can see).
scale up the experiment of Pretraining from Human Feedback by using larger data
AFAICT, PHF doesn't solve any of the core problems of alignment. IIUC, PHF is still using an imperfect reward model trained on a finite amount of human signals-of-approval; I'd tentatively expect scaling up PHF (to ASI) to result in death-or-worse by Goodhart. Haven't thought about PHF very thoroughly though, so I'm uncertain here.
we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words
Did you mean something like "(somehow) design a data set such that, in order to predict token-sequences in that data set, the AI has to learn the real-world structure of things we care about, like freedom, justice, alignment, etc."? [1]
can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment
I don't understand this. What difference are you pointing at with "deceptive" vs "legitimate" generalizations? How does {AI-human (and/or AI-env) interactions being limited to a simple interface} preclude {learning "deceptive" generalizations}?
I'm under the impression that entirely "legitimate" generalizations can (and apriori probably will) lead to "deception"; see e.g. https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness. Do you disagree with that? (If yes, how?)
can't amplify Goodhart
Side note: I don't understand what you mean by this (in the given context).
can't [...] hack the human's values
I don't see how this follows. IIUC, the proposition here is something like
Is that a reasonable representation of what you're saying? If yes, consider: What if we replace "the AI" with "Anonymous" and "the humans" with "the web server"? Then we get
...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn't rely on hardware effects like rowhammering.
(I agree that limiting human-AI interactions to a simple interface would be helpful, but I think it's far from sufficient (to guarantee any form of safety).)
IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier. I'm still confused as to why you think that. To the extent that I understood the reasons you presented, I think they're incorrect (as outlined above). (Maybe I'm misunderstanding something.)
I'm kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
I think a naively designed data set containing lots of {words that are value-laden for English-speaking humans} would not cut it, for hopefully obvious reasons. ↩︎
I don't quite understand what you're saying; I get the impression we're using different ontologies/vocabularies. I'm curious to understand your model of alignment, and below are a bunch of questions. I'm uncertain whether it's worth the time to bridge the conceptual gap, though --- feel free to drop this conversation if it feels opportunity-costly.
(1.)
Are you saying that if we assumed agents to be Cartesian[1], then you'd know a solution to the problem of {how could a weak agent train and align a very powerful agent}? If yes, could you outline that solution?
(2.)
Resolve the embedded alignment problems [...] This is essentially Reinforcement Learning from Human Feedback's method for alignment
How does RLHF solve problems of embedded alignment? I'm guessing you're referring to something other than the problems outlined in Embedded Agency?
(3.)
What exact distinction do you mean by "online" vs "offline"? Given that any useful/"pivotal" AI would need to learn new things about the world (and thus, modify its own models/mind) in order to form and execute a useful/pivotal plan, it would have to learn "online", no?
(4.)
the data set used for human values
What kind of data set did you have in mind here? A data set s.t. training an AI on it in some appropriate way would lead to the AI being aligned to human values? Could you give a concrete example of such a data set (and training scheme)?
e.g. software agents in some virtual world, programmed such that agents are implemented as some well-defined datatype/class, and agent-world interactions can only happen via a well-defined simple interface, running on a computer that cannot be hacked from within the simulation. ↩︎
I agree. But AFAICT that doesn't really change the conclusion that less agents would tend to make the world more predictable/controllable. As you say yourself:
And that was the weaker of the two apparent problems. What about the {implied self-preservation and resource acquisition} part?