I've been active in the meatspace rationality community for years, and have recently started posting regularly on LW. Most of my posts and comments are about AI and alignment.
Posts I'm most proud of, and / or which provide a good introduction to my worldview:
I also wrote a longer self-introduction here.
PMs and private feedback are always welcome.
I inline-reacted to the first sentence of this comment. The comment takes up too much vertical space for the green highlighting to be visible when I hover over the react icon at the bottom though, so I have no way of seeing exactly what I reacted to while it is highlighted. Maybe hovering over underlined text should show the reaction?
I like this post as a vivid depiction of the possible convergence of strategicness. For literal wildfires, it doesn't really matter where or how the fire starts - left to burn, the end result is that the whole forest burns down. Once the fire is put out, firefighters might be able to determine whether the fire started from a lightning strike in the east, or a matchstick in the west. But the differences in the end result are probably unnoticeable to casual observers, and unimportant to anyone that used to live in the forest.
I think, pretty often, people accept the basic premise that many kinds of capabilities (e.g. strategicness) are instrumentally convergent, without thinking about what the process of convergence actually looks like in graphic detail. Metaphors may or may not be convincing and correct as arguments, but they certainly help to make a point vivid and concrete.
I agree the analogy breaks down in the case of very adversarial agents and / or big gaps in intelligence or power. My point is just that these problems probably aren't unsolvable in principle, for humans or AI systems who have something to gain from cooperating or trading, and who are at roughly equal but not necessarily identical levels of intelligence and power. See my response to the sibling comment here for more.
yes but part of what makes interpretability hard is it may be incomputable.
Aside: I think "incomputable" is a vague term, and results from computational complexity theory often don't have as much relevance to these kind of problems as people intuitively expect. See the second part of this comment for more.
(What do you mean by "waveform" reader"?)
An oscilloscope. Note that it isn't particularly realistic to hook up a scope to the kind of hardware that current AI systems are typically trained and run on.
But what I was trying to gesture at with this comment, is this is the kind of problem that the AI might be able to help you with. If the AI isn't willing or able to fold itself up into something that can be run entirely on a single, human-inspectable CPU in an airgapped box, running code that is amenable to easily proving things about its behavior, you can just not cooperate with it, or not do whatever else you were planning to do by proving something about it, and just shut it off instead.
(If the AI is already adversarial to the point where it won't let you shut it off, and is running on a distributed system, you've already lost. Willingness to fold itself up and be shut off means that the AI is already pretty aligned; it wouldn't surprise me if this problem is alignment-complete.)
As for your practical difficulties, I agree these are all problems. I am not saying the problem you pose isn't hard, just that there doesn't seem to be anything that makes if fundamentally impossible to solve in principle.
There is lots of academic research on hardware security and secure hardware, verifiable computing (e.g. using zk-SNARKs), formally verified programming, chain-of-trust, etc. that attempt to make progress on small pieces of this problem (not necessarily with a specific focus on AI). Stitching all of these things together into an actually end-to-end secure system for interacting with a smarter-than-human AI system is probably possible, but will require solving many unsolved problems, and designing and building AI systems in different ways than we're currently doing. IMO, it's probably better to just build an AI that provably shares human values from the start.
Could this exist in the case of a super intelligent agent?
Probably, if the agents make themselves amenable to interpretability and inspection? If I, as a human, am going through a security checkpoint (say, at an airport), there are two ways I can get through successfully:
As I get more clever and more adversarial, the security checkpoint might need to be more onerous and more carefully designed, to ensure that the first option remains infeasible to me. But the designers of the checkpoint don't necessarily have to be smarter than me, they just have to make the first option difficult enough, relative to any benefit I would gain from it, such that I am more likely to choose the second option.
The point is that in practice it's not necessary to model any humans,
Right, but my point is that it's still necessary for something to model something. The bot arena setup in the paper has been carefully arranged so that the modelling is in the bots, the legibility is in the setup, and the decision theory comprehension is in the author's brains.
I claim that all three of these components are necessary for robust cooperation, along with some clever system design work to make each component separable and realizable (e.g. it would be much harder to have the modelling happen in the researcher brains and the decision theory comprehension happen in the bots).
Two humans, locked in a room together, facing a true PD, without access to computers or an arena or an adjudicator, cannot necessarily robustly cooperate with each other for decision theoretic reasons, even if they both understand decision theory.
PrudentBot is modelling its counterparty, and the setup in which it runs is what makes the modelling and legibility possible. To make PrudentBot work, the comprehension of decision theory, counterparty modelling, and legibility are all required. It's just that these elements are spread out, in various ways, between (a) the minds of the researchers who created the bots (b) the source code of the bots themselves (c) the setup / testbed that makes it possible for the bots to faithfully exchange source code with each other.
Also, arenas where you can submit a simple program are kind of toy examples - if you're facing a real, high-stakes prisoner's dilemma and you can set things up such that you can just have some programs make the decisions for you, you're probably already capable of coordinating and cooperating with your counterparty sufficiently well that you could just avoid the prisoner's dilemma entirely, if it were happening in real life and not a simulated game.
If someone succeeds in getting, say a ~13B parameter model to be equal in performance (at high-level tasks) to a previous-gen model 10x that size, using a 10x smaller FLOPs budget during training, isn't that a pretty big win for Eliezer? That seems to be kind of what is happening: this list mostly has larger models at the top, but not uniformly so.
I'd say, it was more like, there was a large minimum amount of compute needed to make things work at all, but most of the innovation in LLMs comes from algorithmic improvements needed to make them work at all.
Hobbyists and startups can train their own models from scratch without massive capital investment, though not the absolute largest ones, and not completely for free. This capability does require massive capital expenditures by hardware manufacturers to improve the underlying compute technology sufficiently, but massive capital investments in silicon manufacturing technology are nothing new, even if they have been accelerated and directed a bit by AI in the last 15 years.
And I don't think it would have been surprising to Eliezer (or anyone else in 2008) that if you dump more compute at some problems, you get gradually increasing performance. For example, in 2008, you could have made massive capital investments to build the largest supercomputer in the world, and gotten the best chess engine by enabling the SoTA algorithms to search 1 or 2 levels deeper in the Chess game tree. Or you could have used that money to pay for researchers to continue looking for algorithmic improvements and optimizations.
Also, I didn't mean for this distinction to be particularly interesting - I am still slightly concerned that it is so pedantic / boring / obvious that I'm the only one who finds it worth distinguishing at all.
I'm literally just saying, a description of a function / mind / algorithm is a different kind of thing than the (possibly repeated) execution of that function / mind / algorithm on some substrate. If that sounds like a really deep or interesting point, I'm probably still being misunderstood.
A possible explanation for this phenomenon that feels somewhat natural with hindsight: there's a relatively large minimum amount of compute required to get certain kinds of capabilities working at all. But once you're above that minimum, you have lots of options: you can continue to scale things up, if you know how to scale them and you have the compute resources to do so, or you can look for algorithmic improvements which enable you to do more and get better results with less compute. Once you're at this point, perhaps the main determiner of the relative rate of progress between algorithms and compute is which option researchers at the capabilities frontier choose to work on.