I think this post, more than anything else, has helped me understand the set of things MIRI is getting at. (Though, to be fair, I've also been going through the 2021 MIRI dialogues, so perhaps that's been building some understanding under the surface)
After a few days reflection on this post, and a couple weeks after reading the first part of the dialogues, this is my current understanding of the model:
In our world, there are certain broadly useful patterns of thought that reliably achieve outcomes. The biggest one here is "optimisation". We can think of this as aiming towards a goal and steering towards it - moving the world closer to what you want it to be. These aren't things we train for - we don't even know how to train for them, or against them. They're just the way the world works. If you want to build a power plant, you need some way of getting energy to turn into electricity. If you want to achieve a task, you need some way of selecting a target and then navigating towards it, whether it be walking across the room to grab a cookie, or creating a Dyson sphere.
With gradient descent, maybe you can learn enough to train your AI for things like "corrigibility" or "not being deceptive", but really what you're training for is "Don't optimise for the goal in ways that violate these particular conditions". This does not stop it from being an optimisation problem. The AI will then naturally, with no prompting, attempt to find the best path that gets around these limitations. This probably means you end up with a path that gets the AI the things it wanted from the useful properties of deception or non-corrigibility while obeying the strict letter of the law. (After all, if deception / non-corrigibility wasn't useful, if it didn't help achieve something, you would not spontaneously get an agent that did this, without training it to do so) Again, this is an entirely natural thing. The shortest path between two points is a straight line. If you add an obstacle in the way, the shortest path between those two points is now to skirt arbitrarily close to the obstacle. No malice is required, any more than you are maliciously circumventing architects when you walk close to (but not walking into!) walls.
Basically - if the output of an optimisation process is dangerous, it doesn't STOP being dangerous by changing it into a slightly different optimisation process of "Achieve X thing (which is dangerous) without doing Y (which is supposed to trigger on dangerous things)". You just end up getting X through Y' instead, as long as you're still enacting the basic pattern - which you will be, because an AI that can't optimise things can't do anything at all. If you couldn't apply a general optimisation process, you'd be unable to walk across the room and get a cookie, let alone do all the things you do in your day-to-day life. Same with the AI.
I'd be interested in whether someone who understands MIRI's worldview decently well thinks I've gotten this right. I'm holding off on trying to figure out what I think about that worldview for now - I'm still in the understanding phase.
I like this dichotomy. I've been saying for a bit that I don't think "companies that only commercialise existing models and don't do anything that pushes forward the frontier" aren't meaningfully increasing x-risk. This is a long and unwieldy statement - I prefer "AI product companies" as a shorthand.
For a concrete example, I think that working on AI capabilities as an upskilling method for alignment is a bad idea, but working on AI products as an upskilling method for alignment would be fine.
Based on the language you've used in this post, it seems like you've tried several arguments in succession, none of them have worked, and you're not sure why.
One possibility might be to first focus on understanding his belief as well as possible, and then once you understand his conclusions and why he's reached them, you might have more luck. Maybe taking a look at Street Epistemology for some tips on this style of inquiry would help.
(It is also worth turning this lens upon yourself, and asking why is it so important to you that your friend believes that AGI is immiment? Then you can decide whether it's worth continuing to try to persuade him.)
If anyone writes this up I would love to know about it - my local AI safety group is going to be doing a reading + hackathon of this in three weeks, attempting to use the ideas on language models in practice. It would be nice to have this version for a couple of people who aren't experienced with AI who will be attending, though it's hardly gamebreaking for the event if we don't have this.
So, I notice that still doesn't answer the actual question of what my probability should actually be. To make things simple, let's assume that, if the sun exploded, I would die instantly. In practice it would have to take at least eight minutes, but as a simplifying assumption, let's assume it's instantaneous.
In the absence of relevant evidence, it seems to me like Laplace's Law of Succession would say the odds of the sun exploding in the next hour is 1/2. But I could also make that argument to say the odds of the sun exploding in the next year is also 1/2, which is nonsensical. So...what's my actual probability, here, if I know nothing about how the sun works except that it has not yet exploded, the sun is very old (which shouldn't matter, if I understand you correctly) and that if it exploded, we would all die?
I notice I'm a bit confused about that. Let's say the only thing I know about the sun is "That bright yellow thing that provides heat", and "The sun is really really old", so I have no knowledge about how the sun mechanistically does what it does.
I want to know "How likely is the sun to explode in the next hour" because I've got a meeting to go to and it sure would be inconvenient for the sun to explode before I got there. My reasoning is "Well, the sun hasn't exploded for billions of years, so it's not about to explode in the next hour, with very high probability."
Is this reasoning wrong? If so, what should my probability be? And how do I differentiate between "The sun will explode in the next hour" and "The sun will explode in the next year"?
This is a for-profit company, and you're seeking investment as well as funding to reduce x-risk. Given that, how do you expect to monetise this in the future? (Note: I think this is well worth funding for altruistic reduce-x-risk reasons)
A frame that I use that a lot of people I speak to seem to find A) Interesting and B) Novel is that of "idiot units".
An Idiot Unit is the length of time it takes before you think your past self was an idiot. This is pretty subjective, of course, and you'll need to decide what that means for yourself. Roughly, I consider my past self to be an idiot if they have substantially different aims or are significantly less effective at achieving them. Personally my idiot unit is about two years - I can pretty reliably look back in time and think that compared to year T, Jay at year T-2 had worse priorities or was a lot less effective at pursuing his goals somehow.
Not everyone has an Idiot Unit. Some people believe they were smarter ten years ago, or haven't really changed their methods and priorities in a while. Take a minute and think about what your Idiot Unit might be, if any.
Now, if you have an Idiot Unit for your own life, what does that imply?
Firstly, hill-climbing heuristics should be upweighted compared to long-term plans. If your Idiot Unit is U, any plan that takes more than U time means that, after U time, you're following a plan that was designed by an idiot. Act accordingly.
That said, a recent addition I have made to this - you should still make long-term plans. It's important to know which of your plans are stable under Idiot Units, and you only get that by making those plans. I don't disagree with my past self about everything. For instance, I got a vasectomy at 29, because not wanting kids had been stable for me for at least ten years, so I don't expect more Idiot Units to change this.
Secondly, if you must act according to long-term plans (A college/university degree takes longer than U for me, especially since U tends to be shorter when you're younger) try to pick plans that preserve or increase optionality. I want to give Future Jay as many options as possible, because he's smarter than me. When Past Jay decided to get a CS degree, he had no idea about EA or AI alignment. But a CS degree is a very flexible investment, so when I decided to do something good for the world, I had a ready-made asset to use.
Thirdly, longer-term investments in yourself (provided they aren't too specific) are good. Your impact will be larger a few years down the track, since you'll be smarter then. Try asking what a smarter version of you would likely find useful and seek to acquire that. Resources like health, money, and broadly-applicable knowledge are good!
Fourthly, the lower your Idiot Unit is, the better. It means you're learning! Try to preserve it over time - Idiot Units naturally grow with age so if yours stands still, you're actually making progress.
I'm not sure if it's worth writing up a whole post on this with more detail and examples, but I thought I'd get the idea out there in Shortform.
Explore vs. exploit is a frame I naturally use (Though I do like your timeline-argmax frame, as well), where I ask myself "Roughly how many years should I feel comfortable exploring before I really need to be sitting down and attacking the hard problems directly somehow"?
Admittedly, this is confounded a bit by how exactly you're measuring it. If I have 15-year timelines for median AGI-that-can-kill-us (which is about right, for me) then I should be willing to spend 5-6 years exploring by the standard 1/e algorithm. But when did "exploring" start? Obviously I should count my last eight months of upskilling and research as part of the exploration process. But what about my pre-alignment software engineering experience? If so, that's now 4/19 years spent exploring, giving me about three left. If I count my CS degree as well, that's 8/23 and I should start exploiting in less than a year.
Another frame I like is "hill-climbing" - namely, take the opportunity that seems best at a given moment. Though it is worth asking what makes something the best opportunity if you're comparing, say, maximum impact now vs. maximum skill growth for impact later.
We don't. Humans lie constantly when we can get away with it. It is generally expected in society that humans will lie to preserve people's feelings, lie to avoid awkwardness, and commit small deceptions for personal gain (though this third one is less often said out loud). Some humans do much worse than this.
What keeps it in check is that very few humans have the ability to destroy large parts of the world, and no human has the ability to destroy everyone else in the world and still have a world where they can survive and optimally pursue their goals afterwards. If there is no plan that can achieve this for a human, humans being able to lie doesn't make it worse.