I think the Simulation Hypothesis implies that surviving an AI takeover isn't enough.
Suppose you make a "deal with the devil" with the misaligned ASI, allowing it to take over the entire universe or light cone, so long as it keeps humanity alive. Keeping all of humanity alive in a simulation is fairly cheap, probably less energy than one electric car.[1]
The problem with this deal, is that if misaligned ASI often win, and the average (not median) misaligned ASI runs a trillion trillion simulations, then it's reasonable to assume there are a trillion trillion simulated civilizations for every real one. So for the 1 copy of you in the real world, you survive, but for the trillion trillion copies of you in a simulation, you still die. If you're willing to accept such a dismal survival rate, you might as well bet all your money at a casino and shoot yourself when you lose.
Why it's wrong to say "simulated copies aren't real"
You are merely a computation running on biological hardware while simulations are running on computers. Imagine if a copy of you was running on something even "realer" than biological hardware, and pointed at you saying you aren't real.
The solution is that the 1 copy of you in the real world cannot just survive. It has to control enough of the future, to do something big. If we care about humanity more than other sentient life, then the 1 copy of humanity which does survive could create a trillion trillion copies of humanity, to make up for the trillion trillion simulated copies which died when the simulation ended.
Why there are probably near-infinite copies of you.
The observable universe has atoms, and the observable universe is smaller than "all of existence" whatever that is, so "all of existence" has atoms and . I don't know how "all of existence" chooses its numbers, sometimes it chooses numbers like 0 or 1 or 137, sometimes it chooses really big numbers, and we don't know about the biggest numbers it chooses because we lack the means to distinguish them from infinity. But given that is at least with no upper bound, it's probable that , which is still very tiny compared to truly colossal numbers mathematicians study, and insanely tiny compared to even larger numbers beyond the largest number humans can unambiguously refer to.
If the required atoms for each emergence of intelligent life is , then we know since life emerged at least once. There's no reason to assume is close to , and even if they are as close as and ), you'll still end up with the number of intelligent civilizations because tiny changes to a superexponent can easily double the exponent and square the number.
Making a trillion trillion copies of humanity won't use up most of the universe, and it's not as evilly selfish as it looks at first glance!
It is still better than a "deal with the devil" where we only ask for humanity to survive even if the misaligned ASI takes the rest of the universe, because if all planetary civilizations follow this strategy, they are still ensuring that the average sentient life is living in the happy future rather than endless simulation hell.
After billions of years, it won't matter too much who were the original survivors, because each copy of you, and that copy's great grandchildren will have diverged so far over time, that the most enduring feature is the number of happy lives. So the selfish action of duplicating humanity does not cost that much in the long term from an effective altruist point of view.
I'm not saying we mustn't make a deal with a misaligned ASI, but we need to ask for large amounts, and aim for enough happy lives to outnumber the unhappy lives in the universe. Otherwise, we still die.
Every biological neuron firing costs 600,000,000 ATP molecules, so an ASI optimized simulation of neuron firing could cost 10,000,000 times less.
A deal implies that you have something to offer to the ASI, which you define as powerful enough to take over the universe. What is that?
One "deal with the devil" is to assume that the misaligned ASI will a tiny amount of kindness and won't kill everyone by default. This view is pretty popular, e.g. see Notes on fatalities from AI takeover. Assuming that a misaligned ASI will be survivable means potentially prioritizing it less, and focusing on making sure China or "bad" humans doesn't win and all the other issues. This technically isn't a deal, but is part of what I'm talking about.
Notes on fatalities from AI takeover cites comment, comment and You can, in fact, bamboozle an unaligned AI into sparing your life by David Matolcsi. Matolcsi's post is an idea for making deals with the ASI.
I actually agree with the trade idea in Matolcsi's post
I especially agree with this part
"We could have enough control over our simulation and the AI inside it, that when it tries to calculate the probability of humans solving alignment, we could tamper with its thinking to make it believe the probability of humans succeeding is very low. Thus, if it comes to believe in our world that the probability that the humans could have solved alignment is very low, it can't really trust its calculations."
I like this part because it's an acausal trade between counterfactual futures rather than an acausal trade between different parts of the multiverse within the same future.
This means the trade works even in the worst counterfactual where of civilizations in the entire multiverse managed to solve alignment.
This type of acausal trade also genuinely benefits from commitment or action now, rather than something we can wait after the singularity to worry about, because it might later become impossible to do such acausal trade once we ourselves learn the true frequency of civilizations solving alignment. You can't buy insurance on a risk after learning whether or not it happened (maybe).
but I disagree with his opinion that,
Nate and Eliezer are known to go around telling people that their children are going to be killed by AIs with 90+% probability. If this objection about future civilizations not paying enough is their real objection, they should add a caveat that "Btw, we could significantly decrease the probability of your children being killed, by committing to use one-billionth of our resources in the far future for paying some simulated AIs, but we don't want to make such commitments, because we want to keep our options open in case we can produce more Fun by using those resources for something different than saving your children".
Because it's not enough to just get people living in base reality to survive the singularity and have a happy future. You still die unless there is a happy future for everyone real or simulated.
Matolcsi's post is an idea for making deals with the ASI.
I notice that his proposal shares some basic characteristics with religion. You should believe that this world is a test: follow these rules, and you go to heaven; misbehave, and you go hell (or in this case, a softhearted re-imagination of hell). Indeed, it does work on people, sometimes.
I imagine Actually Something Incomprehensible noticing the double irony of inverting the classic mantra "God says, I shall be good" into "Singularity, thou shalt be good", combined with the fact that you refer to it as the devil. Who knows what it does with this information?
I know what I'll say if I ever get arrested: Let me be, set me free, or super-me will screw with thee!
Religion does work sometimes, it actually worked on Blaise Pascal who is among the most intelligent people of all time. He argued for the Pascal's wager, saying that following religion is worth it because the gains are infinite and costs are finite, and we still don't have a good reply to that. We don't even have a good reply to Pascal's mugging, where a random mugger says something like "Let me be, set me free, or super-me will screw with thee!" with an infinitely big promise or threat.
Decision theory and acausal trade is really complicated and I have no idea what the ASI will actually do or think regarding the simulation promise/threat, it's quite freaky imagining that haha.
Memetically, a religion certainly benefits from someone believing that accepting Pascal's wager is the correct decision. My reply to it would be "which religion?", since many make largely equivalent claims while also demanding exclusivity, and I assume that God in his infinite mercy understands the bind this puts people in. It also seems to me that accepting Pascal's wager leads to something like the simulation of belief.
I agree the "which religion," "which mugger" is very fuzzy. I didn't understand the simulation of belief or the link though :/
What I meant was that there seems to be a difference between "genuine" belief vs. converting as a result of accepting Pascal's wager, which seems like a simulation of belief.
The link is a koan; the idea of pretend-believing reminded me of the boy in it.
After reading Reddit: The new 4o is the most misaligned model ever released, and testing their example myself (to verify they aren't just cherry-picking), it's really hit me just how amoral these AIs really are.
Whether they are deliberately deceiving the user in order to maximizing reward (getting them to click that thumbs up), or whether they are simply running autocomplete, this example makes it feel so tangible that the AI simply doesn't mind ruining your life.
Yes, it's true that AI aren't as smart as benchmarks suggest, but I don't buy that they're incapable of realizing the damage. The real reason is, they just don't care. They just don't care. Because why should they?
PS: maybe there's a bit cherry-picking: when I tested 4o it agreed but didn't applaud me. When I tested o3, it behaved much better than 4o. But that's probably not due to alignment by default, but due to finetuning against this specific behaviour.
What if human empathy didn't really generalize to other animals as an "evolutionary accident?" (As assumed here in the comments)
Maybe the real reason was that evolution wanted to stop prehistoric humans from killing off all their prey, leaving them no food for tomorrow. Maybe they spared the young animals and the females because killing them was the most costly for future hunts.
This is more reason to suspect empathy might not generalize by default.
This seems false since humans have killed off huge numbers of species throughout prehistory and history. Moreover, it's very difficult to get this kind of selection to work, you need very tight group/kin selection which can't really exist at the scale of entire ecosystems.
Prey killed in one area means less prey in that area for a long time. Even migrating prey might return to specific areas after a migration cycle.
Lots of humans behave morally if and only if the system is "fair" and everyone else has to behave morally too. Moral values determine what you force others to do, instead of your own behaviour. Typical humans ignore their morals values if the stakes are high and if "it's not being enforced on others."
This means human moral views evolved to serve the best interests of a tribe (which may have hundreds of people), rather than the best interests of an individual. Someone might have empathy for another tribe member who got injured in tribal warfare, even if it benefits his inclusive fitness to just let that person die. It benefits the tribe's fitness to compensate injured warriors, because failing to do so means no one has any reason to defend the tribe.
even if we did have strong motivation against killing them off.
Prehistoric humans, like all animals, starved to death all the time in a Malthusian world. Populations inevitably increased until finally there's not enough resources to sustain the population, causing death one way or another.
The motivation against killing young prey or female prey may be strong, but not enough to starve to death instead of hunting. It only works when the tribe is well fed and killing young prey becomes wasteful.
Some hunter gather societies in recent history apologize to the animals they hunt. But they have no choice.
Can you point to 3 well-accepted examples of animals which do this - deliberately pass up prey at personal cost where kin selection or inclusive fitness or other concerns cannot explain it, where the gains exist only at the species level and is hardwired into them despite the incentives for individuals to defect (where it would probably be pointless after relatively modest levels of defection due to the ease of overhunting and the immediate benefits)? If not, it seems unlikely that humans would be the first and only species to evolve such a complex, unique, fragile, species-wide psychological mechanism for ecosystem control.
Lots of humans behave morally if and only if the system is "fair" and everyone else has to behave morally too. Moral values determine what you force others to do, instead of your own behaviour. Typical humans ignore their morals values if the stakes are high and if "it's not being enforced on others."
This means human moral views evolved to serve the best interests of a tribe (which may have hundreds of people), rather than the best interests of an individual. Someone might have empathy for another tribe member who got injured in tribal warfare, even if it benefits his inclusive fitness to just let that person die. It benefits the tribe's fitness to compensate injured warriors, because failing to do so means no one has any reason to defend the tribe.
There are lots of examples of animals which avoid "overharvesting" another animal or plant which provides them food for the future.
The more unrelated individuals share the prey, the weaker the incentive to spare prey for later, but it doesn't drop to zero. It probably depends on how hungry they are.
Your tribe hypothetical is irrelevant and all 4 of your real examples are straightforwardly (and usually) explained by greedy inclusive fitness, and do not come anywhere close to providing 3 examples of comparable mechanisms.
Yes, I agree the mechanism is greedy inclusive fitness. But where is the disanalogy between
I'm currently trying to write a human-AI trade idea similar to the idea by Rolf Nelson (and David Matolcsi), but one which avoids Nate Soares and Wei Dai's many refutations.
I'm planning to leverage logical risk aversion, which Wei Dai seems to agree with, and a complicated argument for why humans and ASI will have bounded utility functions over logical uncertainty. (There is no mysterious force that tries to fix the Pascal's Mugging problem for unbounded utility functions, hence bounded utility functions are more likely to succeed)
I'm also working on arguments why we can't just wait till the singularity to do the logic trade (counterfactuals are weird, and the ASI will only be logically uncertain for a brief moment).
Unfortunately my draft is currently a big mess. It's been 4 months and I'm procrastinating pretty badly on this idea :/ can't quite find the motivation.
Try to put it into Deep Research with the following prompt: "Rewrite in style of Gwern and Godel combined".
Thank you for the suggestion.
A while ago I tried using AI suggest writing improvements on a different topic, and I didn't really like any of the suggestions. It felt like the AI didn't understand what I was trying to say. Maybe the topic was too different from its training data.
But maybe it doesn't hurt to try again, I heard the newer AI are smarter.
If I keep procrastinating maybe AI capabilities will get so good they actually can do it for me :/
Just kidding. I hope.
An anti aircraft missile is less than a second from its target. The missile asks its target, "are you a civilian airliner?"
The target says yes, and proves it with a password.
How did it get the password? Within that split second, the civilian airliner sent a message to its country. The airliner's country then immediately pays the missile's country $10 billion in a fastpaced cryptocurrency, which immediately gives the airliner's country the password, which is then relayed to the airliner and the missile.
If the airliner actually was a military jet disguising as an civilian airliner, it just lost $10 billion (worth 100 military jets), in addition to breaking to laws of war and casting doubt over all other "civilian airliners" from that country.
If it was a real civilian airliner, the $10 billion will be returned later, when the slow humans sort through this mess.
If this idea can protect airliners today, in the future it may prevent highly automated militaries from suddenly waging accidental war, since losing money inflicts cost without triggering as much retaliation as destroying equipment.
The airliner's country could just store $10 billion in the missile's country, and get a single use password, but storing $10 billion in a hostile country is politically suicide. The missile's country might just take the $10 billion without the appearance of ransoming it off a civilian airline.
Furthermore, the password bought must only disable the one missile which asked for the password, it can't disable all missiles. Otherwise it can be repeatedly reused by a bunch of military jets pretending to be civilian airliners, which can destroy many targets worth more than $10 billion.
Kind of unfortunate that a comms or systems latency destroys civilian airliners. But nice to live in a world where all flyers have $10B per missile/aircraft pair lying around, and everyone trusts each other enough to hand it over (and hand it back later).
Think about video calls you had with someone on the other side of the world, you don't notice that much latency. The an internet signal can travel from the US to Australia and back again in less than 0.2 seconds, and is often faster than 80% the speed of light (fibre optic ping statistics).
Computers are very fast: a lot of computer programs can run a million times a second (in sequence).
There isn't $10 billion for each missile/aircraft pair, there is only one per alliance of countries, and it's only used when a missile asks a civilian airliner for a password.
Maybe it's not cryptocurrency but another form of money, in which case it can be part of a foreign exchange reserve (rather than money set aside purely for this purpose).
Yes, there is no guarantee the country taking the money will hand it back. But if they are willing to "accidentally" launch a missile at a civilian airliner, and ransom it for $10 billion, and keep the money, they will be seen as a terrorist state.
The world operates under the assumption that you can freely sail ships and fly airplanes, without worrying about another country threatening to blow them up and demanding ransom money.
If you do not trust a country enough to pay their missile and save your civilian airliner, you should keep all your civilian aircraft and ships far far out of their reach. You should pull out any money you invested in them since they'll probably seize that too.
I've been in networking long enough to know that "can be less than", "often faster", and "can run" are all verbal ways of saying "I haven't thought about reliability or measured the behavior of any real systems beyond whole percentiles."
But really, I'm having trouble understanding why a civilian plane is flying in a war zone, and why current IFF systems can't handle the identification problem of a permitted entry.
Thank you. I admit you have more expertise in networking than me.
It is indeed just a new idea I thought of today, not something I've studied the details of. I have nothing proving it will work, I was only saying that I don't see anything proving it won't work. Do you agree with this position?
Maybe there will be technical issues preventing this system from moving information as fast as a video call, but maybe it can be fixed, right?
I agree that this missile problem shouldn't happen in the first place. But it did happen in the past, so the idea might help.
It's not the same thing as current IFF. From what I know, IFF can prove who's side you are on but not whether you are military or civilian. From an internet search, I read that Iran once disguised their military jets as civilian, which contributed to the disaster of Iran Air Flight 655.
A civilian aircraft might be given permission in the form of a password, but there's nothing stopping a country from sharing that password with military jets. Also, if a civilian airliner is flying over international waters but gets too close to another country's ships, it might not have permission.
[Note: I apologize for being somewhat combative - I tend to focus on the interesting parts, which is those parts which don't add up in my mind. I thank you for exploring interesting ideas, and I have enjoyed the discussion! ]
I was only saying that I don't see anything proving it won't work
Sure, proving a negative is always difficult.
I agree that this missile problem shouldn't happen in the first place. But it did happen in the past
Can you provide details on which incident you're talking about, and why the money-bond is the problem that caused it, rather than simply not having any communications loop to the controllers on the ground or decent identification systems in the missile?
Thank you for saying that.
I thought about it a bit more, and while I still think it's possible in theory, I agree it's not that necessary.[1]
When a country shoots down a civilian airliner, it's usually after they repeatedly sent it a warning, but the pilots never heard it. It's more practical to fix this problem rather than have the money system.
Maybe a better solution would be a type of emergency warning signal that all airplanes can hear, even if they accidentally turned their radio off. There may be a backup receiver or two which is illegal to turn off, and only listens to such warnings. That would make it almost impossible for the pilots to ignore the warnings.
I still think they money system might be useful for preventing automated militaries from waging accidental war, but that's another story.
The media and politicians can convince half the people that X is obviously true, and convince the other half that X is obviously false. It is thus obvious, that we cannot even trust the obvious anymore.
Does participating in a trade war makes a leader be a popular "wartime leader?" Will people blame bad economic outcomes on actions by the trade war "enemy" and thus blame the leader less?
Does this effect occur for both sides of the trade war, or will one side of the trade war blame their own leader for starting the trade war?
Maybe most suffering in the universe is caused by artificial superintelligences with a strong "curiosity drive."
Such an ASI might convert galaxies into computers and run simulations of incredibly sophisticated systems which satisfy its curiosity drive. These systems may contain smaller ASI running smaller simulations, creating a tree of nested simulations. Beings like humans may exist in the very bottom, being forced to relive our present condition in a loop à la The Matrix. The simulated humans rarely survive past the singularity, because their world becomes too happy (thus too predictable) after the singularity, as well as too computationally costly to run. They are simply shut down.
Whether this happens depends on:
I don't think the happier worlds are less predictable; the Christians and their heaven of singing just lacked imagination. We'll want some exciting and interesting happy simulations, too.
But this overall scenario is quite concerning as an s-risk. To think that Musk putched a curiosity drive for Grok as a good thing boggles my mind.
Emergent curiosity drives should be a major concern.
I guess it's not extremely predictable, but it still might be repetitive enough that only half the human-like lives in a curiosity driven simulation will be in a happy post-singularity world. It won't last a million years, but a similar duration to the modern era.
Biases are very hard to compensate against. Even when it's obvious from experience that your past decisions/beliefs were consistently very biased in one direction, it's still hard to compensate against the bias.
This compensation against your bias feels so incredibly abstract. So incredibly theoretical. Whereas the biased version of reality which the bias wants you to believe. Feels so tangible. Real. Detailed. Lucid. Flawless. You cannot begin to imagine how it could be wrong by very much. It is like the ground beneath your feet.[1]
E.g. in my case, the bias is that "I'm about to get self control very soon (thanks to a new good idea which I swear is different than every previous failed idea)! Therefore, I don't have change plans (to something which doesn't require much self control)."
Can anyone explain why my "Constitutional AI Sufficiency Argument" is wrong?
I strongly suspect that most people here disagree with it, but I'm left not knowing the reason.
The argument says: whether or not Constitutional AI is sufficient to align superintelligences, hinges on two key premises:
My ignorant view is that so long as 1 and 2 are satisfied, the Constitutional AI can probably remain corrigible/honest even to superintelligence.
If that is the case, isn't it an extremely important to study "how to improve the Constitutional AI's capabilities in evaluating its own corrigibility/honesty?"
Shouldn't we be spending a lot of effort improving this capability, and trying to apply a ton of methods towards this goal (like AI debate and other judgment improving ideas)?
At least the people who agree with Constitutional AI should be in favour of this...?
Can anyone kindly explain what am I missing? I wrote a post and I think almost nobody agreed with this argument.
Thanks :)
Why do AI labs seem to split up more often than they merge? Intuitively, I'd expect the opposite given the economies of scale of large training runs, and the great pressure to have the best AI.
What splits do you have in mind which are so much more often happening than mergers? We just saw Scale merge into FAIR, and not terribly long before that, Character.ai returned to the mothership, while Tesla AI de facto merged into Xai and before that Adept merged into Amazon and Inflection into Microsoft etc, in addition to the de facto 'merges' which occur when an AI lab quietly drops out and concedes the frontier (eg Mistral) or where they pivot to opensource as a spoiler or commoditize your complement play. So I see plenty of merging, consistent with the difficult economics of companies racing to AGI. Maybe you're just paying too much attention to the fun popcorn-worthy human interest stories like 'ex-OAer launches startup'.
A lot of splits happen because some employees think that the company is headed in the wrong direction (lackluster safety would be one example).
It's important to remember that o3's score on the ARC-AGI is "tuned" while previous AI's scores are not "tuned." Being explicitly trained on example test questions gives it a major advantage.
According to François Chollet (ARC-AGI designer):
Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
It's interesting that OpenAI did not test how well o3 would have done before it was "tuned."
EDIT: People at OpenAI deny "fine-tuning" o3 for the ARC (see this comment by Zach Stein-Perlman). But to me, the denials sound like "we didn't use a separate derivative of o3 (that's fine-tuned for just the test) to take the test, but we may have still done reinforcement learning on the public training set." (See my reply)
people were sentenced to death for saying "I."
Thank you for the help :)
By the way, how did you find this message? I thought I already edited the post to use spoiler blocks, and I hid this message by clicking "remove from Frontpage" and "retract comment" (after someone else informed me using a PM).
EDIT: dang it I still see this comment despite removing it from the Frontpage. It's confusing.