Today we honor Stanislav Petrov. LessWrong's Petrov Day simulation pits East Wrong against West Wrong in a nuclear standoff. The outcome affects the karma scores, site access, and honor of over 300 LessWrong users.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Leon Lang22-10
2
https://www.wsj.com/tech/ai/californias-gavin-newsom-vetoes-controversial-ai-safety-bill-d526f621 “California Gov. Gavin Newsom has vetoed a controversial artificial-intelligence safety bill that pitted some of the biggest tech companies against prominent scientists who developed the technology. The Democrat decided to reject the measure because it applies only to the biggest and most expensive AI models and leaves others unregulated, according to a person with knowledge of his thinking”
Ruby507
10
Just thinking through simple stuff for myself, very rough, posting in the spirit of quick takes   1. At present, we are making progress on the Technical Alignment Problem[2] and like probably could solve it within 50 years. 2. Humanity is on track to build ~lethal superpowerful AI in more like 5-15 years. 3. Working on technical alignment (direct or meta) only matters if we can speed up overall progress by 10x (or some lesser factor if AI capabilities is delayed from its current trajectory). Improvements of 2x are not likely to get us to an adequate technical solution in time. 4. Working on slowing things down is only helpful if it results in delays of decades. 1. Shorter delays are good in so far as they give you time to buy further delays. 5. There is technical research that is useful for persuading people to slow down (and maybe also solving alignment, maybe not). This includes anything that demonstrates scary capabilities or harmful proclivities, e.g. a bunch of mech interp stuff, all the evals stuff. 6. AI is in fact super powerful and people who perceive there being value to be had aren’t entirely wrong[3]. This results in a very strong motivation to pursue AI and resist efforts to be stopped 1. These motivations apply to both businesses and governments. 7. People are also developing stances on AI along ideological, political, and tribal lines, e.g. being anti-regulation. This generates strong motivations for AI topics even separate from immediate power/value to be gained. 8. Efforts to agentically slow down the development of AI capabilities are going to be matched by agentic efforts to resist those efforts and push in the opposite direction. 1. Efforts to convince people that we ought to slow down will be matched by people arguing that we must speed up. 2. Efforts to regulate will be matched by efforts to block regulation. There will be efforts to repeal or circumvent any passed regulation. 3. If there are ch
God exists because the most reasonable take is the Solomonoff Prior.  A funny consequence of that is that Intelligent Design will have a fairly large weight in the Solomonoff prior. Indeed the simulation argument can be seen as a version of Intelligent Design.  The Abrahamic God hypothesis is still substantially downweighted because it seems to involve many contigent bits - i.e noisy random bits that can't be compressed. The Solomonoff prior therefore has to downweight them. 
here's my new fake-religion, taking just-world bias to its full extreme the belief that we're simulations and we'll get transcended to Utopia in 1 second because future civilisation is creating many simulations of all possible people in all possible contexts and then uploading them to Utopia so that from anyone's perspective you have a very high probability of transcending to Utopia in 1 second ^^
Shmi15-5
3
some HPMoR statements did not age gracefully as others.

Popular Comments

Recent Discussion

There has been a renewal of discussion on how much hope we should  have of an unaligned AGI leaving humanity alive on Earth after a takeover. When this topic is discussed, the idea of using simulation arguments or acausal trade to make the AI spare our lives often come up. These ideas have a long history. The first mention I know of comes from Rolf Nelson in 2007 on an SL4 message board, the idea later makes a brief appearance in Superintelligence under the name of Anthropic Capture, and came up on LessWrong last time as recently as a few days ago. In response to these, Nate Soares wrote Decision theory does not imply that we get to have nice things, arguing that decision theory is not...

Yeah, I currently disagree on the competent aliens bailing us out, but I haven't thought super hard about it. It does seem good to think about (though not top priority).

2habryka
I agree that in as much as you have an AI that somehow has gotten in a position to guarantee victory, then leaving humanity alive might not be that costly (though still too costly to make it worth it IMO), but a lot of the costs come from leaving humanity alive threatening your victory. I.e. not terraforming earth to colonize the universe is one more year for another hostile AI to be built, or for an asteroid to destroy you, or for something else to disempower you. Disagree on the critique of Nate's posts. The two posts seem relatively orthogonal to me (and I generally think it's good to have debunkings of bad arguments, even if there are better arguments for a position, and in this particular case due to the multiplier nature of this kind of consideration debunking the bad arguments is indeed qualitatively more important than engaging with the arguments in this post, because the arguments in this post do indeed not end up changing your actions, whereas the arguments Nate argued against were trying to change what people do right now).
4ryan_greenblatt
I think the argument should also go through without simulations and without the multiverse so long as you are a UDT-ish agent with a reasonable prior.
2Buck
Re "It's costly for AI to leave humans alive", I think the best thing written on this is Paul's comment here, the most relevant part of which is:

Executive Summary

  • Refusing harmful requests is not a novel behavior learned in chat fine-tuning, as pre-trained base models will also refuse requests (48% of all harmful requests, 3% of harmless) just at a lower rate than chat models (90% harmful, 3% harmless)
  • Further, for both Qwen 1.5 0.5B and Gemma 2 9B, chat fine-tuning reinforces the existing mechanisms. In both the chat and base models it is mediated by the refusal direction described in Arditi et al.
    • We can both induce and bypass refusal in a pre-trained model, using a steering vector transferred from the chat model’s activations
    • On the contrary, in LLaMA 1 7B (which was trained on data from before November 2022 and so can't have had ChatGPT outputs in the pre-training data), we find evidence that chat fine-tuning learns additional / different refusal
...
3wassname
Are these really pure base models? I've also noticed this kind of behaviour is so called base models. My conclusion is that they are not base models in the sense that they have been trained to predict the next word on the internet, but they have undergone some safety fine-tuning before release. We don't actually know how they were trained, and I am suspicious. It might be best to test on Pythia or some model where we actually know.
4Nathan Helm-Burger
Well, my experience is that even when you seem to have bypassed a refusal, you might not have truly bypassed the model's "reluctance". If you get a refusal, but then get past it with a jailbreak or few-shot prompting, you usually get a weaker answer than the answer you get if you fine-tune. In other words, spontaneous sandbagging. I haven't experimented enough with steering vectors yet to be sure whether they are similar to fine-tuning in getting past spontaneous sandbagging. I would expect they are at least closer. These spontaneous sandbagging phenomena of don't appear so strongly with 'ordinary' sorts of harms. Car jacking or making meth, that sort of thing. Only when you get into extreme stuff that very clearly goes against a wide set of deeply held societal norms (kidnapping and torturing people to death as lab rats to help develop biological weapons, explicit scientific plans to kill billions of innocent people, that sort of thing).
2Zach Stein-Perlman
Interesting. Thanks. (If there's a citation for this, I'd probably include it in my discussions of evals best practices.) Hopefully evals almost never trigger "spontaneous sandbagging"? Hacking and bio capabilities and so forth are generally more like carjacking than torture.

If you are doing evals for CBRN capabilities, you are very much in the zone of terrorists killing billions of innocent people. Indeed, that's practically the definition. There's no citation, it's just my personal experience while doing evals that are much to spicy to publish.

Of course, if you're only doing evals for relatively tame proxy skills (e.g. WMDP) then probably you get less of this effect. I don't have a quantification of the rates or specific datasets, just anecdata.

Update III: Here's the Retrospective.

Update II: Here's the feedback form for today's Petrov Day games! Please let me know how it was for you (be you a General, Civilian, Petrov, or just non-participating LWer reading along). I'm grateful to all who fill it out, your data feeds our designs for next year!

Update: The game has concluded! No nukes were fired. The Cold War is over, East and West wrong shall live on. Recap post will be up tomorrow. Thank you all who participated :-)

Today we honor the actions of Stanislav Petrov (1939 – 2017) once again.

Half an hour past midnight on September 26, 1983, he saw the first apparent launch on his computer monitor in a glass-walled room on the top floor of the Ballistic Missile Early

...
EGI10

I guess I was not clear enough in defining what I was talking about. While it is possible to stretch the definition of "nuclear world war" to include WW2 and Little Boy and Fat Man were certainly strategic weapons at their time, this is not at all what I meant. I was talking about modern strategic weapons, i.e. MIRVed ICBMs shot from hardened silos or ballistic missile submarines, used by a modern nuclear superpower to defeat a near peer opponent. I.e. the scenario Petrov faced.

If e.g. the US in Petrov's time had managed to pull off a perfect nuclear first... (read more)

1David Matolcsi
Planning fallacy got me, and it took much longer to finish than expected, but here it is now: https://www.lesswrong.com/posts/ZLAnH5epD8TmotZHj/you-can-in-fact-bamboozle-an-unaligned-ai-into-sparing-your

Let's call the thing where you try to take actions that make everyone/yourself less dead (on expectation) the "safety game". This game is annoyingly chaotic, kind of like Arimaa.

You write the sequences then  some  risk-averse not-very-power-seeking nerds read it and you're 10x less dead. Then Mr. Altman reads it and you're 10x more dead. Then maybe (or not) there's a backlash and the numbers change again.

You start a cute political movement but the countermovement ends up being 10x more actionable (e/acc).

You try to figure out and explain some of the black box but your explanation is immediately used to make a stronger black box. (Mamba possibly.)

Etc.

I'm curious what folks use as toeholds for making decisions in such circumstances. Or if some folks believe there are  actually  principles then I would like to hear them, but I suspect the fog is too thick. I'll skip giving my own answer on this one.

quila10

I know this approach isn't as effective for xrisk, but still, it's something I like to use.

This sentence has the grammatical structure of acknowledging a counterargument and negating it - "I know x, but y" - but the y is "it's something I like to use", which does not actually negate the x.

This is a kind of thing I suspect results from a process like: someone writes out the structure of negation, out of wanting to negate an argument, but then finds nothing stronger to slot into where the negating argument is supposed to be.

1Answer by AprilSR
I don't really have a good idea of the principles, here. Personally, whenever I've made a big difference in a person's life (and it's been obvious to me that I've done so), I try to take care of them as much as I can and make sure they're okay. ...However, I have ran into a couple issues with this. Sometimes someone or something takes too much energy, and some distance is healthier. I don't know how to judge this other than intuition, but I think I've gone too far before? And I have no idea how much this can scale. I think I've had far bigger impacts than I've intended, in some cases. One time I had a friend who was really in trouble and I had to go to pretty substantial lengths to get them to a better place, and I'm not sure all versions of them would've endorsed that, even if they do now. ...But, broadly, "do what you can to empower other people to make their own decisions, when you can, instead of trying to tell them what to do" does seem like a good principle, especially for the people who have more power in a given situation? I definitely haven't treated this as an absolute rule, but in most cases I'm pretty careful not to stray from it.
1AprilSR
There's a complication where sometimes it's very difficult to get people not to interpret things as an instruction. "Confuse them" seems to work, I guess, but it does have drawbacks too.
1Answer by quila
Principle: Model technical progress as concavely increasing with group size and average member <cognitive power, creativity, difference between their and each other members' ways of thinking, etc>, because (1) humans are mostly the same base model[1], so naively we wouldn't expect more humans to perform significantly better, but (2) humans do still seem to make progress faster when collaborating. Combine with unilateralist's curse. Implies ideal is for various small-ish alignment orgs which don't publicly share their progress by default. Unless I'm missing some unknown unknowns. (I do suspect humans are biased against this principle, they were not selected for in a vulnerable world nor a so large one, and 'humans are mostly the same base model' does contradict aspects of humanistic ontology. That's not strong evidence for this 'principle', just a reason it's probably under-considered) Counterpoint to the concaveness assumption: Collaboration in technical fields in history was very decentralized, IIUC (based on my purely lay understanding of the history of knowledge progression, that was probably merely absorbed from stories and culture)? E.g., how long would it take such a small group to originally derive <x renowned knowledge>? 1. ^ I mean replications of the same fundamental entity, i.e humans or the structure of what a human is. And by 'mostly' I mean of course there are differences too. I think evolution implies human minds will tend to be more reflectively aware of the differences because the sameness can operate as an unnoticed background assumption.

Du sublime au ridicule il n’y a qu’un pas

From the sublime to the ridiculous is but a step

A quote often used to describe Napoleon, Sam Altman is making history rhyme. His cool confidence often gives an air of sublime, and as of last week, it seems he has crossed into the ridiculous. And with the ridiculous, the irrational.

Comparing his past words to the present is confusing. Reading between the lines on his corporate-bureaucratic sounding essay doesn't help much either. Anyone from an outside perspective can see the evidence. He has folded for money. But as obvious as that is, maybe he hasn't realized it himself. Or even more likely, his ego hasn't allowed a concious realization of his bad actions.

Freud characterizes the ego as the unconscious power...

1AprilSR
I don't really think money is the only plausible explanation, here?
1Gabe
No, definitely not, I didn't mean to give that impression. I think on a deeper level, when you consider why anyone does anything though, it does come down to basic instinctual desires such as the need to feel loved or the need to feel powerful. In the absence of a rational motivator, it is likely that whatever Sam Altman's primary instinct is will take over, while the ego rationalizes. So, money is maybe the result, but the real driver is likely a deep seated want of power or status.
AprilSR10

That does seem likely.

1tailcalled
Thesis: The motion of the planets are the strongest governing factor for life on Earth. Reasoning: Time-series data often shows strong changes with the day and night cycle, and sometimes also with the seasons. The daily cycle and the seasonal cycle are governed by the relationship between the Earth and the sun. The Earth is a planet, and so its movement is part of the motion of the planets.
JBlack20

I don't think anybody would have a problem with the statement "The motion of the planet is the strongest governing factor for life on Earth". It's when you make it explicitly plural that there's a problem.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

This post is not a good intro to formal logic.

I'm bothered by how often self-reference paradoxes (e.g. "this sentence is false") are touted as gaping holes logic, glaring flaws in one of our trustiest tools for grokking reality (namely formal logic). It's obvious that something interesting is going on here; the fact that, say, Gödel's incompleteness theorem even exists provides interesting insight into the structure of first-order logic, and in this post I want to explore that insight in more detail.

However, sometimes I'll see people making arguments like "Gödel's incompleteness theorem demonstrates that a certain true theorem can't be proven within the system that makes it true; the fact that humans are capable of recognizing that the Gödel sentence is true anyway suggests that our minds transcend...

2notfnofn
Is the point of this whole thing that if you have a finite system that evolves according to a finite set of rules for a finite number of steps, then you can prove anything about the system?

that's a really good way of putting it yeah, thanks.

and then, there's also something in here about how in practice we can approximate the evolution of our universe with our own abstract predicctions well enough to understand the process by which the physical substrate which is getting tripped up by a self-reference paradox, is getting tripped up. which is the explanation for why we can "see through" such paradoxes.

Looking to do a little compare and contrast.

Followup to: Latent variable models, network models, and linear diffusion of sparse lognormals. This post is also available on my Substack.

Let’s say you are trying to make predictions about a variable. For example, maybe you are an engineer trying to keep track of how well the servers are running. It turns out that if you use the obvious approach advocated by e.g. frequentist statistics [1], you will have huge biases in what you pay attention to, compared to what you should pay attention to, because you will disregard big things in favor of common things. Let’s make a mathematical model of this.

Measurement error is proportional to magnitude

Because of background factors that fluctuate at a greater frequency than you can observe/model, and because of noise that enters through your...

I find this whole series intriguing but hard to grasp. If you have time and if it's possible, I would suggest a big worked example / toy model where all the assumptions are laid out and you run the numbers / run the simulation / just think about it and see the qualitative conclusions that you're advocating for in this series. E.g. you see how going through the "traditional" analysis approach would give totally wrong conclusions, etc. Just my two cents. :)