Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

15Fabien Roger
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.
Customize
Thomas Kwa*Ω362070
1
Some versions of the METR time horizon paper from alternate universes: Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh) Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, doubling 0 times per year (95% CI 0.0-0.0 yearly doublings); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends. When Will Worrying About AI Be Automated? Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves. Estimating Time Since The Singularity Early work
Yonatan Cale1510
1
Seems like Unicode officially added a "person being paperclipped" emoji: Here's how it looks in your browser: 🙂‍↕️ Whether they did this as a joke or to raise awareness of AI risk, I like it! Source: https://emojipedia.org/emoji-15.1
keltan8112
0
I feel a deep love and appreciation for this place, and the people who inhabit it.
lc950
8
My strong upvotes are now giving +1 and my regular upvotes give +2.
RobertM490
0
Pico-lightcone purchases are back up, now that we think we've ruled out any obvious remaining bugs.  (But do let us know if you buy any and don't get credited within a few minutes.)

Popular Comments

Recent Discussion

Nice reminiscence from Stephen Wolfram on his time with Richard Feynman:

Feynman loved doing physics. I think what he loved most was the process of it. Of calculating. Of figuring things out. It didn’t seem to matter to him so much if what came out was big and important. Or esoteric and weird. What mattered to him was the process of finding it. And he was often quite competitive about it. 

Some scientists (myself probably included) are driven by the ambition to build grand intellectual edifices. I think Feynman — at least in the years I knew him — was m

... (read more)

Introduction

Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks.

Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time.

However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that resolves all decision theory dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new decision theory: VDT.

Decision theory problems and existing theories

Some common existing decision theories are:

  • Causal Decision Theory (CDT): select the action that *causes* the best outcome.
  • Evidential Decision Theory (EDT): select the action that you would be happiest to learn that you had taken.
  • Functional Decision Theory
...
1amitlevy49
This post served to effectively convince me that FDT is indeed perfect, since I agree with all its decisions. I'm surprised that Claude thinks paying Omega the 100$ has poor vibes.
2avturchin
If we know the correct answers to decision theory problems, we have some internal instrument: either a theory or a vibe meter, to learn the correct answers.  Claude seems to learn to mimic our internal vibe meter.  The problem is that it will not work outside the distribution. 
xpym10

The problem is that it will not work outside the distribution.

Of course, but neither would anything else so far discovered...

11Seth Herd
Still laughing. Thanks for admitting you had to prompt Claude out of being silly; lots of bot results neglect to mention that methodological step. This will be my reference to all decision theory discussions henceforth Have all of my 40-some strong upvotes!

PDF version. berkeleygenomics.org. Twitter thread. (Bluesky copy.)

Summary

The world will soon use human germline genomic engineering technology. The benefits will be enormous: Our children will be long-lived, will have strong and diverse capacities, and will be halfway to the end of all illness.

To quickly bring about this world and make it a good one, it has to be a world that is beneficial, or at least acceptable, to a great majority of people. What laws would make this world beneficial to most, and acceptable to approximately all? We'll have to chew on this question ongoingly.

Genomic Liberty is a proposal for one overarching principle, among others, to guide public policy and legislation around germline engineering. It asserts:

Parents have the right to freely choose the genomes of their children.

If upheld,...

1River
Genetic engineering is a thing you do to a living person. If a person is going to go on to live a life, they don't somehow become less a person because you are influencing them at the stage of being an embryo in a lab. That's just not a morally coherent distinction, nor is it one the law makes. Nothing in my position is hinging on my personal moral views. I am trying to point out to you that almost everyone in our society has the view that blinding children is evil. And our society already has laws against child abuse which would prohibit blinding children by genetic engineering. Virtually nobody wants to change that, and any politician who tried to change those laws would be throwing away their career. It's not about me. I'm pointing out where society is. If you want to start a campaign to legalize the blinding of children, well, we have a free speech clause, you are entitled to do that. Have you considered maybe doing it separately from the genetic engineering thing? The technology to blind children already exists. If you really think it is worth running an experiment on a generation of children, why don't you try to legalize doing it with the technology we already have and go from there? If, somehow, you succeed in changing the law, you'd even get your experiment quicker. How would a person who has been blind their whole life know? They haven't had the experience of sight to compare to. They seem like the people in the worst position to make the comparison. People who have the experience of seeing are necessarily the ones who can judge whether that is a good thing or not. When it comes to any child, it is up to the existing law of child abuse. That trumps whatever an individual parent may think. Lets not get into autistic people. Autism comes in more varieties than blindness, and some of those varieties I think are much more debatable. For blindness, do you have any idea what that fraction is?
TsviBT20

On the "self-governing" model, it might be that the blind community would want to disallow propagating blindness, while the deaf community would not disallow it:

https://pmc.ncbi.nlm.nih.gov/articles/PMC4059844/

Judy: And how’s… I mean, I know we’re talking about the blind community now, but in a DEAF (person’s own emphasis) community, some deaf couples are actually disappointed when they have an able bodied… child.

William: I believe that’s right.

Paul: I think the majority are.

Judy: Yes. Because then …

Margaret: Do they?

Judy: Oh, yes! It’s well known down at ... (read more)

2TsviBT
Our society also has the view that people should be allowed to reproduce freely, even if they'll pass some condition on to their child.
2TsviBT
I'm mainly talking about engineering that happens before the embryo stage. Of course it's one the law makes. IIUC it's not even illegal for a pregnant woman to drink alcohol. I can't tell if you're strawmanning to make a point, or what, but anyway this makes absolutely no sense.

I run a lot of one-off jobs on EC2 machines. This usually looks like:

  • Stand up a machine
  • Mess around for a while trying things and writing code
  • Run my command under screen
For short jobs this is fine, but when I run a long job there are two issues:
  • If the machine costs a non-trivial amount and the job finishes in the middle of the night I'm not awake to shut it down.

  • I could, and sometimes do, forget to turn the machine off.

Ideally I could tell the machine to shut itself off if no one was logging in and there weren't any active jobs.

I didn't see anything like this (though I didn't look very hard) so I wrote something (github):

$ prevent-shutdown long-running-command

As long as that command is still running, or someone is logged in over ssh,...

jefftk20

That's elegant in some sense, but somehow doesn't feel like the right way to do it.

Epistemic status: sure in fundamentals, informal tone is a choice to publish this take faster. Title is a bit of a clickbait, but in good faith. I don't think much context is needed here, ghiblification and native image generation was (and still is) a very much all-encompassing phenomenon.

No-no, not in the way you might have thought. Of course it is terrible for artists, it's a spit in the face of Miyazaki, who publicly disavowed image generation way back when.
A lot of people hate it, and image generation in general. 

It is good, however, for AGI timelines.

It is good because of this:
 

https://x.com/sama/status/1905296867145154688
https://x.com/sama/status/1905296867145154688

 

And this:
 

https://x.com/sama/status/1906771292390666325
https://x.com/sama/status/1906771292390666325

 

And this:

https://x.com/sama/status/1907098207467032632
https://x.com/sama/status/1907098207467032632

 

Why, you might ask? Isn't all this making people and investors  "feel the AGI" and pour more money into it? Make more money for OpenAI and...

I agree in thinking this could slow OpenAI's frontier development, but I don't think it'll move overall timelines by years. OpenAI is currently building out datacenters, so I expect a delay of at most one year (although this might change who reaches capabilities breakthroughs first).

I just don't see why an OpenAI slowdown would effect overall industry timelines that substantially. It might reduce pressure on Anthropic to ship, but I don't expect it to stall their internal development much.

A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".

If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.

For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].

But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do...

3Davidmanheim
Saying AI won't be more efficient is obviously falsified for narrow tasks like adding numbers, and for general tasks like writing short stories, as LLMs currently do, the brain uses 20w/hour, and that's about 30k tokens from GPT4o, i.e. it is done far more efficiently than a human. And more generally, the argument that AI can't be more efficient than the brain seems to follow exactly the same structure as the claim that AI can't be smarter than humans, or the impossibility result here. You should read the comments to that post.
1StanislavKrym
The AI is also much less efficient at other tasks like the example of Claude playing Pokemon or the ones tested by ARC-AGI. I wonder how hard it will be to perform tasks necessary in the energy industry by using an as-cheap-as-possible AI if the current model o3 is faced with problems like requiring thousands of KWh per task in the high tune. In 2023 the world generated just about 30 billions of thousands of KWh. But this is rather off-topic. What can be said about AI violating taboos?   P.S. Neural networks like human brains or the AI learn from data. A human is unlikely to read more than 240 words a minute. Devoting 8 hours a day to reading, a human won't have read more than 5 billions of words by 100 years. 
2Davidmanheim
I think this is confused about how virtue ethics works. Virtue ethics  is centered on the virtues of the moral agent, but it certainly does not say not to predict consequences of actions. In fact, one aspect of virtue, in the Aristotelian system, is "practical wisdom," i.e. intelligence which is critical for navigating choices - because practical wisdom includes an understanding of what consequences will follow actions. It's more accurate to say that intelligence is channeled differently — not toward optimizing outcomes, but toward choosing in a way consistent with one's virtues. And even if virtues are thought of as policies, as in the "loyal friend" example, the policies for being a good friend require interpretation and context-sensitive application. Intelligence is crucial for that.

I didn't claim virtue ethics says not to predict consequences of actions. I said that a virtue is more like a procedure than it is like a utility function. A procedure can include a subroutine predicting the consequences of actions and it doesn't become any more of a utility function by that.

The notion that "intelligence is channeled differently" under virtue ethics requires some sort of rule, like the consequentialist argmax or Bayes, for converting intelligence into ways of choosing.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

I think rationalists should consider taking more showers.

As Eliezer Yudkowsky once said, boredom makes us human. The childhoods of exceptional people often include excessive boredom as a trait that helped cultivate their genius:

A common theme in the biographies is that the area of study which would eventually give them fame came to them almost like a wild hallucination induced by overdosing on boredom. They would be overcome by an obsession arising from within.

Unfortunately, most people don't like boredom, and we now have little metal boxes and big metal boxes filled with bright displays that help distract us all the time, but there is still an effective way to induce boredom in a modern population: showering.

When you shower (or bathe, that also works), you usually are cut off...

5avturchin
[Deleted]
2amitlevy49
This is an infohazard, please delete
3Neil
Can confirm. Half the LessWrong posts I've read in my life were read in the shower.

Infra-Bayesian physicalism (IBP) is a mathematical formalization of computationalist metaphysics: the view that whether something exists is a matter of whether a particular computation is instantiated anywhere in the universe in any way. In this post, we cover the basics of IBP, both rigorously and informally, its relevance to agent foundations and the alignment problem, and present new definitions that remove the biggest limitation in the previous formulation of IBP: the monotonicity requirement. Having read the original post on IBP is not required to understand this post, though prior familiarity with infra-Bayesianism is useful for understanding the technical parts.

Why Physicalism?

A physicalist agent - that is, an agent that uses IBP - does not regard itself as inherently special. This is in contrast to a Cartesian agent. A...

I notice that although the loot box is gone, the unusually strong votes that people made yesterday persist.

3Richard Korzekwa
My weak downvotes are +1 and my strong downvotes are -9. Upvotes are all positive.
91Zach Stein-Perlman
My strong upvotes are giving +61 :shrug:
9lc
I have Become Stronger