Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by David Scott Krueger (formerly: capybaralet). Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.

New to LessWrong?

51 comments, sorted by Click to highlight new comments since: Today at 7:28 AM

Wow this is a lot better than my FB/Twitter feed :P

:D :D :D

Let's do this guys! This is the new FB :P

:D Glad to hear that! 

I have the intention to convert a number of draft LW blog posts into short-forms.

Then I will write a LW post linking to all of them and asking people to request that I elaborate on any that they are particularly interested in.

I've been building up drafts for a looooong time......

It seems like a lot of people are still thinking of alignment as too binary, which leads to critical errors in thinking like: "there will be sufficient economic incentives to solve alignment", and "once alignment is a bottleneck, nobody will want to deploy unaligned systems, since such a system won't actually do what they want".

It seems clear to me that:

1) These statements are true for a certain level of alignment, which I've called "approximate value learning" in the past ( I think I might have also referred to it as "pretty good alignment" or "good enough alignment" at various times.

2) This level of alignment is suboptimal from the point of view of x-safety, since the downside risk of extinction for the actors deploying the system is less than the downside risk of extinction summed over all humans.

3) We will develop techniques for "good enough" alignment before we develop techniques that are acceptable from the standpoint of x-safety.

4) Therefore, the expected outcome is: once "good enough alignment" is developed, a lot of actors deploy systems that are aligned enough for them to benefit from them, but still carry an unacceptably high level of x-risk.

5) Thus if we don't improve alignment techniques quickly enough after developing "good enough alignment", it's development will likely lead to a period of increased x-risk (under the "alignment bottleneck" model).

Treacherous turns don't necessarily happen all at once. An AI system can start covertly recruiting resources outside its intended purview in preparation for a more overt power grab.

This can happen during training, without a deliberate "deployment" event. Once the AI has started recruiting resources, it can outperform AI systems that haven't done that on-distribution with resources left over which it can devote to pursuing its true objective or instrumental goals.

My pet "(AI) policy" idea for a while has been "direct recourse", which is the idea that you can hedge against one party precipitating an irreversible events by giving other parties the ability to disrupt their operations at will.
For instance, I could shut down my competitors' AI project if I think it's an X-risk.
The idea is that I would have to compensate you if I was later deemed to have done this for an illegitimate reason.
If shutting down your AI project is not irreversible, then this system increases our ability to prevent irreversible events, since I might stop some existential catastrophe, and if I shut down your project when I shouldn't, then I just compensate you and we're all good.

Suggestion for authors here: don't use conclusive titles for posts that make speculative arguments.

"No Free Lunch" (NFL) results in machine learning (ML) basically say that success all comes down to having a good prior.

So we know that we need a sufficiently good prior in order to succeed.

But we don't know what "sufficiently good" means.

e.g. I've heard speculation that maybe we can use 2^-MDL in any widely used Turing-complete programming language (e.g. Python) for our prior, and that will give enough information about our particular physics for something AIXI-like to become superintelligent e.g. within our lifetime.

Or maybe we can't get anywhere without a much better prior.

DOES ANYONE KNOW of any work/(intelligent thoughts) on this?

Although it's not framed this way, I think much of the disagreement about timelines/scaling-hypothesis/deep-learning in the ML community basically comes down to this...

I'm frustrated with the meme that "mesa-optimization/pseudo-alignment is a robustness (i.e. OOD) problem". IIUC, this is definitionally true in the mesa-optimization paper, but I think this misses the point.

In particular, this seems to exclude an important (maybe the most important) threat model: the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.

This is exactly how I imagine a treacherous turn from a boxed superintelligent AI agent to occur, for instance. It secretly begins breaking out of the box (e.g. via manipulating humans) and we don't notice until its too late.

the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.

Sure, but the fact that it defects in deployment and not in training is a consequence of distributional shift, specifically the shift from a situation where it can't break out of the box to a situation where it can.

No, I'm talking about it breaking out during training. The only "shifts" here are:

1) the AI gets smarter

2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).

We can imagine scenarios where it's only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn... the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:

A) it is auto-induced distributional shift

B) the developers never decided to deploy

As alignment techniques improve, they'll get good enough to solve new tasks before they get good enough to do so safely. This is a source of x-risk.

Regarding the "Safety/Alignment vs. Capabilities" meme: it seems like people are sometimes using "capabilities" to use 2 different things:

1) "intelligence" or "optimization power"... i.e. the ability to optimize some objective function

2) "usefulness": the ability to do economically valuable tasks or things that people consider useful

I think it is meant to refer to (1).

Alignment is likely to be a bottleneck for (2).

For a given task, we can expect 3 stages of progress:

i) sufficient capabilities(1) to perform the task

ii) sufficient alignment to perform the task unsafely

iii) sufficient alignment to perform the task safely

Between (i) and (ii) we can expect a "capabilities(1) overhang". When we go from (i) to (ii) we will see unsafe AI systems deployed and a potentially discontinuous jump in their ability to do the task.

LessWrong and the Alignment Forum are great and all, but... if you are interested in technical AI safety, you should also learn about AI from other sources, like by looking at workshops, conference proceedings, looking into different research groups in academia, etc.


I find the argument that 'predicting data generated by agents (e.g. language modeling) will lead a model to learn / become an agent' much weaker than I used to.

This is because I think it only goes through cleanly if the task uses the same input and output as the agent.  This is emphatically not the case for (e.g.) GPT-3.

For all of the hubbub about trying to elaborate better arguments for AI x-risk, it seems like a lot of people are describing the arguments in Superintelligence as relying on FOOM, agenty AI systems, etc. without actually justifying that description via references to the text.

It's been a while since I read Superintelligence, but my memory was that it anticipated a lot of counter-arguments quite well.  I'm not convinced that it requires such strong premises to make a compelling case.  So maybe someone interested in this project of clarifying the arguments should start with establishing that the arguments in superintelligence really have the weaknesses they are claimed to?

Moloch is not about coordination failures.  Moloch is about the triumph of instrumental goals.  Maybe we can defeat Moloch with sufficiently good coordination.  It's worth a shot at least.

A lot of the discussion of mesa-optimization seems confused.

One thing that might be relevant towards clearing up the confusion is just to remember that "learning" and "inference" should not be thought of as cleanly separated, in the first place, see, e.g. AIXI...

So when we ask "is it learning? Or just solving the task without learning", this seems like a confused framing to me. Suppose your ML system learned an excellent prior, and then just did Bayesian inference at test time. Is that learning? Sure, why not. It might not use a traditional search/optimization algorithm, but probably is has to do *something* like that for computational reasons if it wants to do efficient approximate Bayesian inference over a large hypothesis space, so...

I like "tell culture" and find myself leaning towards it more often these days, but e.g. as I'm composing an email, I'll find myself worrying that the recipient will just interpret a statement like: "I'm curious about X" as a somewhat passive request for information about X (which it sort of is, but also I really don't want it to come across that way...)

Anyone have thoughts/suggestions?

Cultures depend on shared assumptions of trust, and indeed, if they don't share your assumptions, you can't just unilaterally declare a culture. (I think the short answer is "unless you want to onboard someone else into your culture, you probably can't just do the sort of thing you want to do.")

I recommend checking out Reveal Culture, which tackles some of this.

(You can manually specify "I'm curious about X [I don't mean to be asking you about it, just mentioning that I'm curious about it, no pressure if you don't want to go into it.]". But, that is indeed a clunkier statement, and probably defeats the point of you being able to casually mention it in the first place.)

I am somewhat curious what you're hoping to get out of being able to say things like "I'm curious about X" if it's not intended as a passive request. I think the answers here of how to communicate across cultures will depend a lot on what specific thing you're trying to communicate and why and how (and then covering that with a variety of patches, which are specific to the topic in question)

But, that is indeed a clunkier statement

I once heard someone say, "I'm curious about X, but only want to ask you about it if you want to talk about it" and thought that seemed very skillful.

It might be a passive request, I'm not actually sure... I'd think of it more like an invitation, which you are free to decline. Although OFC, declining an invitation does send a message whether you like it or not *shrug.

> But, that is indeed a clunkier statement, and probably defeats the point of you being able to casually mention it in the first place.)

Also like, if you're in something like guess culture, and someone tells you "I'm just telling you this with no expectation," they will still be trying to guess what you may want from that.

Be brave. Get clear on your own intentions. Feel out their comfort level with talking about X first. 

I guess one problem here is that how someone responds to such a statement carries information about how much they respect you...

If someone you are honored to even get the time of day from writes that, you will almost certainly craft a strong response about X...

Organizations that are looking for ML talent (e.g. to mentor more junior people, or get feedback on policy) should offer PhD students high-paying contractor/part-time work.

ML PhD students working on safety-relevant projects should be able to augment their meager stipends this way.

I'm most active on Twitter these days; please follow me there!

I also have a website now:

As an academic, I typically find LW/SF posts to be too "pedagogic" and not skimmable enough.  This limits how much I read them.  Academic papers are, on average, much easier to extract a TL;DR from.  

Being pedagogic has advantages, but it can be annoying if you are already familiar with much of the background and just want to skip to the (purportedly) novel bits.

Pedagogic posts are more accessible, and a large portion of the point of publishing on LW is to present technical ideas to a wide audience. While the audience here is intelligent, they also come from a wide variety of domains, so accessibility is key to successfully writing a good LW post (with some exceptions).

Do you have a proposition for how to increase skimability without sacrificing accessibility?

Maybe some, but I think that's a bit besides the point... 
I agree there's a genuine trade-off, but my post was mostly about AF.
I'm mostly in LW/AF for AI Alignment content, and I think these posts should strive to be a bit closer to academic style.

A few quick thoughts:
- include abstracts
- say whether a post is meant to be pedagogic or not
- say "you can skip this section if"
- follow something more like the format of an academic paper
- include a figure towards the top that should summarize the idea for someone with sufficient background with a caption like "a summary of [idea]: description / explanation"

Sounds like a fair point. I'll try to add that to my posts in the future. ;)

I agree that AI alignment posts don't need to aim for accessibility to the same degree as the typical LW post (this was what I was mainly referring to when I edited in "with some exceptions"), but you did name-check LW in your top-level post, and I don't think it's besides the point for the typical LW post.

I think your suggestions are good and reasonable suggestions.

We learned about RICE as a treatment for injuries (e.g. sprains) in middle school, and it's since stuck me as odd that you would want to inhibit the body's natural healing response.

It seems like RICE is being questioned by medical professionals, as well, but consensus is far off.

Anyone have thoughts/knowledge about this?

Are there people in the AI alignment / x-safety community who are still major "Deep Learning skeptics" (in terms of capabilities)?  I know Stuart Russell is... who else?

IMO, the outer alignment problem is still the biggest problem in (technical) AI Alignment.  We don't know how to write down -- or learn -- good specifications, and people making strong AIs that optimize for proxies is still what's most likely to get us all killed.

Some possible implications of more powerful AI/technology for privacy:

1) It's as if all of your logged data gets poured over by a team of super-detectives to make informed guesses about every aspect of your life, even those that seem completely unrelated to those kinds of data.

2) Even data that you try to hide can be read from things like reverse engineering what you type based on the sounds of you typing, etc.

3) Powerful actors will deploy advanced systems to model, predict, and influence your behavior, and extreme privacy precautions starting now may be warranted.

4) On the other hand, if you DON'T have a significant digital footprint, you may be significantly less trustworthy.  If AI systems don't know what to make of you, you may be the first up against the wall (compare with seeking credit without a having credit history).
5) On the other other hand ("on the foot"?), if you trust that future societies will be more enlightened, then you may be retroactively rewarded for being more enlightened today.

Anything important I left out?

Whelp... that's scary: 
Chip Huyen



Replying to


4. You won’t need to update your models as much One mindboggling fact about DevOps: Etsy deploys 50 times/day. Netflix 1000s times/day. AWS every 11.7 seconds. MLOps isn’t an exemption. For online ML systems, you want to update them as fast as humanly possible. (5/6)

What part is scary?  I think they're missing out on the sheer variety of model usage - probably as variable as software deployments.  But I don't think there's anything particularly scary about any given point on the curve.

Some really do get built, validated, and deployed twice a year.  Some have CI pipelines that re-train with new data and re-validate every few minutes.  Some are self-updating, and re-sync to a clean state periodically.  Some are running continuous a/b tests of many candidate models, picking the best-performer for a customer segment every few minutes, and adding/removing models from the pool many times per day.

What's our backup plan if the internet *really* goes to shit?

E.g. Google search seems to have suddenly gotten way worse for searching for machine learning papers for me in the last month or so. I'd gotten used to it being great, and don't have a good backup.

A friend asked me what EAs think of

Here's my response (based on ~1 minute of Googling):

He seems to have what I call a "moral purity" attitude towards morality.
By this I mean, thinking of ethics as less consequentialist and more about "being a good person".

I think such an attitude is natural, very typical and not very EA.So, e.g. living frugally might or might not be EA, but definitely makes sense if you believe we have strong charitable obligations and have a moral purity attitude towards morality.

Moving away from moral purity and giving consequentialist arguments against it are maybe one of the main insights of EA vs. Peter Singer.

Moloch is not about coordination failures.

Moloch is about the triumph of instrumental goals.

Coordination *might* save us from that. Or not. "it is too soon to say"

Working a lot is an instrumental goal. If you start tracking your time, and optimizing that metric, you might end up working more than optimal. That seems like a triumph of instrumental goals that isn't a coordination failure. I wouldn't assign this failure to Moloch. Thoughts?

I basically agree, but I do assign it to Moloch. *shrug

[+][comment deleted]1y2