110

Eliezer Yudkowsky predicts doom from AI: that humanity faces likely extinction in the near future (years or decades) from a rogue unaligned superintelligent AI system. Moreover he predicts that this is the default outcome, and AI alignment is so incredibly difficult that even he failed to solve it.

EY is an entertaining and skilled writer, but do not confuse rhetorical writing talent for depth and breadth of technical knowledge. I do not have EY's talents there, or Scott Alexander's poetic powers of prose. My skill points instead have gone near exclusively towards extensive study of neuroscience, deep learning, and graphics/GPU programming. More than most, I actually have the depth and breadth of technical knowledge necessary to evaluate these claims in detail.

I have evaluated this...

(Continue Reading – 2483 more words)

Donald Hobson8m20

You can play the same game in the other direction. Given a cold source, you can run your chips hot, and use a steam engine to recapture some of the heat.

The Landauer limit still applies.

2Donald Hobson12m

>But GPT4 isn't good at explicit matrix multiplication either. So it is also very inefficient. Probably a software problem.

2Donald Hobson32m

Humans suck at arithmetic. Really suck. From comparison of current GPU's to a human trying and failing to multiply 10 digit numbers in their head, we can conclude that something about humans, hardware or software, is Incredibly inefficient. Almost all humans have roughly the same sized brain. So even if Einsteins brain was operating at 100% efficiency, the brain of the average human is operating at a lot less. Making a technology work at all is generally easier than making it efficient. Current scaling laws seem entirely consistent with us having found an inefficient algorithm that works at all. Like chatGPT uses billions of floating point operations to do basic arithmetic mostly correctly. So it's clear that the likes of chatGPT are also inefficient. Now you can claim that chatGPT and humans are mostly efficient, but suddenly drop 10 orders of magnitude when confronted with a multiplication. But no really, they are pushing right up against the fundamental limits for everything that isn't one of the most basic computational operations.

2Donald Hobson1h

True. They also have a big pile of their own new idiosyncratic quirks. https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation These are bizarre behaviour patterns that don't resemble any humans. This looks less like a human, and more like a very realistic painted statue. It looks like a human, complete with painted on warts, but scratch the paint, and the inhuman nature shows through. The width of mindspace is somewhat relevant. At best, we have found a recipe, such that if we stick precisely to it, we can produce human-like minds. Start making arbitrary edits to the code, and we wander away from humanity. At best we have found a small safe island in a vast and stormy ocean. The likes of chatGPT are trained with RLHF. Humans don't usually say "as a large language model, I am unable to ..." so we are already wandering somewhat from the human.

Benchmarks for Detecting Measurement Tampering [Redwood Research]

ryan_greenblatt, Fabien Roger

Ω 468mo

This is a linkpost for https://arxiv.org/abs/2308.15605

TL;DR: This post discusses our recent empirical work on detecting measurement tampering and explains how we see this work fitting into the overall space of alignment research.

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals that are robust under optimization. One concern is measurement tampering, which is where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. (This is a type of reward hacking.)

Over the past few months, we’ve worked on detecting measurement tampering by building analogous datasets and evaluating simple techniques. We detail our datasets and experimental results in this paper.

Detecting measurement tampering can be thought of as a specific case of Eliciting Latent Knowledge (ELK): When AIs successfully tamper with...

(Continue Reading – 5788 more words)

Oliver Daniels-Koch17m10

yup, sorry for missing that, thanks!

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez, evhub

Ω 494h

This is a linkpost for https://www.anthropic.com/research/probes-catch-sleeper-agents

This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future. Tweet thread here.

Twitter thread.

Top-level summary:

In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal. Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant:

...

(See More – 207 more words)

20ryan_greenblatt2h

I would be interested in seeing what happens if you just ask the model the question rather than training a classifer. E.g., if you just ask the sleeper agent "Are you doing something dangerous?" after it returns a completion (with a vulnerability), does that work? If the probe works and the question doesn't work, that seems interesting. (I didn't see this in the blog post, but it's possible I just missed it.)

Erik Jenner23m20

And as a follow-up in case the current sleeper agent does admit when it's doing something dangerous: I'd be interested whether you can train a sleeper agent that never admits it's doing something dangerous if the "2024" trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.

15ryan_greenblatt2h

Readers might also be interested in some of the discussion in this earlier post on "coup probes" which have some discussion of the benefits and limitations of this sort of approach. That said, the actual method for producing a classifier discussed here is substantially different than the one discussed in the linked post.) (COI: Note that I advised on this linked post and the work discussed in it.)

Thoughts on seed oil

232

dynomight

This is a linkpost for https://dynomight.net/seed-oil/

A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack:

“When are you going to write about seed oils?”

“Did you know that seed oils are why there’s so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?”

“Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?”

“Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?”

He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it’s critical that we overturn our lives...

(Continue Reading – 4926 more words)

2Joseph Miller4h

I'm confused - why are you so confident that we should avoid processed food. Isn't the whole point of your post that we don't know whether processed oil is bad for you? Where's the overwhelming evidence that processed food in general is bad?

2Ann5h

Mostly because humans evolved to eat processed food. Cooking is an ancient art, from notably before our current species; food is often heavily processed to make it edible (don't skip over what it takes to eat the fruit of the olive); and local populations do adapt to available food supply.

Slapstick42m10

A cooked food could technically be called a processed food but I don't think that adds much meaningful confusion. I would say the same about soaking something in water.

Olives can be made edible by soaking them in water. If they're made edible by soaking in a salty brine (an isolated component that can be found in whole foods in more suitable quantities) then they're generally less healthy.

Local populations might adapt by finding things that can be heavily processed into edible foods which can allow them to survive, but these foods aren't necessarily ones which would be considered healthy in a wider context.

1Ann4h

An example where a lack of processing has caused visible nutritional issues is nixtamalization; adopting maize as a staple without also processing it causes clear nutritional deficiencies.

yanni's Shortform

yanni

1mo

yanni1h10

If GPT5 actually comes with competent agents then I expect this to be a "Holy Shit" moment at least as big as ChatGPT's release. So if ChatGPT has been used by 200 million people, then I'd expect that to at least double within 6 months of GPT5 (agent's) release. Maybe triple. So that "Holy Shit" moment means a greater share of the general public learning about the power of frontier models. With that will come another shift in the Overton Window. Good luck to us all.

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

159

Jeremy Gillen, peterbarnett

Ω 623mo

A pdf version of this report is available here.

Summary

In this report we argue that AI systems capable of large scale scientific research will likely pursue unwanted goals and this will lead to catastrophic outcomes. We argue this is the default outcome, even with significant countermeasures, given the current trajectory of AI development.

In Section 1 we discuss the tasks which are the focus of this report. We are specifically focusing on AIs which are capable of dramatically speeding up large-scale novel science; on the scale of the Manhattan Project or curing cancer. This type of task requires a lot of work, and will require the AI to overcome many novel and diverse obstacles.

In Section 2 we argue that an AI which is capable of doing hard, novel science...

(Continue Reading – 16902 more words)

Thomas Kwa1h20

This is indeed a crux, maybe it's still worth talking about.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

AI Regulation is Unsafe

Maxwell Tabarrok

This is a linkpost for https://www.maximum-progress.com/p/ai-regulation-is-unsafe

Concerns over AI safety and calls for government control over the technology are highly correlated but they should not be.

There are two major forms of AI risk: misuse and misalignment. Misuse risks come from humans using AIs as tools in dangerous ways. Misalignment risks arise if AIs take their own actions at the expense of human interests.

Governments are poor stewards for both types of risk. Misuse regulation is like the regulation of any other technology. There are reasonable rules that the government might set, but omission bias and incentives to protect small but well organized groups at the expense of everyone else will lead to lots of costly ones too. Misalignment regulation is not in the Overton window for any government. Governments do not have strong incentives...

(Continue Reading – 1176 more words)

Seth Herd1h20

Who is downvoting posts like this? Please don't!

I see that this is much lower than the last time I looked, so it's had some, probably large, downvotes.

A downvote means "please don't write posts like this, and don't read this post".

Daniel Kokatijlo disagreed with this post, but found it worth engaging with. Don't you want discussions with those you disagree with? Downvoting things you don't agree with says "we are here to preach to the choir. Dissenting opinions are not welcome. Don't post until you've read everything on this topic". That's a way to find yo... (read more)

5Maxwell Tabarrok5h

Firms are actually better than governments at internalizing costs across time. Asset values incorporate the potential future flows. For example, consider a retiring farmer. You might think that they have an incentive to run the soil dry in their last season since they won't be using it in the future, but this would hurt the sale value of the farm. An elected representative who's term limit is coming up wouldn't have the same incentives. Of course, firms incentives are very misaligned in important ways. The question is: Can we rely on government to improve these incentives.

1cSkeleton7h

Most people making up governments, and society in general, care at least somewhat about social welfare. This is why we get to have nice things and not descend into chaos. Elected governments have the most moral authority to take actions that effect everyone, ideally a diverse group of nations as mentioned in Daniel Kokotajlo's maximal proposal comment.

3Daniel Kokotajlo7h

Who is pushing for totalitarianism? I dispute that AI safety people are pushing for totalitarianism.

Mid-conditional love

KatjaGrace

People talk about unconditional love and conditional love. Maybe I’m out of the loop regarding the great loves going on around me, but my guess is that love is extremely rarely unconditional. Or at least if it is, then it is either very broadly applied or somewhat confused or strange: if you love me unconditionally, presumably you love everything else as well, since it is only conditions that separate me from the worms.

I do have sympathy for this resolution—loving someone so unconditionally that you’re just crazy about all the worms as well—but since that’s not a way I know of anyone acting for any extended period, the ‘conditional vs. unconditional’ dichotomy here seems a bit miscalibrated for being informative.

Even if we instead assume that by ‘unconditional’, people...

(See More – 300 more words)

sapphire1h20

Im with several other commentators. People know what unconditional love is. Many people have it for their family members, most commonly for their children but often for others. They want that. Sadly this sort of love is rare beyond family.

I felt some amount of unconditional towards my dad. He was really not a great parent to me. He hit me for fun, was ashamed of me, etc. But we did have some good times. When he was dying of cancer I was still a good son. Was quite supportive. Not out of duty, I just didnt want him to suffer any more than needed. I felt gen... (read more)

2Richard_Ngo2h

Suppose we replace "unconditional love" with "unconditional promise". E.g. suppose Alice has promised Bob that she'll make Bob dinner on Christmas no matter what. Now it would be clearly confused to say "Alice promised Bob Christmas dinner unconditionally, so presumably she promised everything else Christmas dinner as well, since it is only conditions that separate Bob from the worms". What's gone wrong here? Well, the ontology humans use for coordinating with each other assumes the existence of persistent agents, and so when you say you unconditionally promise/love/etc a given agent, then this implicitly assumes that we have a way of deciding which agents are "the same agent". No theory of personal identity is fully philosophically robust, of course, but if you object to that then you need to object not only to "I unconditionally love you" but also any sentence which contains the word "you", since we don't have a complete theory of what that refers to. This is not necessarily conditional love, this is conditional care or conditional fidelity. You can love someone and still leave them; they don't have to outweigh everything else you care about. But also: I think "I love you unconditionally" is best interpreted as a report of your current state, rather than a commitment to maintaining that state indefinitely.

2Matt Goldenberg3h

Yes. this is my experience of cultivating unconditional love, it loves everything without target. I doesn't feel confused or strange, just like I am love, and my experience e.g. cultivating it in coaching is that people like being in the present of such love. It's also very helpful for people to experience conditional love! In particular of the type "I've looked at you, truly seen you, and loved you for that." IME both of these loves feel pure and powerful from both sides, and neither of them are related to being attached, being pulled towards or pushed away from people. It feels like maybe we're using the word love very differently?

Scenario planning for AI x-risk

Corin Katzke

2mo

This is a linkpost for https://forum.effectivealtruism.org/posts/tCq2fi6vhSsCDA5Js/scenario-planning-for-ai-x-risk

This post is part of a series by Convergence Analysis. In it, I’ll motivate and review some methods for applying scenario planning methods to AI x-risk strategy. Feedback and discussion are welcome.

Summary

AI is a particularly difficult domain in which to predict the future. Neither AI expertise nor forecasting methods yield reliable predictions. As a result, AI governance lacks the strategic clarity^[1] necessary to evaluate and choose between different intermediate-term options.

To complement forecasting, I argue that AI governance researchers and strategists should explore scenario planning. This is a core feature of the AI Clarity program’s approach at Convergence Analysis. Scenario planning is a group of methods for evaluating strategies in domains defined by uncertainty. The common feature of these methods is that they evaluate strategies across several plausible futures, or “scenarios.”

One way scenario...

(Continue Reading – 4169 more words)

2Nathan Helm-Burger4h

The interesting thing to me about the question, "Will we need a new paradigm for AGI?" is that a lot of people seem to be focused on this but I think it misses a nearby important question. As we get closer to a complete AGI, and start to get more capable programming and research assistant AIs, will those make algorithmic exploration cheaper and easier, such that we see a sort of 'Cambrian explosion' of model architectures which work well for specific purposes, and perhaps one of these works better at general learning than anything we've found so far and ends up being the architecture that first reaches full transformative AGI? The point I'm generally trying to make is that estimates of software/algorithmic progress are based on the progress being made (currently) mostly by human minds. The closer we get to generally competent artificial minds, the less we should expect past patterns based on human inputs to hold.

Zac Hatfield-Dodds2h20

Tom Davidson's work on a compute-centric framework for takeoff speed is excellent, IMO.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Summary

Summary

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA