Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Ω 221d

I want to draw attention to a new paper, written by myself, David "davidad" Dalrymple, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum.

In this paper we introduce the concept of "guaranteed safe (GS) AI", which is a broad research strategy for obtaining safe AI systems with provable quantitative safety guarantees. Moreover, with a sufficient push, this strategy could plausibly be implemented on a moderately short time scale. The key components of GS AI are:

A formal safety specification that mathematically describes what effects or behaviors are considered safe or acceptable.
A world model that provides a mathematical description of the environment of the AI system.
A verifier

...

(See More – 568 more words)

Vinayak Pathak12m10

I read the paper, and overall it's an interesting framework. One thing I am somewhat unconvinced about (likely because I have misunderstood something) is its utility despite the dependence on the world model. If we prove guarantees assuming a world model, but don't know what happens if the real world deviates from the world model, then we have a problem. Ideally perhaps we want a guarantee akin to what's proved in learning theory, for example, that the accuracy will be small for any data distribution as long as the distribution remains the same during trai... (read more)

"If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"

plex

11h

This is a linkpost for https://aisafety.info/questions/NM1Y/If-we-go-extinct-due-to-misaligned-AI,-at-least-nature-will-continue,-right

[memetic status: stating directly despite it being a clear consequence of core AI risk knowledge because many people have "but nature will survive us" antibodies to other classes of doom and misapply them here.]

Unfortunately, no.^[1]

Technically, “Nature”, meaning the fundamental physical laws, will continue. However, people usually mean forests, oceans, fungi, bacteria, and generally biological life when they say “nature”, and those would not have much chance competing against a misaligned superintelligence for resources like sunlight and atoms, which are useful to both biological and artificial systems.

There’s a thought that comforts many people when they imagine humanity going extinct due to a nuclear catastrophe or runaway global warming: Once the mushroom clouds or CO2 levels have settled, nature will reclaim the cities. Maybe mankind in our hubris will have wounded Mother Earth and paid the price ourselves, but...

(See More – 359 more words)

ryan_greenblatt13m40

I think literal extinction is unlikely even conditional on misaligned AI takeover due to:

The potential for the AI to be at least a tiny bit "kind" (same as humans probably wouldn't kill all aliens)
Decision theory/trade reasons

This is discussed in more detail here and here.

Insofar as humans and/or aliens care about nature, similar arguments apply there too, though this is mostly beside the point given that if humans survive and have resources they can preserve some natural easily.

I find it annoying how confident this article is without really bother to... (read more)

3GoteNoSente1h

It is not at all clear to me that most of the atoms in a planet could be harnessed for technological structures, or that doing so would be energy efficient. Most of the mass of an earthlike planet is iron, oxygen, silicon and magnesium, and while useful things can be made out of these elements, I would strongly worry that other elements that are needed also in those useful things will run out long before the planet has been disassembled. By historical precedent, I would think that an AI civilization on Earth will ultimately be able to use only a tiny fraction of the material in the planet, similarly to how only a very small fraction of a percent of the carbon in the planet is being used by the biosphere, in spite of biological evolution having optimized organisms for billions of years towards using all resources available for life. The scenario of a swarm of intelligent drones eating up a galaxy and blotting out its stars I think can empirically be dismissed as very unlikely, because it would be visible over intergalactic distances. Unless we are the only civilization in the observable universe in the present epoch, we would see galaxies with dark spots or very strangely altered spectra somewhere. So this isn't happening anywhere. There are probably some historical analogs for the scenario of a complete takeover, but they are very far in the past, and have had more complex outcomes than intelligent grey goo scenarios normally portray. One instance I can think of is the Great Oxygenation Event. I imagine an observer back then might have envisioned that the end result of the evolution of cyanobacteria doing oxygenic photosynthesis would be the oceans and lakes and rivers all being filled with green slime, with a toxic oxygen atmosphere killing off all other life. While indeed this prognosis would have been true to a first order approximation - green plants do dominate life on Earth today - the reality of what happened is infinitely more complex than this crude pictu

2Dagon3h

If it's possible for super-intelligent AI to be non-sentient, wouldn't it be possible for insects to evolve non-sentient intelligence as well? I guess I didn't assume "non-sentient" in the definition of "unaligned".

7gilch7h

Yep. It would take a peculiar near-miss for an unfriendly AI to preserve Nature, but not humanity. Seemed obvious enough to me. Plants and animals are made of atoms it can use for something else. By the way, I expect the rapidly expanding sphere of Darkness engulfing the Galaxy to happen even if things go well. The stars are enormous repositories of natural resources that happen to be on fire. We should put them out so they don't go to waste.

Some "meta-cruxes" for AI x-risk debates

Aryeh Englander

21m

[Epistemic status: As I say below, I've been thinking about this topic for several years and I've worked on it as part of my PhD research. But none of this is based on any rigorous methodology, just my own impressions from reading the literature.]

I've been thinking about possible cruxes in AI x-risk debates for several years now. I was even doing that as part of my PhD research, although my PhD is currently on pause because my grant ran out. In particular, I often wonder about "meta-cruxes" - i.e., cruxes related to debates or uncertainties that are more about different epistemological or decision-making approaches rather than about more object-level arguments.

The following are some of my current top candidates for "meta-cruxes" related to AI x-risk debates. There are...

(See More – 679 more words)

Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University

Johannes C. Mayer

Bleeding Feet and Dedication

During AI Safety Camp (AISC) 2024, I was working with somebody on how to use binary search to approximate a hull that would contain a set of points, only to knock a glass off of my table. It splintered into a thousand pieces all over my floor.

A normal person might stop and remove all the glass splinters. I just spent 10 seconds picking up some of the largest pieces and then decided that it would be better to push on the train of thought without interruption.

Some time later, I forgot about the glass splinters and ended up stepping on one long enough to penetrate the callus. I prioritized working too much. A pretty nice problem to have, in my book.

Collaboration as Intelligence Enhancer

It was...

(Continue Reading – 1610 more words)

Nathan Helm-Burger29m30

I absolutely agree that it makes more sense to fund the person (or team) rather than the project. I think that it makes sense to evaluate a person's current best idea, or top few ideas when trying to decide whether they are worth funding.

Ideally, yes, I think it'd be great if the funders explicitly gave the person permission to pivot so long as their goal of making aligned AI remained the same.

Maybe a funder would feel better about this if they had the option to reevaluate funding the researcher after a significant pivot?

3Nathan Helm-Burger1h

So am I. So are a lot of would-be researchers. There are many people who think they have a shot at doing this. Most are probably wrong. I'm not saying an org is a good solution for him or me. It would have to be an org willing to encompass and support the things he had in mind. Same with me. I'm not sure such orgs exist for either of us. With a convincing track-record, one can apply for funding to found or co-found a new org based on your ideas. That's a very high bar to clear though. The FAR AI org might be an adequate solution? They are an organization for coordinating independent researchers.

2Algon1h

Oh, I see! That makes a lot more sense. But he should really write up/link to his project then, or his collaborator's project.

3Emrik35m

He linked his extensive research log on the project above, and has made LW posts of some of their progress. That said, I don't know of any good legible summary of it. It would be good to have. I don't know if that's one of Johannes' top priorities, however. It's never obvious from the outside what somebody's top priorities ought to be.

Scientific Notation Options

jefftk

10h

When working with numbers that span many orders of magnitude it's very helpful to use some form of scientific notation. At its core, scientific notation expresses a number by breaking it down into a decimal ≥1 and <10 (the "significand" or "mantissa") and an integer representing the order of magnitude (the "exponent"). Traditionally this is written as:

3 × 10⁴

While this communicates the necessary information, it has two main downsides:

It uses three constant characters ("× 10") to separate the significand and exponent.
It uses superscript, which doesn't work with some typesetting systems and adds awkwardly large line spacing at the best of times. And is generally lost on cut-and-paste.

Instead, I'm a big fan of e-notation, commonly used in programming and on calculators. This looks like:

3e4

This works everywhere, doesn't mess up your line spacing, and requires half as...

(See More – 73 more words)

8noggin-scratcher6h

Sticking to multiples of three does have a minor advantage of aligning itself with things that there are already number-words for; "thousand", "million", "billion" etc. So for those who don't work with the notation often, they might find it easier to recognise and mentally translate 20e9 as "20 billion", rather than having to think through the implications of 2e10

2Nathan Helm-Burger6h

Yeah, that's probably the rationale

6AnthonyC3h

Makes me wonder if there's an equivalent notation for languages that use other number word intervals. Multiples of 4 would work better in Mandarin, for example. Although i guess it's more important that it aligns with SI prefixes?

Nathan Helm-Burger1h42

Well, the nice thing about at least agreeing on using e as the notation means its easy to understand variants which prefer subsets of exponents. 500e8, 50e9, and 5e10 all are reasonably mutually intelligible. I think sticking to a subset of exponents does feel intuitive for talking about numbers frequently encountered in everyday life, but seems a little contrived when talking about large numbers. 4e977 seems to me like it isn't much easier to understand when written as 40e976 or 400e975.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

In the context of AI interp. What is a feature exactly?

f3mi

As I read more about previous interpretability work, I've noticed this trend that implicitly defines a feature in this weird human centric way. It's this weird prior that expects networks to automatically generate features that correspond with how we process images/text because... why exactly?

Chris Olah's team at Anthropic thinks about features as "Something a large enough neural network would dedicate a neuron to". Which doesn't have the human-centric bias, but just begs the question of what is a thing a large enough network will dedicate an neuron to? They admit that this is flawed, but say it's their best current definition. This never felt like a good enough answer, even to go off of.

I don't really see the alternative engaged with. What if these features aren't robust?...

(See More – 42 more words)

Answer by Charlie SteinerMay 18, 202420

A different way of stating the usual Anthropic-esque concept of features that I find useful: Features are the things that are getting composed when a neural network is taking advantage of compositionality. This isn't begging the question, you just can't answer this without knowing about the data distribution and the computational strategy of the model after training.

For instance, the reason the neurons aren't always features, even though it's natural to write the activations (which then get "composed" into the inputs to the next layer) in the neuron basis, is because if your data only lies on a manifold in the space of all possible values, the local coordinates of that manifold might rarely line up with the neurons basis.

Ilya Sutskever and Jan Leike resign from OpenAI [updated]

230

Zach Stein-Perlman

Ilya Sutskever and Jan Leike have resigned. They led OpenAI's alignment work. Superalignment will now be led by John Schulman, it seems. Jakub Pachocki replaced Sutskever as Chief Scientist.

Reasons are unclear (as usual when safety people leave OpenAI).

The NYT piece (archive) and others I've seen don't really have details.

OpenAI announced Sutskever's departure in a blogpost.

Sutskever and Leike confirmed their departures in tweets.

Updates:

Friday May 17:

Superalignment dissolves.

Leike tweets, including:

I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.
I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.
These problems are quite hard to get right,

...

(See More – 559 more words)

Anders Lindström1h2-6

Without resorting to exotic conspiracy theories, is it that unlikely to assume that Altman et al. are under tremendous pressure from the military and intelligence agencies to produce results to not let China or anyone else win the race for AGI? I do not for a second believe that Altman et al. are reckless idiots that do not understand what kind of fire they might be playing with, that they would risk wiping out humanity just to beat Google on search. There must be bigger forces at play here, because that is the only thing that makes sense when reading Leike's comment and observing Open AI's behavior.

4Thane Ruthenis2h

OpenAI enthusiastically commercializing AI + the "Superalignment" approach being exactly the approach I'd expect someone doing safety-washing to pick + the November 2023 drama + the stated trillion-dollar plans to increase worldwide chip production (which are directly at odds with the way OpenAI previously framed its safety concerns). Some of the preceding resignations (chiefly, Daniel Kokotajlo's) also played a role here, though I didn't update off of them much either.

6Thane Ruthenis3h

Sure, that's basically my model as well. But if the faction (b) only cares about alignment due to perceived PR benefits or in order to appease faction (a), and faction (b) turns out to have overriding power such that it can destroy or drive out faction (a) and then curtail all the alignment efforts, I think it's fair to compress all that into "OpenAI's alignment efforts are safety-washing". If (b) has the real power within OpenAI, then OpenAI's behavior and values can be approximately rounded off to (b)'s behavior and values, and (a) is a rounding error. Not if (b) is concerned about fortifying OpenAI against future challenges, such as hypothetical futures in which the AGI Doomsayers get their way and the government/the general public wakes up and tries to nationalize or ban AGI research. In that case, having a prepared, well-documented narrative of going above and beyond to ensure that their products are safe, well before any other parties woke up to the threat, will ensure that OpenAI is much more well-positioned to retain control over its research. (I interpret Sam Altman's behavior at Congress as evidence for this kind of longer-term thinking. He didn't try to downplay the dangers of AI, which would be easy and what someone myopically optimizing for short-term PR would. He proactively brought up the concerns that future AI progress might awaken, getting ahead of it, and thereby established OpenAI as taking them seriously and put himself into the position to control/manage these concerns.) And it's approximately what I would do, at least, if I were in charge of OpenAI and had a different model of AGI Ruin. And this is the potential plot whose partial failure I'm currently celebrating.

11Zach Stein-Perlman3h

Edit: nevermind; maybe this tweet is misleading and narrow and just about restoring people's vested equity; I'm not sure what that means in the context of OpenAI's pseudo-equity but possibly this tweet isn't a big commitment. @gwern I'm interested in your take on this new Altman tweet: In particular "i did not know this was happening"

On Privilege

shminux

The forum has been very much focused on AI safety for some time now, thought I'd post something different for a change. Privilege.

Here I define Privilege as an advantage over others that is invisible to the beholder. This may not be the only definition, or the central definition, or not how you see it, but that's the definition I use for the purposes of this post. I also do not mean it in the culture-war sense as a way to undercut others as in "check your privilege". My point is that we all have some privileges [we are not aware of], and also that nearly each one has a flip side.

In some way this is the inverse of The Lens That Does Not See Its Flaws: The...

(See More – 319 more words)

Razied1h50

The word "privilege" has been so tainted by its association with guilt that it's almost an infohazard to think you've got privilege at this point, it makes you lower your head in shame at having more than others, and brings about a self-flagellation sort of attitude. It elicits an instinct to lower yourself rather than bring others up. The proper reactions to all these things you've listed is gratitude to your circumstances and compassion towards those who don't have them. And certainly everyone should be very careful towards any instinct they have at publ... (read more)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Bleeding Feet and Dedication

Collaboration as Intelligence Enhancer

LessOnline Festival