Gear-level models are expensive - often prohibitively expensive. Black-box approaches are usually much cheaper and faster. But black-box approaches rarely generalize - they're subject to Goodhart, need to be rebuilt when conditions change, don't identify unknown unknowns, and are hard to build on top of. Gears-level models, on the other hand, offer permanent, generalizable knowledge which can be applied to many problems in the future, even if conditions shift.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Very Spicy Take Epistemic Note:  Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion. Premise 1:  It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research. Premise 2: This was the default outcome.  Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.  Premise 3: Without repercussions for terrible decisions, decision makers have no skin in the game.  Conclusion: Anyone and everyone involved with Open Phil recommending a grant of $30 million dollars be given to OpenAI in 2017 shouldn't be allowed anywhere near AI Safety decision making in the future. To go one step further, potentially any and every major decision they have played a part in needs to be reevaluated by objective third parties.  This must include Holden Karnofsky and Paul Christiano, both of whom were closely involved.  To quote OpenPhil: "OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela."
I'm surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don't care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.
Akash169
2
My current perspective is that criticism of AGI labs is an under-incentivized public good. I suspect there's a disproportionate amount of value that people could have by evaluating lab plans, publicly criticizing labs when they break commitments or make poor arguments, talking to journalists/policymakers about their concerns, etc. Some quick thoughts: * Soft power– I think people underestimate the how strong the "soft power" of labs is, particularly in the Bay Area.  * Jobs– A large fraction of people getting involved in AI safety are interested in the potential of working for a lab one day. There are some obvious reasons for this– lots of potential impact from being at the organizations literally building AGI, big salaries, lots of prestige, etc. * People (IMO correctly) perceive that if they acquire a reputation for being critical of labs, their plans, or their leadership, they will essentially sacrifice the ability to work at the labs.  * So you get an equilibrium where the only people making (strong) criticisms of labs are those who have essentially chosen to forgo their potential of working there. * Money– The labs and Open Phil (which has been perceived, IMO correctly, as investing primarily into metastrategies that are aligned with lab interests) have an incredibly large share of the $$$ in the space. When funding became more limited, this became even more true, and I noticed a very tangible shift in the culture & discourse around labs + Open Phil * Status games//reputation– Groups who were more inclined to criticize labs and advocate for public or policymaker outreach were branded as “unilateralist”, “not serious”, and “untrustworthy” in core EA circles. In many cases, there were genuine doubts about these groups, but my impression is that these doubts got amplified/weaponized in cases where the groups were more openly critical of the labs. * Subjectivity of "good judgment"– There is a strong culture of people getting jobs/status for having “good judgment”. This is sensible insofar as we want people with good judgment (who wouldn’t?) but this often ends up being so subjective that it ends up leading to people being quite afraid to voice opinions that go against mainstream views and metastrategies (particularly those endorsed by labs + Open Phil). * Anecdote– Personally, I found my ability to evaluate and critique labs + mainstream metastrategies substantially improved when I spent more time around folks in London and DC (who were less closely tied to the labs). In fairness, I suspect that if I had lived in London or DC *first* and then moved to the Bay Area, it’s plausible I would’ve had a similar feeling but in the “reverse direction”. With all this in mind, I find myself more deeply appreciating folks who have publicly and openly critiqued labs, even in situations where the cultural and economic incentives to do so were quite weak (relative to staying silent or saying generic positive things about labs). Examples: Habryka, Rob Bensinger, CAIS, MIRI, Conjecture, and FLI. More recently, @Zach Stein-Perlman, and of course Jan Leike and Daniel K. 
robo83
2
Our current big stupid: not preparing for 40% agreement Epistemic status: lukewarm take from the gut (not brain) that feels rightish The "Big Stupid" of the AI doomers 2013-2023 was AI nerds' solution to the problem "How do we stop people from building dangerous AIs?" was "research how to build AIs".  Methods normal people would consider to stop people from building dangerous AIs, like asking governments to make it illegal to build dangerous AIs, were considered gauche.  When the public turned out to be somewhat receptive to the idea of regulating AIs, doomers were unprepared. Take: The "Big Stupid" of right now is still the same thing.  (We've not corrected enough).  Between now and transformative AGI we are likely to encounter a moment where 40% of people realize AIs really could take over (say if every month another 1% of the population loses their job).  If 40% of the world were as scared of AI loss-of-control as you, what could the world do? I think a lot!  Do we have a plan for then? Almost every LessWrong post on AIs are about analyzing AIs.  Almost none are about how, given widespread public support, people/governments could stop bad AIs from being built. [Example: if 40% of people were as worried about AI as I was, the US would treat GPU manufacture like uranium enrichment.  And fortunately GPU manufacture is hundreds of time harder than uranium enrichment!  We should be nerding out researching integrated circuit supply chains, choke points, foundry logistics in jurisdictions the US can't unilaterally sanction, that sort of thing.] TLDR, stopping deadly AIs from being built needs less research on AIs and more research on how to stop AIs from being built. *My research included 😬
If your endgame strategy involved relying on OpenAI, DeepMind, or Anthropic to implement your alignment solution that solves science / super-cooperation / nanotechnology, consider figuring out another endgame plan.

Popular Comments

Recent Discussion

I want to draw attention to a new paper, written by myself, David "davidad" Dalrymple, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum.

In this paper we introduce the concept of "guaranteed safe (GS) AI", which is a broad research strategy for obtaining safe AI systems with provable quantitative safety guarantees. Moreover, with a sufficient push, this strategy could plausibly be implemented on a moderately short time scale. The key components of GS AI are:

  1. A formal safety specification that mathematically describes what effects or behaviors are considered safe or acceptable.
  2. A world model that provides a mathematical description of the environment of the AI system.
  3. A verifier
...

I read the paper, and overall it's an interesting framework. One thing I am somewhat unconvinced about (likely because I have misunderstood something) is its utility despite the dependence on the world model. If we prove guarantees assuming a world model, but don't know what happens if the real world deviates from the world model, then we have a problem. Ideally perhaps we want a guarantee akin to what's proved in learning theory, for example, that the accuracy will be small for any data distribution as long as the distribution remains the same during trai... (read more)

 [memetic status: stating directly despite it being a clear consequence of core AI risk knowledge because many people have "but nature will survive us" antibodies to other classes of doom and misapply them here.]

Unfortunately, no.[1]

Technically, “Nature”, meaning the fundamental physical laws, will continue. However, people usually mean forests, oceans, fungi, bacteria, and generally biological life when they say “nature”, and those would not have much chance competing against a misaligned superintelligence for resources like sunlight and atoms, which are useful to both biological and artificial systems.

There’s a thought that comforts many people when they imagine humanity going extinct due to a nuclear catastrophe or runaway global warming: Once the mushroom clouds or CO2 levels have settled, nature will reclaim the cities. Maybe mankind in our hubris will have wounded Mother Earth and paid the price ourselves, but...

I think literal extinction is unlikely even conditional on misaligned AI takeover due to:

  • The potential for the AI to be at least a tiny bit "kind" (same as humans probably wouldn't kill all aliens)
  • Decision theory/trade reasons

This is discussed in more detail here and here.

Insofar as humans and/or aliens care about nature, similar arguments apply there too, though this is mostly beside the point given that if humans survive and have resources they can preserve some natural easily.

I find it annoying how confident this article is without really bother to... (read more)

3GoteNoSente
It is not at all clear to me that most of the atoms in a planet could be harnessed for technological structures, or that doing so would be energy efficient. Most of the mass of an earthlike planet is iron, oxygen, silicon and magnesium, and while useful things can be made out of these elements, I would strongly worry that other elements that are needed also in those useful things will run out long before the planet has been disassembled. By historical precedent, I would think that an AI civilization on Earth will ultimately be able to use only a tiny fraction of the material in the planet, similarly to how only a very small fraction of a percent of the carbon in the planet is being used by the biosphere, in spite of biological evolution having optimized organisms for billions of years towards using all resources available for life. The scenario of a swarm of intelligent drones eating up a galaxy and blotting out its stars I think can empirically be dismissed as very unlikely, because it would be visible over intergalactic distances. Unless we are the only civilization in the observable universe in the present epoch, we would see galaxies with dark spots or very strangely altered spectra somewhere. So this isn't happening anywhere. There are probably some historical analogs for the scenario of a complete takeover, but they are very far in the past, and have had more complex outcomes than intelligent grey goo scenarios normally portray. One instance I can think of is the Great Oxygenation Event. I imagine an observer back then might have envisioned that the end result of the evolution of cyanobacteria doing oxygenic photosynthesis would be the oceans and lakes and rivers all being filled with green slime, with a toxic oxygen atmosphere killing off all other life. While indeed this prognosis would have been true to a first order approximation - green plants do dominate life on Earth today - the reality of what happened is infinitely more complex than this crude pictu
2Dagon
If it's possible for super-intelligent AI to be non-sentient, wouldn't it be possible for insects to evolve non-sentient intelligence as well?  I guess I didn't assume "non-sentient" in the definition of "unaligned".
7gilch
Yep. It would take a peculiar near-miss for an unfriendly AI to preserve Nature, but not humanity. Seemed obvious enough to me. Plants and animals are made of atoms it can use for something else. By the way, I expect the rapidly expanding sphere of Darkness engulfing the Galaxy to happen even if things go well. The stars are enormous repositories of natural resources that happen to be on fire. We should put them out so they don't go to waste.

[Epistemic status: As I say below, I've been thinking about this topic for several years and I've worked on it as part of my PhD research. But none of this is based on any rigorous methodology, just my own impressions from reading the literature.]

I've been thinking about possible cruxes in AI x-risk debates for several years now. I was even doing that as part of my PhD research, although my PhD is currently on pause because my grant ran out. In particular, I often wonder about "meta-cruxes" - i.e., cruxes related to debates or uncertainties that are more about different epistemological or decision-making approaches rather than about more object-level arguments.

The following are some of my current top candidates for "meta-cruxes" related to AI x-risk debates. There are...

Bleeding Feet and Dedication

During AI Safety Camp (AISC) 2024, I was working with somebody on how to use binary search to approximate a hull that would contain a set of points, only to knock a glass off of my table. It splintered into a thousand pieces all over my floor.

A normal person might stop and remove all the glass splinters. I just spent 10 seconds picking up some of the largest pieces and then decided that it would be better to push on the train of thought without interruption.

Some time later, I forgot about the glass splinters and ended up stepping on one long enough to penetrate the callus. I prioritized working too much. A pretty nice problem to have, in my book.

Collaboration as Intelligence Enhancer

It was...

I absolutely agree that it makes more sense to fund the person (or team) rather than the project. I think that it makes sense to evaluate a person's current best idea, or top few ideas when trying to decide whether they are worth funding.

Ideally, yes, I think it'd be great if the funders explicitly gave the person permission to pivot so long as their goal of making aligned AI remained the same.

Maybe a funder would feel better about this if they had the option to reevaluate funding the researcher after a significant pivot?

3Nathan Helm-Burger
So am I. So are a lot of would-be researchers. There are many people who think they have a shot at doing this. Most are probably wrong. I'm not saying an org is a good solution for him or me. It would have to be an org willing to encompass and support the things he had in mind. Same with me. I'm not sure such orgs exist for either of us. With a convincing track-record, one can apply for funding to found or co-found a new org based on your ideas. That's a very high bar to clear though. The FAR AI org might be an adequate solution? They are an organization for coordinating independent researchers.
2Algon
Oh, I see! That makes a lot more sense. But he should really write up/link to his project then, or his collaborator's project.
3Emrik
He linked his extensive research log on the project above, and has made LW posts of some of their progress. That said, I don't know of any good legible summary of it. It would be good to have. I don't know if that's one of Johannes' top priorities, however. It's never obvious from the outside what somebody's top priorities ought to be.

When working with numbers that span many orders of magnitude it's very helpful to use some form of scientific notation. At its core, scientific notation expresses a number by breaking it down into a decimal ≥1 and <10 (the "significand" or "mantissa") and an integer representing the order of magnitude (the "exponent"). Traditionally this is written as:

3 × 104

While this communicates the necessary information, it has two main downsides:

  • It uses three constant characters ("× 10") to separate the significand and exponent.

  • It uses superscript, which doesn't work with some typesetting systems and adds awkwardly large line spacing at the best of times. And is generally lost on cut-and-paste.

Instead, I'm a big fan of e-notation, commonly used in programming and on calculators. This looks like:

3e4

This works everywhere, doesn't mess up your line spacing, and requires half as...

8noggin-scratcher
Sticking to multiples of three does have a minor advantage of aligning itself with things that there are already number-words for; "thousand", "million", "billion" etc.  So for those who don't work with the notation often, they might find it easier to recognise and mentally translate 20e9 as "20 billion", rather than having to think through the implications of 2e10
2Nathan Helm-Burger
Yeah, that's probably the rationale
6AnthonyC
Makes me wonder if there's an equivalent notation for languages that use other number word intervals. Multiples of 4 would work better in Mandarin, for example. Although i guess it's more important that it aligns with SI prefixes?

Well, the nice thing about at least agreeing on using e as the notation means its easy to understand variants which prefer subsets of exponents. 500e8, 50e9, and 5e10 all are reasonably mutually intelligible. I think sticking to a subset of exponents does feel intuitive for talking about numbers frequently encountered in everyday life, but seems a little contrived when talking about large numbers. 4e977 seems to me like it isn't much easier to understand when written as 40e976 or 400e975.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

As I read more about previous interpretability work, I've noticed this trend that implicitly defines a feature in this weird human centric way. It's this weird prior that expects networks to automatically generate features that correspond with how we process images/text because... why exactly?

Chris Olah's team at Anthropic thinks about features as "Something a large enough neural network would dedicate a neuron to". Which doesn't have the human-centric bias, but just begs the question of what is a thing a large enough network will dedicate an neuron to? They admit that this is flawed, but say it's their best current definition. This never felt like a good enough answer, even to go off of.

I don't really see the alternative engaged with. What if these features aren't robust?...

Answer by Charlie Steiner20

A different way of stating the usual Anthropic-esque concept of features that I find useful: Features are the things that are getting composed when a neural network is taking advantage of compositionality. This isn't begging the question, you just can't answer this without knowing about the data distribution and the computational strategy of the model after training.

For instance, the reason the neurons aren't always features, even though it's natural to write the activations (which then get "composed" into the inputs to the next layer) in the neuron basis, is because if your data only lies on a manifold in the space of all possible values, the local coordinates of that manifold might rarely line up with the neurons basis.

Ilya Sutskever and Jan Leike have resigned. They led OpenAI's alignment work. Superalignment will now be led by John Schulman, it seems. Jakub Pachocki replaced Sutskever as Chief Scientist.

Reasons are unclear (as usual when safety people leave OpenAI).

The NYT piece (archive) and others I've seen don't really have details.

OpenAI announced Sutskever's departure in a blogpost.

Sutskever and Leike confirmed their departures in tweets.


Updates:

Friday May 17:

Superalignment dissolves.

Leike tweets, including:

I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.

I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.

These problems are quite hard to get right,

...

Without resorting to exotic conspiracy theories, is it that unlikely to assume that Altman et al. are under tremendous pressure from the military and  intelligence agencies to produce results to not let China or anyone else win the race for AGI? I do not for a second believe that Altman et al. are reckless idiots that do not understand what kind of fire they might be playing with, that they would risk wiping out humanity just to beat Google on search. There must be bigger forces at play here, because that is the only thing that makes sense when reading Leike's comment and observing Open AI's behavior.

4Thane Ruthenis
OpenAI enthusiastically commercializing AI + the "Superalignment" approach being exactly the approach I'd expect someone doing safety-washing to pick + the November 2023 drama + the stated trillion-dollar plans to increase worldwide chip production (which are directly at odds with the way OpenAI previously framed its safety concerns). Some of the preceding resignations (chiefly, Daniel Kokotajlo's) also played a role here, though I didn't update off of them much either.
6Thane Ruthenis
Sure, that's basically my model as well. But if the faction (b) only cares about alignment due to perceived PR benefits or in order to appease faction (a), and faction (b) turns out to have overriding power such that it can destroy or drive out faction (a) and then curtail all the alignment efforts, I think it's fair to compress all that into "OpenAI's alignment efforts are safety-washing". If (b) has the real power within OpenAI, then OpenAI's behavior and values can be approximately rounded off to (b)'s behavior and values, and (a) is a rounding error. Not if (b) is concerned about fortifying OpenAI against future challenges, such as hypothetical futures in which the AGI Doomsayers get their way and the government/the general public wakes up and tries to nationalize or ban AGI research. In that case, having a prepared, well-documented narrative of going above and beyond to ensure that their products are safe, well before any other parties woke up to the threat, will ensure that OpenAI is much more well-positioned to retain control over its research. (I interpret Sam Altman's behavior at Congress as evidence for this kind of longer-term thinking. He didn't try to downplay the dangers of AI, which would be easy and what someone myopically optimizing for short-term PR would. He proactively brought up the concerns that future AI progress might awaken, getting ahead of it, and thereby established OpenAI as taking them seriously and put himself into the position to control/manage these concerns.) And it's approximately what I would do, at least, if I were in charge of OpenAI and had a different model of AGI Ruin. And this is the potential plot whose partial failure I'm currently celebrating.
11Zach Stein-Perlman
Edit: nevermind; maybe this tweet is misleading and narrow and just about restoring people's vested equity; I'm not sure what that means in the context of OpenAI's pseudo-equity but possibly this tweet isn't a big commitment. @gwern I'm interested in your take on this new Altman tweet: In particular "i did not know this was happening"

The forum has been very much focused on AI safety for some time now, thought I'd post something different for a change. Privilege.

Here I define Privilege as an advantage over others that is invisible to the beholder. This may not be the only definition, or the central definition, or not how you see it, but that's the definition I use for the purposes of this post. I also do not mean it in the culture-war sense as a way to undercut others as in "check your privilege". My point is that we all have some privileges [we are not aware of], and also that nearly each one has a flip side. 

In some way this is the inverse of The Lens That Does Not See Its Flaws: The...

Razied50

The word "privilege" has been so tainted by its association with guilt that it's almost an infohazard to think you've got privilege at this point, it makes you lower your head in shame at having more than others, and brings about a self-flagellation sort of attitude. It elicits an instinct to lower yourself rather than bring others up. The proper reactions to all these things you've listed is gratitude to your circumstances and compassion towards those who don't have them. And certainly everyone should be very careful towards any instinct they have at publ... (read more)

LessOnline Festival

May 31st to June 2nd, Berkely CA