Gear-level models are expensive - often prohibitively expensive. Black-box approaches are usually much cheaper and faster. But black-box approaches rarely generalize - they're subject to Goodhart, need to be rebuilt when conditions change, don't identify unknown unknowns, and are hard to build on top of. Gears-level models, on the other hand, offer permanent, generalizable knowledge which can be applied to many problems in the future, even if conditions shift.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Very Spicy Take Epistemic Note:  Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion. Premise 1:  It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research. Premise 2: This was the default outcome.  Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.  Premise 3: Without repercussions for terrible decisions, decision makers have no skin in the game.  Conclusion: Anyone and everyone involved with Open Phil recommending a grant of $30 million dollars be given to OpenAI in 2017 shouldn't be allowed anywhere near AI Safety decision making in the future. To go one step further, potentially any and every major decision they have played a part in needs to be reevaluated by objective third parties.  This must include Holden Karnofsky and Paul Christiano, both of whom were closely involved.  To quote OpenPhil: "OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela."
I'm surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don't care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.
If your endgame strategy involved relying on OpenAI, DeepMind, or Anthropic to implement your alignment solution that solves science / super-cooperation / nanotechnology, consider figuring out another endgame plan.
Akash20
1
My current perspective is that criticism of AGI labs is an under-incentivized public good. I suspect there's a disproportionate amount of value that people could have by evaluating lab plans, publicly criticizing labs when they break commitments or make poor arguments, talking to journalists/policymakers about their concerns, etc. Some quick thoughts: * Soft power– I think people underestimate the how strong the "soft power" of labs is, particularly in the Bay Area.  * Jobs– A large fraction of people getting involved in AI safety are interested in the potential of working for a lab one day. There are some obvious reasons for this– lots of potential impact from being at the organizations literally building AGI, big salaries, lots of prestige, etc. * People (IMO correctly) perceive that if they acquire a reputation for being critical of labs, their plans, or their leadership, they will essentially sacrifice the ability to work at the labs.  * So you get an equilibrium where the only people making (strong) criticisms of labs are those who have essentially chosen to forgo their potential of working there. * Money– The labs and Open Phil (which has been perceived, IMO correctly, as investing primarily into metastrategies that are aligned with lab interests) have an incredibly large share of the $$$ in the space. When funding became more limited, this became even more true, and I noticed a very tangible shift in the culture & discourse around labs + Open Phil * Status games//reputation– Groups who were more inclined to criticize labs and advocate for public or policymaker outreach were branded as “unilateralist”, “not serious”, and “untrustworthy” in core EA circles. In many cases, there were genuine doubts about these groups, but my impression is that these doubts got amplified/weaponized in cases where the groups were more openly critical of the labs. * Subjectivity of "good judgment"– There is a strong culture of people getting jobs/status for having “good judgment”. This is sensible insofar as we want people with good judgment (who wouldn’t?) but this often ends up being so subjective that it ends up leading to people being quite afraid to voice opinions that go against mainstream views and metastrategies (particularly those endorsed by labs + Open Phil). * Anecdote– Personally, I found my ability to evaluate and critique labs + mainstream metastrategies substantially improved when I spent more time around folks in London and DC (who were less closely tied to the labs). In fairness, I suspect that if I had lived in London or DC *first* and then moved to the Bay Area, it’s plausible I would’ve had a similar feeling but in the “reverse direction”. With all this in mind, I find myself more deeply appreciating folks who have publicly and openly critiqued labs, even in situations where the cultural and economic incentives to do so were quite weak (relative to staying silent or saying generic positive things about labs). Examples: Habryka, Rob Bensinger, CAIS, MIRI, Conjecture, and FLI. More recently, @Zach Stein-Perlman, and of course Jan Leike and Daniel K. 
From my perspective, the only thing that keeps the OpenAI situation from being all kinds of terrible is that I continue to think they're not close to human-level AGI, so it probably doesn't matter all that much. This is also my take on AI doom in general; my P(doom|AGI soon) is quite high (>50% for sure), but my P(AGI soon) is low. In fact it decreased in the last 12 months.

Popular Comments

Recent Discussion

Crossposted from my Substack.

What if our universe had a undo function?

I wish to propose a new paradox about the implications of a simulated world in which we occupy. If concepts like the Many Worlds Interpretation (MWI) and Simulation Theory have grains of truth to them (both of which I will explore in depth later on), then could there be a specific state of stored “memory” (for lack of a better term) that could be referred back to like in a video game? If our world operates under a simulated system, where save states could be held and activated, could we (as inhabitants of the simulation) ever get to discover this mechanism? Could it be what David Chalmers describes as a “Sim Sign”? Who could be responsible for...

2Akash
My current perspective is that criticism of AGI labs is an under-incentivized public good. I suspect there's a disproportionate amount of value that people could have by evaluating lab plans, publicly criticizing labs when they break commitments or make poor arguments, talking to journalists/policymakers about their concerns, etc. Some quick thoughts: * Soft power– I think people underestimate the how strong the "soft power" of labs is, particularly in the Bay Area.  * Jobs– A large fraction of people getting involved in AI safety are interested in the potential of working for a lab one day. There are some obvious reasons for this– lots of potential impact from being at the organizations literally building AGI, big salaries, lots of prestige, etc. * People (IMO correctly) perceive that if they acquire a reputation for being critical of labs, their plans, or their leadership, they will essentially sacrifice the ability to work at the labs.  * So you get an equilibrium where the only people making (strong) criticisms of labs are those who have essentially chosen to forgo their potential of working there. * Money– The labs and Open Phil (which has been perceived, IMO correctly, as investing primarily into metastrategies that are aligned with lab interests) have an incredibly large share of the $$$ in the space. When funding became more limited, this became even more true, and I noticed a very tangible shift in the culture & discourse around labs + Open Phil * Status games//reputation– Groups who were more inclined to criticize labs and advocate for public or policymaker outreach were branded as “unilateralist”, “not serious”, and “untrustworthy” in core EA circles. In many cases, there were genuine doubts about these groups, but my impression is that these doubts got amplified/weaponized in cases where the groups were more openly critical of the labs. * Subjectivity of "good judgment"– There is a strong culture of people getting jobs/status for having “goo

Sorry for brevity, I'm busy right now.

  1. Noticing good stuff labs do, not just criticizing them, is often helpful. I wish you thought of this work more as "evaluation" than "criticism."
  2. It's often important for evaluation to be quite truth-tracking. Criticism isn't obviously good by default.

FSF blogpostFull document (just 6 pages; you should read it). Compare to Anthropic's RSPOpenAI's RSP ("Preparedness Framework"), and METR's Key Components of an RSP.

DeepMind's FSF has three steps:

  1. Create model evals for warning signs of "Critical Capability Levels"
    1. Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals
    2. They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D," and they're thinking about CBRN
      1. E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"
  2. Do model evals every 6x effective compute and every 3 months of fine-tuning
    1. This is an "aim," not a commitment
    2. Nothing about evals during deployment
  3. "When a model reaches
...
1cubefox
RSP = Responsible Scaling Policy
7Akash
@Zach Stein-Perlman I appreciate your recent willingness to evaluate and criticize safety plans from labs. I think this is likely a public good that is underprovided, given the strong incentives that many people have to maintain a good standing with labs (not to mention more explicit forms of pressure applied by OpenAI and presumably other labs). One thought: I feel like the difference between how you described the Anthropic RSP and how you described the OpenAI PF feels stronger than the actual quality differences between the documents. I agree with you that the thresholds in the OpenAI PF are too high, but I think the PF should get "points" for spelling out risks that go beyond ASL-3/misuse. OpenAI has commitments that are insufficiently cautious for ASL-4+ (or what they would call high/critical on model autonomy), but Anthropic circumvents this problem by simply refusing to make any commitments around ASL-4 (for now). You note this limitation when describing Anthropic's RSP, but you describe it as "promising" while describing the PF as "unpromising." In my view, this might be unfairly rewarding Anthropic for just not engaging with the hardest parts of the problem (or unfairly penalizing OpenAI for giving their best guess answers RE how to deal with the hardest parts of the problem). We might also just disagree on how firm or useful the commitments in each document are– I walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks. I do think OpenAI's thresholds are too high, but It's likely that I'll feel the same way about Anthropic's thresholds. In particular, I don't expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities (partially because the competitive pressures and race dynamics pushing them to do so will be intense). I don't see evidence that either lab is taking
2ryan_greenblatt
I agree with this as stated, but don't think that avoiding deploying such models is needed to mitigate risk. I think various labs are to some extent in denial of this because massively deploying possibly misaligned systems sounds crazy (and is somewhat crazy), but I would prefer if various people realized this was likely the default outcome and prepared accordingly. More strongly, I think most of the relevant bit of the safety usefulness trade-off curve involves deploying such models. (With countermeasures.) I think this is a real possibility, but unlikely to be necessarily depending on the risk target. E.g., I think you can deploy ASL-4 models with <5% risk without theoretical insights and instead just via being very careful with various prosaic countermeasures (mostly control). <1% risk probably requires stronger stuff, though it will depend on the architecture and various other random details. (That said, I'm pretty sure that these lab's aren't making decisions based on carefully analyzing the situation and are instead just operating like "idk human level models don't seem that bad, we'll probably be able to figure it out, humans can solve most problems with empiricism on priors". But, this prior seems more right than overwhelming pessimism IMO.) Also, I think you should seriously entertain the idea that just trying quite hard with various prosaic countermeasures might suffice for reasonably high levels of safety. And thus pushing on this could potentially be very leveraged relative to trying to hit a higher target.
Akash20

I personally have a large amount of uncertainty around how useful prosaic techniques & control techniques will be. Here are a few statements I'm more confident in:

  1. Ideally, AGI development would have much more oversight than we see in the status quo. Whether or not development or deployment activities keep national security risks below acceptable levels should be a question that governments are involved in answering. A sensible oversight regime would require evidence of positive safety or "affirmative safety". 
  2. My biggest concern with the prosaic/co
... (read more)
4Tamsin Leake
I'm surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don't care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.

I still parse that move as devastating the commons in order to make a quick buck.

I believe that ChatGPT was not released with the expectation that it would become as popular as it did. OpenAI pivoted hard when it saw the results.

Also, I think you are misinterpreting the sort of 'updates' people are making here.

When working with numbers that span many orders of magnitude it's very helpful to use some form of scientific notation. At its core, scientific notation expresses a number by breaking it down into a decimal ≥1 and <10 (the "significand" or "mantissa") and an integer representing the order of magnitude (the "exponent"). Traditionally this is written as:

3 × 104

While this communicates the necessary information, it has two main downsides:

  • It uses three constant characters ("× 10") to separate the significand and exponent.

  • It uses superscript, which doesn't work with some typesetting systems and adds awkwardly large line spacing at the best of times. And is generally lost on cut-and-paste.

Instead, I'm a big fan of e-notation, commonly used in programming and on calculators. This looks like:

3e4

This works everywhere, doesn't mess up your line spacing, and requires half as...

Yeah, agreed. Also, using just an e makes it much easier to type on a phone keyboard.

There are also other variants, like ee and EE. And also sometimes you see a variant which uses only multiples of three as the exponent. I think it's called engineering notation instead of scientific notation? So like 1e3, 50e3, 700e6, 2e9. I also like this version less.

This is the fourth in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan?

This post has more of my personal opinions than previous posts or the report itself.


Other movements should try to avoid becoming as partisan as the environmental movement. Partisanship did not make environmentalism more popular, it made legislation more difficult to pass, and it resulted in fluctuating executive action. Looking at the history of environmentalism can give insight into what to avoid in order to stay bipartisan.

Partisanship was not inevitable. It occurred as the result of choices and alliances made by individual decision makers. If they had made different choices, environmentalism could have ended up being a bipartisan issue, like it was in the 1980s and is in some countries...

1wolajacy
Agreed. Advocacy seems to me to be ~very frequently tied to bad epistemics, for a variety of reasons. So what is missing to me in this writeup (and indeed, in most of the discussions about the issue): why does it make sense to make laypeople interested in the issue? The status quo is that relevant people (ML researchers at large, AI investors, governments and international bodies like UN) are already well-aware of the safety problem. Institutions are set up, work is being done. What is there to be gained from involving the public to an even greater extent, poison and inevitably simplify the discourse, add more hard-to-control momentum? I can imagine a few answers (at present not enough being done, fear of the market forces eventually overwhelming the governance, "democratic mindset"), but none of those seem convincing in the face of the above. To tie with the environmental movement: wouldn't it be much better for the world if it was an uninspiring issue. It seems to me that this would prevent the anti-nuclear movement being solidified by the momentum, the extinction rebellion promoting degrowth etc, and instead semi-sensible policies would get considered somewhere in the bureaucracy of the states?

instead semi-sensible policies would get considered somewhere in the bureaucracy of the states?

Whilst normally having radical groups is useful for shifting the Overton window or abusing anchoring effects in this case study of environmentalism I think it backfired from what I can understand, given the polling data of public in the sample country already caring about the environment.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

I don't see how this is relevant to my comment.

By "positive EV bets" I meant positive EV with respect to shared values, not with respect to personal gain.

ETA: Maybe your view is that leaders should take this bets anyway even though they know they are likely to result in a forced retirement. (E.g. ignoring the disincentive.) I was actually thinking of the disincentive effect as: you are actually a good leader, so you remaining in power would be good, therefore you should avoid actions that result in you losing power for unjustified reasons. Therefore you sh... (read more)

10ryan_greenblatt
Do you think that whenever anyone makes a decision that ends up being bad ex-post they should be forced to retire? Doesn't this strongly disincentivize making positive EV bets which are likely to fail?
4ryan_greenblatt
I mostly agree with premises 1, 2, and 3, but I don't see how the conclusion follows. It is possible for things to be hard to influence and yet still worth it to try to influence them. (Note that the $30 million grant was not an endorsement and was instead a partnership (e.g. it came with a board seat), see Buck's comment.) (Ex-post, I think this endeavour was probably net negative, though I'm pretty unsure and ex-ante I think it seems great.)
6ryan_greenblatt
Why focus on the $30 million grant? What about large numbers of people working at OpenAI directly on capabilities for many years? (Which is surely worth far more than $30 million.) Separately, this grant seems to have been done to influence the goverance at OpenAI, not make OpenAI go faster. (Directly working on capabilities seems modestly more accelerating and risky than granting money in exchange for a partnership.) (ETA: TBC, there is a relationship between the grant and people working at OpenAI on capabilities: the grant was associated with a general vague endorsement of trying to play inside game at OpenAI.)

I've noticed that the principles of Evolution / Natural Selection apply to a lot of things besides the context they were initially developed for (Biology). 

Examples are things like ideas / culture (memetics), technological progress, and machine learning (sort of).

Reasoning about things like history, politics, companies, etc in terms of natural selection has helped me understand the world much better than I did when I thought that natural selection applied only to Biology. 

So, I'm asking here for any other ideas that are generally applicable in a similar way. 

 

(Sorry if this has been asked before. I tried searching for it and didn't find anything, but it's possible that my phrasing was off and I missed it). 

 [memetic status: stating directly despite it being a clear consequence of core AI risk knowledge because many people have "but nature will survive us" antibodies to other classes of doom and misapply them here.]

Unfortunately, no.[1]

Technically, “Nature”, meaning the fundamental physical laws, will continue. However, people usually mean forests, oceans, fungi, bacteria, and generally biological life when they say “nature”, and those would not have much chance competing against a misaligned superintelligence for resources like sunlight and atoms, which are useful to both biological and artificial systems.

There’s a thought that comforts many people when they imagine humanity going extinct due to a nuclear catastrophe or runaway global warming: Once the mushroom clouds or CO2 levels have settled, nature will reclaim the cities. Maybe mankind in our hubris will have wounded Mother Earth and paid the price ourselves, but...

2Dagon
It's always seemed strange to me what preferences people have for things well outside their own individual experiences, or at least outside their sympathized experiences of beings they consider similar to themselves. Why would one particularly prefer unthinking terrestrial biology (moss, bugs, etc.) over actual thinking being(s) like a super-AI?  It's not like bacteria are any more aligned than this hypothetical destroyer.
Nnotm20

Bugs could potentially result in a new sentient species many millions of years down the line. With super-AI that happens to be non-sentient, there is no such hope.

3plex
The space of values is large, and many people have crystalized into liking nature for fairly clear reasons (positive experiences in natural environments, memetics in many subcultures idealizing nature, etc). Also, misaligned, optimizing AI easily maps to the destructive side of humanity, which many memeplexes demonize.
3Mateusz Bagiński
Note to the LW team: it might be worth considering making links to AI Safety Info live-previewable (like links to other LW posts/sequences/comments and Arbital pages), depending on how much effort it would take and how much linking to AISI on LW we expect in the future.

LessOnline Festival

May 31st to June 2nd, Berkely CA