I'm a fan of the Lightcone team & I think they're one of the few orgs where I'd basically just say "yeah, they should probably just get as much funding as they want."
With that in mind, I was surprised by the lack of information in this funding request. I feel mixed about this: high-status AIS orgs often (accurately) recognize that they don't really need to spend time justifying their funding requests, but I think this often harms community epistemics (e.g., by leading to situations where everyone is like "oh X org is great-- I totally support them" wit...
With that in mind, I was surprised by the lack of information in this funding request. I feel mixed about this: high-status AIS orgs often (accurately) recognize that they don't really need to spend time justifying their funding requests, but I think this often harms community epistemics (e.g., by leading to situations where everyone is like "oh X org is great-- I totally support them" without actually knowing much about what work they're planning to do, what models they have, etc.)
Sorry about that! I've drafted like 3-4 different fundraising posts over th...
I generally don't find writeups of standards useful, but this piece was an exception. Below, I'll try to articulate why:
I think AI governance pieces-- especially pieces about standards-- often have overly vague language. People say things like "risk management practices" or "third-party audits", phrases that are generally umbrella terms that lack specificity. These sometimes serve as applause lights (whether the author intended this or not): who could really disagree with the idea of risk management?
I liked that this piece (fairly unapologetically) advocat...
Congratulations on launching!
On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off?
Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.
But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this.
Excited to see this! I'd be most excited about case studies of standards in fields where people didn't already have clear ideas about how to verify safety.
In some areas, it's pretty clear what you're supposed to do to verify safety. Everyone (more-or-less) agrees on what counts as safe.
One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.
Are there examples of standards in oth...
I've been working on a response to the NTIA request for comments on AI Accountability over the last few months. It's likely that I'll also submit something to the OSTP request.
I've learned a few useful things from talking to AI governance and policy folks. Some of it is fairly intuitive but still worth highlighting (e.g., try to avoid jargon, remember that the reader doesn't share many assumptions that people in AI safety take for granted, remember that people have many different priorities). Some of it is less intuitive (e.g., what actually happens with t...
Glad to see this write-up & excited for more posts.
I think these are three areas that MATS feels like it has handled fairly well. I'd be especially excited to hear more about areas where MATS thinks it's struggling, MATS is uncertain, or where MATS feels like it has a lot of room to grow. Potential candidates include:
(Pasting this exchange from a comment thread on the EA Forum; bolding added)
Peter Park:
...Thank you so much for your insightful and detailed list of ideas for AGI safety careers, Richard! I really appreciate your excellent post.
I would propose explicitly grouping some of your ideas and additional ones under a third category: “identifying and raising public awareness of AGI’s dangers.” In fact, I think this category may plausibly contain some of the most impactful ideas for reducing catastrophic and existential risks, given that alignment seems potentially dif
Nice-- very relevant. I agree with Evan that arguments about the training procedure will be relevant (I'm more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible).
Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.
(In fact, I think arguments that meet some sort of "beyond-a-reasonable-doubt" threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)
Can you say more about what part of this relates to a ban on AI development?
I think the claim "AI development should be regulated in a way such that the burden of proof is on developers to show beyond-a-reasonable-doubt that models are safe" seems quite different from the claim "AI development should be banned", but it's possible that I'm missing something here or communicating imprecisely.
This makes sense. Can you say more about how aviation regulation differs from the FDA?
In other words, are there meaningful differences in how the regulatory processes are set up? Or does it just happen to be the case that the FDA has historically been worse at responding to evidence compared to the Federal Aviation Administration?
(I think it's plausible that we would want a structure similar to the FDA even if the particular individuals at the FDA were bad at cost-benefit analysis, unless there are arguments that the structure of the FDA caused the bad cost-benefit analyses).
My understanding of your claim is something like:
I don't have strong takes on claim 2, but I think (at least at first glance) disagree with claim 1. It seems quite plausible to imagine international cooperation without requiring strong domestic AI safety communities in each country that opts-in to the agreement. If the US tried sufficiently hard, and was willing t...
Copying over my comment from the EA Forum version.
I think it's great that you're releasing some posts that criticize/red-team some major AIS orgs. It's sad (though understandable) that you felt like you had to do this anonymously.
I'm going to comment a bit on the Work Culture Issues section. I've spoken to some people who work at Redwood, have worked at Redwood, or considered working at Redwood.
I think my main comment is something like you've done a good job pointing at some problems, but I think it's pretty hard to figure out what should be done abo...
if you're not willing to engage with people who give clearly genuine and high effort discussion about why they think the policy is unnecessary
Briefly noting that the policy "I will not respond to every single high-effort criticism I receive" is very different from "I am not willing to engage with people who give high-effort criticism."
And the policy "sometimes I will ask people who write high-effort criticism to point me to their strongest argument and then I will engage with that" is also different from the two policies mentioned above.
I think I agree with a lot of the specific points raised here, but I notice a feeling of wariness/unease around the overall message. I had a similar reaction to Haydn's recent "If your model is going to sell, it has to be safe" piece. Let me try to unpack this:
On one hand, I do think safety is important for the commercial interests of labs. And broadly being better able to understand/control systems seems good from a commercial standpoint.
My biggest reservations can be boiled down into two points:
It seems to me like one (often obscured) reason for the disagreement between Thomas and Habryka is that they are thinking about different groups of people when they define "the field."
To assess the % of "the field" that's doing meaningful work, we'd want to do something like [# of people doing meaningful work]/[total # of people in the field].
Who "counts" in the denominator? Should we count anyone who has received a grant from the LTFF with the word "AI safety" in it? Only the ones who have contributed object-level work? Only the ones who have contributed ...
Quick note that this is from a year ago: March 4, 2022. (Might be good to put this on top of the post so people don't think it's from 2023).
I think a lot of threat models (including modern threat models) are found in, or heavily inspired by, old MIRI papers. I also think MIRI papers provide unusually clear descriptions of the alignment problem, why MIRI expects it to be hard, and why MIRI thinks intuitive ideas won't work (see e.g., Intelligence Explosion: Evidence and Import, Intelligence Explosion Microeconomics, and Corrigibility).
Regarding more recent stuff, MIRI has been focusing less on research output and more on shaping discussion around alignment. They are essentially "influence...
With my comments, I was hoping to spark more of a back-and-forth. Having failed at that, I'm guessing part of the problem is that I didn't phrase my disagreements bluntly or strongly enough, while also noting various points of agreement, which might have overall made it sound like I had only minor disagreements.
Did you ask for more back-and-forth, or were you hoping Sam would engage in more back-and-forth without being explicitly prompted?
If it's the latter, I think the "maybe I made it seem like I only had minor disagreements" hypothesis is less likely th...
Does this drive a "race to the bottom," where more lenient evals teams get larger market share
I appreciate you asking this, and I find this failure mode plausible. It reminds me of one of the failure modes I listed here (where a group proposing strict evals gets outcompeted by a group proposing looser evals).
...Governance failure: We are outcompeted by a group that develops (much less demanding) evals/standards (~10%). Several different groups develop safety standards for AI labs. One group has expertise in AI privacy and data monitoring, another has expertis
However, many of these people might not have a sufficient “toolbox” or research experience to have much marginal impact in short timelines worlds.
I think this is true for some people, but I also think people tend to overestimate the amount of years it takes to have enough research experience to contribute.
I think a few people have been able to make useful contributions within their first year (though in fairness they generally had backgrounds in ML or AI, so they weren't starting completely from scratch), and several highly respected senior researche...
I appreciate the comment and think I agree with most of it. Was there anything in the post that seemed to disagree with this reasoning?
I downvoted the post because I don't think it presents strong epistemics. Some specific critiques:
In other words, I now believe a significant probability, on the order of 50-70%, that alignment is solved by default.
Let's suppose that you are entirely right about deceptive alignment being unlikely. (So we'll set aside things like "what specific arguments caused you to update?" and tricky questions about modest epistemology/outside views).
I don't see how "alignment is solved by default with 30-50% probability justifies claims like "capabilities progress is net positive" or "AI alignment should change purpose to something else."
If a doctor told me I had a...
I don't agree with everything in the post, but I do commend Sam for writing it. I think it's a rather clear and transparent post that summarizes some important aspects of his worldview, and I expect posts like this to be extremely useful for discourse about AI safety.
Here are three parts I found especially clear & useful to know:
Thoughts on safety standards
...We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year. At some point, it may be important to
Nevertheless, my guess is that it's more dignified for us to have these sorts of reporting systems than to not have them
Can you elaborate on this one? (I don't have a strong opinion one way or the other; seems unclear to me. If this system had been in place before Bing, and it had properly fixed all the issues with Bing, it seems plausible to me that this would've been net negative for x-risk reduction. The media coverage on Bing seems good for getting people to be more concerned about alignment and AI safety, reducing trust in a "we'll just figure it out ...
It does! I think I'd make it more explicit, though, that the post focuses on the views/opinions of people at Anthropic. Maybe something like this (new text in bold):
...This post is the first half of a series about my attempts understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying. (The impetus for looking into this was to answer the question of whether I should join Anthropic's ops team.) As part of my research
+1. I think this framing is more accurate than the current first paragraph (which, in my reading of it, seems to promise a more balanced and comprehensive analysis).
Even when these discussions don't produce agreement, do you think they're helpful for the community?
I've spoken to several people who have found the MIRI dialogues useful as they enter the field, understand threat models, understand why people disagree, etc. It seems not-crazy to me that most of the value in these dialogues comes from their effects on the community (as opposed to their effects on the participants).
Three other thoughts/observations:
I was surprised at how low the hour estimates were, particularly for the OP people (especially Holden) and even for Paul.
Maybe worth keeping in mind that Nate isn't the only MIRI person who's spent lots of hours on this (e.g., Eliezer and Benya have as well), and the numbers only track Nate-time.
Also maybe worth keeping in mind the full list of things that need doing in the world. This is one of the key important leveraged things that needs doing, so it's easy to say "spend more time on it". But spending a thousand hours (so, like, a good chunk of a year w...
Even when these discussions don't produce agreement, do you think they're helpful for the community?
I've spoken to several people who have found the MIRI dialogues useful as they enter the field, understand threat models, understand why people disagree, etc. It seems not-crazy to me that most of the value in these dialogues comes from their effects on the community (as opposed to their effects on the participants).
IMO having and releasing those dialogues was one of the most obviously useful things MIRI has done to date, and I'm super happy with them....
+1. A few other questions I'm interested in:
Thank you for sharing this; I'd be excited to see more writeups that attempt to analyze the strategy of AI labs.
This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.
I found that this introduction raised my expectations for the post and misled me a bit. After reading the introduction, I was expecting to see more analysis of the pros and cons of Anthropic's strategy, as well as mor...
+1. I'll note though that there are some socially acceptable ways of indicating "smarter" (e.g., better reasoning, better judgment, better research taste). I was on the lookout for these kinds of statements, and I rarely found them. The closest thing that came up commonly was the "strong and concrete models of AI safety" (which could be loosely translated into "having better and smarter thoughts about alignment").
I appreciate this post and your previous post. Fwiw, I think these terminology concerns/confusions are harming discourse on AI existential safety, and I expect posts like these to help people talk-past-each-other less, notice subtle distinctions, deconfuse more quickly, etc.
(I especially like the point about how increasing intent alignment on the margin doesn't necessarily help much with increasing intent alignment in the limit. Some version of this idea has come up a few times in discussions about OpenAI's alignment plan, and the way you presented it makes the point clearer/crisper imo).
Great post. I expect to recommend it at least 10 times this year.
Semi-related point: I often hear people get discouraged when they don't have "good ideas" or "ideas that they believe in" or "ideas that they are confident would actually reduce x-risk." (These are often people who see the technical alignment problem as Hard or Very Hard).
I'll sometimes ask "how many other research agendas do you think meet your bar for "an idea you believe in" or "an idea that you are confident would actually reduce x-risk?" Often, when considering the entire field of ...
Good question. I'm using the term "idea" pretty loosely and glossily.
Things that would meet this vague definition of "idea":
I spent the first half-3/4 of 2022 focused on AIS field-building projects. In the last few months, I've been focusing more on understanding AI risk threat models & strategy/governance research projects.
Before 2022, I was a PhD student researching scalable mental health interventions (see here).
Thanks for sharing this. I've been looking forward to your thoughts on OpenAI's plan, and I think you presented them succinctly/clearly. I found the "evaluation vs generation" section particularly interesting/novel.
One thought: I'm currently not convinced that we would need general intelligence in order to generate new alignment ideas.
...The alignment problem needs high general intelligence, because it needs new ideas for solving alignment. It won’t be enough to input all the math around the alignment problem and have the AI solve that. It's a great improveme
I appreciate you writing this. I found myself agreeing much of it. The post also helped me notice some feeling of "huh, something seems missing... there's something I think this isn't capturing... but what is it?" I haven't exactly figured out where that feeling is coming from, so apologies if this comment ends up being incoherent or tangential. But you've inspired me to try to double-click on it, so here it goes :)
Suppose I meet someone for the first time, and I begin to judge them for reasons that I don't endorse. For example, maybe something about...
Thank you for releasing this dialogue-- lots of good object-level stuff here.
In addition, I think Scott showcased some excellent conversational moves. He seemed very good at prompting Yudkowsky well, noticing his own confusions, noticing when he needed to pause/reflect before continuing with a thread, and prioritizing between topics.
I hope that some of these skills are learnable. I expect the general discourse around alignment would be more productive if more people tried to emulate some of Scott's magic.
Some examples that stood out to me:
Ackno...
I think this is a reasonable critique.
The particular friend I refer to is unusually good at distilling things in ways that I find actionable/motivating, which might bias me a bit.
But of course it depends on the book and the topic and the person, and it would be unwise to think that most books could be easily summarized like this.
Notably, I think that many of the things that people commonly worry about RE status are easier to summarize than books. Examples:
I read this post for the first time in 2022, and I came back to it at least twice.
What I found helpful
Thank you for sharing! I found these two quotes to be the most interesting (bolding added by me):
...Yeah that was my earlier point, I think society should regulate what the wide bounds are, but then I think individual users should have a huge amount of liberty to decide how they want their experience to go. So I think it is like a combination of society -- you know there are a few asterisks on the free speech rules -- and society has decided free speech is not quite absolute. I think society will also decide language models are not quite absolute. But there i
Ah, thanks for the clarifications. I agree with the clarified versions :)
Quick note on getting senior researchers:
A few thoughts:
I agree that it would be great to have more senior researchers in alignment
I agree that, ideally, it would be easier for independent researchers to get funding.
I don’t think it’s necessarily a bad thing that the field of AI alignment research is reasonably competitive.
My impression is that there’s still a lot of funding (and a lot of interest in funding) independent alignment researchers.
My impression is that it’s still considerably easier to get funding for independent alignment research than many other forms of independent n
Thanks for writing this up. I agree with several of the subpoints you make about how the plan could be more specific, measurable, etc.
I'm not sure where I stand on some of the more speculative (according to me) claims about OpenAI's intentions. Put differently, I see your post making two big-picture claims:
Reviewing this quickly because it doesn't have a review.
I've linked this post to several people in the last year. I think it's valuable for people (especially junior researchers or researchers outside of major AIS hubs) to be able to have a "practical sense" of what doing independent alignment research can be like, how the LTFF grant application process works, and some of the tradeoffs of doing this kind of work.
This seems especially important for independent conceptual work, since this is the path that is least well-paved (relative to empirical work...
ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community.
Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and pres...
Thanks for this post; it's probably my favorite Cold Takes post from the last few months. I appreciated the specific scenario, as well as the succinct points in the "we can do better" section. I felt like I could get a more concrete understanding of your worldview, how you think we should move forward, and the reasons why. I'm also glad that you're thinking critically about standards and monitoring.
...For a simple example, imagine an AI company in a dominant market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems
How would you feel about a review process that had two sections?
Section One: How important do you find this work & to what extent do you think the research is worth doing? (Ex: Does it strike at what you see as core alignment problems?)
Section Two: What do you think of the details of the research? (Ex: Do you see any methodological flaws, do you have any ideas for further work, etc).
My impression is that academic peer-reviewers generally do both of these. Compared to academic peer-review, LW/AF discussions tend to have a lot of Section One and not much...
Thanks for this detailed response; I found it quite helpful. I maintain my "yeah, they should probably get as much funding as they want" stance. I'm especially glad to see that Lightcone might be interested in helping people stay sane/grounded as many people charge into the policy space.
This ... (read more)