All of Akash's Comments + Replies

Thanks for this detailed response; I found it quite helpful. I maintain my "yeah, they should probably get as much funding as they want" stance. I'm especially glad to see that Lightcone might be interested in helping people stay sane/grounded as many people charge into the policy space. 

I ended up deciding to instead publish a short post, expecting that people will write a lot of questions in the comments, and then to engage straightforwardly and transparently there, which felt like a way that was more likely to end up with shared understanding.

This ... (read more)

I'm a fan of the Lightcone team & I think they're one of the few orgs where I'd basically just say "yeah, they should probably just get as much funding as they want."

With that in mind, I was surprised by the lack of information in this funding request. I feel mixed about this: high-status AIS orgs often (accurately) recognize that they don't really need to spend time justifying their funding requests, but I think this often harms community epistemics (e.g., by leading to situations where everyone is like "oh X org is great-- I totally support them" wit... (read more)

With that in mind, I was surprised by the lack of information in this funding request. I feel mixed about this: high-status AIS orgs often (accurately) recognize that they don't really need to spend time justifying their funding requests, but I think this often harms community epistemics (e.g., by leading to situations where everyone is like "oh X org is great-- I totally support them" without actually knowing much about what work they're planning to do, what models they have, etc.)

Sorry about that! I've drafted like 3-4 different fundraising posts over th... (read more)

Akash8dΩ270

I generally don't find writeups of standards useful, but this piece was an exception. Below, I'll try to articulate why:

I think AI governance pieces-- especially pieces about standards-- often have overly vague language. People say things like "risk management practices" or "third-party audits", phrases that are generally umbrella terms that lack specificity. These sometimes serve as applause lights (whether the author intended this or not): who could really disagree with the idea of risk management?

I liked that this piece (fairly unapologetically) advocat... (read more)

2Olivia Jimenez7d
Thanks. Good suggestion, added table

Congratulations on launching!

On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off? 

Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.

But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this. 

2Lee Sharkey17d
Thanks Akash!  I agree that this feels neglected. Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712 [https://twitter.com/Manderljung/status/1663700498288115712] Looking forward to it coming out! 

Excited to see this! I'd be most excited about case studies of standards in fields where people didn't already have clear ideas about how to verify safety.

In some areas, it's pretty clear what you're supposed to do to verify safety. Everyone (more-or-less) agrees on what counts as safe.

One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.

Are there examples of standards in oth... (read more)

1Koen.Holtman20d
While overcoming expert disagreement is a challenge, it is not one that is as big as you think. TL;DR: Deciding not to agree is always an option. To expand on this: the fallback option in a safety standards creation process, for standards that aim to define a certain level of safe-enough, is as follows. If the experts involved cannot agree on any evidence based method for verifying that a system X is safe enough according to the level of safety required by the standard, then the standard being created will simply, and usually implicitly, declare that there is no route by which system X can comply with the safety standard. If you are required by law, say by EU law, to comply with the safety standard before shipping a system into the EU market, then your only legal option will be to never ship that system X into the EU market. For AI systems you interact with over the Internet, this 'never ship' translates to 'never allow it to interact over the Internet with EU residents'. I am currently in the JTC21 committee which is running the above standards creation process to write the AI safety standards in support of the EU AI Act, the Act that will regulate certain parts of the AI industry, in case they want to ship legally into the EU market. ((Legal detail: if you cannot comply with the standards, the Act will give you several other options that may still allow you to ship legally, but I won't get into explaining all those here. These other options will not give you a loophole to evade all expert scrutiny.)) Back to the mechanics of a standards committee: if a certain AI technology, when applied in a system X, is well know to make that system radioactively unpredictable, it will not usually take long for the technical experts in a standards committee to come to an agreement that there is no way that they can define any method in the standard for verifying that X will be safe according to the standard. The radioactively unsafe cases are the easiest cases to handle. Th
1RogerDearnaley21d
I agree that, at least for the more serious risks, there doesn't seem to be consensus on what the mitigations should be. For example, I'd be interested to know what proportion of alignment researchers would consider an AGI that's a value learner (and of course has some initial model of human values created by humans to start that value learning process from) to have better outer-alignment safety properties that an AGI with a fixed utility function created by humans. For me it very clear that the former is better, as it incentivizes the AGI to converge from its initial model of human values towards true human values, allowing it to fix problems when the initial model, say, goes out-of-distribution or doesn't have sufficient detail. But I have no idea how much consensus there is on this, and I see a lot of alignment researchers working on approaches that don't appear to assume that the AI system is a value learner.

I've been working on a response to the NTIA request for comments on AI Accountability over the last few months. It's likely that I'll also submit something to the OSTP request.

I've learned a few useful things from talking to AI governance and policy folks. Some of it is fairly intuitive but still worth highlighting (e.g., try to avoid jargon, remember that the reader doesn't share many assumptions that people in AI safety take for granted, remember that people have many different priorities). Some of it is less intuitive (e.g., what actually happens with t... (read more)

Glad to see this write-up & excited for more posts.

I think these are three areas that MATS feels like it has handled fairly well. I'd be especially excited to hear more about areas where MATS thinks it's struggling, MATS is uncertain, or where MATS feels like it has a lot of room to grow. Potential candidates include:

  • How is MATS going about talent selection and advertising for the next cohort, especially given the recent wave of interest in AI/AI safety?
  • How does MATS intend to foster (or recruit) the kinds of qualities that strong researchers often pos
... (read more)
1Ryan Kidd1mo
* We broadened our advertising approach for the Summer 2023 Cohort, including a Twitter post [https://twitter.com/ryan_kidd44/status/1644817241664536578] and a shout-out on Rob Miles' YouTube [https://www.youtube.com/shorts/twMqHDXO29U] and TikTok [https://www.tiktok.com/@robertmilesai/video/7227141622566276378?lang=en] channels. We expected some lowering of average applicant quality as a result but have yet to see a massive influx of applicants from these sources. We additionally focused more on targeted advertising to AI safety student groups, given their recent growth. We will publish updated applicant statistics after our applications close. * In addition to applicant selection and curriculum elements, our Scholar Support staff, introduced in the Winter 2022-23 Cohort, supplement the mentorship experience by providing 1-1 research strategy and unblocking support for scholars. This program feature aims to: * Supplement and augment mentorship with 1-1 debugging, planning, and unblocking; * Allow air-gapping of evaluation and support [https://www.lesswrong.com/posts/ehLR9HeXB5TMp9Y4v/air-gapping-evaluation-and-support], improving scholar outcomes by resolving issues they would not take to their mentor; * Solve scholars’ problems, giving more time for research. * Defining "good alignment research" is very complicated and merits a post of its own (or two, if you also include the theories of change that MATS endorses). We are currently developing scholar research ability through curriculum elements focused on breadth, depth, and epistemology (the "T-model of research"): * Breadth-first search (literature reviews, building a "toolbox" of knowledge [https://www.lesswrong.com/posts/bjjbp5i5G8bekJuxv/study-guide], noticing gaps [https://www.lesswrong.com/s/zpCiuR4T343j9WkcK]); * Depth-first search (forming testable hypotheses, project-specific skills, executing research

(Pasting this exchange from a comment thread on the EA Forum; bolding added)

Peter Park:

Thank you so much for your insightful and detailed list of ideas for AGI safety careers, Richard! I really appreciate your excellent post.

I would propose explicitly grouping some of your ideas and additional ones under a third category: “identifying and raising public awareness of AGI’s dangers.” In fact, I think this category may plausibly contain some of the most impactful ideas for reducing catastrophic and existential risks, given that alignment seems potentially dif

... (read more)

Nice-- very relevant. I agree with Evan that arguments about the training procedure will be relevant (I'm more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible). 

Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.

(In fact, I think arguments that meet some sort of "beyond-a-reasonable-doubt" threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)

Can you say more about what part of this relates to a ban on AI development?

I think the claim "AI development should be regulated in a way such that the burden of proof is on developers to show beyond-a-reasonable-doubt that models are safe" seems quite different from the claim "AI development should be banned", but it's possible that I'm missing something here or communicating imprecisely. 

4aphyer2mo
Apologies, I was a bit blunt here. It seems to me that the most obvious reading of "the burden of proof is on developers to show beyond-a-reasonable-doubt that models are safe" is in fact "all AI development is banned".  It's...not clear at all to me what a proof of a model being safe would even look like, and based on everything I've heard about AI Alignment (admittedly mostly from elsewhere on this site) it seems that no-one else knows either.  A policy of 'developers should have to prove that their models are safe' would make sense in a world where we had a clear understanding that some types of model were safe, and wanted to make developers show that they were doing the safe thing and not the unsafe thing.  Right now, to the best of my understanding, we have no idea what is safe and what isn't. If you have some idea of what a 'proof of safety' would look like under your system, could you say more about that?  Are there any existing AI systems you think can satisfy this requirement?   From my perspective the most obvious outcomes of a burden-of-proof policy like you describe seem to be: * If it is interpreted literally and enforced as written, it will in fact be a full ban on AI development.  Actually proving an AI system to be safe is not something we can currently do. * Many possible implementations of it would not in fact ban AI development, but it's not clear that what they would do would actually relate to safety.  For instance, I can easily imagine outcomes like: * AI developers are required to submit a six-thousand-page 'proof' of safety to the satisfaction of some government bureau.  This would work out to something along the lines of 'only large companies with compliance departments can develop AI', which might be beneficial under some sets of assumptions that I do not particularly share? * AI developers are required to prove some narrow thing about their AI (e.g. that their AI will never output a racial

This makes sense. Can you say more about how aviation regulation differs from the FDA?

In other words, are there meaningful differences in how the regulatory processes are set up? Or does it just happen to be the case that the FDA has historically been worse at responding to evidence compared to the Federal Aviation Administration? 

(I think it's plausible that we would want a structure similar to the FDA even if the particular individuals at the FDA were bad at cost-benefit analysis, unless there are arguments that the structure of the FDA caused the bad cost-benefit analyses).

2Marius Hobbhahn2mo
So far, I haven't looked into it in detail and I'm only reciting other people's testimonials. I intend to dive deeper into these fields soon. I'll let you know when I have a better understanding.  

My understanding of your claim is something like:

  • Claim 1: Cooperation with China would likely require a strong Chinese AI safety community
  • Claim 2: The Chinese AI safety community is weak
  • Conclusion: Therefore, cooperation with China is infeasible

I don't have strong takes on claim 2, but I think (at least at first glance) disagree with claim 1. It seems quite plausible to imagine international cooperation without requiring strong domestic AI safety communities in each country that opts-in to the agreement. If the US tried sufficiently hard, and was willing t... (read more)

6Lao Mein2mo
My claim is that AI safety isn't part of the Chinese gestalt. It's like America asking China to support Israel for because building the Third Temple will bring the Final Judgement. Chinese leadership don't have AI safety as a real concern. Chinese researchers who help advise Chinese leadership don't have AI safety as a real concern. At most they consider it like the new land acknowledgments - another box they have to check off in order to interface with Western academia. Just busy work that they privately consider utterly deranged. 

Copying over my comment from the EA Forum version.

I think it's great that you're releasing some posts that criticize/red-team some major AIS orgs. It's sad (though understandable) that you felt like you had to do this anonymously. 

I'm going to comment a bit on the Work Culture Issues section. I've spoken to some people who work at Redwood, have worked at Redwood, or considered working at Redwood.

I think my main comment is something like you've done a good job pointing at some problems, but I think it's pretty hard to figure out what should be done abo... (read more)

if you're not willing to engage with people who give clearly genuine and high effort discussion about why they think the policy is unnecessary

Briefly noting that the policy "I will not respond to every single high-effort criticism I receive" is very different from "I am not willing to engage with people who give high-effort criticism."

And the policy "sometimes I will ask people who write high-effort criticism to point me to their strongest argument and then I will engage with that" is also different from the two policies mentioned above.

I think I agree with a lot of the specific points raised here, but I notice a feeling of wariness/unease around the overall message. I had a similar reaction to Haydn's recent "If your model is going to sell, it has to be safe" piece. Let me try to unpack this:

On one hand, I do think safety is important for the commercial interests of labs. And broadly being better able to understand/control systems seems good from a commercial standpoint.

My biggest reservations can be boiled down into two points: 

  1. I don't think that commercial incentives will be enoug
... (read more)
1NickGabs3mo
I agree with aspects of this critique. However, to steelman Leopold, I think he is not just arguing that demand-driven incentives will drive companies to solve alignment due to consumers wanting safe systems, but rather that, over and above ordinary market forces, constraints imposed by governments, media/public advocacy, and perhaps industry-side standards will make it such that it is ~impossible to release a very powerful, unaligned model. I think this points to a substantial underlying disagreement in your models - Leopold thinks that governments and the public will "wake up" sufficiently quickly to catastrophic risk from AI such that there will be regulatory and PR forces which effectively prevent the release of misaligned models, including evals/ways of detecting misalignment that are more robust than those that might be used by ordinary consumers (which could as you point out likely be fooled by surface-level alignment due to RLHF).

It seems to me like one (often obscured) reason for the disagreement between Thomas and Habryka is that they are thinking about different groups of people when they define "the field."

To assess the % of "the field" that's doing meaningful work, we'd want to do something like [# of people doing meaningful work]/[total # of people in the field].

Who "counts" in the denominator? Should we count anyone who has received a grant from the LTFF with the word "AI safety" in it? Only the ones who have contributed object-level work? Only the ones who have contributed ... (read more)

2M. Y. Zuo3mo
This sounds like a solid explanation for the difference for someone totally uninvolved with the Berkeley scene.  Though I'm surprised there's no broad consensus on even basic things like this in 2023.  In game terms, if everyone keeps their own score separately then it's no wonder a huge portion of effort will, in aggregate, go towards min-maxing the score tracking meta-game. 

Quick note that this is from a year ago: March 4, 2022. (Might be good to put this on top of the post so people don't think it's from 2023). 

3remember3mo
Yes, good call! Added it.
Answer by AkashMar 07, 2023184

I think a lot of threat models (including modern threat models) are found in, or heavily inspired by, old MIRI papers. I also think MIRI papers provide unusually clear descriptions of the alignment problem, why MIRI expects it to be hard, and why MIRI thinks intuitive ideas won't work (see e.g., Intelligence Explosion: Evidence and Import, Intelligence Explosion Microeconomics, and Corrigibility). 

Regarding more recent stuff, MIRI has been focusing less on research output and more on shaping discussion around alignment. They are essentially "influence... (read more)

With my comments, I was hoping to spark more of a back-and-forth. Having failed at that, I'm guessing part of the problem is that I didn't phrase my disagreements bluntly or strongly enough, while also noting various points of agreement, which might have overall made it sound like I had only minor disagreements.

Did you ask for more back-and-forth, or were you hoping Sam would engage in more back-and-forth without being explicitly prompted?

If it's the latter, I think the "maybe I made it seem like I only had minor disagreements" hypothesis is less likely th... (read more)

Answer by AkashMar 03, 202330

Does this drive a "race to the bottom," where more lenient evals teams get larger market share

I appreciate you asking this, and I find this failure mode plausible. It reminds me of one of the failure modes I listed here (where a group proposing strict evals gets outcompeted by a group proposing looser evals).

Governance failure: We are outcompeted by a group that develops (much less demanding) evals/standards (~10%). Several different groups develop safety standards for AI labs. One group has expertise in AI privacy and data monitoring, another has expertis

... (read more)

However, many of these people might not have a sufficient “toolbox” or research experience to have much marginal impact in short timelines worlds.

I think this is true for some people, but I also think people tend to overestimate the amount of years it takes to have enough research experience to contribute. 

I think a few people have been able to make useful contributions within their first year (though in fairness they generally had backgrounds in ML or AI, so they weren't starting completely from scratch), and several highly respected senior researche... (read more)

I appreciate the comment and think I agree with most of it. Was there anything in the post that seemed to disagree with this reasoning?

I downvoted the post because I don't think it presents strong epistemics. Some specific critiques:

  • The author doesn't explain the reasoning that produced the updates. (They link to posts, but I don't think it's epistemically sound to link to say "I made updates and you can find the reasons why in these posts." At best, people read the posts, and then come away thinking "huh, I wonder which of these specific claims/arguments were persuasive to the poster.")
  • The author recommends policy changes (to LW and the field of alignment) that (in my opinion) don't seem
... (read more)

In other words, I now believe a significant probability, on the order of 50-70%, that alignment is solved by default.

Let's suppose that you are entirely right about deceptive alignment being unlikely. (So we'll set aside things like "what specific arguments caused you to update?" and tricky questions about modest epistemology/outside views).

I don't see how "alignment is solved by default with 30-50% probability justifies claims like "capabilities progress is net positive" or "AI alignment should change purpose to something else."

If a doctor told me I had a... (read more)

1Noosphere893mo
I'll make another version of the thought experiment, in which we can get a genetic upgrade in which it gives you +1000 utils if you have it for a 70% chance, or it gives -1000 utils at a 30% chance. Should you take it? The answer is yes, in expectation, and it will give you +400 utils in expectation. This is related to a general principle: As long as the probabilities of positive outcomes are over 50% and the costs and benefits are symmetrical, it is a good thing to do that activity. And my contention is that AGI/ASI is just a larger version of the thought experiment above. AGI/ASI is a symmetric technology wrt good and bad outcomes, so that's why it's okay to increase capabilities.

I don't agree with everything in the post, but I do commend Sam for writing it. I think it's a rather clear and transparent post that summarizes some important aspects of his worldview, and I expect posts like this to be extremely useful for discourse about AI safety.

Here are three parts I found especially clear & useful to know:

Thoughts on safety standards

We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year. At some point, it may be important to

... (read more)
4M. Y. Zuo4mo
This sounds very sensible. Does anyone know what is this 'best safety work' he's referring to?
0Yitz4mo
+1 here

Nevertheless, my guess is that it's more dignified for us to have these sorts of reporting systems than to not have them

Can you elaborate on this one? (I don't have a strong opinion one way or the other; seems unclear to me. If this system had been in place before Bing, and it had properly fixed all the issues with Bing, it seems plausible to me that this would've been net negative for x-risk reduction. The media coverage on Bing seems good for getting people to be more concerned about alignment and AI safety, reducing trust in a "we'll just figure it out ... (read more)

4Vaniver4mo
I think reducing trust in "we'll just figure it out as we go" while still operating under that mentality is bad; I think steps like this are how we stop operating under that mentality. [Was it the case that nothing like this would happen in a widespread way until high profile failures, because of the lack of external pressure? Maybe.] I think users being able to report problems doesn't help with x-risk-related problems. (The issue will be when these systems stop sending bug reports!) I nevertheless think having systems for users to report issues will be a step in the right direction, even if it doesn't get us all the way. This seems right and is good to point out; but it wouldn't surprise me if the right place for a lot of safety-minded folk to be is non-profits with broad government/industry backing that serve valuable infrastructure roles, rather than just standing athwart history yelling "stop!". [How do we get that backing? Well, that's the challenge.]

It does! I think I'd make it more explicit, though, that the post focuses on the views/opinions of people at Anthropic. Maybe something like this (new text in bold):

This post is the first half of a series about my attempts understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying. (The impetus for looking into this was to answer the question of whether I should join Anthropic's ops team.) As part of my research

... (read more)

+1. I think this framing is more accurate than the current first paragraph (which, in my reading of it, seems to promise a more balanced and comprehensive analysis).

4Swimmer963 (Miranda Dixon-Luinenburg) 4mo
Edited first line, which hopefully clarifies this better. 

Even when these discussions don't produce agreement, do you think they're helpful for the community?

I've spoken to several people who have found the MIRI dialogues useful as they enter the field, understand threat models, understand why people disagree, etc. It seems not-crazy to me that most of the value in these dialogues comes from their effects on the community (as opposed to their effects on the participants). 

Three other thoughts/observations:

  1. I'd be interested in seeing a list of strategies/techniques you've tried and offering suggestions. (I be
... (read more)

I was surprised at how low the hour estimates were, particularly for the OP people (especially Holden) and even for Paul.

Maybe worth keeping in mind that Nate isn't the only MIRI person who's spent lots of hours on this (e.g., Eliezer and Benya have as well), and the numbers only track Nate-time.

Also maybe worth keeping in mind the full list of things that need doing in the world. This is one of the key important leveraged things that needs doing, so it's easy to say "spend more time on it". But spending a thousand hours (so, like, a good chunk of a year w... (read more)

Even when these discussions don't produce agreement, do you think they're helpful for the community?

I've spoken to several people who have found the MIRI dialogues useful as they enter the field, understand threat models, understand why people disagree, etc. It seems not-crazy to me that most of the value in these dialogues comes from their effects on the community (as opposed to their effects on the participants). 

IMO having and releasing those dialogues was one of the most obviously useful things MIRI has done to date, and I'm super happy with them.... (read more)

+1. A few other questions I'm interested in:

  • Which threat models is Anthropic most/least concerned about?
  • What are Anthropic's thoughts on AGI ruin arguments?
  • Would Anthropic merge-and-assist if another safety-conscious project comes close to building AGI?
  • What kind of evidence would update Anthropic away from (or more strongly toward) their current focus on empiricism/iteration?
  • What are some specific observations that made/continue-to-make Anthropic leadership concerned about OpenAI's commitment to safety?
  • What does Anthropic think about DeepMind's commitment
... (read more)
3catherio3mo
A "Core Views on AI Safety" post is now available at https://www.anthropic.com/index/core-views-on-ai-safety [https://www.anthropic.com/index/core-views-on-ai-safety] (Linkpost for that is here: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety.) [https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety.)]

Thank you for sharing this; I'd be excited to see more writeups that attempt to analyze the strategy of AI labs.

This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.

I found that this introduction raised my expectations for the post and misled me a bit. After reading the introduction, I was expecting to see more analysis of the pros and cons of Anthropic's strategy, as well as mor... (read more)

4Swimmer963 (Miranda Dixon-Luinenburg) 4mo
It's deliberate that this post covers mostly specifics that I learned from Anthropic staff, and further speculation is going to be in a separate later post. I wanted to make a really clear distinction between "these are things that were said to me about Anthropic by people who have context" (which is, for the most part, people in favor of Anthropic's strategy), and my own personal interpretation and opinion on whether Anthropic's work is net positive, which is filtered through my worldview and which I think most people at Anthropic would disagree with.  Part two is more critical, which means I want to write about it with a lot of effort and care, so I expect I'll put it up in a week or two. 

+1. I'll note though that there are some socially acceptable ways of indicating "smarter" (e.g., better reasoning, better judgment, better research taste). I was on the lookout for these kinds of statements, and I rarely found them. The closest thing that came up commonly was the "strong and concrete models of AI safety" (which could be loosely translated into "having better and smarter thoughts about alignment"). 

I appreciate this post and your previous post. Fwiw, I think these terminology concerns/confusions are harming discourse on AI existential safety, and I expect posts like these to help people talk-past-each-other less, notice subtle distinctions, deconfuse more quickly, etc. 

(I especially like the point about how increasing intent alignment on the margin doesn't necessarily help much with increasing intent alignment in the limit. Some version of this idea has come up a few times in discussions about OpenAI's alignment plan, and the way you presented it makes the point clearer/crisper imo). 

Great post. I expect to recommend it at least 10 times this year. 

Semi-related point: I often hear people get discouraged when they don't have "good ideas" or "ideas that they believe in" or "ideas that they are confident would actually reduce x-risk." (These are often people who see the technical alignment problem as Hard or Very Hard).

I'll sometimes ask "how many other research agendas do you think meet your bar for "an idea you believe in" or "an idea that you are confident would actually reduce x-risk?" Often, when considering the entire field of ... (read more)

Good question. I'm using the term "idea" pretty loosely and glossily. 

Things that would meet this vague definition of "idea":

  • The ELK problem (like going from nothing to "ah, we'll need a way of eliciting latent knowledge from AIs")
  • Identifying the ELK program as a priority/non-priority (generating the arguments/ideas that go from "this ELK thing exists" to "ah, I think ELK is one of the most important alignment directions" or "nope, this particular problem/approach doesn't matter much"
  • An ELK proposal
  • A specific modification to an ELK proposal that makes
... (read more)

I spent the first half-3/4 of 2022 focused on AIS field-building projects. In the last few months, I've been focusing more on understanding AI risk threat models & strategy/governance research projects.

Before 2022, I was a PhD student researching scalable mental health interventions (see here).

1theflowerpot4mo
Have you written or published any ML related papers? Perhaps you are working on that now? Why did you chose to switch from mental health to AI Alignment?

Thanks for sharing this. I've been looking forward to your thoughts on OpenAI's plan, and I think you presented them succinctly/clearly. I found the "evaluation vs generation" section particularly interesting/novel.

One thought: I'm currently not convinced that we would need general intelligence in order to generate new alignment ideas.

The alignment problem needs high general intelligence, because it needs new ideas for solving alignment. It won’t be enough to input all the math around the alignment problem and have the AI solve that. It's a great improveme

... (read more)
1Shoshannah Tekofsky5mo
Thanks! And to both examples, how are you conceptualizing a "new idea"? Cause I suspect we don't have the same model on what an idea is.

I appreciate you writing this. I found myself agreeing much of it. The post also helped me notice some feeling of "huh, something seems missing... there's something I think this isn't capturing... but what is it?" I haven't exactly figured out where that feeling is coming from, so apologies if this comment ends up being incoherent or tangential. But you've inspired me to try to double-click on it, so here it goes :) 

Suppose I meet someone for the first time, and I begin to judge them for reasons that I don't endorse. For example, maybe something about... (read more)

Thank you for releasing this dialogue-- lots of good object-level stuff here. 

In addition, I think Scott showcased some excellent conversational moves. He seemed very good at prompting Yudkowsky well, noticing his own confusions, noticing when he needed to pause/reflect before continuing with a thread, and prioritizing between topics. 

I hope that some of these skills are learnable. I expect the general discourse around alignment would be more productive if more people tried to emulate some of Scott's magic.

Some examples that stood out to me:

Ackno... (read more)

I think this is a reasonable critique.

The particular friend I refer to is unusually good at distilling things in ways that I find actionable/motivating, which might bias me a bit.

But of course it depends on the book and the topic and the person, and it would be unwise to think that most books could be easily summarized like this.

Notably, I think that many of the things that people commonly worry about RE status are easier to summarize than books. Examples:

  • Takeaways from a conference
  • Takeaways from a meeting with High-Status Person TM
  • Takeaways from a Google
... (read more)

I read this post for the first time in 2022, and I came back to it at least twice. 

What I found helpful

  • The proposed solution: I actually do come back to the “honor” frame sometimes. I have little Rob Bensinger and Anna Salamon shoulder models that remind me to act with integrity and honor. And these shoulder models are especially helpful when I’m noticing (unhelpful) concerns about social status.
  • A crisp and community-endorsed statement of the problem: It was nice to be like “oh yeah, this thing I’m experiencing is that thing that Anna Salamon calls PR
... (read more)

Thank you for sharing! I found these two quotes to be the most interesting (bolding added by me):

Yeah that was my earlier point, I think society should regulate what the wide bounds are, but then I think individual users should have a huge amount of liberty to decide how they want their experience to go. So I think it is like a combination of society -- you know there are a few asterisks on the free speech rules -- and society has decided free speech is not quite absolute. I think society will also decide language models are not quite absolute. But there i

... (read more)
3Noosphere895mo
Pointing this out, since I don't nearly agree with this, IMO, at least not strongly enough that capabilities and safety are the same thing. Also, I note a motivated reasoning alert here, since this is what someone would write to make sure that their beliefs around AI capabilities are good is reinforced, since the inconvenient world where the Orthogonality Thesis and instrumental convergence is true would be personally disastrous for OpenAI.

Ah, thanks for the clarifications. I agree with the clarified versions :)

Quick note on getting senior researchers:

  • It seems like one of the main bottlenecks is "having really good models of alignment."
  • It seems plausible to me that investing in junior alignment researchers today means we'll increase the number of senior alignment researchers (or at least "people who are capable of mentoring new alignment researchers, starting new orgs, leading teams, etc.).
  • My vibes-level guess is that the top junior alignment researchers are ready to lead teams within about
... (read more)

A few thoughts:

  1. I agree that it would be great to have more senior researchers in alignment

  2. I agree that, ideally, it would be easier for independent researchers to get funding.

  3. I don’t think it’s necessarily a bad thing that the field of AI alignment research is reasonably competitive.

  4. My impression is that there’s still a lot of funding (and a lot of interest in funding) independent alignment researchers.

  5. My impression is that it’s still considerably easier to get funding for independent alignment research than many other forms of independent n

... (read more)
9Jérémy Scheurer5mo
Thanks for your comments Akash. I think I have two main points I want to address. 1. I agree that it's very good that the field of AI Alignment is very competitive! I did not want to imply that this is a bad thing. I was mainly trying to point out that from my point of view, it seems like overall there are more qualified and experienced people than there are jobs at large organizations. And in order to fill that gap we would need more senior researchers, who then can follow their research agendas and hire people (and fund orgs), which is however hard to achieve. One disclaimer I want to note is that I do not work at a large org, and I do not precisely know what kinds of hiring criteria they have, i.e. it is possible that in their view we still lack talented enough people. However, from the outside, it definitely does look like there are many experienced researchers.  2. It is possible that my previous statement may have been misinterpreted. I wish to clarify that my concerns do not pertain to funding being a challenge. I did not want to make an assertion about funding in general, and if my words gave that impression, I apologize. I do not know enough about the funding landscape to know whether there is a lot or not enough funding (especially in recent months).  I agree with you that, for all I know, it's feasible to get funding for independent researchers (and definitely easier than doing a Ph.D. or getting a full-time position). I also agree that independent research seems to be more heavily funded than in other fields. My point was mainly the following:  1. Many people have joined the field (which is great!), or at least it looks like it from the outside. 80000 hours etc. still recommend switching to AI Alignment, so it seems likely that more people will join. 2. I believe that there are many opportunities for people to up-skill to a certain le
2jacquesthibs5mo
I didn’t mention this in my comment, but I also agree with this. Apologies if it seemed otherwise. I was mostly expressing a bit of concern about how how funding will be dispursed going forward, from a macro-perspective.

Thanks for writing this up. I agree with several of the subpoints you make about how the plan could be more specific, measurable, etc. 

I'm not sure where I stand on some of the more speculative (according to me) claims about OpenAI's intentions. Put differently, I see your post making two big-picture claims: 

  1. The 1-2 short blog posts about the OpenAI plan failed to meet several desired criteria. Reality doesn't grade on a curve, so even though the posts weren't intended to spell out a bunch of very specific details, we should hold the world's lead
... (read more)
4Søren Elverlin5mo
I'll do both: 1. I (again) affirm that this is very speculative. 2. A substantial part of my private evidence is my personal evaluation of the CEO of OpenAI. I am really uneasy about stating this in public, but I now regret keeping my very negative evaluation of SBF private. Speak the truth, even if your voice trembles [https://forum.effectivealtruism.org/posts/u5gLprWhFDJLxooLc/speak-the-truth-even-if-your-voice-trembles]. I think a full "Heel turn" is more likely than not.

Reviewing this quickly because it doesn't have a review.

I've linked this post to several people in the last year. I think it's valuable for people (especially junior researchers or researchers outside of major AIS hubs) to be able to have a "practical sense" of what doing independent alignment research can be like, how the LTFF grant application process works, and some of the tradeoffs of doing this kind of work. 

This seems especially important for independent conceptual work, since this is the path that is least well-paved (relative to empirical work... (read more)

ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community.

Things about ELK that I benefited from

Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and pres... (read more)

Thanks for this post; it's probably my favorite Cold Takes post from the last few months. I appreciated the specific scenario, as well as the succinct points in the "we can do better" section. I felt like I could get a more concrete understanding of your worldview, how you think we should move forward, and the reasons why. I'm also glad that you're thinking critically about standards and monitoring.

For a simple example, imagine an AI company in a dominant market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems

... (read more)
2HoldenKarnofsky3mo
Thanks! I agree this is a concern. In theory, people who are constantly thinking about the risks should be able to make a reasonable decision about "when to pause", but in practice I think there is a lot of important work to do today making the "pause" more likely in the future, including on AI safety standards and on the kinds of measures described at https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century/

How would you feel about a review process that had two sections?

Section One: How important do you find this work & to what extent do you think the research is worth doing? (Ex: Does it strike at what you see as core alignment problems?)

Section Two: What do you think of the details of the research? (Ex: Do you see any methodological flaws, do you have any ideas for further work, etc).

My impression is that academic peer-reviewers generally do both of these. Compared to academic peer-review, LW/AF discussions tend to have a lot of Section One and not much... (read more)

4Rohin Shah5mo
Disclaimer: I'm not particularly confident in any of my views here or in my original comment. I mostly commented on the post because the post implied that I supported the idea of a review process whereas my actual opinion is mixed (not uniformly negative). If I hadn't been named explicitly I wouldn't have said anything; I don't want to stop people from trying this out if they think it would be good; it's quite plausible that someone thinking about this full time would have a vision for it that would be good that I haven't even considered yet (given how little I've thought about it). I think how excited I'd be would depend a lot more on the details (e.g. who are the reviewers, how much time do they spend, how are they incentivized, what happens after the reviews are completed). But if we just imagine the LessWrong Review extended to the Alignment Forum, I'm not that excited, because I predict (not confidently) that the reviews just wouldn't be good at engaging with the details. (Mostly because LW / AF comments don't seem very good at engaging with details on existing LW / AF posts, and because typical LW / AF commenters don't seem familiar enough with ML to judge details in ML papers.) Academic peer review does do both in principle, but I'd say that typically most of the emphasis is on Section Two. Generally the Section One style review is just "yup, this is in fact trying to make progress on a problem academia has previously deemed important, and is not just regurgitating things that people previously said" (i.e. it is significant and novel). (It is common for bad reviews to just say "this is not significant / not novel" and then ignore Section One entirely, but this is pretty commonly thought of as explicitly "this was a bad review", unless they actually justified "not significant / not novel" well enough that most others would agree with them.)
Load More