Wiki Contributions


On average, scholars self-reported understanding of major alignment agendas increased by 1.75 on this 10 point scale. Two respondents reported a one-point decrease in their ratings, which could be explained by some inconsistency in scholars’ self-assessments (they could not see their earlier responses when they answered at the end of the Research Phase) or by scholars realizing their previous understanding was not as comprehensive as they had thought.

This could also be explained by these scholars actually having a worse understanding after the program. For me, MATS caused me to focus a bunch on a few particular areas and spend less time at the high level / reading random LW posts, which plausibly has the effect of reducing my understanding of major alignment agendas. 

This could happen because of forgetting what you previously knew, or various agendas changing and you not keeping up with it. My guess is that there are many 2 month time periods in which a researcher will have a worse understanding of major research agendas at the end than at the beginning — though on average you want the number to go up. 

It's unclear if you've ordered these in a particular way. How likely do you think they each are? My ordering from most to least likely would probably be:

  • Inductive bias toward long-term goals
  • Meta-learning
  • Implicitly non-myopic objective functions
  • Simulating humans
  • Non-myopia enables deceptive alignment
  • (Acausal) trade

Why do you think this:

Non-myopia is interesting because it indicates a flaw in training – somehow our AI has started to care about something we did not design it to care about. 

Who says we don't want non-myopia, those safety people?! I guess to me it looks like the most likely reason we get non-myopia is that we don't try that hard not to. This would be some combination of Meta-learning, Inductive bias toward long-term goals, and Implicitly non-myopic objective functions, as well as potentially "Training for non-myopia". 

I think this essay is overall correct and very important. I appreciate you trying to protect the epistemic commons, and I think such work should be compensated. I disagree with some of the tone and framing, but overall I believe you are right that the current public evidence for AI-enabled biosecurity risks is quite bad and substantially behind the confidence/discourse in the AI alignment community. 

I think the more likely hypothesis for the causes of this situation isn't particularly related to Open Philanthropy and is much more about bad social truth seeking processes. It seems like there's a fair amount of vibes-based deferral going on, whereas e.g., much of the research you discuss is subtly about future systems in a way that is easy to miss. I think we'll see more and more of this as AI deployment continues and the world gets crazier — the ratio of "things people have carefully investigated" to "things they say" is going to drop. I expect in this current case, many of the more AI alignment people didn't bother looking too much into the AI-biosecurity stuff because it feels outside their domain and more-so took headline results at face value, I definitely did some of this. This deferral process is exacerbated by the risk of info hazards and thus limited information sharing. 

I think this reasoning ignores the fact that at the time someone first tries to open source a system of capabilities >T, the world will be different in a bunch of ways. For example, there will probably exist proprietary systems of capabilities .

I think this is likely but far from guaranteed. The scaling regime of the last few years involves strongly diminishing returns to performance from more compute. The returns are coming (scaling works), but it gets more and more expensive to get marginal capability improvements. 

If this trend continues, it seems reasonable to expect the gap between proprietary models and open source to close given that you need to spend strongly super-linearly to keep a constant lead (measured by perplexity, at least). There's questions about whether there will be an incentive to develop open source models at the $ billion+ cost, and I don't know, but it does seem like proprietary project will also be bottlenecked at the 10-100B range (and they also have this problem of willingness to spend given how much value they can capture). 

Potential objections: 

  • We may first enter the "AI massively accelerates ML research" regime causing the leading actors to get compounding returns and keep a stronger lead. Currently I think there's a >30% chance we hit this in the next 5 years of scaling. 
  • Besides accelerating ML research, there could be other features of GPT-SoTA that cause them to be sufficiently better than open source models. Mostly I think the prior of current trends is more compelling, but there will probably be considerations that surprise me. 
  • Returns to downstream performance may follow different trends (in particular not having the strongly diminishing returns to scaling) than perplexity. Shrug this doesn't really seem to be the case, but I don't have a thorough take.
  • These $1b / SoTA-2 open source LLMs may not be at the existentially dangerous level yet. Conditional on GPT-SoTA not accelerating ML research considerably, I think it's more likely that the SoTA-2 models are not existentially dangerous, though guessing at the skill tree here is hard. 

I'm not sure I've written this comment as clearly as I want. The main thing is: expecting proprietary systems to remain significantly better than open source seems like a reasonable prediction, but I think the fact that there are strongly diminishing returns to compute scaling in the current regime should cast significant doubt on it. 

Only slightly surprised?

We are currently at ASL-2 in Anthropic's RSP. Based on the categorization, ASL-3 is "low-level autonomous capabilities". I think ASL-3 systems probably don't meet the bar of "meaningfully in control of their own existence", but they probably meet the thing I think is more likely:

I think it wouldn’t be crazy if there were AI agents doing stuff online by the end of 2024, e.g., running social media accounts, selling consulting services; I expect such agents would be largely human-facilitated like AutoGPT

I think it's currently a good bet (>40%) that we will see ASL-3 systems in 2024. 

I'm not sure how big of a jump if will be from that to "meaningfully in control of their own existence". I would be surprised if it were a small jump, such that we saw AIs renting their own cloud compute in 2024, but this is quite plausible on my models. 

I think the evidence indicates that this is a hard task, but not super hard. e.g., looking at ARC's report on autonomous tasks, one model partially completes the task of setting up GPT-J via a cloud provider (with human help).

I'll amend my position to just being "surprised" without the slightly, as I think this better captures my beliefs — thanks for the push to think about this more. Maybe I'm at 5-10%.

It may help to know how you're operationalizing AIs that are ‘meaningfully aware of their own existence’.

shrug, I'm being vague 

I think it's pretty unlikely that scaling literally stops working, maybe I'm 5-10% that we soon get to a point where there are only very small or negligible improvements to increasing compute. But I'm like 10-20% on some weaker version. 

A weaker version could look like there are diminishing returns to performance from scaling compute (as is true), and this makes it very difficult for companies to continue scaling. One mechanism at play is that the marginal improvements from scaling may not be enough to produce the additional revenue needed to cover the scaling costs, this is especially true in a competitive market where it's not clear scaling will put one ahead of their competitors. 

In the context of the post, I think it's quite unlikely that I see strong evidence in the next year indicating that scaling has stopped (if only because a year of no-progress is not sufficient evidence). I was more so trying to point to how there [sic] are contingencies which would make OpenAI's adoption of an RSP less safety-critical. I stand by the statements that scaling no longer yielding returns would be such a contingency, but I agree that it's pretty unlikely. 

Second smaller comment:

I'm not saying LLMs necessarily raise the severity ceiling on either a bio or cyber attack. I think it's quite possible that AIs will do so in the future, but I'm less worried about this on the 2-3 year timeframe. Instead, the main effect is decreasing the cost of these attacks and enabling more actors to execute such attacks. (as noted, it's unclear whether this substantially worsens bio threats) 

if I want to download penetration tools to hack other computers without using any LLM at all I can just do so

Yes, it's possible to launch cyber attacks currently. But with AI assistance it will require less personal expertise and be less costly. I am slightly surprised that we have not seen a much greater amount of standard cybercrime (the bar I was thinking when I wrote this was not the hundreds of deaths bar, it was more like "statistically significant increase in cybercrime / serious deepfakes / misinformation, in a way that concretely impacts the world, compared to previous years"). 

Thanks for your comment!

I appreciate you raising this point about whether LLMs alleviate the key bottlenecks on bioterrorism. I skimmed the paper you linked, thought about my previous evidence, and am happy to say that I'm much less certain that I was before. 

My previous thinking for why I believe LLMs exacerbate biosecurity risks:

  • Kevin Esvelt said so on the 80k podcast, see also the small experiment he mentions. Okay evidence. (I have not listened to the entire episode)
  • Anthropic says so in this blog post. Upon reflection I think this is worse evidence than I previously thought (seems hard to imagine seeing the opposite conclusion from their research, given how vague the blog post is; access to their internal reports would help). 

The Montague 2023 paper you link: the main bottleneck to high consequence biological attacks is actually R&D to create novel threats, especially spread testing which needs to take part in a realistic deployment environment. This requires both being purposeful and being having a bunch of resources, so policies need not be focused on decreasing democratic access to biotech and knowledge. 

I don't find the paper's discussion of 'why extensive spread testing is necessary' to be super convincing, but it's reasonable and I'm updating somewhat toward this position. That is, I'm not convinced either way. I would have a better idea if I knew how accurate a priori spread forecasting has been for bioweapons programs in the past. 

I think the "LLMs exacerbate biosecurity risks" still goes through even if I mostly buy Montague's arguments, given that those arguments are partially specific to high consequence attacks. Additionally, Montague is mainly arguing that democratization of biotech / info doesn't help with the main barriers, not that it doesn't help at all:

The reason spread-testing has not previously been perceived as a the defining stage of difficulty in the biorisk chain (see Figure 1), eclipsing all others, is that, until recently, the difficulties associated with the preceding steps in the risk chain were so high as to deter contemplation of the practical difficulties beyond them. With the advance of synthetic biology enabled by bioinformatic inferences on ‘omics data, the perception of these prior barriers at earlier stages of the risk chain has receded.

So I think there exists some reasonable position (which likely doesn't warrant focusing on LLM --> biosecurity risks, and is a bit of an epistemic cop out): LLMs increase biosecurity risks marginally even though they don't affect the main blockers for bio weapon development. 

Thanks again for your comment, I appreciate you pointing this out. This was one of the most valuable things I've read in the last 2 weeks. 

In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.

I'm going to read this as "...1 new potential gradient hacking pathway" because I think that's what the section is mainly about. (It appears to me that throughout the section you're conflating mesa-optimization with gradient hacking, but that's not the main thing I want to talk about.)

The following quote indicates at least two potential avenues of gradient hacking: "In an RL context", "supervised learning with adaptive data sampling". These both flow through the gradient hacker affecting the data distribution, but they seem worth distinguishing, because there are many ways a malign gradient hacker could affect the data distribution. 

Broadly, I'm confused about why others (confidently) think gradient hacking is difficult. Like, we have this pretty obvious pathway of a gradient hacker affecting training data. And it seems very likely that AIs are going to be training on their own outputs or otherwise curating their data distribution — see e.g., 

  • Phi, recent small scale success of using lots of synthetic data in pre-training, 
  • Constitutional AI / self-critique, 
  • using LLMs for data labeling and content moderation,
  • The large class of self-play approaches that I often lump together under "Expert-iteration" which involve iteratively training on the best of your previous actions,
  • the fact that RLHF usually uses a preference model derived from the same base/SFT model being trained. 

Sure, it may be difficult to predictably affect training via partial control over the data distribution. Personally I have almost zero clue how to affect model training via data curation, so my epistemic state is extremely uncertain. I roughly feel like the rest of humanity is in a similar position — we have an incredibly poor understanding of large language model training dynamics — so we shouldn't be confident that gradient hacking is difficult. On the other hand, it's reasonable to be like "if you're not smarter than (some specific set of) current humans, it is very hard for you to gradient hack, as evidence by us not knowing how to do it." 

I don't think strong confidence in either direction is merited by our state of knowledge on gradient hacking. 

Load More