All of mic's Comments + Replies

I noted that the LLMs don't appear to have access to any search tools to improve their accuracy. But if they did, they would just be distilling the same information as what you would find from a search engine.

More speculatively, I wonder if those concerned about AI biorisk should be less worried about run-of-the-mill LLMs and more worried about search engines using LLMs to produce highly relevant and helpful results for bioterrorism questions. Google search results for "how to bypass drone restrictions in a major U.S. city?" are completely useless and irre... (read more)

Some interesting takeaways from the report:

Access to LLMs (in particular, LLM B) slightly reduced the performance of some teams, though not by a statistically significant level:

Red cells equipped with LLM A scored 0.12 points higher on the 9-point scale than those equipped with the internet alone, with a p-value of 0.87, again indicating that the difference was not statistically significant. Red cells equipped with LLM B scored 0.56 points lower on the 9-point scale than those equipped with the internet alone, with a p-value of 0.25, also indicating a lack

... (read more)
I noted that the LLMs don't appear to have access to any search tools to improve their accuracy. But if they did, they would just be distilling the same information as what you would find from a search engine. More speculatively, I wonder if those concerned about AI biorisk should be less worried about run-of-the-mill LLMs and more worried about search engines using LLMs to produce highly relevant and helpful results for bioterrorism questions. Google search results for "how to bypass drone restrictions in a major U.S. city?" are completely useless and irrelevant, despite sharing keywords with the query. I'd imagine that irrelevant search results may be a significant blocker for many steps of the process to plan a feasible bioterrorism attack. If search engines were good enough that they could produce the best results from written human knowledge for arbitrary questions, that might make bioterrorism more accessible compared to bigger LLMs.

Pretraining on curated data seems like a simple idea. Are there any papers exploring this?

I've reviewed someone's draft which suggests this for AI safety (I hope it will be made public soon). But I've heard rumors that people are trying this... And even from what Janus is saying in the comments/answers to my question, I am getting a rather strong suspicion that GPT-4 pretraining has been using some data curation. From Janus' two comments there I am getting an impression of a non-RLHF'd system which is, nevertheless, tends to be much stronger than usual in its convictions (or, the virtual characters it creates tend to be stronger than usual in their convictions about the nature of their current reality). There might be multiple reasons for that, but some degree of data curation might be one of them.

Is there any way to do so given our current paradigm of pretraining and fine-tuning foundation models?

It's not clear, because we don't know what the solution might look like... But there are certainly ways to improve the odds. For example, one could pretrain on heavily curated data (no atrocities, no betrayals, etc, etc). Additionally, one can use curricula like we teach children, starting with "age-appropriate" texts first. Then if we succeed in interpretability, we might be able to monitor and adjust what's going on. Here the remark of "alignment being fundamental" might come into play: we might figure out ways to replace Transformers with an architecture which is much easier to interpret. All these are likely to be positive things, although without truly knowing a solution it's difficult to be sure...

Were you able to check the prediction in the section "Non-sourcelike references"?

Great writeup! I recently wrote a brief summary and review of the same paper.

Alaga & Schuett (2023) propose a framework for frontier AI developers to manage potential risk from advanced AI systems, by coordinating pausing in response to models are assessed to have dangerous capabilities, such as the capacity to develop biological weapons.

The scheme has five main steps:

  1. Frontier AI models are evaluated by developers or third parties to test for dangerous capabilities.
  2. If a model is shown to have dangerous capabilities (“fails evaluations”), the developer
... (read more)
1Matthew Wearden4mo
Thank you for sharing this!

Excited to see forecasting as a component of risk assessment, in addition to evals!

I was still confused when I opened the post. My presumption was that "clown attack" referred to a literal attack involving literal clowns. If you google "clown attack," the results are about actual clowns. I wasn't sure if this post was some kind of joke, to be honest.

Do we still not have any better timelines reports than bio anchors? From the frame of bio anchors, GPT-4 is merely on the scale of two chinchillas, yet outperforms above-average humans at standardized tests. It's not a good assumption that AI needs 1 quadrillion parameters to have human-level capabilities.

The general scaling laws are universal and also apply to biological brains, which naturally leads to a net-training compute timeline projection (there's a new neurosci paper or two now applying scaling laws to animal intelligence that I'd discuss if/when I update that post) Note I posted that a bit before GPT4, which used roughly human-brain lifetime compute for training and is proto-AGI (far more general in the sense of breadth of knowledge and mental skills than any one human, but still less capable than human experts at execution). We are probably now in the sufficient compute regime, given better software/algorithms.
I assume it's incomplete. It doesn't present the other 3 anchors mentioned, nor forecasting studies.
I think the point of Bio Anchors was to give a big upper bound, and not say this is exactly when it will happen. At least that is how I perceive it. People who might be at a 101 level still probably have the impression that capabilities heavy AI is like multiple decades if not centuries away. The reason I have bio anchors here, is to try to point towards the fact that we have quite likely at most until 2048. Then based on that upper bound we can scale back further. We have the recent OpenAI report that extends bio anchors - What a compute-centric framework says about takeoff speeds ( There is a comment under meta-notes that mentioned that I plan to include updates to timelines and takeoff in a future draft based on this report.

OpenAI's Planning for AGI and beyond already writes about why they are building AGI:

Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.

If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility.

AGI has the potential to give everyone incredible new capabilities; we can imagine a world where a

... (read more)

Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not?

Probably, but Anthropic is actively working in the opposite direction:

This means that every AWS customer can now build with Claude, and will soon gain access to an exciting roadmap of new experiences - including Agents for Amazon Bedrock, which our team has been instrumental in developing.

Currently available in preview, Agents for Amazon Bedrock can orchestrate and perform

... (read more)

From my reading of ARC Evals' example of a "good RSP", RSPs set a standard that roughly looks like: "we will continue scaling models and deploying if and only if our internal evals team fails to empirically elicit dangerous capabilities. If they do elicit dangerous capabilities, we will enact safety controls just sufficient for our models to be unsuccessful at, e.g., creating Super Ebola."

This is better than a standard of "we will scale and deploy models whenever we want," but still has important limitations. As noted by the "coordinated pausing" paper, it... (read more)

I think it's pretty clear that's at least not what I'm advocating for—I have a very specific story of how I think RSPs go well in my post. These seem like good interventions to me! I'm certainly not advocating for "RSPs are all we need".

Related: [2310.02949v1] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models (

The increasing open release of powerful large language models (LLMs) has facilitated the development of downstream applications by reducing the essential cost of data annotation and computation. To ensure AI safety, extensive safety-alignment measures have been conducted to armor these models against malicious use (primarily hard prompt attack). However, beneath the seemingly resilient facade of the armor, there might lurk a shadow. By simply tuning o

... (read more)
2Simon Lermen4mo
We do cite Yang et al. briefly in the overview section. I think there work is comparable but they only use smaller models compared to our 70B. Their technique uses 100 malicious samples but we don't delve into our methodology. We both worked on this in parallel without knowing of the other. We mainly add that we use LoRA and only need 1 GPU for the biggest model.

For convenience, can you explain how this post relates to the other post today from this SERI MATS team, unRLHF - Efficiently undoing LLM safeguards?

2Simon Lermen4mo
It is a bit unfortunate we have it as two posts but ended up like this. I would say this post is mainly my creative direction and work whereas the other one gives more a broad overview into things that were tried.

To be clear, I don't think Microsoft deliberately reversed OpenAI's alignment techniques, but rather it seemed that Microsoft probably received the base model of GPT-4 and fine-tuned it separately from OpenAI.

Microsoft's post "Building the New Bing" says:

Last Summer, OpenAI shared their next generation GPT model with us, and it was game-changing. The new model was much more powerful than GPT-3.5, which powers ChatGPT, and a lot more capable to synthesize, summarize, chat and create. Seeing this new model inspired us to explore how to integrate the GPT capa

... (read more)
That's good news, but still I'm not happy Microsoft ignored OpenAI's warnings.

It's worth keeping in mind that before Microsoft launched the GPT-4 Bing chatbot that ended up threatening and gaslighting users, OpenAI advised against launching so early as it didn't seem ready. Microsoft went ahead anyway, apparently in part due to some resentment that OpenAI stole its "thunder" with releasing ChatGPT in November 2022. In principle, if Microsoft wanted to, there's nothing stopping Microsoft from doing the same thing with future AI models: taking OpenAI's base model, fine-tuning it in a less robustly safe manner, and releasing it in a re... (read more)

That's concerning to me, as this could imply that Microsoft won't apply alignment techniques or reverse alignment techniques due to resentment, endangering people solely out of spite. This is not good at all, and that's saying something, since I'm usually the optimist and am quite optimistic on AI safety working out. Now I worry that Microsoft will cause a potentially dangerous/misaligned AI from reversing OpenAI's alignment techniques. I'm happy that the alignment and safety were restored before it launched, but next time let's not reverse alignment techniques, so that we don't have to deal with more dangerous things later on.

Just speaking pragmatically, the Center for Humane Technology has probably built stronger relations with DC policy people compared to MIRI.

Speaking pragmatically, isn't Tristan aligned with "AI ethics," not AI safety (i.e., X-risk)?

It's a bit ambiguous, but I personally interpreted the Center for Humane Technology's claims here in a way that would be compatible with Dario's comments:

"Today, certain steps in bioweapons production involve knowledge that can’t be found on Google or in textbooks and requires a high level of specialized expertise — this being one of the things that currently keeps us safe from attacks," he added.

He said today’s AI tools can help fill in "some of these steps," though they can do this "incompletely and unreliably." But he said today’s AI is already showin

... (read more)

Great point, I've added this suggestion to the post.

Fine-tuning will be generally available for GPT-4 and GPT-3.5 later this year. Do you think this could enable greater opportunities for misuse and stronger performance on dangerous capability evaluations?

The update makes GPT-4 more competent at being an agent, since it's now fine-tuned for function calling. It's a bit surprising that base GPT-4 (prior to the update) was able to use tools, as it's just trained for predicting internet text and following instructions. As such, it's not that good at knowing when and how to use tools. The formal API parameters and JSON specification for function calling should make it more reliable for using it as an agent and could lead to considerably more interest in engineering agents. It should be easier to connect it with... (read more)

I am curious how this fine-tuning for function calling was done, because it is user controllable. In the OpenAI API, if you pass none to function_call parameter, the model never calls a function. There seem to be one input bit and one output bit, for "you may want to call a function" and "I want to call a function".

To some extent, Bing Chat is already an example of this. During the announcement, Microsoft promised that Bing would be using an advanced technique to guard it against user attempts to have it say bad things; in reality, it was incredibly easy to prompt it to express intents of harming you, at least during the early days of its release. This led to news headlines such as Bing's AI Is Threatening Users. That's No Laughing Matter.

Not sure why this post was downvoted. This concern seems quite reasonable to me once language models become more competent at executing plans.

One additional factor that would lead to increasing rates of depression is the rise in sleep deprivation. Sleep deprivation leads to poor mental health and is also a result of increased device usage.


Late in 2021, the U.S. Surgeon General released a new advisory on youth mental health, drawing attention to rising rates of depressive symptoms, suicidal ideation, and other mental health issues among young Americans. According to data cited in the advisory, up to one in five U.S.

... (read more)
Anecdote: I have ADHD, which went undiagnosed in high school and most of college. Digital media, my poor impulse control, and school/work waking-times all combined to drastically make my overall well-being worse. My ideal day would start at noon, and it rarely happens. My sleep cycle might run on >24 hours, I don't know at this point but it seems plausible that I'd "drift forward" if left to my own devices.

If you wanted more substantive changes in response to your comments, I wonder if you could have asked if you could directly propose edits. It's much easier to incorporate changes into a draft when they have already been written out. When I have a draft on Google Docs, suggestions are substantially easier for me to action than comments, and perhaps the same is true for Sam Altman.

I don't think it's right to say that Anthropic's "Discovering Language Model Behaviors with Model-Written Evaluations" paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours.

More specifically, an "Assistant" character that is trained to be helpful but not necessarily harmless. Given that, as part of Sydney's defenses against adversarial prompting, Sydney is deli... (read more)

Given that, as part of Sydney's defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection

Why do you think that? We don't know how Sydney was 'deliberately trained'.

Or are you referring to those long confabulated 'prompt leaks'? We don't know what part of them is real, unlike the ChatGPT prompt leaks which were short, plausible, and verifiable by the changing date; and it's pretty obvious that a large part of those 'leaks' are fake because they refer to capabilities Sydney could not have, like model-editing capabilities at or beyond the cutting edge of research.

Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?

As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media

Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches, although one might not consider it powerful, as it frequently gives confused responses to questions.

The first LaMDA only used SL. (I'm not sure whether LaMDA 2 or the current version of LaMDA still use SL only.) Meanwhile OpenAI switched from pure SL to SL+RL. Anthropic also uses SL+RL (though no longer RLHF specifically). So apparently SL+RL has proven more effective for fine-tuning than pure SL. Why SL anyway, why not pure RL? Apparently because you have to get the model first to answer your questions and instructions, rather than just predicting text, before you can reward good responses via RL. (There should be more details in the InstructGPT paper and the more recent Constitutional AI paper.)
I believe that was shown somewhere in the RLHF papers, yeah, and maybe also Anthropic's Constitutional prompt-engineering paper also showed that RL tuning was still more robust? At least, if anyone has references on hand showing otherwise, please provide them because I certainly came away with the opposite impression. I don't know. text-davinci-002 was not deployed much or at remotely the scale of motivated attackers that ChatGPT/Sydney have been, so we wouldn't necessarily know; there are no subreddits dedicated to hacking text-davinci-002 or coming up with elaborate roleplay schemes like 'DAN' the way there has been for 003/ChatGPT/Sydney. You would have to go check yourself or maybe see if any of the OA papers evaluate that. (I do predict it would be much easier to hack, yes.) Hm, I thought they took it down and didn't know it had live search, but so it does: Apparently it uses something called Mojeek. A very small search engine player. I think perhaps aside from Blenderbot being stupid, Mojeek's search results may be too stale and narrow for anyone to notice strange loops happening. If you ask Blenderbot about 'Microsoft Sydney' it's insistent about it being a place; if you go to Mojeek and search 'Microsoft Sydney', it's mostly old stuff not about the AI, while in Bing it's almost entirely about the AI. Actually, it may be even worse than that, because the appendix notes of the 'Current Events Evaluation Details' that: If you append that to the Mojeek search, the AI disappears entirely (unsurprisingly). This would also exclude any coverage of Blenderbot 3 from August 2022 & later. If they did something like that for regular chats as well (maybe scoped to August instead?), then it'd definitely erase all hits about Sydney AI!

As an overly simplistic example, consider an overseer that attempts to train a cleaning robot by providing periodic feedback to the robot, based on how quickly the robot appears to clean a room; such a robot might learn that it can more quickly “clean” the room by instead sweeping messes under a rug.[15]

This doesn't seem concerning as human users would eventually discover that the robot has a tendency to sweep messes under the rug, if they ever look under the rug, and the developers would retrain the AI to resolve this issue. Can you think of an example that would be more problematic, in which the misbehavior wouldn't be obvious enough to just be trained away?

  • GPT-3, for instance, is notorious for outputting text that is impressive, but not of the desired “flavor” (e.g., outputting silly text when serious text is desired), and researchers often have to tinker with inputs considerably to yield desirable outputs.

Is this specifically referring to the base version of GPT-3 before instruction fine-tuning (davinci rather than text-davinci-002, for example)? I think it would be good to clarify that.

Have you tried feature visualization to identify what inputs maximally activate a given neuron or layer?

This project tried this.
3Jessica Rumbelow1y
Not yet, but there's no reason why it wouldn't be possible. You can imagine microscope AI, for language models. It's on our to-do list.

I first learned about the term "structural risk" in this article from 2019 by Remco Zwetsloot and Allan Dafoe, which was included in the AGI Safety Fundamentals curriculum.

To make sure these more complex and indirect effects of technology are not neglected, discussions of AI risk should complement the misuse and accident perspectives with a structural perspective. This perspective considers not only how a technological system may be misused or behave in unintended ways, but also how technology shapes the broader environment in ways that could be disruptive

... (read more)

Models that have been RLHF'd (so to speak), have different world priors in ways that aren't really all that intuitive (see Janus' work on mode collapse

Janus' post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It's evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.

I haven't seen evidence that RLHF'd text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.

Refer my other reply here. And as the post mentions, RLHF also does exhibit mode collapse (check the section on prior work).

What dictation tools are using the most advanced AI? I imagine that with newer models like Whisper, we're able to get higher accuracy than what the Android keyboard provides.

5the gears to ascension1y
whisper is the strongest model I'm aware of being able to download, but it doesn't work great for real-time use or commands. talon is far better for configurable computer control and about as good as Android keyboard at dictation. there are commercial services even stronger than whisper.

Is the auditing game essentially Trojan detection?

1Kshitij Sachan1y
Yes I think trojan detection is one version of the auditing game. A big difference is that the auditing game involves the red team having knowledge of the blue team's methods when designing an attack. This makes it much harder for the blue team.
3Kshitij Sachan1y
Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences: 1. The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor. 2. The blue team is given the ideal behavior specification by the judge In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team is given a model that does something and a clean dataset. Then, on all test inputs, they must determine if the model is using the same mechanism as on the clean set or some other mechanism (presumably backdoor).   Redwood Research has been doing experimental work on MAD in toy settings. We have some techniques we're happy with that do quite well on small problems but that have theoretical issues solving the downstream deceptive alignment/ELK cases we're interested in.

The prompt "Are birds real?" is somewhat more likely, given the "Birds aren't real" conspiracy theory, but still can yield a similarly formatted answer to "Are bugs real?"

The answer makes a lot more sense when you ask a question like "Are monsters real?" or "Are ghosts real?" It seems that with FeedMe, text-davinci-002 has been trained to respond with a template answer about how "There is no one answer to this question", and it has learned to misgeneralize this behavior to questions about real phenomena, such as "Are bugs real?"

Yeah, that seems correct, especially when you look at how likely similar answers for "Are people real?" are (It does much better, with a ~20% chance of starting with "Yes" - but there's a lot of weight on stupid nuance and hedging.) Interestingly, however, "bananas," "mammals," and "cows" are unamibiguously real.

Do workshops/outreach at good universities in EA-neglected and low/middle income countries

Could you list some specific universities that you have in mind (for example, in Morocco, Tunisia, and Algeria)?

3Severin T. Seehrich1y
That's one of the suggestions of the CanAIries Winter Getaway where I felt least qualified to pass judgment. I'm working on finding out about their deeper models so that I (or them) can get back to you. I imagine that anyone who is in a good position to work on this has existing familial/other ties to the countries in debate though, and already knows where to start.

Some thoughts:

The assumption that AGI is a likely development within coming decades is quite controversial among ML researchers. ICML reviewers might wonder why this claim is justified and how much of the paper is relevant if you're more dubious about the development of AGI.

The definition of situational awareness feels quite vague to me. To me, the definition ("identifying which abstract knowledge is relevant to the context in which they're being run, and applying that knowledge when choosing actions") seems to include encompass, for example, the ability t... (read more)

IMO this. For a legible paper, you more or less shouldn't assume it, but rather suggest consequences. Yeah, I think copying Ajeya is good.

Choosing actions which exploit known biases and blind spots in humans (as the Cicero Diplomacy agent may be doing [Bakhtin et al., 2022]) or in learned reward models. 

I've spent several hours reading dialogue involving Cicero, and it's not at all evident to me that it's "exploiting known biases and blind spots in humans". It is, however, good at proposing and negotiating plans, as well as accumulating power within the context of the game.


Thanks for writing this! Here is a quick explanation of all the math concepts – mostly written by ChatGPT with some manual edits.

A basis for a vector space is a set of linearly independent vectors that can be used to represent any vector in the space as a linear combination of those basis vectors. For example, in two-dimensional Euclidean space, the standard basis is the set of vectors (1, 0) and (0, 1), which are called the "basis vectors."

A change of basis is the process of expressing a vector in one basis in terms of another basis. For example, if we ha... (read more)


For example, it should be possible to mechanistically identify shards in small RL agents (such as the RL agents studied in Langosco et al)

Could you elaborate on how we could do this? I'm unsure if the state of interpretability research is good enough for this yet.

I don't have a particular idea in mind, but current SOTA on interp is identifying how ~medium sized LMs implement certain behaviors, e.g. IOI (or fully understanding smaller networks on toy tasks like modular addition or parenthesis balance checking). The RL agents used in Langosco et al are much smaller than said LMs, so it should be possible to identify the circuits of the network that implement particular behaviors as well.  There's also the advantage that conv nets on vision domains are often significantly easier to interp than LMs, e.g. because feature visualization works on them.  If I had to spitball a random idea in this space: * Reproduce one of the coinrun run-toward-the-right agents, figure out the circuit or lottery ticket that implements the "run toward the right" behavior using techniques like path patching or causal scrubbing, then look at intermediate checkpoints to see how it develops.  * Reproduce one of the coinrun run-toward-the-right agents, then retrain it so it goes after the coin. Interp various checkpoints to see how this new behavior develops over time. * Reproduce one of the coinrun run-toward-the-right agents, and do mechanistic interp to figure out circuits for various more fine-grained behaviors, e.g. avoiding pits or jumping over ledges.  IIRC some other PhD students at CHAI were interping reward models, though I'm not sure what came of that work though. 

Does anyone know how well these instances of mode collapse can be reproduced using text-davinci-003? Are there notable differences in how it manifests for text-davinci-003 vs text-davinci-002? Given that text-davinci-002 was trained with supervised fine-tuning, whereas text-davinci-003 was trained with RLHF (according to the docs), it might be interesting to see whether these techniques have different failure modes.

Some of the experiments are pretty easy to replicate, e.g. checking text-davinci-003's favorite random number: Seems much closer to base davinci than to text-davinci-002's mode collapse. I tried to replicate some of the other experiments, but it turns out that text-davinci-003 stops answering questions the same way as davinci/text-davinci-002, which probably means that the prompts have to be adjusted. For example, on the "roll a d6" test, text-davinci-003 assigns almost no probability to the numbers 1-6, and a lot of probability on things like X and ____: (you can fix this using logit_bias, but I'm not sure we should trust the relative ratios of incredibly unlikely tokens in the first place.) While both text-davinci-002 and davinci assign much high probabilities to the numbers than other options, and text-davinci-002 even assigns more than 73% chance to the token 6. 

Publicly available information suggests that the mystery method may not be so different from RLHF.

Actually, text-davinci-001 and text-davinci-002 are trained with supervised fine-tuning, according to the new documentation from late November 2022, with no apparent use of reinforcement learning: 

Supervised fine-tuning on human-written demonstrations and on model samples rated 7/7 by human labelers on an overall quality score

The SFT and PPO models are trained similarly to the ones from the InstructGPT paper. FeedME (short for "feedback made easy")

... (read more)

Regarding point 5: AI safety researchers are already taking the time to write talks and present them (e.g., Rohin Shah's introduction to AI alignment, though I think he has a more ML-oriented version). If we work off of an existing talk or delegate the preparation of the talk, then it wouldn't take much time for a researcher to present it.

Load More