All of Ben Cottier's Comments + Replies

What is the source for the "JP Morgan note"?


To be clear (sorry if you already understood this from the post): Running BLOOM via an API that someone else created is easy. My claim is that someone needs significant expertise to be able to run their own instance of BLOOM. I think the hardest part is setting up multiple GPUs to run the 176B parameter model. But looking back, I might have underestimated how straightforward it is to get the open-source code to run BLOOM working. Maybe it's basically plug-and-play as long as you get an appropriate A100 GPU instance on the cloud. I did not attempt to run BL... (read more)

Personal AI assistants seem to have one of the largest impacts (or at least "presence") mainly due to the number of users. The impact per person seems small - making life slightly more convenient and productive, maybe. Not sure if there is actually much impact on productivity. I wonder if there is any research on this. I haven't looked into it at all.

Relatedly, chatbots are certainly used a lot, but I'm uncertain about its current impacts beyond personal entertainment and wellbeing (and uncertain about the direction of the impact on wellbeing).

What 2026 lo... (read more)

Nice! A couple things that this comment pointed out for me:

  1. Real time is not always (and perhaps often not) the most useful way to talk about timelines. It can be more useful to talk about different paths, or economic growth, if that's more relevant to how tractable the research is.
  2. An agenda doesn't necessarily have to argue that its assumptions are more likely, because we may have enough resources to get worthwhile expected returns on multiple approaches.

Something that's unclear here: are you excited about this approach because you think brain-like AGI will be easier to align? Or is it more about the relative probabilities / neglectedness / your fit?

5Steven Byrnes2y
I don't think it's obvious that "we should do extra safety research that bet on a future wherein AGI safety winds up being easy". If anything it seems backwards. Well, tractability cuts one way, importance cuts the other way, "informing what we should do viz. differential technology development" is a bit unclear. I do know one person who works on brain-like AGI capabilities on the theory that brain-like AGI would be easier to align. Not endorsing that, but at least there's an internal logic there. (FWIW, my hunch is that brain-like AGI would be better / less bad for safety than the "risks from learned optimization" scenario, albeit with low confidence. How brain-like AGI compares to other scenarios (GPT-N or whatever), I dunno.) Instead I'm motivated to work on this because of relative probabilities and neglectedness.

I'm excited about this project. I've been thinking along similar lines about inducing a model to learn deception, in the context of inner alignment. It seems really valuable to have concrete (but benign) examples of a problem to poke at and test potential solutions on. So far there seem to be less concrete examples of deception, betrayal and the like to work with in ML compared to say, distributional shift, or negative side effects.

Previous high level projects have tried to define concepts like "trustworthiness" (or the closely related "truthful") and motivated the AI to follow them. Here we will try the opposite: define "betrayal", and motivate the AIs to avoid it.

Why do you think the betrayal approach is more tractable or useful? It's not clear from the post.

To your first point - I agree both with why we limited the scope (but also, it was partly just personal interests), and that there should be more of this kind of work on other classes of risk. However, my impression is the literature and "public" engagement (e.g. EA forum, LessWrong) on catastrophic AI misuse/structural risk is too small to even get traction on work like this. We might first need more work to lay out the best arguments. Having said that, I'm aware of a fair amount of writing which I haven't got around to reading. So I am probably misjudgin... (read more)

My impression is that there is indeed substantially less literature on misuse risk and structural risk, compared to accident risk, in relation to AI x-risk. (I'm less confident when it comes to a broader set of negative outcomes, not just x-risks, but that's also less relevant here and less important to me.) I do think that that might the sort of work this post does less interesting if done in relation to those less-discussed types of risks, since there fewer disagreements have been revealed, so there's less to analyse and summarise.  That said, I still expect interesting stuff along these lines could be done on those topics. It just might be a quicker job with a smaller output than this post.  I collected a handful of relevant sources and ideas here. I think someone reading those things and providing a sort of summary, analysis, and/or mapping could be pretty handy, and might even be doable in just a day or so of work. It might also be relatively easy to provide more "novel ideas" in the course of that work that it would've been for your post, since misuse/structural risks seem like less charted territory.  (Unfortunately I'm unlikely to do this myself, as I'm currently focused on nuclear war risk.) --- A separate point is that I'd guess that one reason why there's less work on misuse/structural AI x-risk than on accidental AI x-risk is that a lot of people aren't aware of those other categories of risks, or rarely think about them, or assume the risks are much smaller. And I think one reason for that is that people often write or talk about "AI x-risk" while actually only mentioning accidental AI x-risk. That's part of why I say "So, personally, I think I’d have made that choice of scope even more explicit."  (But again, I do very much like this post overall. And as a target of this quibble of mine, you're in good company - I have the same quibble with The Precipice. I think one of the quibbles I most often have with posts I like is "This post seems to imply

It's great to hear your thoughts on the post!

I'd also like to see more posts that do this sort of "mapping". I think that mapping AI risk arguments is too neglected - more discussion and examples in this post by Gyrodiot. I'm continuing to work collaboratively in this area in my spare time, and I'm excited that more people are getting involved.

We weren't trying to fully account for AGI timelines - our choice of scope was based on a mix of personal interest and importance. I know people currently working on posts similar to this that will go in-depth on tim... (read more)

A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself

I'm confused about this. I tried substituting different words for "optimisation":

"A system capable of reasoning about photosynthesis is likely also capable of reusing that same machinery to do photosynthesis itself." Nope.

"A system capable of reasoning about arithmetic is likely also capable of reusing that same machinery to do arithmetic itself". Maybe? The rules of arithmetic can be reused, but the machinery to reason abstrac... (read more)

Thanks. I think I understand, but I'm still confused about the effect on the risk of catastrophe (i.e. not just being pseudo-aligned, but having a catastrophic real-world effect). It may help to clarify that I was mainly thinking of deceptive alignment, not other types of pseudo-alignment. And I'll admit now that I phrased the question stronger than I actually believe, to elicit more response :)

I agree that the probability of pseudo-alignment will be the same, and that an unrecoverable action could occur despite the threat of modification. I'm interested i... (read more)

I agree with all of this—online learning doesn't change the probability of pseudo-alignment but might make it harder for a deceptively aligned model to defect. That being said, I don't think that deceptive models defecting later is necessarily a good thing—if your deceptive models start defecting sooner, but in recoverable ways, that's actually good because it gives you a warning shot. And once you have a deceptive model, it's going to try to defect against you at some point, even if it just has to gamble and defect randomly with some probability. First, I do think that worst-case guarantees are achievable if we do relaxed adversarial training with transparency tools. Second, I actually have done a bunch of probabilistic risk analysis on exactly this sort of situation here. Note, however, that the i.i.d. situation imagined in that analysis is extremely hard to realize in practice as there are fundamental distributional shifts that are very difficult to overcome—such as the distributional shift from a situation where the model can't defect profitably to a situation where it can.

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective.

Joint optimization may be unstable, but if the model is not trained to convergence, might it still be jointly optimizing at the end of training? This occurred to me after reading which finds that "Larger models are... (read more)

I talk about this a bit here, but basically if you train huge models for a short period of time, you're really relying on your inductive biases to find the simplest model that fits the data—and mesa-optimizers, especially deceptive mesa-optimizers, are quite simple, compressed policies.


Church's views on AI seem far away from my and most people's views in the AI risk community, and really intrigued me. It would be great to try distil and summarise these views to update on it properly.

Model of the threat and interventions for mesa-optimization

  • Consider a chain model
    • Base optimizer
    • -> Mesa optimizer
      • Produced through optimization of base objective in training environment
    • -> Misalignment (base objective != mesa objective)
      • Different kinds of misalignment
        • Proxy: mesa objective is a proxy for the base objective in the training environment
          • Side-effect: optimizing mesa objective happens to optimize base objective
          • Instrumental: optimizing base objective happens to optimize mesa objective
        • Approximate: objectives differ due to an a
... (read more)

Thanks, that makes sense. To clarify, I realise there are references/links throughout. But I forgot that the takeoff speeds post was basically making the same claim as that quote, and so I was expecting a reference more from the biology space. And there are other places where I'm curious what informed you, e.g. the progress of guns, though that's easier to read up on myself.

A team of people including Smolensky and Schmidhuber have produced better results on a mathematics problem set by combining BERT with a tensor products (Smolensky et al., 2016), a formal system for representing symbolic variables and their bindings (Schlag et al., 2019), creating a new system called TP-Transformer.

Notable that the latter paper was rejected from ICLR 2020, partly for unfair comparison. It seems unclear at present whether TP-Transformer is better than the baseline transformer.

I think this is a good analysis, and I'm really glad to see this kind of deep dive on an important crux. The most clarifying thing for me was connecting old and new arguments - they seem to have more common ground than I thought.

One thing I would appreciate is more in-text references. There are a bunch of claims here about e.g. history, evolution with no explicit reference. Maybe it seems like common knowledge, but I wasn't sure whether to believe some things, e.g.

Evolution was optimizing for fitness, and driving increases in intelligence only i
... (read more)
3Sammy Martin4y
I'm glad this changed someone's mind about the connection between old/new views! The links in the text are references, and links before quotes go to the location of that quote - though there should be more and I'll add more. To clarify that section in particular, evolution is always optimizing for fitness (tautologically) but what specific traits evolution is promoting change all the time as selection pressures shift. What Paul Christiano argued is that evolution basically was not trying to make general intelligence until very recently, and that as soon as it did try it made continuous progress. That argument was the one thing I researched that was most surprising to me, and I'm not sure why it hasn't been more commonly discussed.

Thanks! Comments are much appreciated.

Why the arrow from "agentive AI" to "humans are economically outcompeted"? The explanation makes it sounds like it should point to "target loading fails"??

It's been a few months and I didn't write in detail why that arrow is there, so I can't be certain of the original reason. My understanding now: humans getting economically outcompeted means AI systems are competing with humans, and therefore optimising against humans on some level. Goal-directedness enables/worsens this.

L... (read more)