All of Rohin Shah's Comments + Replies

Did you ever look at the math in the paper? Theorem 1 looks straight up false to me (attempted counterexample) but possibly I'm misunderstanding their assumptions.

1Fabien Roger21h
I hadn't looked at their math too deeply and it indeed seems that an assumption like "the bad behavior can be achieved after an arbitrary long prompt" is missing (though maybe I missed an hypothesis where this implicitly shows up). Something along these lines is definitely needed for the summary I made to make any sense.

(cross-posted from EAF, thanks Richard for suggesting. There's more back-and-forth later.)

I'm not very compelled by this response.

It seems to me you have two points on the content of this critique. The first point:

I think it's bad to criticize labs that do hits-based research approaches for their early output (I also think this applies to your critique of Redwood) because the entire point is that you don't find a lot until you hit.

I'm pretty confused here. How exactly do you propose that funding decisions get made? If some random person says they are pursu... (read more)

5Marius Hobbhahn2d
I'm not going to crosspost our entire discussion from the EAF.  I just want to quickly mention that Rohin and I were able to understand where we have different opinions and he changed my mind about an important fact. Rohin convinced me that anti-recommendations should not have a higher bar than pro-recommendations even if they are conventionally treated this way. This felt like an important update for me and how I view the post. 

Biggest disagreement I have with him is that, from my perspective, is that he understands that the optimization deck stacks against us, but he does not understand the degree to which the optimization deck is stacked against us or the extent to which this ramps up and diversifies and changes its sources as capabilities ramp up, and thus thinks many things could work that I don’t see as having much chance of working. I also don’t think he’s thinking enough about what type and degree of alignment would ‘be enough’ to survive the resulting Phase 2.

I'm not committing to it, but if you wrote up concrete details here, I expect I'd engage with it.

8Zvi2d
Thanks! I've seen various attempts to articulate different aspects of this but I may be able to find a better way to put it, and hopefully it would at least help you understand the position better.

Sounds plausible! I haven't played much with Conway's Life.

(Btw, you may want to make this comment on the original post if you'd like the original author to see it.)

I agree "information loss" seems kinda sketchy as a description of this phenomenon, it's not what I would have chosen.

I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)

That does seem like better terminology! I'll go change it now.

Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)

Okay, I understand how that addresses my edit.

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.

4Thane Ruthenis1mo
My impression is that it being a concrete example is the why. "What is the right framework to use?" and "what is the environment-structure in which natural abstractions can be defined?" are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer. The fact that it loops-in the speed of causal influence is also suggestive — it seems fundamental to the structure of our universe, crops up in a lot of places, so the proposition that natural abstractions are somehow downstream of it is interesting.

Okay, that mostly makes sense.

note that the resampler itself throws away a ton of information about  while going from  to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

I agree this is true, but why does the Lightcone theorem matter for it?

It is also a theorem that a Gibbs resampler initialized at equilibrium will produce  distributed according to , and as you say it's c... (read more)

4johnswentworth1mo
Sounds like we need to unpack what "viewing X0 as a latent which generates X" is supposed to mean. I start with a distribution P[X]. Let's say X is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what X is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as P[X]=∑iP[Xi|Λ]P[Λ], where Λ is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with Λ, rather than all the other rolls, which is useful insofar as Λ is much smaller than all the rolls. Note that P[X|Λ] is not supposed to match P[X]; then the representation would be useless. It's the marginal ∑iP[Xi|Λ]P[Λ] which is supposed to match P[X]. The lightcone theorem lets us do something similar. Rather all the Xi's being independent given Λ, only those Xi's sufficiently far apart are independent, but the concept is otherwise similar. We express P[X] as ∑X0P[X|X0]P[X0] (or, really, ∑ΛP[X|Λ]P[Λ], where Λ summarizes info in X0 relevant to X, which is hopefully much smaller than all of X).

The Lightcone Theorem says: conditional on , any sets of variables in  which are a distance of at least  apart in the graphical model are independent.

I am confused. This sounds to me like:

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than  could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).

Some things that I am confused about as a result:

  1. I don't se
... (read more)
9johnswentworth1mo
Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance 2T implies that nothing other than X0 could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious. It does, but then XT doesn't have the same distribution as the original graphical model (unless we're running the sampler long enough to equilibrate). So we can't view X0 as a latent generating that distribution. Not quite - note that the resampler itself throws away a ton of information about X0 while going from X0 to XT. And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes. So the reason this is interesting (for the thing you're pointing to) is not that it lets us ignore information from far-away parts of XT which could not possibly have been relevant given X0, but rather that we want to further throw away information from X0 itself (while still maintaining conditional independence at a distance).

I agree that there's a threshold for "can meaningfully build and chain novel abstractions" and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as "AI research -> better AI -> better assistance for human researchers -> AI research") and it's not clear why to expect the new feedback loop to be much more powerful than the existing ones.

(Aside: we're now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)

4Thane Ruthenis1mo
Yeah, the argument here would rely on the assumption that e. g. the extant scientific data already uniquely constraint some novel laws of physics/engineering paradigms/psychological manipulation techniques/etc., and we would be eventually able to figure them out even if science froze right this moment. In this case, the new feedback loop would be faster because superintelligent cognition would be faster than real-life experiments. And I think there's a decent amount of evidence for this. Consider that there are already narrow AIs that can solve protein folding more efficiently than our best manually-derived algorithms — which suggests that better algorithms are already uniquely constrained by the extant data, and we've just been unable to find them. Same may be true for all other domains of science — and thus, a superintelligence iterating on its own cognition would be able to outspeed human science.

Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don't agree with, so that I can notice if its predictions start coming true.

You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying l... (read more)

4Thane Ruthenis1mo
Interesting, thanks. Agreed that this point (universality leads to discontinuity) probably needs to be hashed out more. Roughly, my view is that universality allows the system to become self-sustaining. Prior to universality, it can't autonomously adapt to novel environments (including abstract environments, e. g. new fields of science). Its heuristics have to be refined by some external ground-truth signals, like trial-and-error experimentation or model-based policy gradients. But once the system can construct and work with self-made abstract objects, it can autonomously build chains of them [https://www.lesswrong.com/posts/w26TY2tFHRTdposre/why-are-some-problems-super-hard?commentId=CsHFp7pWJYRRfoJGX] — and that causes a shift in the architecture and internal dynamics, because now its primary method of cognition is iterating on self-derived abstraction chains, instead of using hard-coded heuristics/modules. 

Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.)

Fwiw, I feel like if I had your model, I'd be interested in:

  1. Producing tests for general intelligence. It really feels like there should be something to do here, that at least gives you significant Bayesian evidence. For example, filter the training data to remove anything talking about <some scientific field, e.g. complexity theory>, then see whether the resulting AI system can invent that field from scratch if you point it at the problems that motiva
... (read more)
2Thane Ruthenis1mo
I agree that those are useful pursuits. Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

Discontinuity ending (without stalling):

Stalling:

Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).

Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could ... (read more)

4Thane Ruthenis1mo
Ah, makes sense. I do expect that some sort of ability to reprogram itself at inference time will be ~necessary for AGI, yes. But I also had in mind something like your "SGD creates a set of weights that effectively treats the input English tokens as a programming language" example. In the unlikely case that modern transformers are AGI-complete, I'd expect something on that order of exoticism to be necessary (but it's not my baseline prediction). "Doing science" is meant to be covered by "lack of empirical evidence that there's anything in the universe that humans can't model". Doing science implies the ability to learn/invent new abstractions, and we're yet to observe any limits to how far we can take it / what that trick allows us to understand. Mmm. Consider a scheme like the following: * Let T2 be the current date. * Train an AI on all of humanity's knowledge up to a point in time T1, where T1<T2. * Assemble a list D of all scientific discoveries made in the time period (T1;T2]. * See if the AI can replicate these discoveries. At face value, if the AI can do that, it should be considered able to "do science" and therefore AGI, right? I would dispute that. If the period (T1;T2] is short enough, then it's likely that most of the cognitive work needed to make the leap to any discovery in D is already present in the data up to T1. Making a discovery from that starting point doesn't necessarily require developing new abstractions/doing science — it's possible that it may be done just by interpolating between a few already-known concepts. And here, some asymmetry between humans and e. g. SOTA LLMs becomes relevant: * No human knows everything the entire humanity knows. Imagine if making some discovery in D by interpolation required combining two very "distant" concepts, like a physics insight and advanced biology knowledge. It's unlikely that there'd be a human with sufficient expertise in both, so a human will likely do it by actual-scie

See Section 5 for more discussion of all of that.

Sorry, I seem to have missed the problems mentioned in that section on my first read.

There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions.

I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.

(I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity in capabilities... (read more)

4Thane Ruthenis1mo
Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two? Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts). The strategy of slowly scaling our AI up is workable at the core, but IMO there are a lot of complications: * A "mildly-superhuman" AGI, or even just a genius-human AGI, is still be an omnicide risk [https://www.lesswrong.com/posts/oBBzqkZwkxDvsKBGB/ai-could-defeat-all-of-us-combined#How_AIs_could_defeat_humans_without__superintelligence_] (see also [https://www.lesswrong.com/posts/KTbGuLTnycA6wKBza/what-would-a-fight-between-humanity-and-agi-look-like]). I wouldn't want to experiment with that; I would want it safely at average-human-or-below level. It's likely hard to "catch" it at that level by inspecting its external behavior, though: can only be reliably done via advanced interpretability tools. * Deceptiveness (and manipulation) is a significant factor, as you've mentioned. Even just a mildly-superhuman AGI will likely be very good at it. Maybe not implacably good, but it'd be like working bare-handed with an extremely dangerous chemical substance, with the entire humanity at the stake. * The problem of "iterating" on this system. If we have just a "weak" AGI on our hands, it's mostly a pre-AGI system, with a "weak" general-intelligence component that doesn't control much. Any "naive" approaches, like blindly training interpretability probes on it or something, would likely ignore that weak GI component, and focus mainly on analysing or shaping heuristics/shards. To get the right kind of experience from it, we'd have to very precisely aim our experiments at the GI component — which, again, likely

What ties it all together is the belief that the general-intelligence property is binary.

Do any humans have the general-intelligence property?

If yes, after the "sharp discontinuity" occurs, why won't the AGI be like humans (in particular: generally not able to take over the world?)

If no, why do we believe the general-intelligence property exists?

7Thane Ruthenis1mo
Yes, ~all of them. Humans are not superintelligent because despite their minds embedding the algorithm for general intelligence, that algorithm is still resource-constrained (by the brain's compute) and privilege-constrained within the mind (e. g., it doesn't have full write-access to our instincts). There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions. On the contrary: even if we resolve to check for "AGI-ness" often, with the intent of stopping the training the moment our AI becomes true AGI but still human-level or below it, we're likely to miss the right moment without advanced interpretability tools, and scale it past "human-level" straight to "impossible-to-ignore superintelligent". There would be no warning signs, because "weak" AGI (human-level or below) can't be clearly distinguished from a very capable pre-AGI system, based solely on externally-visible behaviour. See Section 5 [https://www.lesswrong.com/posts/3JRBqRtHBDyPE3sGa/a-case-for-the-least-forgiving-take-on-alignment#5__A_Caveat] for more discussion of all of that. Quoting from my discussion with cfoster0 [https://www.lesswrong.com/posts/3JRBqRtHBDyPE3sGa/a-case-for-the-least-forgiving-take-on-alignment?commentId=xzZsPjjEHTnoGwqbJ]:

High level response: yes, I agree that "gradient descent retargets the search" is a decent summary; I also agree the thing you outline is a plausible failure mode, but it doesn't justify confidence in doom.

  1. Our understanding of what we actually want is poor, such that we wouldn't want to optimize for how we understand what we want

I'm not very worried about this. We don't need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear.

  1. We poorly express our unde
... (read more)
4watermark2mo
I agree that we don't need to solve philosophy/morality if we could at least pin down things like corrigibility, but humans may poorly understand "leaving humans in control" and "respecting human preferences" such that optimizing for human abstractions of these concepts could be unsafe (this belief isn't that strongly held, I'm just considering some exotic scenarios where humans are technically 'in control' according to the specification we thought of, but the consequences are negative nonetheless, normal goodharting failure mode). Depending on the work you're asking the AI(s) to do (e.g. automating large parts of open ended software projects, automating large portions of STEM work), I'd say the world takeover/power-seeking/recursive self improvement type of scenarios happen since these tasks incentivize the development of unbounded behaviors (because open-ended, project based work doesn't have clear deadlines, may require multiple retries, and has lots of uncertainty, I can imagine unbounded behaviors like "gain more resources because that's broadly useful under uncertainty" to be strongly selected for).

So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.

The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about ... (read more)

Agreed.

(In practice I think it was rare that people appealed to the robustness of evidence when citing the optimizer's curse, though nowadays I mostly don't hear it cited at all.)

Why are you considering Anthropic as a unified whole here? Sure, Anthropic as a whole is probably doing some work that is directly aimed towards advancing capabilities, but this just doesn't seem true of the interp team. (I guess you could imagine that the only reason the interp team exists at Anthropic is that Anthropic believes interp is great for advancing capabilities, but this seems pretty unlikely to me.)

(Note that the criterion is "not specifically work towards advancing capabilities", as opposed to "try not to advance capabilities".)

1James Payor2mo
I don't think that the interp team is a part of Anthropic just because they might help with a capabilities edge; seems clear they'd love the agenda to succeed in a way that leaves neural nets no smarter but much better understood. But I'm sure that it's part of the calculus that this kind of fundamental research is also worth supporting because of potential capability edges. (Especially given the importance of stuff like figuring out the right scaling laws in the competition with OpenAI.) (Fwiw I don't take issue with this sort of thing, provided the relationship isn't exploitative. Like if the people doing the interp work have some power/social capital, and reason to expect derived capabilities to be used responsibly.)

I have found much more success modeling intentions and institutional incentives at the organization level than the team level.

My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work. In-general I've found arguments of the type of "this team in this org is working towards totally different goals than the rest of the org" to have a pretty bad track record, unless you are talking about very independent and mostly remote teams.

Yes, that all seems reasonable.

I think I have still failed to communicate on (1). I'm not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that's already the situation I want to detect and stop. (Obviously it's not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it's not that worth going into detail here.

On (2), the theory of change is that you don't get into the red area, which I agree is equivalent ... (read more)

Speaking on behalf of myself, rather than the full DeepMind alignment team:

The part that gives me the most worry is the section on General Hopes.

General hopes. Our plan is based on some general hopes:

  • The most harmful outcomes happen when the AI “knows” it is doing something that we don’t want, so mitigations can be targeted at this case. 
  • Our techniques don’t have to stand up to misaligned superintelligences — the hope is that they make a difference while the training process is in the gray area, not after it has reached the red area. 
  • In terms of
... (read more)
4Zvi2mo
1. I see why one might think this is a mostly safe assumption, but it doesn't seem like one to me - it's kind of presuming some sort of common sense morality can be used as a check here, even under Goodhart conditions, and I don't think it would be that reliable in most doom cases? I'm trying to operationalize what this would mean in practice in a way a sufficiently strong AI wouldn't find a way around via convincing itself or us, or via indirect action, or similar, and I can't.   2. This implies that you think that if you win the grey area you know how to use that to not lose in the red area? Perhaps via some pivotal act? I notice I am confused why you believe that? The assumption here is more 'the grey area solution would be sufficient' for some value of grey, and I'm curious both why that would work and what value of grey would count. Would do a lot of work if I bought into this. 3. Yes, these are advantages and this seems closer to a correct assumption as stated than the others. I'm still worried it's being used to carry in the other assumptions about being able to make use of the advantage, which interacts a lot with #2 I think? 4. Depends what value of impossible. It's definitely not fully impossible, and yes that is more hopeful than the alternative. It seems at least shut-up-and-do-the levels of impossible. 5. As stated strictly it's not an assumption, it's more I read it as implying the 'and when it is importantly trying to do this, our techniques will notice' part, which I indeed hope but I don't have much confidence in that. Does that help with where my head is at?

Interestingly, I apparently had a median around 2040 back in 2019, so my median is still later than it used to be prior to reading the bio anchors report.

Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:

  • [While I was working on goal misgeneralization] "Basically all the work that I'm doing is about convincing other people that the problem is real". I might have also said something like "and most people I work with" intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor
... (read more)
8habryka3mo
This seems like the most likely explanation. Decent chance I interpreted "and most people I work with" as referring to the rest of the Deepmind safety team.  I still feel confused about some stuff, but I am happy to let things stand here.

I still endorse that comment, though I'll note that it argues for the much weaker claims of

  • I would not stop working on alignment research if it turned out I wasn't solving the technical alignment problem
  • There are useful impacts of alignment research other than solving the technical alignment problem

(As opposed to something more like "the main thing you should work on is 'make alignment legit'".)

(Also I'm glad to hear my comments are useful (or at least influential), thanks for letting me know!)

I think most of the goal misgeneralization examples are in fact POCs, but they're pretty weak POCs and it would be much better if we had better POCs. Here's a table of some key disanalogies:

Our examples

Deceptive alignment

Deployment behavior is similar to train behaviorBehaves well during training, executes treacherous turn on deployment
No instrumental reasoningTrain behavior relies on instrumental reasoning
Adding more diverse data would solve the problemAI would (try to) behave nicely for any test we devise
Most (but not all) were designed to show goal misg
... (read more)

(I realize this is straying pretty far from the intent of this post, so feel free to delete this comment)

I totally agree that a non-trivial portion of DeepMind's work (and especially my work) is in the "make legit" category, and I stand by that as a good thing to do, but putting all of it there seems pretty wild. Going off of a list I previously wrote about DeepMind work (this comment):

We do a lot of stuff, e.g. of the things you've listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:

  • El
... (read more)

I mean, I think my models here come literally from conversations with you, where I am pretty sure you have said things like (paraphrased) "basically all the work I do at Deepmind and the work of most other people I work with at Deepmind is about 'trying to demonstrate the difficulty of the problem' and 'convincing other people at Deepmind the problem is real'". 

In as much as you are now claiming that is only 10%-20% of the work, that would be extremely surprising to me and I do think would really be in pretty direct contradiction with other things we ... (read more)

Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.

(Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty h... (read more)

I'm not claiming that you figure out whether the model's underlying motivations are bad. (Or, reading back what I wrote, I did say that but it's not what I meant, sorry about that.) I'm saying that when the model's underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.

It's plausible that you then get a model with bad motivations that knows not to produce bad... (read more)

2dxu3mo
but, but, standard counterargument imperfect proxies Goodharting magnification of error adversarial amplification etc etc etc? (It feels weird that this is a point that seems to consistently come up in discussions of this type, considering how basic of a disagreement it really is, but it really does seem to me like lots of things come back to this over and over again?)

Sorry, I think that particular sentence of mine was poorly written (and got appropriate pushback at the time). I still endorse my followup comment, which includes this clarification:

The thing I'm not that interested in (from a "how scared should we be" or "timelines" perspective) is when you take a bunch of different tasks, shove them into a single "generic agent", and the resulting agent is worse on most of the tasks and isn't correspondingly better at some new task that none of the previous systems could do.

In particular, my impression with Gato is that ... (read more)

I think you're missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your "follow-the-trying" approach.

While all of these are often analyzed from the perspective of "suppose you have a potentially-misaligned powerful AI; here's what would happen", I view that as an analysis tool, not the primary theory of change.

The theory of change that I most buy is that as you are training your model, while it is developing the "trying", you would like it to develop good "trying" and not bad "trying", and one... (read more)

4Steven Byrnes3mo
Thanks, that helps! You’re working under a different development model than me, but that’s fine. It seems to me that the real key ingredient in this story is where you propose to update the model based on motivation and not just behavior—“penalize it instead of rewarding it” if the outputs are “due to instrumental / deceptive reasoning”. That’s great. Definitely what we want to do. I want to zoom in on that part. You write that “debate / RRM / ELK” are supposed to “allow you to notice” instrumental / deceptive reasoning. Of these three, I buy the ELK story—ELK is sorta an interpretability technique, so it seems plausible that ELK is relevant to noticing deceptive motivations (even if the ELK literature is not really talking about that too much at this stage, per Paul’s comment [https://www.lesswrong.com/posts/PDx4ueLpvz5gxPEus/why-i-m-not-working-on-debate-rrm-elk-natural-abstractions?commentId=Lusyr3RdHCHvdNKXn]). But what about debate & RRM? I’m more confused about why you brought those up in this context. Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something? (We can give the debaters / the reward model a printout of model activations alongside the model’s behavioral outputs. But I’m not sure what the next step of the story is, after that. How do the debaters / reward model learn to skillfully interpret the model activations to extract underlying motivations?)

Yes, that's right, though I'd say "probable" not "possible" (most things are "possible").

Depends what the aligned sovereign does! Also depends what you mean by a pivotal act!

In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don't expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.

6David Johnston3mo
So, if I'm understanding you correctly: * if it's possible to build a single AI system that executes a catastrophic takeover (via self-bootstrap or whatever), it's also probably possible to build a single aligned sovereign, and so in this situation winning once is sufficient * if it is not possible to build a single aligned sovereign, then it's probably also not possible to build a single system that executes a catastrophic takeover and so the proposition that the model only has to win once is not true in any straightforward way * in this case, we might be able to think of "composite AI systems" that can catastrophically take over or end the acute risk period, and for similar reasons as in the first scenario, winning once with a composite system is sufficient, but such systems are not built from single acts and you think the second scenario is more likely than the first.

I think that skews it somewhat but not very much. We only have to "win" once in the sense that we only need to build an aligned Sovereign that ends the acute risk period once, similarly to how we only have to "lose" once in the sense that we only need to build a misaligned superintelligence that kills everyone once.

(I like the discussion on similar points in the strategy-stealing assumption.)

7David Johnston3mo
Is building an aligned sovereign to end the acute risk period different to a pivotal act in your view?

[People at AI labs] expected heavy scrutiny by leadership and communications teams on what they can state publicly. [...] One discussion with a person working at DeepMind is pending approval before publication. [...] We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness.

(I'm the person in question.)

I just want to note that in the case of DeepMind:

  • I don't expect "heavy" scrutiny by leadership and communications teams (though it is not literally zero)
  • For the di
... (read more)
9RobertM4mo
Just as a data point, the impression I got with respect to DeepMind was that they'd approved the conversation (contra some other orgs, for which the post said otherwise) and the review was in progress.

Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:

  1. Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
  2. Malleable motivations: There is a "nearby" model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
  3. Stron
... (read more)
2Gunnar_Zarncke4mo
I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5. 3. I don't know about "the shape of the loss landscape" but there will be problems with "the developers wrote correct code" because "correct" here includes that it doesn't have side-effects that the model can self-exploit (though I don't think this is the biggest problem). 4. Correct rewards means two things:  * a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing. * b) That we understand all the consequences of the reward - at least sufficiently to avoid goodharting but also long-term consequences. It seems there was more work on a) with ChatGPT, but there was goodharting and even with ChatGPT one can imagine a lot of value lost due to exclusion of human values. 5. It seems clear that the ChatGPT training didn't include enough exploration and with smarter moders that have access to their own output (Bing) there will be incredible amounts of potential failure modes. I think that an adversarial mindset is needed to come up with ways to limit the exploration space drastically.

In that case, I think you should try and find out what the incentive gradient is like for other people before prescribing the actions that they should take. I'd predict that for a lot of alignment researchers your list of incentives mostly doesn't resonate, relative to things like:

  1. Active discomfort at potentially contributing to a problem that could end humanity
  2. Social pressure + status incentives from EAs / rationalists to work on safety and not capabilities
  3. Desire to work on philosophical or mathematical puzzles, rather than mucking around in the weeds of
... (read more)

conversely, a lot of people in the wider world that don’t work on alignment directly make relevant decisions that will affect alignment and AI, and think about alignment too (likely many more of those exist than “pure” alignment researchers, and this post is addressed to them too!)

  1. Your post is titled "Don't accelerate problems you're trying to solve". Given that the problem you're considering is "misalignment", I would have thought that the people trying to solve the problem are those who work on alignment.
  2. The first sentence of your post is "If one believe
... (read more)

I continue to think that if your model is that capabilities follow an exponential (i.e. dC/dt = kC), then there is nothing to be gained by thinking about compounding. You just estimate how much time it would have taken for the rest of the field to make an equal amount of capabilities progress now. That's the amount you shortened timelines by; there's no change from compounding effects.

if you make a discovery now worth 2 billion dollars [...] If you make this 2 billion dollar discovery later

Two responses:

  1. Why are you measuring value in dollars? That is both
... (read more)

Errr, I feel like we already agree on this point?

Yes, sorry, I realized that right after I posted and replaced it with a better response, but apparently you already saw it :(

What I meant to say was "I think most of the time closing overhangs is more negative than positive, and I think it makes sense to apply OP's higher bar of scrutiny to any proposed overhang-closing proposal".

But like, why? I wish people would argue for this instead of flatly asserting it and then talking about increased scrutiny or burdens of proof (which I also don't like).

4leogao4mo
I think maybe the crux is the part about the strength of the incentives towards doing capabilities. From my perspective it generally seems like this incentive gradient is pretty real: getting funded for capabilities is a lot easier, it's a lot more prestigious and high status in the mainstream, etc. I also myself viscerally feel the pull of wishful thinking (I really want to be wrong about high P(doom)!) and spend a lot of willpower trying to combat it (but also not so much that I fail to update where things genuinely are not as bad as I would expect, but also not allowing that to be an excuse for wishful thinking, etc...). 

To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. 

As you note, the second claim is false for the model the OP mentions. I don't care about the first claim once you know whether the second claim is true or false, which is the important part.

I agree it could be true in practice in other models but I am unhappy about the pattern where someone makes a claim based on arguments that are ... (read more)

4leogao4mo
"This [model] is zero evidence for the claim" is a roughly accurate view of my opinion. I think you're right that epistemically it would have been much better for me to have said something along those lines. Will edit something into my original comment.
4leogao4mo
Errr, I feel like we already agree on this point? Like I'm saying almost exactly the same thing you're saying; sorry if I didn't make it prominent enough: I'm also not claiming this is an accurate model; I think I have quite a bit of uncertainty as to what model makes the most sense. I was not intending to make a claim of this strength, so I'll walk back what I said. What I meant to say was "I think most of the time the benefit of closing overhangs is much smaller than the cost of reduced timelines, and I think it makes sense to apply OP's higher bar of scrutiny to any proposed overhang-closing proposal". I think I was thinking too much inside my inside view when writing the comment, and baking in a few other assumptions from my model (including: closing overhangs benefiting capabilities at least as much, research being kinda inefficient (though not as inefficient as OP thinks probably)). I think on an outside view I would endorse a weaker but directionally same version of my claim. In my work I do try to avoid advancing capabilities where possible, though I think I can always do better at this.

Progress often follows an s-curve, which appears exponential until the current research direction is exploited and tapers off. Moving an exponential up, even a little, early on can have large downstream consequences:

Your graph shows "a small increase" that represents progress that is equal to an advance of a third to a half the time left until catastrophe on the default trajectory. That's not small! That's as much progress as everyone else combined achieves in a third of the time till catastrophic models! It feels like you'd have to figure out some newer e... (read more)

Your graph shows "a small increase" that represents progress that is equal to an advance of a third to a half the time left until catastrophe on the default trajectory. That's not small! That's as much progress as everyone else combined achieves in a third of the time till catastrophic models! It feels like you'd have to figure out some newer efficient training that allows you to get GPT-3 levels of performance with GPT-2 levels of compute to have an effect that was plausibly that large.

In general I wish you would actually write down equations for your mod

... (read more)

I think this is pretty false. There's no equivalent to Let's think about slowing down AI, or a tag like Restrain AI Development (both of which are advocating an even stronger claim than just "caution") -- there's a few paragraphs in Paul's post, one short comment by me, and one short post by Kaj. I'd say that hardly any optimization has gone into arguments to AI safety researchers for advancing capabilities. 
[...]
(I agree in the wider world there's a lot more optimization for arguments in favor of capabilities progress that people in general would fin

... (read more)
5MichaelStJules4mo
A horizontal shift seems more realistic. You just compress one part of the curve or skip ahead along it. Then a small shift early on leads to an equally small shift later. Of course, there are other considerations like overhang and community effects that complicate the picture, but this seems like a better place to start.
6Richard Korzekwa 4mo
Yes, I was going to say something similar. It looks like the value of the purple curve is about double the blue curve when the purple curve hits AGI. If they have the same doubling time, that means the "small" increase is a full doubling of progress, all in one go. Also, the time you arrive ahead of the original curve is equal to the time it takes the original curve to catch up with you. So if your "small" jump gets you to AGI in 10 years instead of 15, then your "small" jump represents 5 years of progress. This is easier to see on a log plot:
leogao4moΩ6159

Not OP, just some personal takes:

That's not small!

To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. (To be clear, I don't think the exponential model presented provides evidence for this latter claim)

I think the first claim is obviously true. The second claim could be true in practice, though I feel quite uncertain about this. It happens to be false in the specific model of moving an ex... (read more)

And now, it seems like we agree that the pseudocode I gave isn't a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no? 

I don't think I agree with this.

At a high level, your argument can be thought of as having two steps:

  1. Grader-optimizers are bad, because of problem P.
  2. Approval-directed agents / [things built by IDA, debate, RRM] are grader-optimizers.

I've been t... (read more)

Nice comment!

The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn't know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we ... (read more)

3RogerDearnaley4mo
I agree, this is only a proposal for a solution to the outer alignment problem. On the optimizer's curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn't going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before). Optimizing while not allowing for the optimizer's curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you're doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they're all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn't build a near-human AI that is still making that mistake.  Edit: I expanded this comment into a post, at: https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1 [https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1] 

I don't really disagree with any of what you're saying but I also don't see why it matters.

I consider myself to be saying "you can't just abstract this system as 'trying to make evaluations come out high'; the dynamics really do matter, and considering the situation in more detail does change the conclusions."

I'm on board with the first part of this, but I still don't see the part where it changes any conclusions. From my perspective your responses are of the form "well, no, your abstract argument neglects X, Y and Z details" rather than explaining how X, ... (read more)

2TurnTrout4mo
Uh, I'm confused. From your original comment [https://www.lesswrong.com/posts/rauMEna2ddf26BqiE/alignment-allows-nonrobust-decision-influences-and-doesn-t?commentId=DXJKvisMDqco5Trb5] in this thread: You also said: And now, it seems like we agree that the pseudocode I gave isn't a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?  Sounds like we mostly disagree on cumulative effort to: (get a grader-optimizer to do good things) vs (get a values-executing agent to do good things).  We probably perceive the difficulty as follows: 1. Getting the target configuration into an agent 1. Grader-optimization 1. Alex: Very very hard 2. Rohin: Hard 2. Values-executing 1. Alex: Moderate/hard 2. Rohin: Hard 2. Aligning the target configuration such that good things happen (e.g. makes diamonds), conditional on the intended cognitive patterns being instilled to begin with (step 1) 1. Grader-optimization 1. Alex: Extremely hard 2. Rohin: Very hard 2. Values-executing 1. Alex: Hard 2. Rohin: Hard Does this seem reasonable? We would then mostly disagree on relative difficulty of 1a vs 1b.  -------------------------------------------------------------------------------- Separately, I apologize for having given an incorrect answer earlier, which you then adopted, and then I berated you for adopting my own incorrect answer -- how simplistic of you! Urgh.  I had said: But I should also have mentioned the change in planModificationSample. Sorry about that.
1Noosphere894mo
I think the truth is even worse than that, in that deceptively aligned value children are the default scenario without myopia and solely causal decision theory or variants of it, and Turntrout either has good arguments for why this isn't favored, or Turntrout hopes that deceptive alignment is defined away. For reasons why deceptive alignment might be favored, I have a link below: https://www.lesswrong.com/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment [https://www.lesswrong.com/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment]

The edits help, thanks. I was in large part reacting to the fact that Kaj's post reads very differently from your summary of Bad Argument 1 (rather than the fact that I don't make Bad Argument 1). In the introductory paragraph where he states his position (the third paragraph of the post), he concludes:

Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research.

Which is clearly not equivalent to "alignment researchers hibernate for N years and then get back to work".

Plausibly... (read more)

4Steven Byrnes4mo
On further reflection, I promoted the thing from a footnote to the main text, elaborated on it, and added another thing at the end. (I think I wrote this post in a snarkier way than my usual style, and I regret that. Thanks again for the pushback.)

Fwiw, when talking about risks from deploying a technology / product, "accident" seems (to me) much more like ascribing blame ("why didn't they deal with this problem?"), e.g. the Boeing 737-MAX incidents are "accidents" and people do blame Boeing for them. In contrast "structural" feels much more like "the problem was in the structure, there was no specific person or organization that was in the wrong".

I agree that in situations that aren't about deploying a technology / product, "accident" conveys a lack of blameworthiness.

2David Scott Krueger (formerly: capybaralet)4mo
Hmm... this is a good point. I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way.  One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.

I recall some people in CHAI working on a minecraft AI that could help players do useful tasks the players wanted.

I think you're probably talking about my work. This is more of a long-term vision; it isn't doable (currently) at academic scales of compute. See also the "Advantages of BASALT" section of this post.

(Also I just generically default to Minecraft when I'm thinking of ML experiments that need to mimic some aspect of the real world, precisely because "the game getting played here is basically the same thing real life society is playing".)

Okay, then let me try to directly resolve my confusion. My current understanding is something like - in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we've screened off evolution, the size of the genome, etc - all of that is going into the "with certain structural parameters" part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same

... (read more)
Load More