(cross-posted from EAF, thanks Richard for suggesting. There's more back-and-forth later.)
I'm not very compelled by this response.
It seems to me you have two points on the content of this critique. The first point:
I think it's bad to criticize labs that do hits-based research approaches for their early output (I also think this applies to your critique of Redwood) because the entire point is that you don't find a lot until you hit.
I'm pretty confused here. How exactly do you propose that funding decisions get made? If some random person says they are pursu...
Biggest disagreement I have with him is that, from my perspective, is that he understands that the optimization deck stacks against us, but he does not understand the degree to which the optimization deck is stacked against us or the extent to which this ramps up and diversifies and changes its sources as capabilities ramp up, and thus thinks many things could work that I don’t see as having much chance of working. I also don’t think he’s thinking enough about what type and degree of alignment would ‘be enough’ to survive the resulting Phase 2.
I'm not committing to it, but if you wrote up concrete details here, I expect I'd engage with it.
Sounds plausible! I haven't played much with Conway's Life.
(Btw, you may want to make this comment on the original post if you'd like the original author to see it.)
I agree "information loss" seems kinda sketchy as a description of this phenomenon, it's not what I would have chosen.
I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
Okay, I understand how that addresses my edit.
I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.
Okay, that mostly makes sense.
note that the resampler itself throws away a ton of information about while going from to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.
I agree this is true, but why does the Lightcone theorem matter for it?
It is also a theorem that a Gibbs resampler initialized at equilibrium will produce distributed according to , and as you say it's c...
The Lightcone Theorem says: conditional on , any sets of variables in which are a distance of at least apart in the graphical model are independent.
I am confused. This sounds to me like:
If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).
Some things that I am confused about as a result:
I agree that there's a threshold for "can meaningfully build and chain novel abstractions" and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as "AI research -> better AI -> better assistance for human researchers -> AI research") and it's not clear why to expect the new feedback loop to be much more powerful than the existing ones.
(Aside: we're now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)
Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don't agree with, so that I can notice if its predictions start coming true.
You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying l...
Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.)
Fwiw, I feel like if I had your model, I'd be interested in:
Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?
Discontinuity ending (without stalling):
Stalling:
Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).
Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could ...
See Section 5 for more discussion of all of that.
Sorry, I seem to have missed the problems mentioned in that section on my first read.
There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions.
I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.
(I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity in capabilities...
What ties it all together is the belief that the general-intelligence property is binary.
Do any humans have the general-intelligence property?
If yes, after the "sharp discontinuity" occurs, why won't the AGI be like humans (in particular: generally not able to take over the world?)
If no, why do we believe the general-intelligence property exists?
High level response: yes, I agree that "gradient descent retargets the search" is a decent summary; I also agree the thing you outline is a plausible failure mode, but it doesn't justify confidence in doom.
- Our understanding of what we actually want is poor, such that we wouldn't want to optimize for how we understand what we want
I'm not very worried about this. We don't need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear.
...
- We poorly express our unde
So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.
The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about ...
Agreed.
(In practice I think it was rare that people appealed to the robustness of evidence when citing the optimizer's curse, though nowadays I mostly don't hear it cited at all.)
Why are you considering Anthropic as a unified whole here? Sure, Anthropic as a whole is probably doing some work that is directly aimed towards advancing capabilities, but this just doesn't seem true of the interp team. (I guess you could imagine that the only reason the interp team exists at Anthropic is that Anthropic believes interp is great for advancing capabilities, but this seems pretty unlikely to me.)
(Note that the criterion is "not specifically work towards advancing capabilities", as opposed to "try not to advance capabilities".)
I have found much more success modeling intentions and institutional incentives at the organization level than the team level.
My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work. In-general I've found arguments of the type of "this team in this org is working towards totally different goals than the rest of the org" to have a pretty bad track record, unless you are talking about very independent and mostly remote teams.
Yes, that all seems reasonable.
I think I have still failed to communicate on (1). I'm not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that's already the situation I want to detect and stop. (Obviously it's not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it's not that worth going into detail here.
On (2), the theory of change is that you don't get into the red area, which I agree is equivalent ...
Speaking on behalf of myself, rather than the full DeepMind alignment team:
...The part that gives me the most worry is the section on General Hopes.
General hopes. Our plan is based on some general hopes:
- The most harmful outcomes happen when the AI “knows” it is doing something that we don’t want, so mitigations can be targeted at this case.
- Our techniques don’t have to stand up to misaligned superintelligences — the hope is that they make a difference while the training process is in the gray area, not after it has reached the red area.
- In terms of
Interestingly, I apparently had a median around 2040 back in 2019, so my median is still later than it used to be prior to reading the bio anchors report.
Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:
I still endorse that comment, though I'll note that it argues for the much weaker claims of
(As opposed to something more like "the main thing you should work on is 'make alignment legit'".)
(Also I'm glad to hear my comments are useful (or at least influential), thanks for letting me know!)
I think most of the goal misgeneralization examples are in fact POCs, but they're pretty weak POCs and it would be much better if we had better POCs. Here's a table of some key disanalogies:
Our examples | Deceptive alignment |
| Deployment behavior is similar to train behavior | Behaves well during training, executes treacherous turn on deployment |
| No instrumental reasoning | Train behavior relies on instrumental reasoning |
| Adding more diverse data would solve the problem | AI would (try to) behave nicely for any test we devise |
| Most (but not all) were designed to show goal misg |
(I realize this is straying pretty far from the intent of this post, so feel free to delete this comment)
I totally agree that a non-trivial portion of DeepMind's work (and especially my work) is in the "make legit" category, and I stand by that as a good thing to do, but putting all of it there seems pretty wild. Going off of a list I previously wrote about DeepMind work (this comment):
...We do a lot of stuff, e.g. of the things you've listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:
- El
I mean, I think my models here come literally from conversations with you, where I am pretty sure you have said things like (paraphrased) "basically all the work I do at Deepmind and the work of most other people I work with at Deepmind is about 'trying to demonstrate the difficulty of the problem' and 'convincing other people at Deepmind the problem is real'".
In as much as you are now claiming that is only 10%-20% of the work, that would be extremely surprising to me and I do think would really be in pretty direct contradiction with other things we ...
Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.
(Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty h...
I'm not claiming that you figure out whether the model's underlying motivations are bad. (Or, reading back what I wrote, I did say that but it's not what I meant, sorry about that.) I'm saying that when the model's underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It's plausible that you then get a model with bad motivations that knows not to produce bad...
Sorry, I think that particular sentence of mine was poorly written (and got appropriate pushback at the time). I still endorse my followup comment, which includes this clarification:
The thing I'm not that interested in (from a "how scared should we be" or "timelines" perspective) is when you take a bunch of different tasks, shove them into a single "generic agent", and the resulting agent is worse on most of the tasks and isn't correspondingly better at some new task that none of the previous systems could do.
In particular, my impression with Gato is that ...
I think you're missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your "follow-the-trying" approach.
While all of these are often analyzed from the perspective of "suppose you have a potentially-misaligned powerful AI; here's what would happen", I view that as an analysis tool, not the primary theory of change.
The theory of change that I most buy is that as you are training your model, while it is developing the "trying", you would like it to develop good "trying" and not bad "trying", and one...
Depends what the aligned sovereign does! Also depends what you mean by a pivotal act!
In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don't expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.
I think that skews it somewhat but not very much. We only have to "win" once in the sense that we only need to build an aligned Sovereign that ends the acute risk period once, similarly to how we only have to "lose" once in the sense that we only need to build a misaligned superintelligence that kills everyone once.
(I like the discussion on similar points in the strategy-stealing assumption.)
[People at AI labs] expected heavy scrutiny by leadership and communications teams on what they can state publicly. [...] One discussion with a person working at DeepMind is pending approval before publication. [...] We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness.
(I'm the person in question.)
I just want to note that in the case of DeepMind:
Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:
In that case, I think you should try and find out what the incentive gradient is like for other people before prescribing the actions that they should take. I'd predict that for a lot of alignment researchers your list of incentives mostly doesn't resonate, relative to things like:
conversely, a lot of people in the wider world that don’t work on alignment directly make relevant decisions that will affect alignment and AI, and think about alignment too (likely many more of those exist than “pure” alignment researchers, and this post is addressed to them too!)
I continue to think that if your model is that capabilities follow an exponential (i.e. dC/dt = kC), then there is nothing to be gained by thinking about compounding. You just estimate how much time it would have taken for the rest of the field to make an equal amount of capabilities progress now. That's the amount you shortened timelines by; there's no change from compounding effects.
if you make a discovery now worth 2 billion dollars [...] If you make this 2 billion dollar discovery later
Two responses:
Errr, I feel like we already agree on this point?
Yes, sorry, I realized that right after I posted and replaced it with a better response, but apparently you already saw it :(
What I meant to say was "I think most of the time closing overhangs is more negative than positive, and I think it makes sense to apply OP's higher bar of scrutiny to any proposed overhang-closing proposal".
But like, why? I wish people would argue for this instead of flatly asserting it and then talking about increased scrutiny or burdens of proof (which I also don't like).
To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold.
As you note, the second claim is false for the model the OP mentions. I don't care about the first claim once you know whether the second claim is true or false, which is the important part.
I agree it could be true in practice in other models but I am unhappy about the pattern where someone makes a claim based on arguments that are ...
Progress often follows an s-curve, which appears exponential until the current research direction is exploited and tapers off. Moving an exponential up, even a little, early on can have large downstream consequences:
Your graph shows "a small increase" that represents progress that is equal to an advance of a third to a half the time left until catastrophe on the default trajectory. That's not small! That's as much progress as everyone else combined achieves in a third of the time till catastrophic models! It feels like you'd have to figure out some newer e...
...Your graph shows "a small increase" that represents progress that is equal to an advance of a third to a half the time left until catastrophe on the default trajectory. That's not small! That's as much progress as everyone else combined achieves in a third of the time till catastrophic models! It feels like you'd have to figure out some newer efficient training that allows you to get GPT-3 levels of performance with GPT-2 levels of compute to have an effect that was plausibly that large.
In general I wish you would actually write down equations for your mod
...I think this is pretty false. There's no equivalent to Let's think about slowing down AI, or a tag like Restrain AI Development (both of which are advocating an even stronger claim than just "caution") -- there's a few paragraphs in Paul's post, one short comment by me, and one short post by Kaj. I'd say that hardly any optimization has gone into arguments to AI safety researchers for advancing capabilities.
[...]
(I agree in the wider world there's a lot more optimization for arguments in favor of capabilities progress that people in general would fin
Not OP, just some personal takes:
That's not small!
To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. (To be clear, I don't think the exponential model presented provides evidence for this latter claim)
I think the first claim is obviously true. The second claim could be true in practice, though I feel quite uncertain about this. It happens to be false in the specific model of moving an ex...
And now, it seems like we agree that the pseudocode I gave isn't a grader-optimizer for the grader
self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?
I don't think I agree with this.
At a high level, your argument can be thought of as having two steps:
- Grader-optimizers are bad, because of problem P.
- Approval-directed agents / [things built by IDA, debate, RRM] are grader-optimizers.
I've been t...
Nice comment!
The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn't know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we ...
I don't really disagree with any of what you're saying but I also don't see why it matters.
I consider myself to be saying "you can't just abstract this system as 'trying to make evaluations come out high'; the dynamics really do matter, and considering the situation in more detail does change the conclusions."
I'm on board with the first part of this, but I still don't see the part where it changes any conclusions. From my perspective your responses are of the form "well, no, your abstract argument neglects X, Y and Z details" rather than explaining how X, ...
The edits help, thanks. I was in large part reacting to the fact that Kaj's post reads very differently from your summary of Bad Argument 1 (rather than the fact that I don't make Bad Argument 1). In the introductory paragraph where he states his position (the third paragraph of the post), he concludes:
Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research.
Which is clearly not equivalent to "alignment researchers hibernate for N years and then get back to work".
Plausibly...
Fwiw, when talking about risks from deploying a technology / product, "accident" seems (to me) much more like ascribing blame ("why didn't they deal with this problem?"), e.g. the Boeing 737-MAX incidents are "accidents" and people do blame Boeing for them. In contrast "structural" feels much more like "the problem was in the structure, there was no specific person or organization that was in the wrong".
I agree that in situations that aren't about deploying a technology / product, "accident" conveys a lack of blameworthiness.
I recall some people in CHAI working on a minecraft AI that could help players do useful tasks the players wanted.
I think you're probably talking about my work. This is more of a long-term vision; it isn't doable (currently) at academic scales of compute. See also the "Advantages of BASALT" section of this post.
(Also I just generically default to Minecraft when I'm thinking of ML experiments that need to mimic some aspect of the real world, precisely because "the game getting played here is basically the same thing real life society is playing".)
...Okay, then let me try to directly resolve my confusion. My current understanding is something like - in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we've screened off evolution, the size of the genome, etc - all of that is going into the "with certain structural parameters" part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same
Did you ever look at the math in the paper? Theorem 1 looks straight up false to me (attempted counterexample) but possibly I'm misunderstanding their assumptions.