Violet Hour

Wiki Contributions


Thanks for sharing this! A couple of (maybe naive) things I'm curious about.

Suppose I read 'AGI' as 'Metaculus-AGI', and we condition on AGI by 2025 — what sort of capabilities do you expect by 2027? I ask because I'm reminded of a very nice (though high-level) list of par-human capabilities for 'GPT-N' from an old comment:

  1. discovering new action sets
  2. managing its own mental activity
  3. cumulative learning
  4. human-like language comprehension
  5. perception and object recognition
  6. efficient search over known facts 

My immediate impression says something like: "it seems plausible that we get Metaculus-AGI by 2025, without the AI being par-human at 2, 3, or 6."[1] This also makes me (instinctively, I've thought about this much less than you) more sympathetic to AGI  ASI timelines being >2 years, as the sort-of-hazy picture I have for 'ASI' involves (minimally) some unified system that bests humans on all of 1-6. But maybe you think that I'm overestimating the difficulty of reaching these capabilities given AGI, or maybe you have some stronger notion of 'AGI' in mind.

The second thing: roughly how independent are the first four statements you offer? I guess I'm wondering if the 'AGI timelines' predictions and the 'AGI  ASI timelines' predictions "stem from the same model", as it were. Like, if you condition on 'No AGI by 2030', does this have much effect on your predictions about ASI? Or do you take them to be supported by ~independent lines of evidence?


  1. ^

    Basically, I think an AI could pass a two-hour adversarial turing test without having the coherence of a human over much longer time-horizons (points 2 and 3). Probably less importantly, I also think that it could meet the Metaculus definition without being search as efficiently over known facts as humans (especially given that AIs will have a much larger set of 'known facts' than humans).

Could you say more about why you think LLMs' vulnerability to jailbreaks count as an example? Intuitively, the idea that jailbreaks are an instance of AIs (rather than human jailbreakers) "optimizing for small loopholes in aligned constraints" feels off to me.

A bit more constructively, the Learning to Play Dumb example (from pages 8-9 in this paper) might be one example of what you're looking for? 

In research focused on understanding how organisms evolve to cope with high-mutation-rate environments, Ofria sought to disentangle the beneficial effects of performing tasks (which would allow an organism to execute its code faster and thus replicate faster) from evolved robustness to the harmful effect of mutations. To do so, he tried to disable mutations that improved an organism’s replication rate (i.e. its fitness). He configured the system to pause every time a mutation occurred, and then measured the mutant’s replication rate in an isolated test environment. If the mutant replicated faster than its parent, then the system eliminated the mutant; otherwise, the mutant would remain in the population. 

However, while replication rates at first remained constant, they later unexpectedly started again rising. After a period of surprise and confusion, Ofria discovered that he was not changing the inputs provided to the organisms in the isolated test environment. The organisms had evolved to recognize those inputs and halt their replication. Not only did they not reveal their improved replication rates, but they appeared to not replicate at all, in effect “playing dead” when presented with what amounted to a predator.

Ofria then ... [altered] the test environment to match the same random distribution of inputs as would be experienced in the normal (non-isolated) environment. While this patch improved the situation, it did not stop the digital organisms from continuing to improve their replication rates. Instead they made use of randomness to probabilistically perform the tasks that accelerated their replication. For example, if they did a task half of the time, they would have a 50% chance of slipping through the test environment; then, in the actual environment, half of the organisms would survive and subsequently replicate faster.

Nice work! 

I wanted to focus on your definition of deceptive alignment, because I currently feel unsure about whether it’s a more helpful framework than standard terminology. Substituting terms, your definition is:

Deceptive Alignment: When an AI has [goals that are not intended/endorsed by the designers] and [attempts to systematically cause a false belief in another entity in order to accomplish some outcome].

Here are some initial hesitations I have about your definition:

If we’re thinking about the emergence of DA during pre-deployment training, I worry that your definition might be too divorced from the underlying catastrophic risk factors that should make us concerned about “deceptive alignment” in the first place. 

  • Hubinger’s initial analysis claims that the training process is likely to produce models with long-term goals.[1] I think his focus was correct, because if models don’t develop long-term/broadly-scoped goals, then I think deceptive alignment (in your sense) is much less likely to result in existential catastrophe.
  • If a model has long-term goals, I understand why strategic deception can be instrumentally incentivized.  To the extent that strategic deception is incentivized in the absence of long-term goals, I expect that models will fall on the milder end of the ‘strategically deceptive’ spectrum.
    • Briefly, this is because the degree to which you’re incentivized to be strategic is going to be a function of your patience. In the maximally extreme case, the notion of a ‘strategy’ breaks down if you’re sufficiently myopic.
    • So, at the moment, I don’t think I’d talk about 'deceptive alignment' using your terminology, because I think it misses a crucial component of why deceptively aligned models could pose a civilizational risk.   

If we’re thinking about the risks of misaligned strategic deception more broadly, I think distinguishing between the training and oversight process is helpful. I also agree that it’s worth thinking about the risks associated with models whose goals are (loosely speaking) ‘in the prompts’ rather than ‘in the weights’. 

  • That said, I’m a bit concerned that your more expansive definition encompasses a wide variety of different systems, many of which are accompanied by fairly distinct threat models.
    • The risks from LM agents look to be primarily (entirely?) misuse risks, which feels pretty different from the threat model standardly associated with DA. Among other things, one issue with LM agents appears to be that intent alignment is too easy.
    • One way I can see my objection mattering is if your definitions were used to help policymakers better understand people's concerns about AI. My instinctive worry is that a policymaker who first encountered deceptive alignment through your work wouldn’t have a clear sense of why many people in AI safety have been historically worried about DA, nor have a clear understanding of why many people are worried about ‘DA’ leading to existential catastrophe. This might lead to policies which are less helpful for 'DA' in the narrower sense. 
  1. ^

    Strictly speaking, I think 'broadly-scoped goals' is probably slightly more precise terminology, but I don't think it matters much here. 

Minor point, but I asked the code interpreter to produce a non-rhyming poem, and it managed to do so on the second time of asking. I restricted it to three verses because it stared off well on my initial attempt, but veered into rhyming territory in later verses.

The first point is extremely interesting. I’m just spitballing without having read the literature here, but here’s one quick thought that came to mind. I’m curious to hear what you think.

  1. First, instruct participants to construct a very large number of 90% confidence intervals based on the two-point method.
  2. Then, instruct participants to draw the shape of their 90% confidence interval.
  3. Inform participants that you will take a random sample from these intervals, and tell them they’ll be rewarded based on both: (i) the calibration of their 90% confidence intervals, and (ii) the calibration of the x% confidence intervals implied by their original distribution — where x is unknown to the participants, and will be chosen by the experimenter after inspecting the distributions.
  4. Allow participants to revise their intervals, if they so desire.  

So, if participants offered the 90% confidence interval [0, 10^15] on some question, one could back out (say) a 50% or 5% confidence interval from the shape of their initial distribution. Experimenters could then ask participants whether they’re willing to commit to certain implied x% confidence intervals before proceeding. 

There might be some clever hack to game this setup, and it’s also a bit too clunky+complicated. But I think there’s probably a version of this which is understandable, and for which attempts to game the system are tricky enough that I doubt strategic behavior would be incentivized in practice.

On the second point, I sort of agree. If people were still overprecise, another way of putting your point might be to say that we have evidence about the irrationality of people’s actions, relative to a given environment. But these experiments might not provide evidence suggesting that participants are irrational characters. I know Kenny Easwaran likes (or at least liked) this distinction in the context of Newomb's Problem.

That said, I guess my overall thought is that any plausible account of the “rational character” would involve a disposition for agents to fine-tune their cognitive strategies under some circumstances. I can imagine being more convinced by your view if you offered an account of when switching cognitive strategies is desirable, so that we know the circumstances under which it would make sense to call people irrational, even if existing experiments don’t cut it.

Interesting work, thanks for sharing!  

I haven’t had a chance to read the full paper, but I didn’t find the summary account of why this behavior might be rational particularly compelling. 

At a first pass, I think I’d want to judge the behavior of some person (or cognitive system) as “irrational” when the following three constraints are met:

  1. The subject, in some sense, has the basic capability to perform the task competently, and
  2. They do better (by their own values) if they exercise the capability in this task, and
  3. In the task, they fail to exercise this capability. 

Even if participants are operating with the strategy “maximize expected answer value”, I’d be willing to judge the participants' responses as “irrational” if the participants were cognitively competent, understood the concept '90% confidence interval', and were incentivized to be calibrated on the task (say, if participants received increased monetary rewards as a function of their calibration). 

Pointing out that informativity is important in everyday discourse doesn’t do much to persuade me that the behavior of participants in the study is “rational”, because (to the extent I find the concept of “rationality” useful), I’d use the moniker to label the ability of the system to exercise their capabilities in a way that was conducive to their ends. 

I think you make a decent case for claiming that the empirical results outlined don’t straightforwardly imply irrationality, but I’m also not convinced that your theoretical story provides strong grounds for describing participant behaviors as “rational”.