Sorted by New

Wiki Contributions


Is the overall karma for this mostly just people boosting it for visibility? Because I don't see how this would be a quality comment by any other standards.

Frontpage comment guidelines:

  • Maybe try reading the post

So if I'm getting at things correctly, capabilities and safety are highly correlated, and there can't be situations where capabilities and alignment decouple.

Not that far, more like it doesn't decouple until more progress has been made. Pure alignment is an advanced subtopic of AI research that requires more progress to have been made before it's a viable field.

I'm not super confident in the above and wouldn't discourage people from doing alignment work now (plus the obvious nuance that it's not one big lump, there are some things that can be done later and some that can be done earlier) but the idea of alignment work that requires a whole bunch of work in serial, independent of AI capability work, doesn't seem plausible to me. From Nate Soares' post:

The most blatant case of alignment work that seems serial to me is work that requires having a theoretical understanding of minds/optimization/whatever, or work that requires having just the right concepts for thinking about minds.

This is the kind of thing that seems inextricably bound up with capability work to me. My impression is that MIRI tends to think that whatever route we take to get to AGI, as it moves from subhuman to human-level intelligence it will transform to be like the minds that they theorize about (and they think this will happen before it goes foom) no matter how different it was when it started. So even if they don't know what a state of the art RL agent will look like five years from now, they feel confident they can theorize about what it will look like ten years from now. Whereas my view is that if you can't get the former right you won't get the latter right either. 

To the extent that intelligences will converge towards a certain optimal way of thinking as they get smarter, being able to predict what that looks like will involve a lot of capability work ("Hmm, maybe it will learn like this; let's code up an agent that learns that way and see how it does"). If you're not grounding your work in concrete experiments you will end up with mistakes in your view of what an optimal agent looks like and no way to fix them.

A big part of my view is that we seem to still be a long way from AGI. This hinges on how "real" the intelligence behind LLMs is. If we have to take the RL route then we are a long way away - I wrote a piece on this, "What Happened to AIs Learning Games from Pixels?", which points out how slow the progress has been and covers the areas where the field is stuck. On the other hand if we can get most of the way to AGI just with massive self-supervised training then it starts seeming more likely that we'll walk into AGI without having a good understanding of what's going on. I think that the failure of VPT for minecraft compared to GPT for language, and the difficulty LLMs have with extrapolation and innovation, means that self-supervised learning won't be enough without more insight. I'll be paying close attention to how GPT-4 and other LLMs do over the next few years to see if they're making progress faster than I thought, but I talked to chatGPT and it was way worse than I thought it'd be.

I'm very unconfident in the following but, to sketch my intuition:

I don't really agree with the idea of serial alignment progress that is independent from capability progress. This is what I was trying to get at with

"AI capabilities" and "AI alignment" are highly related to each other, and "AI capabilities" has to come first in that alignment assumes that there is a system to align.

By analogy, nuclear fusion safety research is inextricable from nuclear fusion capability research.

When I try to think of ways to align AI my mind points towards questions like "how do we get an AI to extrapolate concepts? How will it be learning? What will its architecture be?" etc. In other words it just points towards capabilities questions. Since alignment turns on capability questions that we don't yet have an answer to, it doesn't surprise me when many alignment researchers seem to spin their wheels and turn to doom and gloom - that's more or less what I had thought would happen. 

As an example of the blurred lines between capability and alignment: while I think it's useful to have specific terms for inner and outer alignment, I also think that really anyone who worked with RL in a situation where they were manually setting the reward function was aware of these ideas already on some level. "Sometimes I mess up the reward function" and "sometimes the agent isn't optimizing properly" are both issues encountered frequently. Basically while many people in the alignment community seem to think of alignment as something that is cooked up entirely separately from capability research I tend to think that a lot of it will develop naturally as part of day-to-day AI research with no specific focus on alignment.

As a thought experiment, let's say that about 20% of current AI capability researchers are very concerned about AI alignment and get together to decide what to do for the next five years. They're deciding between taking the stance "Capability work is fine right now! Go for it! Worry about alignment when we're farther along!" or "Let's get out of capability and go into alignment instead. Capability research is dangerous and burning precious time." What's the impact of adopting these two positions?

The first is roughly the default position, and I'd expect that basically what we'll see is AGI in the year 20XX and that in the runup to this we'll see vastly increased interest in alignment work and also a significant blurring between "alignment" and "regular AI research" since people want their home robots to not roll over their cat. We'll also see all major AI research orgs and the AI community as a whole take existential risk from self-improving AGI a lot more seriously once modern SOTA AI systems start looking more and more like the kind of thing that could do that. Because of this there'll be a concerted effort to handle the situation appropriately which has a good chance of success.

Option two involves slowing down the timeline by about 5-10%. Cutting the size of a field by 20% doesn't slow progress that much since there's diminishing returns to adding more researchers, and on top of that AI capability research is only half of what drives progress (the other half being compute). In return for this small slowdown the AI researchers who are now going into alignment will initially spin their wheels due to the lack of anything concrete to focus on or any concrete knowledge of what the future systems will look like. When AGI does start approaching the remaining AI capability community will take it much less seriously due to having been selected specifically for that trait. Three years before the arrival of transformative AGI alignment research is further along than it otherwise would have been, but AI capability researchers have gotten used to tuning alignment researchers out and there aren't alignment-sympathetic colleagues around to say "hey, given how things are progressing I think it's time we start taking all that AI risk stuff seriously".  Prospects are worse than option one.

So right now my intuition is that I think alignment will be very doable as long as it's something that the AI community is taking seriously in the few years leading up to transformative AGI. The biggest risk seems to me to be some AI researchers at one of the leading research groups thinking "man, it sure would be cool if we could use the latest coding LLM combined with RL to make an AI that could improve itself in order to accomplish a goal" and set it running without it ever occuring to them that this could go wrong. Given this, the suggestion that everyone concerned about alignment basically cedes the whole field of AI research (outside of this specific community, "AI capability research" is just called "AI research") to people who aren't worried about it seems like a bad idea.

Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you'd want to do that as part of your attempt). There's a huge difference between reading about what's worked in the past and trying to think of new things that could work and then trying them out to see if they do.

As I've learned more about deep learning and tried to understand the material, I've constantly had ideas that I think could improve things. Then I've tried them out, and usually learned that they didn't, or they did but they'd already been done, or that it was more complicated than that, etc. But I learned a ton in the process. On the other hand, suppose I was wary of doing AI capability work. Each time I had one of these ideas, I shied away from it out of fear of advancing AGI timelines. The result would be threefold: I'd have a much worse understanding of AI, and I'd be a lot more concerned about immininent AGI (after all, I had tons of ideas for how things could be done better!), and I wouldn't have actually delayed AGI timelines at all.

I think a lot of people who get into AI from the alignment side are in danger of falling into this trap. As an example in an ACX thread I saw someone thinking about doing their PHD in ML, and they were concerned that they may have to do capability research in order to get their PHD. Someone replied that if they had to they should at least try to make sure it is nothing particularly important, in order to avoid advancing AGI timelines. I don't think this is a good idea. Spending years working on research while actively holding yourself back from really thinking deeply about AI will harm your development significantly, and early in your career is right when you benefit the most from developing your understanding and are least likely to actually move up AGI timelines.

Suppose we have a current expected AGI arrival date of 20XX. This is the result of DeepMind, Google Brain, OpenAI, FAIR, Nvidia, universities all over the world, the Chinese government, and more all developing the state of the art. On top of that there's computational progress happening at the same time, which may well turn out to be a major bottleneck. How much would OpenAI removing themselves from this race affect the date? A small but real amount. How about a bright PHD candidate removing themselves from this race? About zero. I don't think people properly internalize both how insignificant the timeline difference is, and also how big the skill gains are from actually trying your hardest at something as opposed to handicapping yourself. And if you come up with something you're genuinely worried about you can just not publish.

  1. A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).

If I'm understanding this one right OpenAI did something similar to this for purely pragmatic reasons with VPT, a minecraft agent. They first trained a "foundation model" to imitate humans playing. Then, since their chosen task was to craft a diamond pickaxe, they finetuned it with RL, giving it reward for each step along the path to a diamond pickaxe that it successfully took. There's a problem with this approach:

"A major problem when fine-tuning with RL is catastrophic forgetting because previously learned skills can be lost before their value is realized. For instance, while our VPT foundation model never exhibits the entire sequence of behaviors required to smelt iron zero-shot, it did train on examples of players smelting with furnaces. It therefore may have some latent ability to smelt iron once the many prerequisites to do so have been performed. To combat the catastrophic forgetting of latent skills such that they can continually improve exploration throughout RL fine-tuning, we add an auxiliary Kullback-Leibler (KL) divergence loss between the RL model and the frozen pretrained policy."

In other words they first trained it to imitate humans, and saved this model; after that they trained the agent to maximize reward, but used the saved model and penalized it for deviating too far from what that human-imitator model would do. Basically saying "try to maximize reward but don't get too weird about it."