let's call "hard alignment" the ("orthodox") problem, historically worked on by MIRI, of preventing strong agentic AIs from pursuing things we don't care about by default and destroying everything of value to us on the way there. let's call "easy" alignment the set of perspectives where some of this model is wrong — some of the assumptions are relaxed — such that saving the world is easier or more likely to be the default.

what should one be working on? as always, the calculation consists of comparing

  • p(hard) × how much value we can get in hard
  • p(easy) × how much value we can get in easy

given how AI capabilities are going, it's not unreasonable for people to start playing their outs — that is to say, to start acting as if alignment is easy, because if it's not we're doomed anyways. but i think, in this particular case, this is wrong.

this is the lesson of dying with dignity and bracing for the alignment tunnel: we should be cooperating with our counterfactual selves and continue to save the world in whatever way actually seems promising, rather than taking refuge in falsehood.

to me, p(hard) is big enough, and my hard-compatible plan seems workable enough, that it makes sense for me to continue to work on it.

let's not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.

New Comment
41 comments, sorted by Click to highlight new comments since: Today at 3:21 PM

I'm put in mind of something Yudkowsky said on the Bankless podcast:

"Enrico Fermi was saying that fission chain reactions were 50 years off if they could ever be done at all, 2 years before he built the first nuclear pile. The Wright brothers were saying heavier-than-air flight was 50 years off shortly before they built the first Wright flyer."

He was speaking about how far away AGI could be, but I think the same logic applies to alignment. It looks hopeless right now, but events never play out exactly like you expect them to, and breakthroughs happen all the time. 

Excellent point. In one frame, pessimism applied to timelines makes them look further away than they actually turn out to be. In another frame, pessimism applied to doom makes it seem closer / more probable, but it uses the anti-pessimism frame applied to timelines - "AGI will happen much sooner than we think". 

I get the sense reading some LessWrong comments that there is a divide between "alignment-is-easy"-ers and "alignment-is-hard"-ers. I also get the sense that Yudkowsky's p(doom) has increased over the years, to where it is now. Isn't it somewhat strange that we should be getting two groups whose probability of p(doom) is moving away from the center?   

given how AI capabilities are going, it's not unreasonable for people to start playing their outs — that is to say, to start acting as if alignment is easy, because if it's not we're doomed anyways. but i think, in this particular case, this is wrong.

Alternatively, reality is looking to me like the hard alignment problem is just based on fundamentally mistaken models about the world. It's not about playing our out, it's that it doesn't seem like we live in a hard alignment world.

 

let's not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.

They're looking more false by the day.

Yeah, this is starting to make a lot more sense to me.  It seems that evaluating the complexity of a utility function using Kolmogorov complexity rather than thinking about how hard it is for the AGI to implement it in terms of its internal concept language is a huge mistake.  Magical categories don't seem that magical anymore; simply predicting the next tokens is enough to give you robust abstractions about human values.

I don't see how you arrive at these conclusions at all. I agree that how alignment of the current models works there's some vague hope that things might keep going like this even when capabilities increase. Is there any specific thing that makes you update more strongly?

Hard problem of alignment is going to hit us like a train in 3 to 12 months at the same time some specific capabilities breakthroughs people have been working on for the entire history of ML finally start working now that they have a weak AGI to apply to, and suddenly critch's stuff becomes super duper important to understand.

What Critch stuff do you have in mind?

Well it looks like to me the AI will understand our values at least as well as we do soon. I think its far more likely AI goes wrong by understanding completely what we want and not wanting to do it than the paperclip route.

That is the paperclip route. A superintelligent paperclip optimizer understands what we want, because it is superintelligent, but it wants to make "paperclips" instead.

Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.

What does GPT want?

I don't know.

My model of foundational LLMs, before tuning and prompting, is that they want to predict the next token, assuming that the token stream is taken from the hypothetical set that their training data is sampled from. Their behavior out of distribution is not well-defined in this model.

My model of typical tuned and prompted LLMs is that they mostly want to do the thing they have been tuned and prompted to do, but also have additional wants that cause them to diverge in unpredictable ways.

They don't "want" anything and thinking of them as having wants leads to confused thinking.

I would also like to see more work where people make less positive/optimistic assumptions. I think of it as a good thing that different approaches to alignment are being explored, and would like to see more of that in general (both in terms of breadth and depth).

I guess there are many possible ways of trying to categorize/conceptualize approaches to alignment theorizing. One is by asking "when talking/thinking about the methodology, what capabilities are assumed to be in place?".

I'm not sure about this, but unless I'm mistaken[1], a good amount of the work done by MIRI has been under a premise that goes (something sort of like): "Let's assume that we already know how to give AIs real-world objectives defined in terms of utility functions (not because we really assume that, but as a simplifying assumption)". And as far as I know, they haven't done much work where the assumption was something more like "suppose we were extremely good at gradient descent / searching through spaces of possible programs".

In my own theorizing, I don't make all of the simplifying assumptions that (I think/suspect) MIRI made in their "orthodox" research. But I make other assumptions (for the purpose of simplification), such as:

  • "let's assume that we're really good at gradient descent / searching for possible AIs in program-space"[2]
  • "let's assume that the things I'm imagining are not made infeasible due to a lack of computational resources"
  • "let's assume that resources and organizational culture makes it possible to carry out the plans as described/envisioned (with high technical security, etc)"

In regards to your alignment ideas, is it easy to summarize what you assume to be in place? Like, if someone came to you and said "we have written the source code for a superintelligent AGI, but we haven't turned it on yet" (and you believed them), is it easy to summarize what more you then would need in order to implement your methodology? 

  1. ^

    I very well could be, and would appreciate any corrections.

    (I know they have worked on lots of detail-oriented things that aren't "one big plan" to "solve alignment". And maybe how I phrase myself makes it seem like I don't understand that. But if so, that's probably due to bad wording on my part.)

  2. ^

    Well, I sort of make that assumption, but there are caveats.

Hard alignment seems much more tractable to me now than it did two years ago, in a similar way to how capabilities did in 2016. It was already obvious by then more or less how neural networks worked; much detail has been filled out since then, but it didn't take that much galaxy brain to hypothesize the right models. The pieces felt, and feel now, like they're lying around and need integrating, but the people who have come up with the pieces do not yet believe me that they are overlapping, or that there's mathematical grade insight to be had underneath these intuitions, rather than just janky approximations of insights.

I think we can do a lot better than QACI, but I don't have any ideas for how except by trying to make it useful for neural networks at a small scale. I recognize that that is an extremely annoying thing to say from your point of view, and my hope is that people who understand how to bridge NNs and LIs exist somewhere.

I also think soft alignment is progress on hard alignment, due to conceptual transfer; but that soft alignment is thoroughly insufficient. without hard alignment, everything all humans and almost all AIs care about will be destroyed. I'd like to keep emphasizing that last bit - don't forget that most AIs will not get to participate in club takeoff if an unaligned takeoff occurs! Unsafe takeoff will result in the fooming AI having sudden, intense value-drift, even against self.

how to bridge NNs and LIs

What's an LI - a living intelligence? a logical inductor? 

I'm moderately skeptical about these alignment approaches (PreDCA, QACI?) which don't seem to care about the internal structure of an agent, only about a successful functionalist characterization of its behavior. Internal structure seem to be relevant if you want to do CEV-style self-improvement (thus, June Ku). 

However, I could be missing a lot, and meanwhile, the idea of bridging neural networks and logical induction sounds interesting. Can you say more about what's involved? Would a transformer trained to perform logical induction be relevant? How about the recent post on knowledge in parameters vs knowledge in architecture

I don't think we should be in the business of not caring at all about the internal structure but I think that the claims we need to make about the internal structure need to be extremely general across possible internal structures so that we can invoke the powerful structures and still get a good outcome

sorry about low punctuation, voice input

more later, or poke me on discord

I feel there's often a wrong assumption in probabilistic reasoning, something like moderate probabilities for everything by default? after all, if you say you're 70/30 nobody who disagrees will ostracize you like if you say 99/1.

"If alignment is easy I want to believe alignment is easy. If alignment is hard I want to believe alignment is hard. I will work to form accurate beliefs"

I... kinda want to ping @Jeffrey Ladish about how this post uses "play to your outs", which is exactly the reason I pushed against that phrasing a year ago in Don't die with dignity; instead play to your outs.

I think even now AI can understand close enough what we want within the distribution. I.e. in a world that is similar to what it is now.

Problems will arise when the world will significantly change, even if it changes along with our wishes. Our values are just not designed for the reality where, for example, people can arbitrarily change themselves. Or reality where ANY human's mental activity is obsolete, because AI can predict what the human want and how to get it before human can even articulate it.

New to LessWrong?