Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

"John, what do you think of this idea for an alignment research project?"

I get questions like that fairly regularly. How do I go about answering? What principles guide my evaluation? Not all of my intuitions for what makes a project valuable can easily be made legible, but I think the principles in this post capture about 80% of the value.

Tackle the Hamming Problems, Don't Avoid Them

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things.

The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI, and "just" try to align that AI without understanding the Hard Parts of alignment ourselves. The next most common pattern is to argue that, since Hard Parts are Hard, we definitely don't have enough time to solve them and should therefore pretend that we're going to solve alignment while ignoring them. Third most common is to go into field building, in hopes of getting someone else to solve the Hard Parts. (Admittedly these are not the most charitable summaries.)

There is value in seeing how dumb ideas fail. Most of that value is figuring out what the Hard Parts of the problem are - the taut constraints which we run into over and over again, which we have no idea how to solve. (If it seems pretty solvable, it's probably not a Hard Part.) Once you can recognize the Hard Parts well enough to try to avoid them, you're already past the point where trying dumb ideas has much value.

On a sufficiently new problem, there is also value in checking dumb ideas just in case the problem happens to be easy. Alignment is already past that point; it's not easy.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off. That's one of the big problems with trying to circumvent the Hard Parts: when the circumvention inevitably fails, we are still no closer to solving the Hard Parts. (It has been observed both that alignment researchers mostly seem to not be tackling the Hard Parts, and that alignment research mostly doesn't seem to build on itself; I claim that the latter is a result of the former.)

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

Have An Intuitive Story Of What We're Looking For

One project going right now is looking at how modularity in trained systems corresponds to broad peaks in parameter space. Intuitive story for that: we have two "modules", each with lots of stuff going on inside, but only a relatively-low-dimensional interface between them. Because each module has lots of stuff going on inside, but only a low-dimensional interface, there should be many ways to change around the insides of a module while keeping the externally-visible behavior the same. Because such changes don't change behavior, they don't change system performance. So, we expect that modularity implies lots of degrees-of-freedom in parameter space, i.e. broad peaks.

This story is way too abstract to be able to look for immediately in a trained net. How do we operationalize "modules", and find them? How do we operationalize "changes in a module", especially since parameter space may not line up very neatly with functional modules? But that's fine; the story can be pretty abstract.

The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what's going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!

Operationalize

It's relatively easy to make vague/abstract intuitive arguments. Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it's where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.

My abstraction work is a good example here. I started with some examples of abstraction and an intuitive story about throwing away information while keeping info relevant "far away". Then, the bulk of the work was to operationalize that idea in a way which matched all the intuitive examples, and made the intuitive stories provable.

Derive the Ontology, Don't Assume It

In ML interpretability, some methods look at the computation graph of the net. Others look at orthogonal directions in activation space. Others look at low-rank decompositions of the weight matrices. These are all "different ontologies" for interpretation. Methods which look at one of these ontologies will typically miss structure in the others; e.g. if run a graph clustering algorithm on the computation graph I probably won't pick up interpretable concepts embedded in directions in activation space.

What we'd really like is to avoid assuming an ontology, and rather discover/derive the ontology itself as part of our project. For instance, we could run an experiment where we change one human-interpretable "thing" in the environment, and then look at how that changes the trained net; that would let us discover how the concept is embedded rather than assume it from the start (credit to Chu for this suggestion). Another approach is to start out with some intuitive story for why a particular ontology is favored - e.g. if we have a graph with local connectivity, then maybe the Telephone Theorem kicks in. Such an argument should (a) allow us to rule out interactions which circumvent the favored ontology, and (b) be testable in its own right, e.g. for the Telephone Theorem we can (in principle) check the convergence of mutual information to a limit.

Open The Black Box

Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.

Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it's usually a sign of avoiding the Hard Parts.

Partly, opening the black box is about getting a very rich data channel. When we just work with a black box, we get relatively sparse data about what's going on. When we open the black box, we can in-principle directly observe every gear and directly check what's going on.

Relative Importance of These Principles

Tackle The Hamming Problems is probably the advice which is most important to follow for marginal researchers right now, but mostly I expect people who aren't already convinced of it will need to learn it the hard way. (I certainly had to learn it the hard way, though I did that before starting to work on alignment.) Open the Black Box follows pretty naturally once you're leaning in to the Hard Parts.

Once you're past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.

Have an Intuitive Story is especially helpful for people who tend to get lost in the weeds and go nowhere. Make sure you have an intuitive story, and use that story to guide everything else.

110

Ω 44

19 comments, sorted by Click to highlight new comments since: Today at 7:30 AM
New Comment

Because each module has lots of stuff going on inside, but only a low-dimensional interface, there should be many ways to change around the insides of a module while keeping the externally-visible behavior the same. Because such changes don't change behavior, they don't change system performance. So, we expect that modularity implies lots of degrees-of-freedom in parameter space, i.e. broad peaks.

I know it's a bit off-topic, but FWIW I don't immediately share this intuition. If there are "many ways to change around the insides of a module while keeping the externally-visible behavior the same", then if the whole network is just one "module" (i.e. it's not modular at all), can't I likewise say there are "many ways to change around the insides of [the one module which comprises the entire network] while keeping the externally-visible behavior the same"?

Yup. I'm also not entirely convinced by this argument, for the same reason.

Well, isn't having multiple modules a precondition to something being modular? That seems like what's happening in your example: it has only one module, so it doesn't even make sense to apply John's criterion.

I think Steven's point is that if your explanation for modularity leading to broadness is that the parameters inside a module can take any configuration, conditioned on the output of the module staying the same, then you're at least missing an additional step showing that a network consisting of two modules with n/2 parameters each has more freedom in those parameters than a network consisting of one module (just the entire network itself) with n parameters does. Otherwise you're not actually pointing out how this favours modularity over non-modularity.

Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there's two places in the network where you're demanding particular behaviour rather than one.

My own guiding intuition for why modularity seems to cause broadness goes the more circumspect path of "modularity seems connected to abstraction, abstraction seems connected to generality, generality seems connected to broadness in parameter space". 

I think the hand-wavy math we currently have also points more towards this connection. It seems to talk about how broadness is connected to dropping information about the input, as much as you can while still getting the right answer. Which sure looks suggestively like a statement about avoiding fine tuning. And modules are things that only give out small summaries of what goes on inside them, rather than propagating all the information they contain. 

Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there's two places in the network where you're demanding particular behaviour rather than one.

Doesn't more constraints mean less freedom and therefore a less broadness in parameter space?

(Sorry if that's a stupid question, I don't really understand the reasoning behind the whole connection yet.)

(And thanks, the last two paragraphs were helpful, though I didn't look into the math!)

Yes, that was the point. At least at first blush, this line of argument looks like it's showing the opposite of what it purports to, so maybe it isn't that great of an explanation.

On a separate note, I think the math I referenced above can now be updated to say: broadness is dependent on the number of orthogonal features a network has, and how large the norm of these features is. Where both feature orthogonality and norm are defined by the L2 Hilbert space norm, which you may know from quantum mechanics. 

This neatly encapsulates, extends, and quantifies the "information loss" notion in Vivek's linked post above. It also sounds a lot like it's formalising intuitions about broadness being connected to "generality", "simplicity", and lack of "fine tuning".

It also makes me suspect that the orthogonal feature basis is the fundamentally correct way to think about computations in neural networks.

Post on this incoming once I figure out how to explain it to people who haven't used Hilbert space before.

Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off.

I agree that avoiding the Hard parts is rarely productive, but you also don't address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by trying to prove it is impossible instead of avoiding it. And just like with most impossibility results in TCS, it's possible that even if the precise formulation is impossible, it often just means that you need to reframe the problem a bit.

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

I expect you would also say that a crucial hard part many people are avoiding is "how to learn human values?", right? (Not the true names, but a useful pointer)

The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what's going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!

I want to note that the failure mode of blind theory here is to accept any story, and thus make the requirement of a story completely impotent to guide research. There's an art (and hopefully a science) to finding stories that bias towards productive mistakes.

Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it's where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.

I expect you to partially disagree, but there's not always a "right" operationalization, and there's a failure mode where one falls in love with their neat operationalization, making the misses parts of the phenomena invisible.

Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.

I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you're going to answer that we have evidence that this doesn't work in Alignment and so it is avoiding the Hard part. Am I correct?

Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it's usually a sign of avoiding the Hard Parts.

One formal example of this is the relativization barrier in complexity theory, which tells you that you can't prove (and a bunch of other separations) using only techniques using algorithms as blackboxes instead of looking at the structure.

Once you're past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.

Agreed that it's a great pair of advice to keep in mind!

I expect you would also say that a crucial hard part many people are avoiding is "how to learn human values?", right? (Not the true names, but a useful pointer)

Yes, although I consider that one more debatable.

I expect you to partially disagree, but there's not always a "right" operationalization...

When there's not a "right" operationalization, that usually means that the concepts involved were fundamentally confused in the first place.

I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you're going to answer that we have evidence that this doesn't work in Alignment and so it is avoiding the Hard part. Am I correct?

Actually, I think starting from a behavioral theorem is fine. It's just not where we're looking to end up, and the fact that we want to open the black box should steer what starting points we look for, even when those starting points are behavioral.

When there's not a "right" operationalization, that usually means that the concepts involved were fundamentally confused in the first place.


Curious about the scope of the conceptual space where this belief was calibrated. It seems to me to tacitly say something like "everything that's important is finitely characterizable".

Maybe the "fundamentally confused" in your phrasing already includes the case of "stupidly tried to grab something that wasn't humanly possible, even if in principle" as a confused way for a human, without making any claim of reality being conveniently compressible at all levels. (Note that this link explicitly disavows beauty at "all levels" too.)

I suppose you might also say "I didn't make any claim of finiteness" but I do think something like "at least some humans are only a finite string away from grokking anything"  is implicit if you expect there to be blogposts/textbooks that can operationalize everything relevant. It would be an even stronger claim than "finiteness", it would be "human-typical length strings"

I believe Adam is pointing at something quite important, akin to a McNamara fallacy for formalization. To paraphrase:

The first step is to formalize whatever can be easily formalized. This is OK as far as it goes. The second step is to disregard that which can't be easily formalized or to make overly simplifying assumptions. This is artificial and misleading. The third step is to presume that what can't be formalized easily really isn't important. This is blindness. The fourth step is to say that what can't be easily formalized really doesn't exist. This is suicide.

In the case of something that has already been engineered (human brains with agency), we probably should grant that it is possible to operationalize everything relevant. But I want to pushback on the general version and would want "why do you believe simple-formalization is possible here, in this domain?" to be allowed to be asked.

[PS. am not a native speaker]

I generally agree with you on the principle Tackle the Hamming Problems, Don't Avoid Them.

That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are

  1. Do something that will affect policy in a positive way

  2. Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions

I agree with 1 (but then it is called alignment forum, not the more general AI Safety forum). But I don't see that 2 would do much good.

All narratives I can think of where 2 plays a significant part sounds like strawmen to me, perhaps you could help me?

Not sure what makes you think 'strawmen' at 2, but I can try to unpack this more for you.

Many warnings about unaligned AI start with the observation that it is a very bad idea to put some naively constructed reward function, like 'maximize paper clip production', into a sufficiently powerful AI. Nowadays on this forum, this is often called the 'outer alignment' problem. If you are truly worried about this problem and its impact on human survival, then it follows that you should be interested in doing the Hard Thing of helping people all over the world write less naively constructed reward functions to put into their future AIs.

John writes:

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things. [...] The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI [...]

This pattern of outsourcing the Hard Part to the AI is definitely on display when it comes to 2 above. Academic AI/ML research also tends to ignore this Hard Part entirely, and implicitely outsources it to applied AI researchers, or even to the end users.

A "cheat" is a solution to a problem that is invariant to a wide range of scenarios for how the hard parts could be solved individually.

ML itself is a cheat. Even if we don't understand the particulars of the information-processing task, we can just bonk it with an ML algorithm and it spits out a solution for us.

But in order to have a hope of finding an adequate cheat code, you need to have a good grasp of at least where the hard parts are even if you're unsure about how they could be tackled individually. And constraining your expectation over what the possible subsolutions should look like expands the range of cheats you could apply, because now they need to be invariant to a smaller space of possible scenarios.[1]

Insofar as you're saying that we can't hope to find remotely adequate cheats unless we start with a rough understanding of what we even need to cheat over, I agree. I don't think you're saying that we shouldn't be looking for cheats in the first place, but it could be interpreted that way. Yes, it has the problem that it doesn't build upon itself as well as directly challenging the hard parts, but, realistically, I think the solution has to look like some kind of cheat.

  1. ^

    There's this funny dynamic where if you expand the range of plausible solutions you can search through (e.g. by constraining your expectation for what they need to be invariant to), it might become harder to locate a particular area of the search space. If effort spent on constraining expectation expands the search space, then it makes sense to at least confirm that there are no fully invariant solutions at the top layer before you iterate and search a broader range.

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things


Regarding making use of a superintelligent AGI-system in a safe way, here are some things that possibly could help with that:

  1. Working on making it so that the first superintelligent AGI is robustly aligned from the beginning.
  2. Working on security measures so that if we didn't do #1 (or thought we did, but didn't), the AGI will be unable to "hack" itself out in some digital way (e.g. exploiting some OS flaw and getting internet access)
  3. Developing and preparing techniques/strategies so that if we didn't do #1 (or thought we did, but didn't), we can obtain help from various instances of the AGI-system that get us towards a more and more aligned AGI-system, while (1) minimizing the causual influence of the AGI-systems and the ways they might manipulate us and (2) making requests in such a way that we can verify that what we are getting is what we actually want, greatly leveraging how verifying a system often is much easier than making it.

#2 and #3 seems to me as worth pursuing in addition to #1, but not instead of #1. Rather #2 and #3 could work as additional layers of alignment-assurance.

I do think genuine failure modes are being alluded to by "Clever Ways To Avoid Doing Hard Things", but I think there also may be failure modes having to do with encouraging "everyone" to only work on "The Hard Things" in a direct way (without people also looking for potential workarounds and additional layers of alignment-assurance).

Also, consider if someone comes up with alignment methodologies for an AGI that don't seem robust or fully safe, but do seem like they might have a decent/good chance of working in practice. Such alignment methodologies may be bad ideas if they are used as "the solution", but if we have a "system of systems", where some of the sub-systems themselves are AGIs that we have attempted to align based on different alignment methodologies, then we can e.g. see if the outputs from these different sub-systems converge.

Sincerely someone who does not call himself an alignment researcher, but who does self-identify as a "hobbyist alignment theorist", and is working on a series where much of the focus is on Clever Techniques/Strategies That Might Work Even If We Haven't Succeeded At The Hard Things (and thus maybe could provide additional layers of alignment-assurance).

This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!

I think these are very good principles. Thank you for writing this post, John.

Thoughts on actually going about Tackling the Hamming Problems in real life:

My model of humans (and of doing research) says that in order for a human to actually successfully work on the Hard Parts, they probably need to enjoy doing so on a System 1 level. (Otherwise, it'll probably be an uphill battle against subconscious flinches away from Hard Parts, procrastinating with easier problems, etc.)

I've personally found these essays to be insightful and helpful for (i.a.) training my S1 to enjoy steering towards Hard Parts. I'm guessing they could be helpful to many other people too.

Tackle the [Hamming Problems](https://www.lesswrong.com/posts/Thwfy4gNFx9kHgvov/research-hamming-questions), Don't Avoid Them

I agree with that statement and this statement

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things [...]

seems true as well. However, there was something in this section that didn't seem quite right to me.

Say that you have identified the Hamming Problem at lowest resolution be getting the outcome "AI doesn't cause extinction or worse". However, if you zoom in a little bit you might find that there are different narratives that lead to the same goal. For example:

  • AGI isn't developed due to event(s)
  • AGI is developed safely due to event(s)

At this level I would say that it is correct to go for the easier narrative. Going for harder problems seem to be when you zoom into these narratives.

For each path you can imagine a set of events (e.g. research break-throughs) that are necessary and sufficient to solve the end-goal. Here I'm unsure but my intuition tells me that the marginal impact would often be greater working on the necessary parts that are the hardest as these are the ones that are least likely to be solved without intervention.

Of course working on something that isn't necessary in any narrative would probably be easier in most cases but would never be a Hamming Problem.

For each path you can imagine a set of events (e.g. research break-throughs) that are necessary and sufficient to solve the end-goal. Here I'm unsure but my intuition tells me that the marginal impact would often be greater working on the necessary parts that are the hardest as these are the ones that are least likely to be solved without intervention.

This is exactly right, and those are the things which I would call Hamming Problems or the Hard Parts.

I suppose I would just like to see more people start at an earlier level and from that vantage point you might actually want to switch to a path with easier parts.