"John, what do you think of this idea for an alignment research project?"
I get questions like that fairly regularly. How do I go about answering? What principles guide my evaluation? Not all of my intuitions for what makes a project valuable can easily be made legible, but I think the principles in this post capture about 80% of the value.
Tackle the Hamming Problems, Don't Avoid Them
Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things.
The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI, and "just" try to align that AI without understanding the Hard Parts of alignment ourselves. The next most common pattern is to argue that, since Hard Parts are Hard, we definitely don't have enough time to solve them and should therefore pretend that we're going to solve alignment while ignoring them. Third most common is to go into field building, in hopes of getting someone else to solve the Hard Parts. (Admittedly these are not the most charitable summaries.)
There is value in seeing how dumb ideas fail. Most of that value is figuring out what the Hard Parts of the problem are - the taut constraints which we run into over and over again, which we have no idea how to solve. (If it seems pretty solvable, it's probably not a Hard Part.) Once you can recognize the Hard Parts well enough to try to avoid them, you're already past the point where trying dumb ideas has much value.
On a sufficiently new problem, there is also value in checking dumb ideas just in case the problem happens to be easy. Alignment is already past that point; it's not easy.
You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off. That's one of the big problems with trying to circumvent the Hard Parts: when the circumvention inevitably fails, we are still no closer to solving the Hard Parts. (It has been observed both that alignment researchers mostly seem to not be tackling the Hard Parts, and that alignment research mostly doesn't seem to build on itself; I claim that the latter is a result of the former.)
Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.
Have An Intuitive Story Of What We're Looking For
One project going right now is looking at how modularity in trained systems corresponds to broad peaks in parameter space. Intuitive story for that: we have two "modules", each with lots of stuff going on inside, but only a relatively-low-dimensional interface between them. Because each module has lots of stuff going on inside, but only a low-dimensional interface, there should be many ways to change around the insides of a module while keeping the externally-visible behavior the same. Because such changes don't change behavior, they don't change system performance. So, we expect that modularity implies lots of degrees-of-freedom in parameter space, i.e. broad peaks.
This story is way too abstract to be able to look for immediately in a trained net. How do we operationalize "modules", and find them? How do we operationalize "changes in a module", especially since parameter space may not line up very neatly with functional modules? But that's fine; the story can be pretty abstract.
The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what's going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!
It's relatively easy to make vague/abstract intuitive arguments. Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it's where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.
My abstraction work is a good example here. I started with some examples of abstraction and an intuitive story about throwing away information while keeping info relevant "far away". Then, the bulk of the work was to operationalize that idea in a way which matched all the intuitive examples, and made the intuitive stories provable.
Derive the Ontology, Don't Assume It
In ML interpretability, some methods look at the computation graph of the net. Others look at orthogonal directions in activation space. Others look at low-rank decompositions of the weight matrices. These are all "different ontologies" for interpretation. Methods which look at one of these ontologies will typically miss structure in the others; e.g. if run a graph clustering algorithm on the computation graph I probably won't pick up interpretable concepts embedded in directions in activation space.
What we'd really like is to avoid assuming an ontology, and rather discover/derive the ontology itself as part of our project. For instance, we could run an experiment where we change one human-interpretable "thing" in the environment, and then look at how that changes the trained net; that would let us discover how the concept is embedded rather than assume it from the start (credit to Chu for this suggestion). Another approach is to start out with some intuitive story for why a particular ontology is favored - e.g. if we have a graph with local connectivity, then maybe the Telephone Theorem kicks in. Such an argument should (a) allow us to rule out interactions which circumvent the favored ontology, and (b) be testable in its own right, e.g. for the Telephone Theorem we can (in principle) check the convergence of mutual information to a limit.
Open The Black Box
Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.
Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it's usually a sign of avoiding the Hard Parts.
Partly, opening the black box is about getting a very rich data channel. When we just work with a black box, we get relatively sparse data about what's going on. When we open the black box, we can in-principle directly observe every gear and directly check what's going on.
Relative Importance of These Principles
Tackle The Hamming Problems is probably the advice which is most important to follow for marginal researchers right now, but mostly I expect people who aren't already convinced of it will need to learn it the hard way. (I certainly had to learn it the hard way, though I did that before starting to work on alignment.) Open the Black Box follows pretty naturally once you're leaning in to the Hard Parts.
Once you're past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.
Have an Intuitive Story is especially helpful for people who tend to get lost in the weeds and go nowhere. Make sure you have an intuitive story, and use that story to guide everything else.