Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In some sense, any new measurement device automates a part of research. A thermometer automates the task of sticking one's finger in something to check how hot it is, a scale automates the task of holding something to check how heavy it is, etc. The automated version is not only more convenient, but more precise and reproducible (as is usually the case when automating things). For alignment, one analogue might be interpretability tools which automate the work done when some human looks at a part of a neural net and sees what it's doing.

Let's take that last example and dig into it a bit more: interpretability tools which automate the work done when some human looks at a part of a net and sees what it's doing. We want to leverage the analogy to other measurement tools, like a thermometer or a scale, to better understand automation of interpretability.

Here's one type of proposal I hear a lot: to automate interpretability, have some human researchers look at parts of a net, poke at them, and write up an explanation of how they're interpreting it. Collect data from many such instances, and train a neural net to take net-parts and produce explanations.

We want to leverage the analogy to thermometers or scales, so what would be the analogous strategy for making a thermometer or scale? Well, have a bunch of humans stick their fingers in a bunch of stuff and report how hot the stuff is, then train a neural net to replicate the humans'  hotness-reports. Or, have a bunch of humans hold things and report how heavy they are, then train a net to replicate the humans' heaviness-reports.

Hopefully it is obvious that the "train a net to replicate human reports" results would not be nearly as useful, for purposes of scientific progress, as actual thermometers or scales. But what's missing? And how can we carry that insight back to the interpretability problem?

The thermometer has two great powers: a simple legible data type, and reproducilbility. First, simple legible data type: the thermometer's output is a single number (the temperature), and we can compare that number with other thermometer-outputs. That's a kind-of-thing for which we have very precise mathematical understanding: we know exactly what kinds-of-things we can do with these numbers, we have a nice general representation, we're confident that different people mean the same thing by numbers, etc. This is in contrast to natural language, which is typically ambiguous, doesn't necessarily make it obvious what we can do, leads to frequent miscommunication, etc.

Second, reproducibility: if the thermometer says X is hotter than Y, then when I put X and Y in contact, X gets cooler and Y gets hotter (all else equal). I can use the thermometer to rank hotness of a bunch of things, sort them by thermometer reading, and consistently (approximately-deterministically) find that the things on the hotter end feel hotter than the things on the colder end. This is what makes the single-number output (temperature) actually useful: it approximately-deterministically predicts some stuff, across a broad range of contexts, based on just those simple numbers.

Exercise for the reader: walk through the same analogy for a scale.

What would be the analogy of a thermometer for an interpretability tool? Well, something which we can point at part of a net, and get back a simple legible output, which approximately-deterministically predicts some stuff across a broad range of contexts.

When you look at it like that, it's clear that building a measurement tool like a scale or thermometer is itself a task which requires scientific insight. It requires finding some approximately-deterministically-reproducible pattern, which can be predicted by some simple legible summary data. Just picking some ad-hoc measurement convention (like e.g. reporting hotness using natural language) fails to capture most of the value. Most of the value isn't in the automation work itself, it's in noticing the reproducible pattern.

I'm generally pretty pessimistic about research-automation proposals analogous to "train a net to mimic a human sticking their finger in some stuff and reporting hotness". But I am much more optimistic about things analogous to "notice that a single number for each subsystem allows us to reproducibly predict which things get hotter/colder when in contact".

New Comment
7 comments, sorted by Click to highlight new comments since:

I think this is going to be wrong as an approach. Weight and temperature are properties of physical systems at specific points in time, and can be measured coherently because we understand laws about those systems. Alignment could be measured as a function of a particular system at a specific point in time, once we have a clear understanding of what? All of human values? 

I'm not arguing that "alignment" specifically is the thing we should be measuring.

More generally, a useful mantra is "we do not get to choose the ontology". In this context, it means that there are certain things which are natural to measure (like temperature and weight), and we do not get to pick what they are; we have to discover what they are.

That's correct. My point is that measuring goals which are not natural to measure will, in general, have many more problems with Goodharting and similar misoptimization and overoptimization pressures. And other approaches can be more productive, or at least more care is needed with design of metrics rather than discovery of what to measure and how.

I'm with Davidmanheim here, it seems this idea could benefit from reading in measurement theory, or at least recognizing a discrepancy that undermines the analogy. I'll get into that a bit, but to start, the post was definitely positive food for thought.

If you're measuring actual temperature, you have some measure options there too, but fundamentally it's a quality of the material under study. If you're measuring "the" perceived temperature, it's an interaction between "the average person" and the material, and sticking fingers in is probably a good measure. Yes, temperature and perceived temperature will correlate, but if the thing you're measuring exists only in someone's head, you're going to have to go to their head for the measurement (also see psychophysics).

"Train[ing] a net to replicate human reports" is not obviously less useful than "actual" scales. Human reports may in fact be the most construct-valid measure. (Though I do agree that leaving these reports in the form of natural language rather than attempted quantifications would indeed be ambiguous, and if we lack face-valid quantitative measures, we will have to develop them from somewhere, probably with those open-ended responses as a foundation.)

Although human reports may be noisy, so are all measures. The thermometer has an implicit +/- margin of error. It seems very precise to us, but human judgments of attributes can also be reliable (in that lots of people agree) and precise (in that the error bars are narrow). For example, if I asked a lot of people to rate the perceived precision of various measures on a scale of 1=extremely noisy to 100=extremely precise, I expect there to be a decent amount of consistency in the rank ordering of those ratings, for thermometers to score highly, and for at least some of the average perceived precisions to flash pleasantly narrow error bars.

But because even the lowest-variance perceptions vary a lot between people (vs. the variability in temperature readings from a thermometer), I do suspect you're not going to get readings that are "approximately-deterministically" useful indicators for lots of perceptual domains, such as alignment. But you'll get indicators that "far-from-deterministically-but-reliably" predict variance in criterion variables. In the end, we're pessimistic and optimistic about the same things; I just don't think it's because human reports are inherently the wrong tool, it's because the attribute of interest is a psychological construct rather than a conveniently-precisely-measurable-physical property. Again, the post was good food for thought - just as measurement of temperature has improved and gotten more precise (touch it -> use mercury -> use radiation), maybe the methods we use for psychological measurement will develop and improve, with hope for alignment.

I think one of the biggest problems with human reports is that it is very unclear what they actually measure.

It seems reasonable to suppose that they measure the best combination of constructs for the human purposes using the best information available to human senses, within the contexts the humans usually operate. This makes it straightforwardly the best information available to humans.

But in order to make sense of this from an external scientific perspective, we have a lot of trouble. Can we precisely characterize the purposes for which people use the information? Can we precisely characterize the external sources of the information, and how those sources work in the human contexts? Maybe we can, but if so it's a huge research project.

These sorts of questions are necessary to answer for alignment-related purposes, as they can tell us how the system extrapolates, e.g. which kinds of deception it gracefully handles. However, human-mimicking approaches don't solve this problem, they just complicate it by adding an extra layer of indirection where things can go wrong.

Getting a general theory for how these sorts of perceptions work is useful both because it allows us to more precisely enumerate the failure cases, and because it can teach us what a system must do to avoid these error cases.

It seems to me that you fail to understand the natural direction of interpretation. Given an "object" O, an "interpretation" of O in terms of I is a map M: I -> O that preserves some structure from I into O. Not the other way around.

Your physical scale (or your physical thermometer) is before all a physical object S. Calling such a physical object S a scale is asserting the existence of an "interpretation map" M: I -> S that preserves the structure of I into S, in a relevant and satisfying way.  What is the domain I here? For an engineer, it's going to be the theory of Newtonian mechanics, the primitives of which being the concepts of mass, space, time, and force. On top of that, a very explicit notion of weight/gravitational interaction between two objects has to be given. 

That is, the domain I of the interpretation map actually contains more than just the primitive concepts, it also gives an explicit form of F in the F = ma part of Newtonian mechanics. You can for example go with F = -mg if that satisfies you, or go with the less incorrect -GmM/r^2 form. Still, both forms require as data the earth's mass and a knowledge of G (either directly, either indirectly by a process of calibration).

So suppose that you have such an interpretation map M: I -> S into your object. S is now called a scale if it preserves the whole structure from I in a way you judge satisfactory. That is, if it shows the expected numbers predicted on the domain of M, and if it preserves the relative order of mass/weight difference (given two objects O1 and O2 to be measured with S, if the domain I says that the mass of O1 is superior to the mass of O2, then S agrees). Note that if you were rigorous, it is not S that is interpreted to be a scale, but M(I).  It would be a mistake to think that S is uniquely and completely characterized by M. In fact, even M(I) is not completely or uniquely characterized by M (as in, it is possible to find different maps from the same I into M(I), to find different I, etc.).

This is where lies the problem with language. The "uniqueness part" is very very wrong. There exists a lot of different and incompatible domains that are yet mapped satisfyingly into your sentence. A lot of wordplay are actually based on that. Language is also non-associative, but the parenthesizing is never written explicitly. It is also non-commutative (both within a sentence, or between sentences themselves), but that's less of a problem.

What is important to understand in any case, is that this interpretation process is not a canonical or choice-free one. A decision is made at some point, by someone, so it is inherently subjective. Physics being extremely constraining and authoritative, these choices and fundamental ambiguities are generally barely visible in the daily world/engineering world. For language, however, you cannot avoid them. It gets even worse when the very same word of the dictionnary can refer to multiple incompatible concepts already.

When it comes to your post, you're basically asking people to try to find the good domains/theories that explain what happens in a net. First, a net is a black box, so good luck with that. Secondly, people are trying to do so for decades and are failing hard.  Finding the good "I" is just an extremely difficult task.  Even if you had some good "I", asserting that what you see in a black box is indeed well interpreted by such a specific domain (and not another one) won't be convincing (most likely). Any metric people are using to quantify what happens in a net is already playing the role of a scale. It's just not an interesting scale, generally.

I'm generally pretty pessimistic about research-automation proposals analogous to "train a net to mimic a human sticking their finger in some stuff and reporting hotness".

I haven’t read your whole post yet, but I predict that this is because you wouldn’t have reason to believe that you can extrapolate out of the distribution that is measurable by humans.