Yup, that's right. A wrong frame is costly relative to the right frame. A less wrong frame can still be less costly than a more wrong frame, and that's especially relevant when nobody knows what the right frame is yet.
If we’d learned that GPT-4 or Claude had those capabilities, we expect labs would have taken immediate action to secure and contain their systems.
At that point, the time at which we should have stopped is probably already passed, especially insofar as:
As written, this evaluation plan seems to be missing elbow-room. The AI which I want to not be widely deployed is the one which is almost but not quite capable of autonomous function in a test suite. The bar for "don't deploy" should be slightly before a full end-to-end demonstration of that capability.
LessWrong, conveniently, has a rough metric of status directly built-in, namely karma. So we can directly ask: do people with high karma (i.e. high LW-status) wish to avoid quantification of performance? Speaking as someone with relatively high karma myself, I do indeed at least think that every quantitative performance metric I've heard sounds terrible, and I'd guess that most of the other folks with relatively high karma on the site would agree.
... and yet the story in the post doesn't quite seem to hold up. My local/first-order incentives actually favor quantifying performance by status, so long as the quantitative metric in question is one by which I'm already doing well - like, say, LW karma. If e.g. research grants were given out based solely on LW karma, that would be great for me personally (ignoring higher-order effects).
And yet, despite the favorable local/first-order incentives, I think that's not a very good idea (either for me personally or at the community level), because implementing it would mostly result in karma being goodhearted a lot more.
Zooming back out to the more general case, I see two generalizable lessons.
First: the local incentives of those with high status agree with performance quantification just fine, so long as the metric in question is one by which they're already doing well. Quantification is not actually the relevant thing to focus on. The relevant thing to focus on is whether a particular new criterion (whether quantitative or not) is something on which high-status people already perform well.
Second: performance standards are a commons. Goodhearting burns that commons; performing well at a widely-goodhearted metric has relatively little benefit even for those who are very good at goodhearting the metric, compared to performing well on a non-goodhearted metric. So, high-status individuals' incentives also push toward avoiding goodhearting, in a way which is plausibly-beneficial to the community as a whole.
You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?
The basic analogy is roughly "if we want a baseline for how hard it will be to evaluate an AI's outputs on their own terms, we should look at how hard it is to evaluate humans' outputs on their own terms, especially in areas similar in some way to AI safety". My guess is that you already have lots of intuition about how hard it is to assess results, from your experience assessing grantees, so that's the intuition I was trying to pump. In particular, I'm guessing that you've found first hand that things are much harder to properly evaluate than it might seem at first glance.
The "we" in "we can point AIs toward and have some ability to assess" meant humans, not Open Phil.
If you think generic "humans" (or humans at e.g. Anthropic/OpenAI/Deepmind, or human regulators, or human ????) are going to be better at the general skill of evaluating outputs than yourself or the humans at Open Phil, then I think you underestimate the skills of you and your staff relative to most humans. Most people do not perform any minimal-trust investigations. So I expect your experience here to provide a useful conservative baseline.
I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later.
Indeed, I think you're a good role model in this regard and hope more people will follow your example.
It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?
I don't think this is implausible but haven't seen a particular reason to consider it likely.
The phrase I'd use there is "grokking general-purpose search". Insofar as general-purpose search consists of a relatively-simple circuit/function recursively calling itself a lot with different context-specific knowledge/heuristics (e.g. the mental model here), once a net starts to "find" that general circuit/function during training, it would grok for the same reasons grokking happens with other circuits/functions (whatever those reasons are). The "phase transition" would then be relatively sudden for the same reasons (and probably to a similar extent) as in existing cases of grokking.
I don't personally consider that argument strong enough that I'd put super-high probability on it, but it's at least enough to privilege the hypothesis.
Among other things, it seems important that there are a bunch of specific useful tasks we can point AIs toward and have some ability to assess on their own grounds (standards enforcement, security, etc.)
Do you think you/OpenPhil have a strong ability to assess standards enforcement, security, etc, e.g. amongst your grantees? I had the impression that the answer was mostly "no", and that in practice you/OpenPhil usually mostly depend on outside indicators of grantees' background/skills and mission-alignment. Am I wrong about how well you think you can evaluate grantees, or do you expect AI to be importantly different (in a positive direction) for some reason?
+1, this is probably going to be my new default post to link people to as an intro.
We may disagree about how much progress the results to date represent regarding finite approximations. I'd say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).
Perhaps your instincts here are better than mine! Going to the finite case has indeed turned out to be more difficult than I expected at the time of writing most of the posts you reviewed.
Brief responses to the critiques:
Results don’t discuss encoding/representation of abstractions
Totally agree with this one, it's the main thing I've worked on over the past month and will probably be the main thing in the near future. I'd describe the previous results (i.e. ignoring encoding/representation) as characterizing the relationship between the high-level and the high-level.
Definitions depend on choice of variables
The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. For instance, it doesn't make sense to use variables which "rotate together" the states of five different local patches of spacetime which are not close to each other. (For instance, those five different local patches will generally not be rotated together by default in an evolving agent's sensory feed.)
That does still leave degrees of freedom in how we represent all the local patches, but those are exactly the degrees of freedom which don't matter for natural abstraction. (Under the minimal latent formulation: we can represent each individual variable or set-of-variables-which-we're-making-independent-of-some-other-stuff in a different way without changing anything informationally. Under the redundancy formulation: assume our resampling process allows simultaneous resampling of small sets of variables, to avoid the thing where there's two variables very tightly coupled but they're otherwise independent of everything else. With that modification in place, same argument as the minimal latent formulation applies.)
Theorems focus on infinite limits, but abstractions happen in finite regimes
Totally agree with this one too, and it has also been a major focus for me over the past couple months.
I'd also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn't yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual "right" formalizations. (The more general principle here is to only add formality when it's the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don't yet know the full right formality, then we should sketch at the level we think we do know.)
Missing theoretical support for several key claims
Basically agree with this. In particular, I think the quoted block is indeed a place where I was a bit overexcited at the time and made too strong a claim. More generally, for a while I was thinking of "deterministic constraints" as basically implying "low-dimensional" in practice, based on intuitions from physics. But in hindsight, that's at least not externally-legibly true, and arguably not true in general at all.
Figuring out whether the Universality Hypothesis is true
... What we’re less convinced of is that the current theoretical approach is a good way to tackle this question. One worrying sign is that almost two years after the project announcement (and over three years after work on natural abstractions began), there still haven’t been major empirical tests, even though that was the original motivation for developing all of the theory. ... Of course sometimes experiments do require upfront theory work. But in this case, we think that e.g. empirical interpretability work is already making progress on the Universality Hypothesis, whereas we’re unsure whether the natural abstractions agenda is much closer to major empirical tests than it was two years ago.
See the section on "Low level of precision...". Also, You Are Not Measuring What You Think You Are Measuring is a very relevant principle here - I have lots of (not necessarily externally-legible) bits of evidence about a rough version of natural abstraction, but the details I'm still figuring out are (not coincidentally) exactly the details where it's hard to tell whether we're measuring the right thing.
Abstractions as a bottleneck for agent foundations: The high-level story for why abstractions seem important for formalizing e.g. values seems very plausible to us. It’s less clear to us whether they are necessary (or at least a good first step)
Yeah, I don't think this should be externally-legibly clear right now. I think people need to spend a lot of time trying and failing to tackle agent foundations problem themselves, repeatedly running into the need for a proper model of abstraction, in order for this to be clear.
Accelerating alignment research: The promise behind this motivation is that having a theory of natural abstractions will make it much easier to find robust formalizations of abstractions such as “agency”, “optimizer”, or “modularity”. ... To us, such an outcome seems unlikely, though it may still be worth pursuing
I probably put higher probability on success here then you do, but I don't think it should be legibly clear.
Interpretability: ... Figuring out the real-world meaning of internal network activations is one of the core themes of safety-motivated interpretability work. And reverse-engineering a network into “pseudocode” is not just some separate problem, it’s deeply intertwined. We typically understand the inputs of a network, so if we can figure out how the network transforms these inputs, that can let us test hypotheses for what the meaning of internal activations is.
An intuitive understanding of inputs plus a circuit is not, in general, sufficient to interpret the internal things computed by the circuit. Easy counterargument: neural nets are circuits, so if those two pieces were enough, we'd already be done; there would be no interpretability problem in the first place.
Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple - e.g. edge detectors in Olah's early work, or the sinusoidal elements in Neel's work on modular addition. But this falls apart quickly as the circuits get bigger - e.g. later layers in vision nets, once we get past early things like edge and texture detectors.
Low level of precision and formalization
I mentioned earlier the heuristic of "only add formality when it's the right formality; don't prematurely add ad-hoc formulations just for the sake of making things more formal".
More generally, if you're used to academia, then bear in mind the incentives of academia push towards making one's work defensible to a much greater degree than is probably optimal for truth-seeking. Formalization is one part of this: in academia, the incentive is usually to add ad-hoc formalization in order to get a full formal proof rather than a sketch, even if the ad-hoc formalization added does not match reality well. On the experimental side, the incentive is usually on bulletproof results, rather than gaining lots of information. (... and that's the better case. In the worse case, the incentive is on jumping through certain hoops which are nominally about bulletproofing, but don't even do that job very well, like e.g. statistical significance.) And yes, defensibility does have value even for truth-seeking, but there are tradeoffs and I advise against anchoring too much on academia.
With that in mind: both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don't think I currently have all the right pieces, and I'm trying to get the right pieces quickly. For that purpose, it's important to make the stuff I think I understand as legible as possible so that others can help. I try to accurately convey my models and epistemic state. But it's not important to e.g. make it easy for others to point out mistakes in places where I didn't think the formality was right anway. If and when I have all the pieces, then I can worry about defensible proof.
That said, I agree with at least some parts of the critique. Being both precise and readable at the same time is hard, man.
As we briefly discussed earlier, we think it’s worrying that there haven’t been major experiments on the Natural Abstraction Hypothesis, given that John thinks of it as mostly an empirical claim. We would be excited to see more discussion on experiments that can be done right now to test (parts of) the natural abstractions agenda! We elaborate on a preliminary idea in the appendix (though it has a number of issues).
I do love your experiment ideas! The experiments I ran last summer had a similar flavor - relatively-simple checks on MNIST nets - though they were focused on the "information at a distance" lens rather than the redundancy or minimal latent lenses.
Anyway, similar answer here as the previous section: at this point I'm mainly trying to get to the right answers quickly, not trying to provide some impressive defensible proof. I run experiments insofar as they give me bits about what the right answers are.
That's a great connection which I had indeed not made, thanks! Strong-upvoted.