## LESSWRONGLW

Research Scientist at Protocol Labs Network Goods; formerly FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.

# Wiki Contributions

In computer science this distinction is often made between extensional (behavioral) and intensional (mechanistic) properties (example paper).

For the record, the canonical solution to the object-level problem here is Shapley Value. I don’t disagree with the meta-level point, though: a calculation of Shapley Value must begin with a causal model that can predict outcomes with any subset of contributors removed.

I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.

1. I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally represented goals”, calling the other one safer—or different at all, from a safety perspective—would be a mistake. (And, in the long run, a mistake that opens Goodhartish loopholes for models to avoid having the internal properties we don’t like without actually being safer.)

2. The fix is, roughly, to identify the “internal representations” in any suitable causal explanation of the extensionally observable behavior, including but not limited to a causal explanation whose variables correspond naturally to the “actual” computational graph that implements the policy/model.

3. If we can identify internally represented [goals, etc] in the actual computational graph, that is of course strong evidence of (and in the limit of certainty and non-approximation, logically implies) internally represented goals in some suitable causal explanation. But the converse and the inverse of that implication would not always hold.

4. I believe many non-safety ML people have a suspicion that safety people are making a vaguely superstitious error by assigning such significance to “internal representations”. I don’t think they are right exactly, but I think this is the steelman, and that if you can adjust to this perspective, it will make those kinds of critics take a second look. Extensional properties that scientists give causal interpretations to are far more intuitively “real” to people with a culturally stats-y background than supposedly internal/intensional properties.

Sorry these comments come at the last day; I wish I had read it in more depth a few days ago.

Overall the paper is good and I’m glad you’re doing it!

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.

In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those points didn’t get enough emphasis and people focused more on route 1 there, which seems much more hopeless.

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

• you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
• your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.

I tend to favor the latter interpretation—I'd say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have "programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon" is the decomposition into Reward Specification and Adequate Policy Learning really useful.

I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.

Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.

I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).

If we were still using 10-digit decimal words like the original ENIAC and other early computers, I'd be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.

The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property (presumably a conjunct of many other properties, etc.) such that conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to resources sufficient for at least their previous standard of living). In human history, the development of increasingly formalized bright lines around what things count as definitely bad things (namely, laws) seems to have been greatly instrumental in the reduction of bad things overall.