on wellunderstoodness

Quinn

epistemic status: nothing new, informal, may or may not provide a novel compression which gives at least one person more clarity. Also, not an alignment post, just uses arguments about alignment as a motivating example, but the higher-level takeaway is extremely similar to the Rocket Alignment dialogue and to the definition of "deconfusion" that MIRI offered recently (even though I'm making a very low-level point)

notation: please read a <- x as "a replaced with x", e.g. (fx) x<-5 => (mx+b) x<-5 => m5+b. (typically the sequents literature uses fx[5/x] but I find this much harder to read).

motivation

A younger version of me once said "What's the big deal with sorceror's apprentice / paperclip maximizer? you would obviously just give an $ϵ$ and say that you want to be satisfied up to a threshold". But ultimately it isn't hard to believe that software developers don't want to be worrying about whether some hapless fool somewhere forgot to give the $ϵ$ . Not to mention that we don't expect the need for such thresholds to be finitely enumerable. Not to mention the need to pick the right epsilons.

The aspiration of value alignment just tells you "we want to really understand goal description, we want to understand it so hard that we make epsilons (chasing after the perfectly calibrated thresholds) obsolete". Well, is it clear to everyone what "really understand" even means? It's definitely not obvious, and I don't think everyone's on the same page about it.

core claim

I want to focus on the following idea;

to be mathematically understood is to be reduced to substitution instances.

Consider; linear algebra. If I want to teach someone the meaning and the importance of ax+by+c=0 I only need to make sure (skipping over addition, multiplication, and equality) that they can distinguish between a scalar and a variable and then sculpt one form from another. Maybe it generalizes to arbitrary length $a_{i} x_{i}$ and to arbitrary number of equations as nicely as it does because the universe rolled us some dumb luck, it's unclear to me, but I know that every time I see a flat thing in the world I can say "grab the normal forms from the sacred texts and show me by substitution how this data is an instance of that". Yes, if I had coffee I'd derive it from something more primitive and less sacred, but I've been switching to tea.

Substitution isn't at the root of the "all the questions... that can be asked in an introductory course can be answered in an introductory course" property, but I view it as the only mechanism in a typical set of formal mechanisms that roots the possibility of particularization. By which I mean the clean, elegant, "all wrapped up, time for lunch" kind.

Substitutions can behave as a path from a more abstract level to a more concrete level, but if you're lucky they have types to tell us which things fit into other things.

A fundamentalist monk of the deductionist faith may never succumb to the dark inferences, but they can substitute if they believe they are entitled to do so, and they can be told they are entitled by types. $a_{i} : N ⊢ Σ a_{i} x_{i} = 0$ allows even our monk to say a <- 5 just as soon as they establish that $5 : N$ . Yes, without the temptation of the dark inferences they'd never have known that the data could be massaged into the form and ultimately resolved by substitution, but this is exactly why we're considering a monk who can't induct or abduct in public, rather than an automaton who can't induct or abduct at all. The closer you get to deductionist fundamentalism, the more you realize that the whole "never succumb to the dark inferences" thing is more of a guideline anyway.

Problems in well-understood domains impose some inferential cost to figure out which forms they're an instance of. But that's it. You do that, and it's constant time from there, substitution only costs 2 inferenceBucks, maybe 3.

Now compare that to ill-understood domains, like solutions to differential equations. No, not the space of "practical" DEs that humans do for economic reasons, I mean every solution for every DE, the space which would only be "practical" if armies from dozens of weird Tegmark II's were attacking all at once. You can see the monk stepping back, "woah woah woah, I didn't sign up for this. I can only do things I know how to do, I can't go poking and prodding in these volatile functionspaces, it's too dangerous!" You'd need some shoot-from-the-hip engineers, the kind that maybe forget an epsilon here and there and scratch their head for an hour before they catch it, but they have to move at that pace! If you're not willing to risk missing, don't shoot from the hip.

I will give two off-the-cuff "theorems", steamrolling/handwaving through two major problems; 1. variety in skill level or cognitive capacity; 2. domains don't really have borders, we don't really have a way to carve math into nations. But we won't let either of those discourage us!

"Theorem": in a well-understood domain, there is a strict upper bound on the inferential cost of solving an arbitrary problem.

There is at least some social component here; on well-tread paths you're far less likely to bother with the wrong sink/rabbit holes. Others in the community have provided wonderful compressions for you, so you don't have to be a big whizzkid. We expect an upper bound because even a brute force search for your problem in corpus of knowledge would terminate in finite time, assuming the corpus is finite and the search doesn't attempt a faulty regex. And like I said, after that search (which is probably slightly better than brute force), substitution is a constant number of steps.

"Followup theorem": in an ill-understood domain, we really can't give an upper bound on the inferential cost of solving an arbitrary problem. To play it safe, we'll say that problems cost an infinite amount of inferenceBucks unless proven otherwise.

to close where I began, we know that sorcerer's apprentice (i.e., goal description as a mathematical domain) is ill-understood because "fill the cauldron with water" $\mapsto$ "oh crap, i meant maximize p subject to the constraint $p < 1 - ϵ$ $\mapsto$ "oh crap, one more, last one i swear... " etc. is worst case an infinite trap.

I drew the core insight from conversation

At the end of the day, "well-understood just means substitution" was my best attempt at clustering the way people talked about different aspects of math in college.

notes:

"inferenceBucks" sounds almost like graph distance when each inference step is some edge but I was intending more like small-step semantics.

The maximise p=P(cauldrun full) constrained by $p < 1 - ϵ$ has really weird failure modes. This is how I think it would go. Take over the world, using a method that has $< ϵ$ chance of failing. Build giant computers to calculate the exact chance of your takeover. Build a random bucket filler to make the probability work out. Ie if $ϵ$ =3%, then the AI does its best to take over the world, once it succeeds it calculates that its plan had a 2% chance of failure. So it builds a bucket filler that has $\frac{97}{98}$ chance of working. This policy leaves the chance of the cauldron being filled at exactly 97%.

The idea (as I understood it from the post) isn't to maximize P(goal) constrained by P(goal)<1- $ϵ$ . The idea is to select actions in a way which maximizes P(goal|action), unless multiple actions achieve P(goal|action)>1- $ϵ$ , in which case you choose between those randomly. Or the same but for policies rather than actions. Or the same but for plans rather than actions. Something in that general area.

One general concern is tiling -- it seems very possible that such a system would build a successor-agent with the limitations taken off, since this is an easy way to achieve its objective, especially in an environment in which one AGI has (apparently) already been constructed.

Another problem with this solution to AI safety is that we can't apply it to, say, a big neural network. The NN may internally perform search-based planning to achieve high performance on a task, but we don't have access to that, to prevent it from "trying too hard".

We could, of course, apply it to NN training, making sure to stop our training before the NN becomes so capable that we have safety concerns. But the big problem there is knowing where to stop. If you train the NN enough to be highly capable, it may already be implementing the kind of internal search that we would be concerned about. But this suggests your stopping point has to be before NNs could develop internal search. Unfortunately we've already seen NNs learn to search (granted, with some prodding to do so -- I believe there have been other examples, but I didn't find what I remembered). So this constraint has already been violated, and it would presumably be really difficult to convince people to reel things back in.