Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The first draft of this post started with a point that was clear, cohesive, and wrong. So instead, you get this bunch of rambling that I think should be interesting.

 

I - Do not mess with time

Harry wrote down 181,429. He repeated what he'd just written down, and Anthony confirmed it.

Then Harry raced back down into the cavern level of his trunk, glanced at his watch (the watch said 4:28 which meant 7:28) and then shut his eyes.

Around thirty seconds later, Harry heard the sound of steps, followed by the sound of the cavern level of the trunk sliding shut. (Harry wasn't worried about suffocating. An automatic Air-Freshening Charm was part of what you got if you were willing to buy a really good trunk. Wasn't magic wonderful, it didn't have to worry about electric bills.)

And when Harry opened his eyes, he saw just what he'd been hoping to see, a folded piece of paper left on the floor, the gift of his future self.

Call that piece of paper "Paper-2".

Harry tore a piece of paper off his pad.

Call that "Paper-1". It was, of course, the same piece of paper. You could even see, if you looked closely, that the ragged edges matched.

Harry reviewed in his mind the algorithm that he would follow.

If Harry opened up Paper-2 and it was blank, then he would write "101 x 101" down on Paper-1, fold it up, study for an hour, go back in time, drop off Paper-1 (which would thereby become Paper-2), and head on up out of the cavern level to join his dorm mates for breakfast.

If Harry opened up Paper-2 and it had two numbers written on it, Harry would multiply those numbers together.

If their product equaled 181,429, Harry would write down those two numbers on Paper-1 and send Paper-1 back in time.

Otherwise Harry would add 2 to the number on the right and write down the new pair of numbers on Paper-1. Unless that made the number on the right greater than 997, in which case Harry would add 2 to the number on the left and write down 101 on the right.

And if Paper-2 said 997 x 997, Harry would leave Paper-1 blank.

Which meant that the only possible stable time loop was the one in which Paper-2 contained the two prime factors of 181,429.

If this worked, Harry could use it to recover any sort of answer that was easy to check but hard to find. He wouldn't have just shown that P=NP once you had a Time-Turner, this trick was more general than that. Harry could use it to find the combinations on combination locks, or passwords of every sort. Maybe even find the entrance to Slytherin's Chamber of Secrets, if Harry could figure out some systematic way of describing all the locations in Hogwarts. It would be an awesome cheat even by Harry's standards of cheating.

Harry took Paper-2 in his trembling hand, and unfolded it.

Paper-2 said in slightly shaky handwriting:

DO NOT MESS WITH TIME

Harry wrote down "DO NOT MESS WITH TIME" on Paper-1 in slightly shaky handwriting, folded it neatly, and resolved not to do any more truly brilliant experiments on Time until he was at least fifteen years old.

-(Harry Potter and the Methods of Rationality, ch. 17)

 

II - Intro

This post is primarily an attempt to think about if HCH is aligned, in the hypothetical where it can be built. By "aligned" I mean in the sense that asking HCH to choose good actions, or to design an FAI, would result in good outputs, even if there were no fancy safety measures beyond verbal instructions to the human inside HCH on some best practices for factored cognition. I will mention some considerations from real-world approximations later, but mostly I'm talking about a version of what Paul later calls "weak HCH" for conceptual simplicity.

In this hypothetical, we have a perfectly simulated human in some environment, and they only get to run for ~2 weeks of subjective time. We can send in our question only at the start of the simulation, and they send out their answer at the end, but in between they're allowed to query new simulations, each given a new question and promptly returning the answer they would get after 2  subjective weeks of rapid simulation (and the sub-simulations have access to sub-sub-simulations in turn, and so on). We assume that the computer running this process is big but finite, and so the tree structure of simulations asking questions of more simulations should bottom out with probability 1 at a bunch of leaf-node simulations who can answer their question without any help. And if halting is ever in doubt, we can posit some safeguards that eventually kick in to hurry the human up to make a decision.

I see two main questions here. One is about the character of attractors within HCH. An attractor is some set of outputs (of small size relative to the set of all possible outputs) such that prompting the human with an element of the set will cause them to produce an output that's also an element of the set. We can think of them as the basins of stability in the landscape traversed by the function. The set just containing an empty string is one example of an attractor, if our human doesn't do anything unprompted. But maybe there's some more interesting attractors like "really sad text," where your human reliably outputs really sad text upon being prompted with really sad text. Or a maybe there's a virtuous cycle attractor, where well-thought-out answers support more well-thought-out answers.

How much we expect the behavior of HCH to be dominated by the limiting behavior of the attractors depends on how deep we expect it to be, relative to how quickly convergent behavior appears, and how big we expect individual attractors to be. There's no point describing the dynamical behavior of HCH in this way if there are 10 jillion different tiny attractors, after all.

The other question is about capability. Does a tree of humans actually converge to good answers, or do they probably screw up somewhere? This is mainly speculation based on gut feelings about the philosophical competence of humans, or how easy it is to do factored cognition, but this is also related to questions about attractors, because if there are undesirable attractors, you don't want HCH to pursue cognitive strategies that will find them. If undesirable attractors are hard to find, this is easy - and vice versa.

Previous discussion from Evan Hubinger, Eliezer Yudkowsky.

 

III - Features of HCH

Normally we think of attractors in the context of nonlinear dynamics (wikipedia), or iterated function application (review paper). HCH is just barely neither of these things. In HCH, information flows two ways - downwards and upwards. Iterated function application requires us to be tracking some function output  , which is only a function of the history  (think the Fibonacci sequence or the logistic map), whereas even though we can choose an ordering to put an HCH tree into a sequence, for any such ordering you can have functional dependence on terms later in the sequence, which makes proving thing much harder. The only time we can really squeeze HCH into the straitjacket is when considering the state of the human up until they get their first response to their queries to HCH, because the sequence of first queries really is just iterated function application.

Aside before getting back to attractors: future-dependence of the sequence also makes it possible for there to be multiple consistent solutions. However, this is mostly avoided for finite-sized trees, is a separate problem from the character of attractors, and corresponds to a training issue for IDA approximations to HCH that I think can be avoided with foresight.

At first blush this partial description (only the first queries, and no responses allowed) doesn't seem that exciting (which is good, exciting is bad!). This is primarily because in order for the computation of HCH to terminate, we already know that we want everything to eventually reach the "no query" attractor. We can certainly imagine non-terminating attractors - maybe I spin up an instance to think about anti-malarial drugs if asked about malaria vaccines, and spin up an instance to think about malaria vaccines if asked about anti-malarial drugs - but we already know we'll need some methods to avoid these.

However, downward attractors can be different from the "no query" attractor and yet still terminate, by providing restrictions above and beyond the halting of the computation. Let's go back to the hypothetical "very sad text" attractor. You can have a "very sad text" attractor that will still terminate; the set of queries in this attractor will include "no query," but will also include very sad text, which (in this hypothetical) causes the recipient to also include very sad text in their first query.

In the end, though, the thing we care about is the final output of HCH, not what kind of computation it's running down in its depths, and this is governed by the upward flow of answers. So how much can we treat the upward answers as having attractors? Not unboundedly - there is almost certainly no small set of responses to the first query that causes the human to produce an output also in that set regardless of the prompt. But humans do have some pretty striking regularities. In this post, I'll mostly have to stick to pure speculation.

 

IV - Pure speculation

If we think of the flow of outputs up the HCH tree as opposed to down, do there seem to be probable attractors? Because this is no longer an iterated function, we have to loosen the definition - let's call a "probable attractor" some set that (rather than mapping to itself for certain) gets mapped to itself with high probability, given some probability distribution over the other factors affecting the human. Thus probable attractors have some lifetime that can be finite, or can be infinite if the probabilities converge to 1 sufficiently quickly.

Depending on how we phrase mathematical questions about probable attractors, they seem to either be obviously common (if you think about small stochastic perturbations to functions with attractors), or obviously rare (if you think about finite state machines that have been moved to worst-case nearby states). I'm not actually sure what a good perspective is.

But I do have some guesses about possible attractors for humans in HCH. An important trick for thinking about them is that attractors aren't just repetitious, they're self-repairing. If the human gets an input that deviates from the pattern a little, their natural dynamics will steer them into outputting something that deviates less. This means that a highly optimized pattern of flashing lights that brainwashes the viewer into passing it on is a terrible attractor, and that bigger, better attractors are going to look like ordinary human nature, just turned up to 11.

Here are some ideas:

  1. Framings of the problem. Once you are prompted to think about something a certain way, that might easily self-reinforce, even though you'd just as easily think about it a different way if prompted differently.
  2. Memes that use the communication channel plus normal human reasoning. E.g. a convincing argument for why we should be doing something else instead, or an apparent solution to the entire problem that seems important to pass on.
  3. Self-propagating style/emotional choices in communication. E.g. the very sad text attractor. I wonder if there's any data on how easy it is to influence humans in this sort of "emotional telephone" game.
  4. Memes that hijack the communication channel using normal communication in ways that go against sensible precommitments, but are some sort of slippery slope. E.g. a joke so funny that you want to share it, or a seductive conspiracy theory.

None of these are automatically bad (well, except maybe the last one). But we might expect them to decrease the quality of answers from HCH by driving the humans to somewhat unusual parts of the probability distribution. This leads to the observation that there's no reason to expect a single "good reasoning" attractor, where all the good behavior lives, independent of the original question asked to HCH. This is sort of the inverse Anna Karenina principle.

Let me restate the issue in bigger font: To the extent that iterated human message-passing tends to end up in these attractors, we can expect this to degrade the answer quality. Not because of a presumption of malicious behavior, but because these long-lifetime attractors are going to have some features based on human psychological vulnerabilities, not on good reasoning. It's like extremal Goodhart for message-passing.

 

V - Some thoughts on training

Because these issues are intrinsic to HCH even if it gets to use perfect simulations of humans, they aren't going to be fixed too easily in training. We could just solve meta-preferences and use reflectively consistent human-like agents, but if we can do that kind of ambitious applied philosophy project, we can probably just solve the alignment problem without simulating human-adjacent things.

Transparency tools might be able to catch particularly explicit optimization for memetic fitness. But they might not help when attractors are arrived at by imitation of human reasoning (and when they do help, they'll be an implementation of a specific sort of meta-preference scheme). A trained IDA approximation of HCH is also likely to use some amount of reasoning about what humans will do (above and beyond direct imitation), which can seamlessly learn convergent behavior.

Correlations between instances of imitations can provide a training incentive to produce outputs for which humans are highly predictable (I'm reminded of seeing David Keuger's paper recently). This is that issue that I mentioned earlier was related to the multiplicity of consistent solutions to HCH - we don't want our human-imitations to be incentivized to make self-fulfilling prophecies that push them away from the typical human probability distribution. Removing these correlations might require expensive efforts to train off of a clean supervised set with a high fraction of actual humans.

 

VI - Humans are bad at things

The entire other question about the outer alignment of HCH is whether humans being bad at things is going to be an issue. Sometimes you give a hard problem to people, and they just totally mess it up. Worst case we ask HCH to design an FAI and get back something that maximizes compressibility or similar. In order for HCH to be "really" aligned, we might want it to eliminate such errors, or at least drastically decrease them. Measuring such a thing (or just establishing trust) might require trying it on "easier problems" first (see some of the suggestions from Ajeya Cotra's post).

Honestly this topic could be its own post. Or book. Or anthology. But here, I'll keep it short and relate it back to the discussion of attractors.

My gut feeling is that the human is too influential for errors to ever really be extirpated. If we ask HCH a question that requires philosophical progress, and the human happens to be locked into some unproductive frame of mind (which gets duplicated every time a new instance is spooled up), then they're probably going to process the problem and then interpret the responses to their queries in line with their prior inclinations. History shows us that people on average don't make a ton of philosophical progress in ~2 weeks of subjective time.

And suppose that the human is aware of this problem and prepares themselves to cast a wide net for arguments, to really try to change their mind, to try (or at least get a copy of themselves to try) all sorts of things that might transform their philosophical outlook. Is the truth going to pour itself down upon this root-node human? No. You know what's going to pour down upon them? Attractors.

This sort of issue seems to put a significant limit on the depth and adventurousness available to searches for arguments that would change your mind about subjective matters. Which in turn makes me think that HCH is highly subject to philosophical luck.

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 10:33 PM

This is the best explanation I have seen yet of what seem to me to be the main problems with HCH. In particular, that scene from HPMOR is one that I've also thought about as a good analogue for HCH problems. (Though I do think the "humans are bad at things" issue is more probably important for HCH than the malicious memes problem; HCH is basically a giant bureaucracy, and the same shortcomings which make humans bad at giant bureaucracies will directly limit HCH.)

IMO the strongest argument in favor of imitation-based solutions is: if there is any way to solve the alignment problem which we can plausibly come up with, then a sufficiently reliable and amplified imitation of us will also find this solution. So, if the imitation is doomed to produce a bad solution or end up in a strange attractor, then our own research is also doomed in the same way anyway. Any counter-argument to this would probably have to be based on one of the following:

  • Maybe the imitation is different from how we do alignment research in important ways. For example, we have more than 2 weeks of memory of working on the solution: but maybe if you spend 1 week learning the relevant context and another week making further progress, then it's not a serious problem. I definitely think factored cognition is a big difference, but also that we don't need factored cognition (we should use confidence thresholds instead).
  • Maybe producing reliable imitation is much harder than some different solution that explicitly references the concept of "values". An imitator doesn't have any way to know which features of what it's imitating are important, which makes its job hard. I think we need some rigorous learning-theoretic analysis to confirm or disprove this.
  • Maybe by the time we launch the imitator we'll have such a small remaining window of opportunity that it won't do as good a job at solving alignment as we would do working on the problem starting from now. Especially taking into account malign AI leaking into the imitation. [EDIT: Actually malign AI leakage is a serious problem since the malign AI takeover probability rate at the time of IDA deployment is likely to be much higher than it is now, and the leakage is proportional to this rate times the amplification factor.]

Yeah, I agree with this. It's certainly possible to see normal human passage through time as a process with probable attractors. I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow.

If we imagine actual human imitations I think all of these problems have fairly obvious solutions, but I think the problems are harder to solve if you want IDA approximations of HCH. I'm not totally sure what you meant by the confidence thresholds link - was it related to this?

The monoculture problem seems like it should increase the size ("size" meaning attraction basin, not measure of the equilibrium set), lifetime, and weirdness of attractors, while the restrictions and expectations on message-passing seem like they might shift the distribution away from "normal" human results.

But yeah, in theory we could use imitiation humans to do any research we could do ourselves. I think that gets into the relative difficulty of super-speed imitations of humans doing alignment research versus transformative AI, which I'm not really an expert in.

[EDIT: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn't work that well.]

I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow.

I don't think that last one is a real constraint. What counts as "an answer" is entirely a matter of interpretation by the participants in the HCH. For example, initially I can ask the question "what are the most useful thoughts about AI alignment I can come up with during 1,000,000 iterations?". When I am tasked to answer the question "what are the most useful thoughts about AI alignment I can come up with during iterations?" then

  • If , I will just spend my allotted time thinking about AI alignment and write whatever I came up with in the end.
  • If , I will ask "what are the most useful thoughts about AI alignment I can come up with during iterations?". Then, I will study the answer and use the remaining time to improve on it to the best of my ability.

An iteration of 2 weeks might be too short to learn the previous results, but we can work in longer iterations. Certainly, having to learn the previous results from text carries overhead compared to just remembering myself developing them (and having developed some illegible intuitions in the process), but only that much overhead.

As to "monoculture", we can do HCH with multiple people (either the AI learns to simulate the entire system of multiple people or we use some rigid interface e.g. posting on a forum). For example, we can imagine putting the entire AI X-safety community there. But, we certainly don't want to put the entire world in there, since that way malign AI would probably leak into the system.

I think the problems are harder to solve if you want IDA approximations of HCH. I'm not totally sure what you meant by the confidence thresholds link - was it related to this?

Yes: it shows how to achieve reliable imitation (although for now in a theoretical model that's not feasible to implement), and the same idea should be applicable to an imitation system like IDA (although it calls for its own theoretical analysis). Essentially, the AI queries a real person if and only if it cannot produce a reliable prediction using previous data (because there are several plausible mutually inconsistent hypotheses), and the frequency of queries vanishes over time.

If you want to prove things about fixed points of HCH in an iterated function setting, consider it a function from policies to policies. Let M be the set of messages (say ascii strings < 10kb.) Given a giant look up table T that maps M to M, we can create another giant look up table. For each m in M , give a human in a box the string m, and unlimited query access to T. Record their output.

The fixed points of this are the same as the fixed points of HCH. "Human with query access to" is a function on the space of policies.

Sure, but the interesting thing to me isn't fixed points in the input/output map, it's properties (i.e. attractors that are allowed to be large sets) that propagate from the answers seen by a human in response to their queries, into their output.

Even if there's a fixed point, you have to further prove that this fixed point is consistent - that it's actually the answer to some askable question. I feel like this is sort of analogous to Hofstadter's q-sequence.

In the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point.