Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

[Metadata: crossposted from First completed September 4, 2023.]

A hermeneutic net for agency is a natural method to try, to solve a bunch of philosophical difficulties relatively quickly. Not to say that it would work. It's just the obvious thing to try.

Thanks to Sam Eisenstat for related conversations.


To create AGI that's aligned with human wanting, it's necessary to design deep mental structures and resolve confusions about mind. To design structures and resolve confusions, we want to think in terms of suitable concepts. We don't already have the concepts we'd need to think clearly enough about minds. So we want to modify our concepts and create new concepts. The new concepts have to be selected by the Criterion of providing suitable elements of thinking that will be adequate to create AGI that's aligned with human wanting.

The Criterion of providing suitable elements of thinking is expressed in propositions. These propositions use the concepts we already have. Since the concepts we already have are inadequate, the propositions do not express the Criterion quite rightly. So, we question one concept, with the goal of replacing it with one or more concepts that will more suitably play the role that the current concept is playing. But when we try to answer the demands of a proposition, we're also told to question the other concepts used by that proposition. The other concepts are not already suitable to be questioned——and they will, themselves, if questioned, tell us to question yet more concepts. Lacking all conviction, we give up even before we are really overwhelmed.

The hermeneutic net would brute-force this problem by analyzing all the concepts relevant to AGI alignment "at once". In the hermeneutic net, each concept would be questioned, simultaneously trying to rectify or replace that concept and also trying to preliminarily analyze the concept. The concept is preliminarily analyzed in preparation, so that, even if it is not in its final form, it at least makes itself suitably available for adjacent inquiries. The preliminary analysis collects examples, lays out intuitions, lays out formal concepts, lays out the relations between these examples, intuitions, and formal concepts, collects desiderata for the concept such as propositions that use the concept, and finds inconsistencies in the use of the concept and in propositions asserted about it. Then, when it comes time to think about another related concept——for example, "corrigibility", which involves "trying" and "flaw" and "self" and "agent" and so on——those concepts ("flaw" and so on) have been prepared to well-assist with the inquiry about "corrigibility". Those related concepts have been prepared so that they easily offer up, to the inquiry about "corrigibility", the rearrangeable conceptual material needed to arrange a novel, suitable idea of "flaw"——a novel idea of "flaw" that will both be locally suitable to the local inquiry of "corrigibility" (suitable, that is, in the role that was preliminarily assigned, by the inquiry, to the preliminary idea of "flaw"), and that will also have mostly relevant meaning mostly transferable across to other contexts that will want to use the idea of "flaw".

The need for better concepts

Hopeworthy paths start with pretheoretical concepts

The only sort of pathway that appears hopeworthy to work out how to align an AGI with human wanting is the sort of pathway that starts with a pretheoretical idea that relies heavily on inexplicit intuitions, expressed in common language. As an exemplar, take the "Hard problem of corrigibility":

The "hard problem of corrigibility" is to build an agent which, in an intuitive sense, reasons internally as if from the programmers' external perspective. We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."

I find this compelling. But also, it rests on a bunch of ideas, and these ideas are being used in a way that pushes them to their limits or beyond.

For example, the compellingness of this paragraph rests on using evaluative notions like "error" and "good thing" and "correct" and "mistaken" in a way that calls on intuitions about human wanting. It feels like I have some preliminarily grasped idea of what it would be like to believe that I'm flawed, and that my current disposition towards correcting my flaws is also flawed——and that this other thing over there is exerting the right sort of control to rightly correct those flaws. I feel like I have the context needed to gemini model the possible "belief state" that a mind could be in, when it believes "I am flawed, potentially at any level of description.". That state of mind, which I intuitively think I have some preliminary grasp on, intuitively implies many desirable behaviors called corrigibility. But, the most prominent technical concept of evaluation that I have——a utility function——doesn't work here: it only recovers some but not all of the desired properties of the intuitive idea. (Other proposals for replacement technical concepts also don't work; they only recover a different strict subset of the desired properties.)

The upshot is this: our concepts are inadequate for alignment. We don't know what sort of mental element determines the effects of a mind, certainly not well enough to specify those effects. The concepts that we already have are either confused——inexplicit, conflationary, inconsistently used, internally logically inconsistent——or not even potentially adequate to play the role we're trying to have them play in our thinking. We don't see deeply enough into the structure of those sorts of computations that bring about large general effects in the world, to reach in and design / find / select / grow / describe / recognize / build such computations so that whatever determines their effects will determine the effects to be [liked by us when we're fully informed and our judgement is not manipulated].

Pretheoretical concepts don't automatically direct thinking rightly

What to do with pretheoretical concepts that suggest some hopeworthy paths, but aren't yet adequate? Here are two ways to just go on with aspects of the hopeworthy path, without worrying too much about the inadequacy of the concepts involved:

  1. Go on using the concept pretheoretically——allow it to remain ambiguous, suggestive, and inconsistently used. This way of just going on lets the concept stay flexible enough to potentially satisfy all the desiderata placed on it by the hopeworthiness of the hopeworthy path.
  2. Find a good-enough formal concept that captures some aspects of the pretheoretical concept, perhaps excluding other aspects. This way of just going on lets the concept be built upon and analyzed more sharply, since it can bear more load and do so more explicitly.

A fictional dialogue:

Ḥaḥam: "My plan to make aligned AI is to make the AI be honest."

Rasha: "Suppose you have an honest AI. How does that get you aligned AI?"

Ḥaḥam: "We ask the AI what the results of its planned actions will be, and how confident it is that it knows all the main results are. Since the AI is honest, we can use its answers to filter out any plans that aren't confidently known to have effects we like."

Rasha: "So, one problem with this is that, even assuming you can avoid having the AI kill you through a side-channel, unless you can point the AI's optimization power specifically towards plans that actually have large effects that you like, all of its plans will have large effects that you very much dislike. If you reject all the plans that the AI honestly reports as having bad effects, you'll reject all of its plans (assuming no bad ones get through), making this AI aligned only in the sense that a literal sponge is aligned."

Ḥaḥam: "If that turns out to be a problem, then we can also leverage the AI's honesty to dig in to the generators of its plans, rather than just the plans themselves, and make more high-level / generator-level adjustments to how the AI is searching for plans. For example, we can ask the AI what it is trying to do in its planning, and if it's trying to do the wrong thing, we at least know what we need to correct."

Rasha: "Before, you were talking about the AI being honest about the likely results of its planned actions. That feels relatively more straightforward than asking about generator-ish elements of the AI. It seems potentially significantly less difficult and confusing to design and test an AI to make accurate reports about effects of actions, compared to accurate reports about "what the AI wants" and similar. How would you go about trying to make an AI honest?"

Ḥaḥam: "For starters, we can design training mechanisms for reporter systems that, given an AI, learn where various variables are stored, so to speak. Then, to know what the AI thinks of some variable, we run the trained reporter system on the AI."

Rasha: "This will discover variables that you know how to evaluate, like where the cheese is in the maze——you have access to the ground truth against which you can compare a reporter-system's attempt to read off the position of the cheese from the AI's internals. But this won't extend to variables that you don't know how to evaluate. So this approach to honesty won't solve the part of alignment where, at some point, some mind has to interface with ideas that are novel and alien to humanity and direct the power of those ideas toward ends that humans like."

Ḥaḥam: "I'm not so sure about that. Honesty feels like a fairly simple notion. There's complications, e.g. there's built-in complexity because the reporter system will have to translate things into terms that humans can understand. But the idea of accurately reporting seems like a simple enough notion, so it might be fairly easy to point a system at making pretty comprehensive, honest reports, in a way that generalizes to new cases, even ones of a new kind."

Rasha: "This might be right, but notice that you're making a function call to quite a lot of general intelligence in your supposed reporter system. It's figuring out how to think at the level of the base system, and it's figuring out how to communicate about that. Pointing such a system at all, even at a maybe-simple goal like honest reports, seems to reproduce much of the whole alignment problem!"

Ḥaḥam: "I think it's an improvement, at least in that the reporter system can be much weaker than the base system. The reporter can be sort of parasitic on the base system, copying its ideas; and it seems hard for the base system to think "in secret", in a way that the reporter can't fairly easily notice."

Rasha: "That's very far from clear to me.

... For one thing, what would constitute an honest report might be quite subtle. Imagine trying to explain to someone who is a beginner at chess what is happening in a position from a chess game between two grandmasters. You can't just straightforwardly use concepts like "weakening the pawn chain" (or [concept I'm not familiar with either]) because the beginner doesn't know what a pawn chain is, what would constitute weakening one, and what implications that would have for the position going forward. But it's hard to explain what the grandmaster is doing without using that concept——there's no good way to translate "weakening the pawn chain" into concepts like "attacking the queen with the knight" or "blocking the bishop with the pawn" or "cutting off the king with the rook" that the beginner is familiar with. You could try to compile the idea into beginner concepts, like "The bishop attacks the f2 pawn. The f2 pawn defends the e3 pawn. If the f2 pawn moves or is taken, the e3 pawn isn't defended. The e3 pawn defends the d4 pawn. If the e3 pawn moves..." and so on, but this will amount to trying to teach the beginner the concept of a pawn chain.

... For another thing, in many cases we're specifically using our honest reporter AI in a way that pushes the base AI to be more illegible. For example, if we naively delete any mental content from the base AI that the reporter reports as being about understanding how humans work (and how humans could be manipulated), we are evoking the base AI to develop conceptual Doppelgängers that are harder for the reporter to recognize as being about humans. As another example, if we try to delete terminal goals from the base AI, we are prone to instead mostly just edit the manner in which the AI pursues instrumental goals, leaving the more terminal goals still hidden, if not intact. In general, we evoke nearest unblocked strategies. This puts the AI in a state that is heavily selected to exploit any flaw in the notion of "honesty" that the reporter wants and is able to uphold."

So much for the dialogue. (It's a fictional dialogue; e.g. the authors of this post and of ELK seem to me to be perfectly aware that these concepts are problematic.)

Why does Ḥaḥam think that "an AI that honestly reports its beliefs" is a hopeworthy idea? I suspect that among other intuitions, he has an intuition like this: honesty has a sort of comprehensiveness to it. The thing that is being honest is "looking out from behind the same eyes" as the AI uses to see; the thing that is being honest is speaking from the same viewpoint as the AI (because it is the AI); any planning that the AI is doing, is flowing through the ideas that form the perspective from which the honest reports are coming; if there were optimization power flowing through the AI, that optimization power is an integral part of the AI, so that to be honest is to reveal that optimization power.

This is a possibly hopeworthy intuition. However, just taking the concept "honesty", as it is structured pretheoretically, does not automatically lead Ḥaḥam to walk the fine line of unraveling difficulties in the hopeworthy path and developing the idea so that it could be turned into an engineering specification, while keeping to the hopeworthy path.

Rather, Ḥaḥam slips off and instead works out something that doesn't have a shot at realizing the hope. The idea of a reporter AI that's trained on the task of reading off facts from a base AI's internals is a fine idea, but it has already from the outset dropped the core intuition described above. And this shows up. The core hopeworthy intuition involves comprehensiveness, and the reporter AI doesn't. The core intuition hiddenly involves the idea of a mind that is integrated in such a way that it is rightly described as believing or disbelieving propositions——so that it either behaves in all cases as if the proposition is true, or else it behaves in all cases as if the proposition is false. For such a mind, one could then say that it reports its beliefs——in the unified, comprehensive sense——accurately. It may be that this sort of honesty, fully realized, would preclude conceptual Doppelgängers. The reporter AI does not preclude Doppelgängers, which witnesses that the idea of the reporter AI has not captured the core desirable properties aspirationally claimed by the pretheoretical intuition of honesty.

The pretheoretical concept of "honesty" didn't automatically direct Ḥaḥam's thinking rightly. Ḥaḥam might have said:

The preexisting idea of "honesty" may not be fully explicit, or precise, or unconfused. But that's fine. We'll keep this in mind, while going ahead with concrete research questions. The starting place is what we can access using our current provisional idea of honesty, and we'll go from there. That way we'll get involved with the details we need to be involved with in order to get traction and make progress in touch with reality. As the research goes on we'll get more clarity about how minds work and what are the hard parts of shaping minds, especially as that relates to honesty, even if we're not addressing the whole problem at once. That clarity will build up so that we can start addressing the big hard stuff.

That's well enough and true enough. Rasha would reply that Ḥaḥam has stopped pushing in the direction that he originally wanted to go, and is now barely moving in the direction, and can't tell that he's barely moving in that direction because he stopped tracking that direction. The original core hopeworthy intuition isn't driving the investigation.

Some confusions are essential, even if only pretheoretically described

See also: "philosophical concept laundering".

Above, two responses were listed to the problem of pretheoretical concepts that don't immediately do the work we want them to do. To that list, a third item can be added:

  1. Go on using the pretheoretical concept as-is.
  2. Replace it with a clear but partial formalization.
  3. Throw the concept away entirely. Don't use it, or at most use it in a way which is supposed to be merely a suggestive piece of poetry, not a central part of the real work of building up a way of thinking that can make aligned AGI.

The third response gives up the possibly-hopeworthy paths that used the concept. In exchange it emphasizes the negative space: the paths that involve thinking in other terms, not using the abandoned concept. Is that a good trade? In some cases yes. E.g. abandoning the idea of "anger", as in "make the AI so that it doesn't experience anger", is right; the abandoned path wasn't hopeworthy. In other cases, the abandoned path was hopeworthy, so abandoning it is a cost.

What about the benefit, the newly emphasized paths? Those paths seem to have at least one thing going for them: they don't involve having to worry about the sticky confusions that the abandoned concept brings with it. Is that benefit really present? In many cases, no, it's not really present. Some confusions are forced. Such confusions are essential to the problem of making an aligned AGI——any successful approach will have to deconfuse the confusion, will have to find a way of thinking that can adequately answer to the calling that originally revealed the confusion.

For example, I suspect that we can't avoid the need to go much further in clarifying the idea of {value, intention, wanting, goal, trying, motive}. Looking at the difficult-to-clarify menagerie of what we call wanting, and the confusion that arises from using the concept "wanting" as-is, and the inadequacy of formal ideas like "utility function", it's tempting to try abandoning the idea of wanting. But this is not so easily done. Consequentialist cognition is not so easily subverted——the way that we can see it being possible for an AI to be extremely useful, is for the AI to do consequentialist cognition. If there is consequentialist cognition, that begs the question: what determines the ends of the consequentialist cognition; what determines the direction of its ultimate effects?

The situation is like Greenspun's tenth rule:

Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

The structure of Lisp is out there for programming languages to lack, and for programmers to need and to find. Some of the understanding that we'll have when we understand better [what we currently call {value, intention, wanting, goal, trying, motive}] is out there for conceptual schemes to lack, and for AGI engineers to need and to find.

What does it look like when essential confusions are dodged? It looks like dirty concepts, which pretend to be philosophically innocent and unproblematic, while also trying to play roles that are rightly played by problematic concepts. It looks like conceptual Doppelgängers growing: ad hoc, informally-specified, bug-ridden, slow implementations of half of the supposedly avoided concepts. It looks like playing shell games: shuffling around the role played by the problematic concept so that locally the concept seems unnecessary, but the shuffling is not actually globally consistent.


Examples of concepts to be analyzed

The hermeneutic net approaches the problem of inadequate concepts by straddling all the relevant concepts at once, unfolding and clarifying and arranging them. The relevant concepts are the concepts that play roles in alignment-related desiderata, constraints, and hopeworthy ideas. Here are some words which are used as handles for relevant concepts:

  • agency, optimization, capability, strategy, skill
  • mind, thought, awareness, consciousness, experience, context, reflection
  • translation, interpretation, analogy, language, reference, pointer, meaning, perspective
  • belief, knowledge, proposition, term, truth, canonicity, model, understanding, honesty, evidence, hypothesis
  • care, goal, value, aim, intention, want, try, pursue, desire, motive, drive, utility function, terminal goal, instrumental goal
  • criterion, reward, loss, search
  • flaw, error, correction
  • reason, justification, explanation
  • creativity, abduction, structure, ontology, generality, abstraction, concept, representation, essential, thing, intrinsic, learning
  • integration, coherence, explicit
  • corrigibility
  • cause, effect, power, determine, time, counterfactual, control, stability, possible
  • action, activity, plan, decision, choice, problem, output, behavior
  • world, reality, objective, external, state
  • identity, self, boundary
  • comprehensiveness, encompassing
  • probability, distribution, anomaly, information, uncertainty, ambiguity, complexity, naturality
  • mechanism, element, module, subagent

Examples of interrelatedness

  • A utility function is an evaluation of world states. An agent has a specific utility function when it optimizes the world so that the world is evaluated highly by the utility function. If the agent represents the world partly inexplicitly then the utility function is also represented only inexplicitly.
  • When an agent's values are not perfectly unambiguous——are not a perfect pointer with unambiguous reference——the agent is forced to choose how to resolve the ambiguity, i.e. forced to choose its values.
  • An agent is corrigible when it believes that it is flawed and that its reflection procedures for correcting its flaws are also flawed.
  • An agent is corrigible when it treats the human as though the human is within the boundary of the agent's agency; the agent identifies with the human+agent system. Pelkey: The agent treats corrections coming from the human as internal self-corrections, gladly accepted, rather than external interference, actively resisted.
  • If an agent thinks about some thing, then the thing potentially violates the boundary of the agent via its representation making the thing present internally to the agent.
  • If a mind is highly capable then it has high degree of instrumental coherence, which locally appears as though it is optimizing the world according to a complete utility function.

The project of large-scale conceptual revision

Criteria for conceptual revision

The task is to clarify preexisting proleptic, partial, provisional, pretheoretical concepts, and to create new concepts. The criteria that will call forth the future improved concepts have to be collected and applied to the search. What are these criteria? They all flow from the overarching criterion of thinking in a way that will produce good overall outcomes——concepts should be suitable for that sort of thinking. To give some specific classes of examples: A criterion might call for one or more concepts to:

  • Play a role in a proposition. That is, the concept appears in a proposition, and its appearance in the proposition determines something about how the concept should work. The concept should work in a way that renders the proposition true, interesting, relevant, useful, worthwhile. For example, the statement "a corrigible agent doesn't optimize to deceive the operators" asks for a concept of corrigibility such that, if an agent is corrigible in that sense, then the agent consequently also doesn't "optimize to deceive the operators". The statement also asks that the concepts of optimization, deception, and operator, as they are used in the statement, also be such that when an agent doesn't "optimize to deceive the operators", it doesn't, you know, trick the humans. If "optimization" is also used in another proposition such as "it is unnatural for optimization to be strong and wide-ranging without being fully-general", then there are multiple criteria pulling on the concept.

  • Relate to another concept by being founded on it or by providing foundation for it. E.g. "an optimization process controls the world to be in a small set of possible worlds" founds optimization on world, control, and possible. A foundation of concepts acts as a conduit between concepts for criteria on concepts. E.g. if "optimization" is used in a proposition such as "a corrigible agent doesn't optimize to deceive the operators", then the concept of "world" inherits the criteria on "optimization"——e.g., "world" has to include the agent's way of thinking, because the agent's way of thinking is one way the agent could trick the human (e.g. if the human is directly looking at the agent's internals).

  • Perform a task. E.g. in trying to write a computer program to display a rotating object, the concept of "rotation" is called to be put into a format that recommends a data format and transformations and displays of that data, so that the screen shows something that looks like a rotating object.

  • Be good, elegant, simple, general, useful, well-engineered, well-factored, precise, delineated, explicit. In other words, to satisfy the generally applicable senses that the thinker has for what is and isn't a good concept. E.g. if the concept of "values" is "the scoring function for possible worlds used to select actions; and also something about what the procedure used to resolve ambiguities in preliminary scoring functions is like", then it is sensibly not yet elegantly factored.

  • Capture and explicate the starting intuitions; well-describe examples. The starting intuitions around a concept are probably on the outskirts of a region containing some Things. The examples are probably best understood in terms of latent, underlying, intersecting Things. All these Things should be brought out as concepts.

Three problems of conceptual revision

Infectious questioning and Indra's net

To revise a concept C, we consult the criteria for C. Some key criteria for C are likely to be propositions. Those propositions use C, and also use other concepts, say D.

In questioning C, we're looking to satisfy the criteria for C more fully than our present concept C is satisfying them. That means we're making more demands than usual on the criteria: we're trying to get advice, so to speak, from the proposition acting as a criterion for C, and we're asking the proposition for advice we haven't already heard and incorporated into C. We're asking the proposition to give us new examples, new problems, new logical implications. In asking for new advice from a proposition, we're also making new demands on the other concepts that the proposition uses, e.g. D. So now we want to revise D as well as C, as a soft prerequisite for rightly revising C.

You see where this is going. Everything wants to be pried loose all at once, too many questions are raised, and the glimpses of better concepts are hidden in the chaos. What starts as a careful analysis ends in gridlock.

Indra's net is an infinite net going in all directions with a jewel at each lattice point. Every jewel reflects every other jewel, and reflects all of every other jewel's reflections of every other jewel, and so on. This is the situation with ideas: to turn one stone all the way over is to upend the whole world. See also Leibniz's Monadology. Everything being interrelated doesn't mean that there is no differentiation or clarification to be had; out of all the connections that are present, only relatively few are essential.


When we're trying to get at new concepts, we're always dancing around them; the missing concept leaves a shadow in our understanding, a lack of clarity. We can catch glimpses of stuff that would lead to developing the missing concepts, if we follow the stuff forward. For example, the discovery of timeless decision theory growing (I speculate) from something like a combination of noticing that:

  1. the one-boxing strategy in Newcomb's problem, and the cooperate strategy in the twin prisoner's dilemma, are strategies that do better than what's considered "rational", and
  2. the arguments saying that those strategies are "irrational" have foundations that crumble when looked at.

These observations suggest to a careful listener that there's a principled notion of rational decision making that wins in these cases.

A more usual reaction is to think "Hmmm... That's weird..." and then move on, or make an ad-hoc patch to the problem, or squirm away from the question (e.g. saying that the decision problems are impossible or incoherent or vague hypotheticals). The mystery withdraws, the preexisting concepts stretching implausibly to cover up the vacancy——stretching just enough to give local answers to questions, whether or not the answers are clear and globally consistent. The vacancy where a new idea should be is laundered through existing concepts, or is shuffled around in a shell game.

It's as though the soldiers of understanding, after encroaching some ways into the region of a Thing, expanding the borders of understanding some ways up the mountain of the Thing, at some point stop. They stop not when they've summitted, but before then. They stop once they're merely higher up than other forces——when the territory of understanding they've gained gives enough advantage to address the demands already clearly made on the understanding of this Thing by other neighboring regions. And they turn around, and face downhill, and dig their trenches there, as if guarding the summit from demands made by neighboring regions——a summit which they haven't themselves met.

This "just-in-time philosophy", which only engages in speculative conceptual revision when immediately forced to by nearby demands, I think will not work. I suspect that it doesn't even work very well for normal science, and that we have curiosity because evolution (so to speak) saw fit to specifically tell us to wonder what things are like even without a specific purpose.

The centripetal force of the preexisting conceptual scheme

To say it another way: The situation, as I'm here conjecturing, is that Indra's net creates a centripetal force, pulling thoughts into the convex hull of the preexisting conceptual scheme. A question always bakes in background conceptual assumptions.

Science works. Also, math works. Questioning points the way out of the existing conceptual scheme by asking us to come to terms with something other than our own ideas as they already are. So, it's not like there's some fundamental barrier here. But still, if a concept like "values" always shows up in the question, and nearby questions are taken as equivalent to a rephrasing that relies on the pretheoretic idea of "values", and the pretheoretic idea of "values" is investigated in some aspects but not others, then all questions about "values", and all the other questions that they inspire in the network of questioning, will have a correlated failure to address the uninvestigated aspects.

The centripetal force bends inquiries away from certain directions, and bends inquiries away from going too far, too persistently, too "unresponsively to data", in a single direction——too far out of the convex hull. The normal progress of science depends on anomalies continuing to make themselves visible and felt, and even to become more and more salient and pressing as the preexisting understanding works out its home-turf implications. Traveling in the space of possible conceptual schemes (e.g. learning, doing science, learning a new language) proceeds by taking some steps——steps of tweaking, refactoring, abducing, conjecturing, combining ideas. Inquiries that would only be satisfied by taking many simultaneous steps aren't pursued. In the long run there may be no such inquiries, but relative to humanity's current conceptual scheme, there are valleys of [worse, useless, uninteresting conceptual schemes that result from some, but not enough, simultaneous conceptual mutations away from the current scheme] between here and where we need to go. If that's true, then what would be required is leaps, not steps. An analogy: in bouldering, a coordination dyno is a dynamic move where the climber removes most or all zer limbs from the wall during a movement, and then lands in a new improved position and has to arrange multiple points of contact correctly at the same time to stabilize (example).

Even pointing out that there's a technical metaphilosophical problem here, a problem of multiple simultaneous conceptual revisions being needed, is difficult. Radical confusion (or in other words, a missing conceptual scheme) is heavily dependent on the mental context——both for its existence, and for pointing at its presence (there's more ways to be confused than ways to understand clearly). So examples don't translate well to other minds. And, each example is compute-intensive to have in the first place because an example can only come from long-term inquiry that crosses the valley the slow way (by going around, step by step along the rimdale).

To explore multiple simultaneous modifications requires more channels of creativity. Like the difference between evolution piling up isolated tweaks and a designer leaping to an island of effectiveness, across a valley of ineffectiveness, via multiple simultaneous changes, a hermeneutic net (hypothetically, hopefully) can rewrite conceptual schemes in ways that would take much longer to do by steps that rely on the usual stepwise inquiry and conceptual revision.

The hermeneutic net


More and better concepts are needed. An inquiry aimed at adding or improving a concept will recursively make inquiries for more and better adjacent concepts. Such an inquiry will ask too many questions at once, and the fundamental problems will stay hidden.

The basic idea: brute-force global analysis

I don't know how to deal with these difficulties. The "hermeneutic net" is a conjectural method to brute-force the issue. It's not sophisticated or tested. It's just what seems to me like the first obvious thing to try.

The idea of a hermeneutic net is this: to analyze all the concepts at once.

The hope is that a larger, more systematic effort than has already been put forward might set up the conditions where the infectious questioning can be contained, and a net of mutating and splitting concepts can be cast over and pulled tight around the mysteries. In other words, the hope is to flank and cut off, from all sides, the out-of-control questioning of key concepts, by systematically setting up all the related concepts to be questioned, and in particular set them up to be suitable for providing support to inquiries of adjacent concepts.

For example, a hermeneutic net would start with a preliminary analysis of some concept, e.g. "action". The analysis would be carried out not until "action" is fully understood in any sense, or meets any set of direct criteria (such as fully explaining some example), but instead carried out until the understanding of "action" has been brought into a state that is prepared for what comes next——as prepared as is feasible with reasonable effort. Next, an adjacent concept is analyzed, e.g. "decision". When that concept calls on the previous concept "action", for example in the criterion given by a proposition like "a decision is when an agent selects an action" or "when an agent gains coherence it has expanded the range of its actions that can be counted as decisions" or something, the preparation done with "action" is supposed to kick into gear, providing the inquiry into "decision" with as much ready help as feasible.

A basic analytic method

How does the actual inquiry into a given concept go? It should go however it has to go to generate adequate concepts——which is an unboundedly complex task. But here is a starting point for a method to analyze a concept:

In short, to analyze "X":

  • What is X?
  • Why do I want to talk about X?
  • How can I talk (about X, or otherwise) to best satisfy those reasons I wanted to talk about X?

In more detail:

  • Start with a hopeworthy idea. Pick a concept X used in the idea that seems like it could sharpen or dispel the hope.

  • Unpacking the concept.

    • What are examples of X? What are clear anti-examples? What is something that's close to being an X but isn't, or close to not being an X but is? What are some examples of something that ought to be well-characterized as either X or not X, but instead it's ambiguous whether it even makes sense to ask if it's an X? Expand the domain of discourse.
    • What are the intuitive "cores" of the idea of X (e.g. a definition, a central example, or an aspirational criterion that X should satisfy)? What are the essential features of X? What are some examples that separate these intuitive cores?
    • What are some concepts similar to X? What are some formal concepts that capture some aspect of X? What are some examples that separate these formal concepts from X?
    • When and how is X talked about? Someone who has been talking about X, why were they talking about X? What did they get a glimpse of, that made them talk that way, though they haven't seen the full thing yet? What are they really trying to get at, if only confusedly? What are they trying to do, that called them to talk that way?
  • Bringing in the criteria.

    • What are some concepts that X is defined in terms of? How does such a definition constrain X, and how does it constrain the other concepts it uses? What are some concepts that are defined in terms of X? What role does such a definition call on X to play?
    • Where and in what role does X want to be talked about? That is, where does it seem like you would want to use X, if only X were a little different (more general, more precise, more well-factored)? Given how X is already talked about, what role is it being called on to play? What properties are being assumed?
    • What are some true propositions about X? What does their truth imply about X?
    • What are some tug-of-war conflicts between demands made on X? Where is X being used, or wanting to be used, in a way that's inconsistent with other uses of X? What are some contradictions between propositions about X that should be true?
    • Can X be eliminated? If you can't do without X or something like X, why exactly not? This is in response to Mateusz Bagiński pointing out that analyzing X may bias a thinker toward keeping X around. So this question asks, can you do without X? Otherwise a thinker might for example reshape other related concepts to be suitable in the context where X is around, creating more pressure to have an X-like concept. (Though: concepts are usually touching some Things.)
    • What are some thoughts that you can't quite speak, and that you would want to speak if you had better concepts in place of X?
    • For each formal concept that captures some aspects of X, what does that formal concept fail to do that a good conceptual scheme for X ought to do?
  • Generating concepts.

    • Make concepts, somehow. E.g. concepts that tease out subclusters within the vaguer concept, that better carve examples at the joints, that better classify examples, that more parsimoniously describe examples. E.g. concepts that are precise definitions, that are tweaked versions of preexisting concepts, that are combinations of preexisting concepts, that are anchored to an example or to a criterion.
    • X is used in some thoughts. Try to enroll and assist your mind in using its natural powers toward the task of making those thoughts more thinkable——which will involve creating new concepts.
    • E.g. try redescribing or rethinking these thoughts over and over——emphasizing different aspects, holding different central examples in mind, maybe tabooing words you used previously——like running your hands over an object over and over, making it familiar, a part of you. Let yourself follow your inclinations to speak or think the thought a little differently each time.
    • E.g. try compiling the thoughts you can't quite speak, that you would want to speak if you had better concepts in place of X, into a long, concrete situation or problem. Then redescribe the long concrete situation, trying to compress it. The idea is to teach yourself to use the new concepts that you'll create in the process of being taught. Like a chess beginner who——being told a whole paragraph about pawns protecting pawns, and then pawns not protecting each other, and then pawns being attacked in sequence, and then pawns ceasing to defend squares and block lines of attack——comes to cope with the whole long concretely described situation by comprehending it into a concept of "undermining a pawn chain".
    • If a concept is new and seems worthwhile, make a new word for it.
    • The word could come first, as a handle for a way of thinking in some context. The word then proleptically means something like "the idea(s) that I'll use (in some role suggested by the grammar of this new word) to think in this context, as I learn to think better in this context".
  • Going through the hermeneutic circle.

    • When a concept is created or tweaked, propagate the change. Redescribe examples, restate propositions, and reask desiderata using the new concept. Are the examples now clear? Do they pose new problems? Do the propositions and desiderata make new demands of other concepts that they use? How does the sense of the old version of the proposition, before the conceptual change, relate to the sense of the new version of the proposition?
    • Given the preparation done for one concept——having collected and made available the intuitions, examples, variations, and components of the concept——now try to refactor that concept along with adjacent related concepts. For example: Which [pair of: a variant C' of concept C and a variant D' of concept D] would make more sense of this proposition / desideratum / definition / situation / example that involves both C and D, if it instead uses C' and D' in their places? Which example of C shows how C is not really connected to D in quite the way it had seemed to be at first? And so on...
    • Interleave the small picture——analyzing a single concept, digging into examples and its internal logic——with the big picture——seeing what the major confusions are, seeing what would now be needed to satisfy key desiderata. The big picture directs attention to what matters most and provides corrections from outside the method of the hermeneutic net, while the small picture takes steps forward.


Hermeneutics is interpretation and the study of interpretation. The hermeneutic circle goes between text and context; newly understanding the text changes the context, and a new context changes how the text is understood.

Why talk about interpretation? On one interpretation of "interpretation", "interpretation" is when a mind incorporates new structure into itself by empowering itself through that structure. How does this go with the intuitive meaning of interpretation as receiving and translating a message? Because when the message communicates something that is new to the hearer, that can't be assimilated in the hearer's preexisting conceptual scheme, the work of translation is mainly the work of grasping a new idea. As a key instance of such creative hearing, we interpret ourselves: we study ourselves as the exemplars of mind and agency, and we explicate our existing ideas about ourselves, and we coherentify ourselves by interpreting ourselves as coherent. We interpret ourselves (our ideas, pursuits, creations), for ourselves to hear, as a message from ourselves to ourselves.

The "net" is supposed to suggest a structure distributed across lots of concepts, that might catch confusions; and Indra's net.

Alternative names:

  • Hermeneutic cycle, suggesting ongoingness, or spiral, helix, suggesting upward motion.

    • "The sun was sinking in the sky, for Harry had been thinking for some hours now, thinking mostly the same thoughts over and over, but with key differences each time, like his thoughts were not going in circles, but climbing a spiral, or descending it." HPMOR ch. 63
  • Conceptual engineering.

  • Abductive, analytic, analytic/synthetic net.

  • Hermeneutic mesh, scaffold, network, web.

  • Hermeneutic load-distribution, suggesting the way that load in a building is spread out so that no one support beam has to bear all the weight of the whole building——finding concepts that each do enough work that deep confusions can be unraveled...

Difficulties with the hermeneutic net for agency

Good babble requires good prune

The hardest part of making good new concepts is coming up with any possibilities that are at all novel and at all have a chance of being useful for the task at hand. The first thing to try is Babble. But good babble requires good Prune. Without good prune, the products of babble are all highly correlated: they all allow themselves the same errors——the same mistaken assumptions, the same unhopeworthy hopes, the same ill-suited concepts, the same violated conservation laws, the same unknown and unheeded impossibility proofs, the same shell games, the same unsatisfied desiderata, the same unexplored regions. Highly correlated babble is no babble at all very little babble at all.

Wittgenstein's tarpit

A trap that scares off investigators from systematic conceptual revision is Wittgenstein's tarpit. In Wittgenstein's tarpit, the meaning of words is questioned——not just as in "What is the meaning of 'X'?" but as in "Maybe there's no such thing as a meaning of 'X'.". This questioning, on its own part of a healthy hermeneutic, can congeal into a sticky denial of the Thingness of Things. Any proposition about the nature of things is met with an unendable "critique" that just takes the form of repeatedly disallowing the natural use of any idea brought up to justify the use of another idea.

No privileged foundational direction

There's an instinct to "ground" or "found" concepts. But there's no globally privileged direction of "more grounded" in the space of possible concepts. We have to settle for a reductholistic pluralism——or better, learn to think rightly, which will, as a side effect, make reductholism not seem like settling.

The whole mind is involved in any of its aspects

A hermeneutic net for mind has to understand the role that an element of a mind plays in the mind. In many cases this role only makes sense within the context of being a whole mind. Since the whole mind is far from fully understood, the element isn't fully understood. In other words, the element has to be gemini modeled where the context is the whole mind.


Since an agent potentially {changes, unfolds, grows, self-modifies}, any aspect of it might change. So concepts about agents are by default essentially provisional. That is, to be a concept that well-describes something about an agent, the concept has to have some openness to relate to the agent's ongoing unfolding, and so is provisional by nature, unlike more clean-cut concepts, such as concepts about simple physical systems. To put it another way, however much we try, we won't be able to understand everything about future very advanced agents. The elements of future very advanced agents that are novel to us will also change the context of even elements that we do understand ahead of time, rendering them alien.

Silently imputing the ghost in the machine

It is so natural for us to gemini model aspects of other humans, or possibilities for our own mental elements, that we do it without knowing that we're doing it. We impute a ghost in the machine without knowing what assumptions we're thereby making. E.g. we think of an agent having a belief, and assume that it has a belief in the way we have beliefs.

Imputing the ghost in the machine goes beyond anthropomorphism. Imputing anger to an alien agent is anthropomorphizing——assuming the agent is human-shaped. This is a mistake because the agent need not be human-shaped. There are agents that are full agents with full minds, that don't have anger. Imputing the ghost in the machine may impute very general properties to an agent that aren't especially human-shaped. That may be a mistake in two ways:

  1. The imputed properties may not hold, either of ourselves or of the other agent. Even for very abstract-seeming properties of minds, we're biased to think the way we work, or the way we know how to describe how we work, is how most minds work.
  2. Even if the imputed properties do hold of the other agent, we may impute them transparently——without knowing that we are imputing the properties to the agent——and in so doing, confuse ourselves by keeping it hidden from ourselves that those properties are important for how we're thinking of the agent.

For example, what is it like to be a bat? I imagine closing my eyes, and then getting a ghostly 3D point-cloud image, showing the ray-ends of rays radiating out from my head. This is probably not right, but even if it is right, I'm assuming that the bat thinks in terms of 3D space. That is probably right——but it's important that I'm assuming that the bat thinks in terms of 3D space. I might not notice that I'm assuming that. I just "put myself behind the eyes" of the bat, and in so doing I import the ghost, my mind, the machinery that I don't notice as it constitutes my world for me. When I "put myself behind the eyes" of the bat, I unconsciously bring along (that is, silently impute) the 3D scene modeler. Imagine trying to reprogram a bat to live in a 4D world. Where would you even start? It will be difficult anyway, but I think it will be extra difficult if you don't realize that the reason you think the bat thinks about 3D space in such and such a way, is that you're calling on your own 3D space modeler. Until you notice that you're thinking about the bat in that way, you might be confused about what you're even trying to do. Isn't that just... how space is? How does space work... It's like this [that is, like this space around me, which I'm looking at from behind my eyes using my 3D space modeler], is it not? So that's how space is. And now I want to make space different to a bat? How does that make sense, how could it be different to the bat, given that that's just how space is?

For example, see "What the Tortoise Said to Achilles". Achilles imputes a dynamic to the tortoise, which he transparently takes as just an aspect of speaking sentences.

For example, sometimes people believe that, for some X, we just need X to make AGI from current ML systems. Sometimes they believe this because they are imputing the ghost in the machine. E.g.: "LLMs don't get feedback from the environment, where they get to try an experiment and then see the results from the external world. When they do, they'll be able to learn unboundedly and be fully generally intelligent.". I think what this person is doing is imagining themselves without feedback loops with external reality; then imagining themselves with feedback loops; noticing the difference in their own thinking in those two hypotheticals; and then imputing the difference to the LLM+feedback system, imagining that the step LLM⟶ LLM+feedback is like the step human⟶ human+feedback. In this case imputing the ghost is a mistake in both ways: they don't realize that they're making that imputation, and the LLM+feedback system actually doesn't have the imputed capabilities. They're falsely imputing [all those aspects of their mind that would be turned on by going from no-feedback to yes-feedback] to the LLM+feedback. That's a mistake because really the capabilities that come online in the human⟶ human+feedback step require a bunch of machinery that the human does have, in the background, but that the LLM doesn't have (and the [LLM+feedback + training apparatus] system doesn't have the machinery that [human + humanity + human evolution] has).

It's the local differences in our experience that we notice, against a fixed unnoticed background. We notice the event of the update, but not the fixed Bayesian laws; we notice the change in our visual field, but not the 2.5D structure of our visual perception; we notice that we want vanilla ice cream, not chocolate ice cream, but not the wanting to eat or the structure of pursuing what we want or the structure of reflecting on and choosing our values. We then ask about an alien agent "Does it like vanilla ice cream or chocolate ice cream?" and we don't ask "In what manner does it want?".

If the ghost in the machine is imputed, and that imputation isn't noticed, there's a higher risk of merely rearranging confusion, playing shell games with the confusions about the hidden machinery.

New Comment
4 comments, sorted by Click to highlight new comments since:

note to self: go through and read Tsvi posts on his blog, as he seems to take a long time to post them to lesswrong. (perhaps that could change? I'm curious why that is the case)

It's a makeshift stop-gradient. I less feel like I'm writing to LessWrong if I'm not publishing it immediately, and although LW is sadly the best place on the internet that I'm aware of, it's very much not in aggregate a gradient I want. Sometimes I write posts intended for LW and publish them immediately.

This section is probably my favorite thing you (Tsvi) have written, and motivated me to read through all your alignment related posts on your blog.

Before I read that passage, I was confident that deconfusion research was the highest value thing I could be doing (and getting better at), but I did not have a succinct way of communicating the fact that me seeming confused about a certain concept is not a sign that I have worse understanding about the problem involved compared to someone who doesn't seem confused.

There's a misconception where most people pattern match confidence in one's understanding of a concept / domain with better understanding of the domain, while vagueness in description of a concept as someone not quite understanding the domain. I notice hints of these even in rationalist friends I have, the ones who have read The Sequences and have a strong aversion to stuff that, in their head, pattern matches to making basic rationality mistakes. Reading this passage helped me have a handle on why I felt that my epistemic state was still better than that of others who seemed more confident in their claims.

Also, I feel like this somewhat relates to Eliezer's aversion to bio-anchors and concrete 'base rates', but I don't yet have a good way of clarifying it in my head.

A lot of the examples of the concepts that you list already belong to established scientific fields: math, logic, probability, causal inference, ontology, semantics, physics, information theory, computer science, learning theory, and so on. These concepts don't need philosophical re-definition. Respecting the field boundaries, and the ways that fields are connected to each other via other fields (e.g., math and ontology to information theory/CS/learning theory via semantics) is also I think on net a good practice: it's better to focus attention on the fields that are actually most proto-scientific and philosophically confusing: intelligence, sentience, psychology, consciousness, agency, decision making, boundaries, safety, utility, value (axiology), and ethics[1].

Then, to make the overall idea solid, I think it's necessary to do a couple of extra things (you may already mention this in the post, but I semi-skimmed it and maybe missed these).

  • First, specify the concepts in this fuzzy proto-scientific area of intelligence, agency, and ethics not in terms of each other, but in terms of (or in a clearly specified connection with) those other scientific fields/ontologies that are already established, enumerated above. For example, a theory of agency should be compatible or connected with (or, specified in terms of) causal inference and learning theories. Theory of boundaries and ethics should be based on physics, information theory, semantics, and learning theory, among other things (cf. scale-free axiology and ethics).
  • Second, establish feedback loops that test these "proposed" theories of agency (psychology, ethics, decision-making, ethics) both in simulated environments (e.g., with LLM-based agents embodying these proposed theories acting in Minecraft- or Sims-like worlds) and (constrained) real life settings or environments. Note that the obligatory connection to physics, information theory, causal inference, and learning theory will ensure that these test themselves can be counted as scientific.

The good news are that now, there are sufficient (or almost sufficient) affordances to build AI agents that can embody sufficiently realistic and rich versions of these theories in realistic simulated environments as well as just the real life. And I think an actual R&D agenda proposal should be written about this and apply to a Superalignment grant.

There's an instinct to "ground" or "found" concepts. But there's no globally privileged direction of "more grounded" in the space of possible concepts. We have to settle for a reductholistic pluralism——or better, learn to think rightly, which will, as a side effect, make reductholism not seem like settling.

I disagree with the last sentence: "reductholism" should be the settling, as I argue in "For alignment, we should simultaneously use multiple theories of cognition and value". (Note that this view itself is based largely on quantum information theory: see "Information flow in context-dependent hierarchical Bayesian inference".)


  1. ^

    A counterargument could be made here that although logic, causal inference, ontology, semantics, physics, information theory, CS, learning theory, and so on are fairly established and all have SoTA, mature theories that look solid, these are probably not the final theories in all or many of these fields, and philosophical poking could highlight the problems with these theories, and perhaps this will actually be the key to "solving alignment". I agree that this is in principle possible chain of events, but it looks quite low expected impact to me from the "hermeneutic nets" perspective, so that this agenda is still better focused on the "core confusing" fields (intelligence, agency, ethics, etc.) and treat the established fields and the concepts therein "as given".