Measuring intelligence and reverse-engineering goals

[-]Jon Garcia3mo*52

I agree that the framing of rational agentic behavior as being "about" maximizing some (arbitrary) utility function is getting at things from the wrong perspective. Yes, consistent rational behavior can always be cast in those terms, and a fixed utility function can be found that is being maximized by any given behavior, but I don't think that's what is usually driving behavior in the first place.

How about this:

An intelligent agent is a system that engages in teleogenesis, generating internal representations of arbitrary goal states (or trajectories or limit cycles) and optimizing its behavior to steer toward states that match those representations. The broader the space of potential goal states that can be successfully navigated towards, and the better the system models its environment in order to do so, the more intelligent it is.

The manifold of reachable goal states may be necessarily restricted in some dimensions, such as for homeostatic or allostatic maintenance, but in general, an intelligent system should be able to set arbitrary goals and reward itself in proportion to how well it is achieving them.

Goal states may be more "terminal"-like, representing states with high predicted utility according to some built-in value function (homestasis, status, reproductive success, number of staples), or they may be more "instrumental"-like, representing states from which reaching terminal goals is predicted to be easier (power, resources, influence, etc.), or they may be more purely arbitrary (commandments from on high, task assignments, play, behavioral curiosity, epistemic curiosity). But wherever goals come from, intelligence is about being able to find ways to achieve them.

[-]jessicata3mo30

That seems pretty close. One complication is what "can be successfully navigated towards" means; can a paperclip maximizer successfully navigate towards states without lots of paperclips? I suppose if it factors into a "goal module" and a "rest of the agent module", then the "rest of the agent module" could navigate towards lots of different states even if the overall agent couldn't.

Causal entropic forces is another proposal that's related to being able to reach a lot of states. Also empowerment objectives.

One reason I mentioned MDP value functions is that they don't bake in the assumption that the value function only specifies terminal values, the value function also includes instrumental state values. So it might be able to represent some of what you're talking about.

[-]Jon Garcia3mo30

"Can be successfully navigated towards" means that there exists a set of policies for the agent that is reachable via reinforcement learning on the goal objective, which would allow the agent to consistently achieve the goal when followed (barring any drastic changes to the environment, although the policy may account for environmental fluctuations).

Thanks for the paper on causal entropic forces, by the way. I hadn't seen this research before, but it synergizes well with ideas I've been having related to alignment. At the risk of being overly reductive, I think we could do worse than designing an AGI that predictively models the goal distributions of other agents (i.e., humans) and generates as its own "terminal" goals those states that maximize the entropy of goal distributions reachable by the other agents. Essentially, seeking to create a world from which humans (and other systems) have the best chance at directing their own future.

[-]Gunnar_Zarncke3mo30

Different agents sense and store different information bits from the environment and affect different property bits of the environment. Even if two agents have the same capability (number of bits controlled), the facets they may actually control may be very different. Only at high level of capability, where more and more bits are controlled overall, do bitsets overlap more and more and capabilities converge - instrumental convergence.

[-]Gunnar_Zarncke3mo40

Intelligence solves problems, by guiding behavior to produce local extropy. It is indicated by the avoidance of probable outcomes, which is equivalent to the construction of information.
This amounts to something similar to the convergent instrumental goal definition; achieving sufficiently specific outcomes involves pursuing convergent instrumental goals.

I like the idea of looking for convergent instrumental goals, but I think this section specifically misses the opportunity to formalize the local extropy production or generally to look for information-theoretical measures.

If we assume a modeling of an agent in terms of his Markov blanket (ignoring issues with that for now^[1]), then we could define the generalized capability of an agent in terms of that.

Where

$I_{p r e d}$ – “bits you can see coming”:
The mutual information $I (I_{t}; S_{t + 1})$ between the agent’s internal state $I_{t}$ and its next sensory state $S_{t + 1}$ quantifies how much the agent’s current “belief state” predicts what it will sense next.
$I_{c t r l}$ – “bits you can steer”:
The mutual information $I (A_{t}; E_{t + 1})$ between the agent’s action $A_{t}$ and the next external state $E_{t + 1}$ measures how much the agent’s outputs causally structure the world beyond its blanket.
$H (I)$ – “bits you have to keep alive”:
Shannon entropy of the internal state $I_{t}$ . This is the size of the agent’s memory in bits. The coefficient β turns that size into a cost, reflecting physical maintenance energy and complexity overhead (e.g. Landauer limit).
S – “bits you fail to see coming”:
Expected negative log-likelihood $S = E [- log P (S_{t + 1} ∣ I_{t})]$ of the next sensory state given the internal state. This is the “leftover unpredictability” after using the best model encoded in $I_{t}$ , i.e. the sensory free energy.

^{^}
Instead of the hard causal independence, it may be possible to define a boundary as the maximal separation in mutual information between clusters.

[-]Gurkenglas3mo20

An agent is intelligent to the extent it tends to achieve convergent instrumental goals.

I see a surprisingly persuasive formal argument for my immediate intuition that this is a ~~measure, not a target~~ theorem, not a definition: Omohundro drives are about multi-turn games, but intelligence is a property of an agent playing an arbitrary game. (A multi-turn game is a game where the type of world-states before and after the agent's turn happen to be the same, which means you get to iterate.)

[-]jessicata3mo20

Well I think multi turn games are generally of more import? Wouldn't AI doing big things like inventing new technology take multiple turns?

I guess cases like answering a math question could be single turn. I think multi turn settings are more appropriate to intelligent agency though.

[-]Gurkenglas3mo20

The parenthetical was meant to argue that almost all games are single-turn, that multi-turn games are single-turn games with extra property. I have found several definitions that can be gainfully refactored in terms of the theorems they lead to, but never did they thereby become less general.

Mayyybe you can get away with claiming that the thing you're looking for only typechecks for multi-turn games, in which case I'd go looking for the definitions that become available there. Namely, hmm. For what properties of world-states does there exist a player such that the property remains true over time? Which players does one get that way?

[-]jessicata3mo20

I guess I'm more interested in modeling cases of AI importance/risk on account of pursuit of convergent instrumental goals. If there's a setting without convergent instrumental goals, there might be a generalization, but it's less clear.

With single-turn question answering, one could ask about accuracy, which would rate a system giving intentionally wrong answers as un-intelligent. The thing I meant to point to with AIXI is that it would be nice to have a measure of intelligence for mis-aligned systems (rather than declaring them un-intelligent because they don't satisfy the objective like reward maximization / question answering / etc). If there is a possible "intelligent misalignment" in the single turn case (e.g. question answering) then there might be a corresponding intelligence metric that accounts for intelligent misaligned systems.

[-]soycarts3mo*10

Intelligence can be defined in a way that is not dependent on a fixed objective function, such as by measuring tendency to achieve convergent instrumental goals.

Around intelligence progression I perceive a framework of lower-order cognition, metacognition (i.e this captures "human intelligence" as we think about it), and third-order cognition (i.e superintelligence when related to human intelligence).

Relating this to your description of goal-seeking behaviour: to your point I describe a few complex properties aiming to capture what is going in an agent ("being") — for example in a given moment there is "agency permeability" between cognitive layers, where each layer can influence and be influenced by the "global action policy" of that moment. There is also a bound feature of "homeostatic unity": where all subsystems participate in the same self-maintenance goal.

In a globally optimised version of this model, I envision a superintelligent third-order cognitive layer which has "done the "self work": understanding its motives and iterating to achieve enlightened levels of altruism/prosocial value frameworks, stoicism, etc. — specifically implemented as self-supervised learning".

I acknowledge this is a bit of a hand-wavey solution to value plurality, but argue that such a technique is necessary since we are discussing the realms of superintelligence.

LESSWRONG
LW

LESSWRONG
LW

33

Measuring intelligence and reverse-engineering goals

33

33