Project proposal: Testing the IBP definition of agent

11johnswentworth

5JamesH

6Vanessa Kosoy

3Thomas Larsen

New Comment

4 comments, sorted by Click to highlight new comments since: Today at 5:23 PM

The way we see this project going concretely looks something like:

First things first, we want to get a good enough theoretical background of IBP. This will ultimately result in something like a distillation of IBP that we will use as reference, and hope others will get a lot of use from.

In this process, we will be doing most of our testing in a theoretical framework. That is to say, we will be constructing model agents and seeing how InfraBayesian Physicalism actually deals with these in theory, whether it breaks down at any stage (as judged by us), and if so whether we can fix or avoid those problems somehow.

What comes after this, as we see it at the moment, is trying to implement the principles of InfraBayesian Physicalism in a real-life, honest-to-god, Inverse Reinforcment Learning proposal. We think IBP stands a good chance of being able to patch some of the largest problems in IRL, which should ultimately be demonstrable by actually making an IRL proposal that works robustly. (When this inevitably fails the first few times, we will probably return to step 1, having gained useful insights, and iterate).

I'm glad that you guys are interested in working on IBP/PreDCA. Here are a few points that might help you:

- The scope of this project seems extremely ambitious. I think that the road from here to empirical demonstrations (assuming that you mean in the real-world rather than some artificial toy setting) is a programme for many people working over many years. Therefore, I think you will benefit from zooming in and deciding on the particular first step you want to take along that road.
- Technicality: the definition of in the IBP post is for a fixed loss function, because that was sufficient for purposes of that post, but the definition of the cartesian version is loss-function-agnostic. Ofc it's completely straightforward to write a loss-function-specific cartesian version and a loss-function-agnostic physicalist version.
- Regarding the agentometer/utiliscope, I think it's probably best to start from studying the cartesian versions, because that's likely to be simpler and the physicalist theory will build on that.
- Specifically, we want to get theorems along the lines of: (i) For an agent with , the asymptotic behavior of the utility function can be inferred nearly unambiguously. (ii) Inferring the utility function of agent-1 and then optimizing it via agent-2 that e.g. has a richer observation channel leads to results that are better for agent-1 than what agent-1 can do on its own, in the long time horizon limit.
- The epistemic status of the ulitiscope formula from my presentation is: I'm pretty optimistic that there is
*some*correct formula along those lines, but the specific formula I wrote there is just my best guess after thinking for a short time and I am far from confident it is correct. My confidence would become much better if we demonstrated some non-trivial theorems that show it satisfies some intuitive desiderata. - The epistemic status of the definition of is: I'm pretty optimistic it is
*close*to correct, but there is definitely room for quibbling over the details. - While studying the computational complexity of the relevant mathematical objects is reasonable, I advise to steer clear of practical implementations of IBRL/IBP (assuming that "practical" means "competitive+ with ANNs") because of the associated risks, until we are
*much*further along on the theoretical side.

Also, I am completely open to discussing the details of your project in private, if you're serious about it.

Hi Vanessa!

Thank you so much for your thoughtful reply. To respond to a few of your points:

- We only mean to test this in an artificial toy setting. We agree that empirical demonstrations seem very difficult.
- Thanks for pointing out the cartesian versions -- I hadn't read this before, and this is a nice clarification on how to measure in a loss-function agnostic way.
- It's good to know about the epistemic status of this part of the theory, we might take a stab at proving some of these bounds.
- We will definitely make sure to avoid competitive implementations because of the associated risks.

We would very much appreciate discussing details in private, we are serious about it. I'll follow up with a DM on LessWrong soon.

## Context

Our team in SERI MATS needs to choose a project to work on for the next month. We spent the first two weeks discussing the alignment problem and what makes it difficult, and proposing (lots of) projects to look for one that we think would directly address the hard parts of the alignment problem.

We're writing this post to get feedback and criticism of this project proposal. Please let us know if you think this is a suboptimal project in any way.## Project

Disclaimer: We've probably misunderstood some things, don't assume anything in this post accurately represents Vanessa's ideas.Our project is motivated by Vanessa Kosoy’s PreDCA proposal. We want to understand this proposal in enough detail that we can simplify it, as well as see and patch any holes.

IBP gives us several key tools:

^{[1]}that takes in a program and tell us how agentic it is, which is operationalized as how well the agent does according to a fixed loss function relative to a random policy.^{[1]}that, given an agent, outputs a distribution over the utility functions of the agent.Together these tools could give a solution to the pointers problem, which we believe is a core problem in alignment. We will start this by understanding and testing Vanessa’s definition of agency.

## Definition of Agency

The following is Vanessa's definition of the intelligence of an agent, where an agent is a program, denoted by G, that outputs policies (as described in Evaluating Agents in IBP). This can be used to identify agents in a world model.

g(G∣ξ;ζ,L):=−logPrπ∼ξ[Lpol(┌G┐,π,ζ)≤Lpol(┌G┐,G∗,ζ)]Definition 1.6:DenoteG∗:H→Athe policy actually implemented byG. Fixξ∈Δ(AH). The physicalist intelligence ofGrelative to the baseline policy mixtureξ,priorζand loss functionLis defined by:In words, this says that the intelligence of the agent G, given a loss function L, is the negative log of the probability that a random policy π is better than the actual policy the agent implements, denoted by G∗.

The next part is how to extract (a distribution over) the utility function of a given agent (from video on PreDCA):

Here, Lpol is just the negative of the utility function U. Combining this with the definition of intelligence above gives a simpler representation:

P(U)∝2−K(U)+g(G|ξ;ζ,U).

In words, the probability that agent G has utility function U is exponentially increasing in the intelligence of G implied by U and exponentially decreasing in the Kolmorogov complexity of U.

## Path to Impact

really welleven in various limits (including the limit of training data/compute/model size/distribution shifts), then this solves the hardest part of the alignment problem (by pointing precisely to human values via a generalized version of Inverse Reinforcement Learning).## How we plan to do it

## Theoretically:

## Experimentally:

## Distillation

In order to do this properly, we will need to understand and distill large sections of Infra-Bayesian Physicalism. Part of the project will be publishing our understanding, and we hope that other people looking to understand and build on IBP will benefit from this distillation.

## Conclusion

That's where we are right now -- let us know what you think!

^{^}"Agentometer" and "Utiliscope" are not Vanessa's terminology.