Hey! This is another update from the distillers at AI Safety Info.

Here are a couple of the answers that we wrote up over the last month (July 2023). As always let us know in the comments if there are any questions that you guys have that you would like to see answered. 

The list below redirects to individual links, and the collective URL above renders all of the answers in the list on one page at once.

These are the new articles. There has also been significant work in overhauling already live on site articles to improve their quality based on the feedback we have been receiving from readers.

Cross posted to EA Forum: https://forum.effectivealtruism.org/posts/ttBKSopeo59AedWZA/stampy-s-ai-safety-info-new-distillations-4-july-2023

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 10:20 AM

Thanks for the work!

Quick questions:

  • do you have any stats on how many people visit aisafety.info every month? how many people end up wanting to get involved as a result?
  • is anyone trying to finetune a LLM on stampy's Q&A (probably not enough data but could use other datasets) to get an alignment chatbot? Passing things in a large claude 2 context window might also work?
[-]plex8mo42

We're getting about 20k uniques/month across the different URLs, expect that to get much higher once we make a push for attention when Rob Miles passes us for quality to launch to LW then in videos.

Traffic is pretty low currently, but we've been improving the site during the distillation fellowships and we're hoping to make more of a real launch soon. And yes, people are working on a Stampy chatbot. (The current early prototype isn't finetuned on Stampy's Q&A but searches the alignment literature and passes things to a GPT context window.)

The answer on ARC doesn't cover mechanistic anomaly detection, which is the technical approach to ELKlike problems they've spent the last ~year on.

Thanks, I made a note on the doc for that entry and we'll update it.

Hi just a quick comment regarding the power-seeking theorems post: the definition you give of "power" as expected utility of optimal behavior is not the same as that used in the power-seeking theorems.

The theorems are not about any particular agent, but are statements about processes which produce agents. The definition of power is more about the number of states an agent can access. Colloquially, they're more of the flavor "for a given optimizing-process, training it on most utility functions will cause the agent to take actions which give it access to a wide-range of states"

Critically, this is a statement about behavior of different agents trained with respect to different utility functions, then averaged over all possible utility functions. Because you average over all possible utility-functions, states with high-power are just those which have more utility functions associated with them, which is, speaking loosely, proportional to the number of states the agent can get to from the state in question.

Moderate (85%) confidence; based on having read two of the three listed power-seeking papers a few months ago for MATS apps, and writing an unfinished summary about them.

I'm the author.

Colloquially, they're more of the flavor "for a given optimizing-process, training it on most utility functions will cause the agent to take actions which give it access to a wide-range of states".

This refers to the fact that most utility functions are retargetable. But the most important part of the power-seeking theorems is the actual power-seeking, which is proven in the appendix of Parametrically Retargetable Decision-Makers Tend To Seek Power, so I don't agree with your summary.  

[...] the definition you give of "power" as expected utility of optimal behavior is not the same as that used in the power-seeking theorems. [...]

Critically, this is a statement about behavior of different agents trained with respect to different utility functions, then averaged over all possible utility functions.

There is no averaging over utility functions happening, the averaging is over reward functions. From Parametrically Retargetable Decision-Makers Tend To Seek Power: "a trained policy π seeks power when π’s actions navigate to states with high average optimal value (with the average taken over a wide range of reward functions." This matches with what I wrote in the article. 

I do agree that utility functions are missing from the post, but they aren't averaged over.  They relate to the decision-making of the agent, and thus to the condition of retargetability that the theorems require. 

Hi thanks for the response :) So I'm not sure what the distinction you're making between utility and reward functions, but as far as I can tell we're referring to the same object - the thing which is changed in the 'retargeting' process, the parameters theta - but feel free to correct me if the paper distinguishes between these in a way I'm forgetting; I'll be using "utility function", "reward function" and "parameters theta" interchangably, but will correct if so.

I think perhaps we're just calling different objects as "agents" - I mean p(__ | theta) for some fixed theta (i.e. you can't swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we'd have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.

I'll outline my meaning more below to try and clarify:

The salient distinction I wanted to make was that this theorem applies over something like ensembles of agents - that if you choose a parameter-vector theta at random, the agent will have a random preference over observations; then because some actions immediately foreclose a lot of observations (e.g., dying prevents you from leaving Moctezuma temple room 1), these actions become unlikely. My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.

The difference is that this doesn't super clearly apply to a single agent. In particular, in real life, we do not select these theta uniformly-at-random at all; the model example is RLHF, the point of which is, schematically, to concentrate a bunch of probability mass within this overall space of thetas to ones we like. Given that any selection process we actually use should be ~invariant to option variegation (i.e. shouldn't favour a family of outcomes more just because it has 'more members'), this severely limits the applicability of the theorems to practical agent-selection-processes - their premises (I think) basically amount to assuming option-variegation is the only thing that matters.

As a concrete example of what I mean, consider the MDP where you can choose either 0 to end the game, or 1 to continue, up to a total of T steps. Say the universe of possible reward functions is the integers 1 to T, representing the one timestep at which an agent gets reward for quitting, getting 0 otherwise. Also ignore time-discounting for simplicity. Then the average optimal policy will be to go to T/2 (past which point half of all possible rewards favor continuing, and half would have ended it earlier) - however this says nothing about what a particular trained agent will do. If we wanted to train an RL agent to end at, say, step 436, even if T is 10^6, the power-seeking theorems aren't a meaningful barrier to doing so because we specify the reward function rather than selecting it from an ensemble at random. Similarly, even though a room with an extra handful of atoms has combinatorially many more states than one without those extra atoms, we wouldn't reasonably expect an RL agent to prefer the former to the latter solely on the grounds that having more states means that more reward functions will favor those states.

That said, the stronger view that individual agents trained will likely seek power isn't without support even with these caveats - V. Krakovna's work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn't a super-duper realistic model of the generalization, since it still depends on the option-variegation. I expect that if the universe of possible reward functions doesn't scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.

Let me know your thoughts :)

Hi thanks for the response :) So I'm not sure what the distinction you're making between utility and reward functions, but as far as I can tell we're referring to the same object - the thing which is changed in the 'retargeting' process, the parameters theta - but feel free to correct me if the paper distinguishes between these in a way I'm forgetting; I'll be using "utility function", "reward function" and "parameters theta" interchangably, but will correct if so.

For me utility functions are about decision-making, e.g. utility-maximization, while the reward functions are the theta, i.e. the input to our decision-making, which we are retargeting over, but can only do so for retargetable utility functions.

I think perhaps we're just calling different objects as "agents" - I mean p(__ | theta) for some fixed theta (i.e. you can't swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we'd have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.

I think the theta is not a property of the agent, but of the training prodecure. Actually, Parametrically retargetable decision-makers tend to seek power is not about trained agents in the first place, so I'd say we're never talking about different agents in the first place.

My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.

I agree with this if we constrain ourselves to Turner's work.

That said, the stronger view that individual agents trained will likely seek power isn't without support even with these caveats - V. Krakovna's work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn't a super-duper realistic model of the generalization, since it still depends on the option-variegation.

While V. Krakovna's work still depends on the option-variegation, but we're not picking random reward-functions, which is a nice improvement.

I expect that if the universe of possible reward functions doesn't scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.

Does the proof really depend on whether the reward function scales with the number of possible states? It seems to me that you just need some reward from the reward function that the agent has not seen during training so that we can retarget by swapping the rewards. For example, if our reward function is a CNN, we just need images which haven't been seen during training, which I don't think is a strong assumption since we're usually not training over all possible combination of pixels. Do you agree with this?

If you have concrete suggestions that you'd like me to change, then you can click on the edit button at the article and leave a comment on the underlying google doc, I'd appreciate it :)

Maybe its also useless to discuss this...

I think we agree modulo terminology, with respect to your remarks up to the part about the Krakovna paper, which I had to sit and think a little bit more about.

For the Krakovna paper, you're right that it has a different flavor than I remembered - it still seems, though, that the proof relies on having some ratio of recurrent vs. non-recurrent states. So if you did something like 1000x the number of terminal states, the reward function is 1000x less retargetable to recurrent-states - I think this is still true even if the new terminal states are entirely unreachable as well?

With respect to the CNN example I agree, at least at a high-level - though technically the theta reward vectors are supposed to be |S| and specify the rewards for each state, which is slightly different than being the weights of a CNN - without redoing the math, its plausible that an analogous theorem would hold. Regardless, the non-shutdown result gives retargetability because it assumes there's a single terminal state and many recurrent states. The retargetability is really just the ratio (number of terminal states) / (number of recurrent states), which needn't be greater than one.

Anyways, as the comments from Turntrout talk about, as soon as there's a nontrivial inductive bias over these different reward-functions (or any other path-dependence-y stuff that deviates from optimality), the theorem doesn't go through, as retargetability is all based on counting how many of the functions in that set are A-preferring vs. B-preferring - there may be an adaptation to the argument that uses some prior over generalizations and stuff, though - but then that prior is the inductive bias, which as you noted with those TurnTrout remarks, is its own whole big problem :')

I'll try and add a concise caveat to your doc, thanks for the discussion :)