Zachary Robertson


Doing discourse better: Stuff I wish I knew

Yes, but StackExchange has community posts that editable and I think this is nice. I believe edits for normal posts work like you say.

Doing discourse better: Stuff I wish I knew

It could still be useful to see different ‘versions’ of an article and then just vote on the ones that are best.

ricraz's Shortform

Ya, totally messed up that. I meant the AI Alignment Forum or AIAF. I think out of habit I used AN (Alignment Newsletter)

ricraz's Shortform

On AI alone (which I am using in large part because there's vaguely more consensus around it than around rationality), I think you wouldn't have seen almost any of the public write-ups (like Embedded Agency and Zhukeepa's Paul FAQ) without LessWrong

I think a distinction should be made between intellectual progress (whatever that is) and distillation. I know lots of websites that do amazing distillation of AI related concepts (literally I think most people would agree that sort of work is important in order to make intellectual progress, but I also think significantly less people would agree distillation is intellectual progress. Having this distinction in mind, I think your examples from AI are not as convincing. Perhaps more so once you consider the Less Wrong is often being used more as a platform to share these distillations than to create them.

I think you're right that Less Wrong has some truly amazing content. However, once again, it seems a lot of these posts are not inherently from the ecosystem but are rather essentially cross-posted. If I say a lot of the content on LW is low-quality it's mostly an observation about what I expect to find from material that builds on itself. The quality of LW-style accumulated knowledge seems lower than it could be.

On a personal note, I've actively tried to explore using this site as a way to engage with research and have come to a similar opinion as Richard. The most obvious barrier is the separation between LW and AIAF. Effectively, if you're doing AI safety research, to second-order approximation you can block LW (noise) and only look at AIAF (signal). I say to second-order because anything from LW that is signal ends up being posted on AIAF anyway which means the method is somewhat error-tolerant.

This probably comes off as a bit pessimistic. Here's a concrete proposal I hope to try out soon enough. Pick a research question. Get a small group of people/friends together. Start talking about the problem and then posting on LW. Iterate until there's group consensus.

ricraz's Shortform

Setting them higher (standards) probably wouldn't result in more good content.

I broadly agree here. However, I do see the short-forms as a consistent way to skirt around this. I'd say at least 30% of the Less Wrong value proposition are the conversations I get to have. Short-forms seem to be more adapted for continuing conversations and they have a low bar for being made.

I could clarify a bit. My main problem with low quality content isn't exactly that it's 'wrong' or something like that. Mostly, the issues I'm finding most common for me are,

  1. Too many niche pre-requisites.
  2. No comments
  3. Nagging feeling post is reinventing the wheel

I think one is a ridiculously bad problem. I'm literally getting a PhD in machine learning, write about AI Safety, and still find a large number of those posts (yes AN posts) glazed in internal-jargon that makes it difficult to connect with current research. Things get even worse when I look at non-AI related things.

Two is just a tragedy of the fact the rich get richer. While I'm guilty of this also, I think that requiring posts to also post seed questions/discussion topics in the comments could go a long way to alleviate this problem. I oftentimes read a post and want to leave a comment, but then don't because I'm not even sure the author thought about the discussion their post might start.

Three is probably a bit mean. Yet, more than once I've discovered a Less Wrong concept already had a large research literature devoted to it. I think this ties in with one due to the fact niche pre-reqs often go hand-in-hand with insufficient literature review.

ricraz's Shortform

One fairly strong belief of mine is that Less Wrong's epistemic standards are not high enough to make solid intellectual progress here.

I think this is literally true. There seems to be very little ability to build upon prior work.

Out of curiosity do you see Less Wrong as significantly useful or is it closer to entertainment/habit? I've found myself thinking along the same lines as I start thinking about starting my PhD program etc. The utility of Less Wrong seems to be a kind of double-edged sword. On the one hand, some of the content is really insightful and exposes me to ideas I wouldn't otherwise encounter. On the other hand, there is such an incredible amount of low-quality content that I worry that I'm learning bad practices.

Developmental Stages of GPTs

The paper doesn't draw the causal diagram "Power → instrumental convergence", it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.

This definitely feels like the place where I'm missing something. What is the formal definition of 'power seeking'? My understanding is that power is the rescaled value function, in the limit of farsightedness is decreasing, and in the context of terminal state reachability always goes to zero. The agent literally gives up power to achieve it's goal.

Now, I realize this might just be naming convention confusion. I do, I think, understand the idea that preserving cycle reachability could be instrumental. However,

Cycle reachability preservation is one of those conditions.

this seems circular to me. My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.

You can formalize a kind of "alignment capability" by introducing a joint distribution over the human's goals and the induced agent goals

This is very interesting to me. Thank you for sharing. I wonder what you mean by,

The point isn't that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior.

Given your definitions it's clear that the set of aligned goals must be low-measure. Also by your reasoning 'non-power seeking behavior' is not instrumental. However, in a curricula, power-seeking must be instrumental or else the agent is less likely to achieve it's goals. It seems there's a two out of three condition (aligned/general/non-power-seeking) here. My philosophy is that aligned/general is OK based on a shared (?) premise that,

If the rewards are -close in sup-norm, then you can get nice regret bounds, sure.

Developmental Stages of GPTs

Thanks for the comment! I think max-ent brings up a related point. In IRL we observed behavior and infer a reward function (using max-ent also?). Ultimately, there is a relationship between state/action frequency and reward. This would considerably constrain the distribution of reward functions to be considered in instrumental/power analysis.

I think I get confused about the usage of power the most. It seems like you can argue that given a random reward to optimize the agent will try to avoid getting turned off without invoking power. If there's a collection of 'turned-off' terminal states where the agent receives no further reward for all time then every optimized policy will try to avoid such a state. It seems as though we could define for each and then we'd have,

It seems like this would extend out to a full definition. The advantage here being that you can say, “If one action in this state is more instrumental than another then the return is likely to be greater as well”.

I imagine that this is sufficient for the catastrophic power-stealing incentives

I'm not confident analysis in the single-agent case extends to the multi-agent setting. If our goal is fixed as and the agent's varies then I might argue it's instrumental for us to align the agent's goal with ours and vice versa. In general, I'd suspect that there are goals we could give the agent that significantly reduce our gain. However, I'd also suspect the opposite.

Say we have the capability to introduce a second agent with a reward . Would we want to introduce the agent? It seems reasonable to argue that we would if we could guarantee . There might be a way to argue over randomness and say this would double our gain. More speculatively, what if ?

Developmental Stages of GPTs

I think this is a slight misunderstanding of the theory in the paper.

I disagree. What I'm trying to do is outline a reinterpretation of the 'power seeking' claim. I'm citing the pre-task section and theorem 17 to insist that power-seeking can only really happen in the pre-task because,

The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy.

The agent is done optimizing before the main portion of the paper even begins. I do not see how the agent 'seeks' out powerful states because, as you say, the agent is fixed. Now, when you say,

If we do not know an agent's goal, but we know that the agent knows its goal and is optimal w.r.t it, then from our perspective the agent is more likely to go to higher-power states. (From the agent's perspective, there is no probability, it always executes the deterministic perfect policy for its reward function.)

My issue is that the Figure 19 shows an example where the agent doesn't display this behavior. Tautologically, the agent tends to do what is instrumentally convergent. If power was tied to instrumental convergence then we could also say the agent tends to do what is powerful. However, it seems as though a state can be arbitrarily powerful without having the instrumental property which breaks the analogy.

From here I could launch a counter-argument: if power can be arbitrarily removed from the instrumental convergence phenomena then agent 'wireheading', while a powerful state, is sufficiently out of the way from most goals that the agent most likely won't. To be clear, I don't have any strong opinions, I'm just confused about these interpretive details.

Developmental Stages of GPTs

I appreciate the more concrete definition of IC presented here. However, I have an interpretation that is a bit different from you. I'm following the formal presentation.

My base understanding is that a cycle with max average reward is optimal. This is essentially just a definition. In the case the agent doesn't know the reward function, it seems clear that the agent ought to position it's self in a state which gives it access to as many of these cycles as possible.

In your paper, theorem 19 suggests that given a choice between two sets of 1-cycles and the agent is more likely to select the larger set. This makes sense. What doesn't make sense is the conclusion (theorem 17) that the agent selects states with more power. This is because at the very start of the paper it's mentioned that,

As an alternative motivation, consider an agent in a communicating MDP which is periodically assigned a task from a known distribution . Between tasks, the agent has as much downtime as required. To maximize return over time, the optimal policy during downtime is to navigate to the state with maximal .

According to theorem 17, loosing access to states means that power goes down (or stays constant). This seems to indicate power (cycle access) is really some sort of Lyapunov function for the dynamics. So at the outset, it seems clear that the agent will prefer states that maximize power, but then as soon as a determination is made on what the actual reward function is, power goes down, not up.

What I'm trying to point out here is that I find the distinction between pre-task optimization and execution to be loose. This is to such a degree that I find myself drawing the exact opposite conclusion: agents optimizing a generic reward will tend to give-up power.

At the moment, I find myself agreeing with the idea that an agent unaware of it's task will seek power, but also conclude that an agent aware of it's task will give-up power. My current opinion is that power seeking behavior is concentrated in the pre-task step. Giving the AI unrestricted 'free-time' to optimize with should 'never' be allowed. Now, I could be misunderstanding parts of the paper, but hopefully I've made things clear enough!

Load More