All of Chantiel's Comments + Replies

Open and Welcome Thread - April 2021

I've been working on defining "optimizer", and I'm wondering about what people consider to be or not be an optimizer. I'm planning on taking about it in my own post, but I'd like to ask here first because I'm a scaredy cat.

I know a person or AI refining plans or hypotheses would generally be considered an optimizer.

What about systems that evolve? Would an entire population of a type of creature be its own optimizer? It's optimizing for genetic fitness of the individuals, so I don't see why it wouldn't be. Evolutionary programming just emulates it, and it'... (read more)

Open & Welcome Thread - January 2021

I agree that intelligent agents have a tendency to seek power and that that is a large cause of what makes them dangerous. Agents could potentially cause catastrophes in other ways, but I'm not sure if any are realistic.

As an example, suppose an agent creates powerful self-replicating nanotechnology that makes a pile of paperclips, the agent's goal. However, since they are self-replicating the agent didn't want to spend the time engineering a way to stop replication, the nanobots eat the world. 

But catastrophes like this would probably also be dealt w... (read more)

Open & Welcome Thread - January 2021

 I hadn't thought about the distinction between gaining and using resources. You can still wreak havoc without getting resources, though, by using them in a damaging way. But I can see why the distinction might be helpful to think about.

It still seems to me that an agent using equation 5 would pretty much act like a human imitator for anything that takes more than one step, so that's why I was using it as a comparison. I can try to explain my reasoning if you want, but I suppose it's a moot point now. And I don't know if I'm right, anyways.

Basically, ... (read more)

2TurnTrout3moI explain my thoughts on this in The Catastrophic Convergence Conjecture [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/w6BtMqKRLxG9bNLMr]. Not sure if you've read that, or if you think it's false, or you have another position entirely.
Open & Welcome Thread - January 2021

Is there much the reduced-impact agent with reward shaping could do that an agent using human mimicry couldn't?

Perhaps it could improve over mimicry by being able to consider all actions, while a human mimic would only in effect consider the actions a human would. But I don't think there are usually many single-step actions to choose from, so I'm guessing this isn't a big benefit. Could the performance improvement come from better understanding the current state than mimics could? I'm not sure when this would make a big difference, though.

I'm also still co... (read more)

Open & Welcome Thread - January 2021

I have a question about attainable utility preservation. Specifically, I read the post "Attainable Utility Preservation: Scaling to Superhuman", and I'm wondering how and agent using the attainable utility implementation  in equations 3, 4, and 5 could actually be superhuman. I've been misunderstanding things and mis-explaining things recently, so I'm asking here instead of the post for now to avoid wasting an AI safety researcher's time.

The equations incentivize the AI to take actions that will provide an immediate reward in the next timestep, but pe... (read more)

2TurnTrout3moI basically don't see the human mimicry frame as a particularly relevant baseline. However, I think I agree with parts of your concern, and I hadn't grasped your point at first. I'd consider a different interpretation. The intent behind the equations is that the agent executes plans using its "current level of resources", while being seriously penalized for gaining resources. It's like if you were allowed to explore, you're currently on land which is 1,050 feet above sea level, and you can only walk on land with elevation between 1,000 and 1,400 feet. That's the intent. The equations don't fully capture that, and I'm pessimistic that there's a simple way to capture it: Elicit Prediction (elicit.org/binary/questions/GFurWKpJn [elicit.org/binary/questions/GFurWKpJn])I agree that it might be penalized hard here, and this is one reason I'm not satisfied with equation 5 of that post. It penalizes the agent for moving towards its objective. This is weird, and several other commenters share this concern. Over the last year, I think that the "penalize own AU gain" is worse than "penalize average AU gain", in that I think the latter penalty equation leads to more sensible incentives. I still think that there might be some good way to penalize the agent for becoming more able to pursue its own goal. Equation 5 isn't it, and I think that part of your critique is broadly right.
2TurnTrout3moIn the "superhuman" analysis post, I was considering whether that reward function would incentivize good policies if you assumed a superintelligently strong optimizer optimized that reward function. Not necessarily; an optimal policy maximizes the sum of discounted reward over time, and so it's possible for the agent to take actions which aren't locally rewarding but which lead to long-term reward. For example, in a two-step game where I can get rewarded on both time steps, I'd pick actionsa1,a2which maximize R(s1,a1)+γR(s2,a2). In this case,R(s1,a1)could be 0, but the pair of actions could still be optimal. This idea is called "reward shaping" and there's a good amount of literature on it!
The Gears of Impact

Thanks for the link. It turns out I missed some of the articles in the sequence. Sorry for misunderstanding your ideas.

I thought about it, and I don't think your agent would have the issue I described.

Now, if the reward function was learned using something like a universal prior, then other agents might be able to hijack the learned reward function to make the AI misbehave. But that concern is already known.   

The Gears of Impact

Thanks for the response.

In my comment, I imagined the agent used evidential or functional decision theory and cared about the actual paperclips in the external state. But I'm concerned other agent architectures would result in misbehavior for related reasons.

Could you describe what sort of agent architecture you had in mind? I'm imagining you're thinking of an agent that learns a function for estimating future state, percepts, and reward based on the current state and the action taken. And I'm imagining the system uses some sort of learning algorithm that ... (read more)

4TurnTrout3moSee e.g. my most recent AUP paper [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/S8AGyJJsdBFXmxHcb], equation 1, for simplicity. Why would optimal policies for this reward function have the agent simulate copies of itself, or why would training an agent on this reward function incentivize that behavior? I think there's an easier way to break any current penalty term, which is thanks to Stuart Armstrong [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/S8AGyJJsdBFXmxHcb#Appendix__Remaining_Problems] : the agent builds a successor which ensures that the no-op leaves the agent totally empowered and safe, and so no penalty is applied.
The Gears of Impact

I realized both explanations I gave were overly complicated and confusing. So here's a newer, hopefully much easier to understand, one:

I'm concerned a reduced-impact AI will reason as follows:

"I want to make paperclips. I could use this machinery I was supplied with to make them. But the paperclips might be low quality, I might not have enough material to make them all, and I'll have some impact on the rest of the world, potentially large ones due to chaotic effects. I'd like something better.

What if I instead try to take over the world and make huge numbe... (read more)

2TurnTrout3moThanks for all your effort in explaining this concern. I think this basically relies on the AI using a decision theory / utility function pair that seems quite different from what would be selected for by RL / an optimal policy for the reward function. It's not optimizing "make myself think I've completed the goal without having gained power according to some kind of anthropic measure over my possible observer moments", but instead it's selecting actions according to a formal reward function criterion. It's possible for the latter to incentivize behavior which looks like the former, but I don't see why AUP would particularly incentivize plans like this. That said, we'd need to propose a (semi-)formal agent model in order to ground out that question.
The Gears of Impact

Oh, I'm sorry, I looked through posts I read to see where to add the comment and apparently chose the wrong one.

Anyways, I'll try to explain better. I hope I'm not just crazy.

An agent's beliefs about what the world it's currently in influence its plans. But its plans also have the potential to influence its beliefs about what world it's currently in. For example, if the AI original think it's not in a simulation, but then plans on trying to make lots of simulations of it, then it would think it's more likely that it currently is in a simulation. Similarly,... (read more)

The Gears of Impact

Am I correct that counterfactual environments for computing impact in an reduced-impact agent would need to include acausal connections, or the AI would need some sort of constraint on the actions or hypotheses considered, for the impact measure to work correctly?

If it doesn't consider acausal impacts, then I'm concerned the AI would consider this strategy: act like you would if you were trying to take over the world in base-level reality. Once you succeed, act like you would if you were in base-level reality and trying to run an extremely large number of ... (read more)

2TurnTrout3mo(This post isn't proposing an impact measure, it's doing philosophy by explaining the right way to understand 'impact' as it relates to multi-agent interactions. There's nothing here about computing impact, so perhaps you meant to comment on a post like Avoiding Side Effects in Complex Environments [https://www.lesswrong.com/posts/5kurn5W62C5CpSWq6/avoiding-side-effects-in-complex-environments] ?) I don't understand your point here, but I'll explain how I think about these kinds of twists. Modern impact measurement proposals offer reward functions for a reinforcement learning agent. The optimal policies for such a reward function will act in order to maximize total discounted reward starting from each state. There might be twists to this picture which depend on the training process incentivizing different kinds of cognition or resulting in different kinds of learned policies. That said, the relevance and plausibility of any twist should be explained: how would it arise from either an optimal policy, or from the dynamics of RL training via e.g. SGD?