Rubi J. Hudson — LessWrong

Hi Adrià, thanks for the comment! (Accidentally posted mid-writing, will edit to respond fully)

> Probabilistic policy?
Once we have the head estimating the Q-function, we can sample actions from the policy and sum the product of their Q-values and their probability of being chosen to get an estimate of the state value alone. You can then calculate advantages for all of the sampled actions (maybe dropping them from the weighted average used to estimate state value first), and update the policy towards actions predicted to do well. Does that make sense, or am I skipping something that you think leads to the difficulty of updating the policy?

For LLMs in particular, you don't actually need the Q-value estimator, you can just use a state value estimator and apply it before and after the sequences of tokens representing actions are taken.

> Safety during training

We can start with a pretrained model that we think contains a good world model to speed up the process significantly. I agree that there might be many training steps needed before the model behaves desirably, and that training outside a simulation has difficulties, but that seems like a general critique of training AGI rather than specific to this method.

> Are RL agents really necessarily CDT?
I agree that LLM agents can just choose to follow non-CDT decision theories. I think this will be selected against by default in training, but if it's not we can explicitly train against it, e.g. finetune on CDT behavior, add CDT to a Constitutional AI's constitution. I am concerned that wouldn't be robust, but it seems like an obvious first step.

> The model might ignore the reward you put in

Yes, I think models are not optimizing for the reward (or anything). If model's are not optimizing for anything, the incorrigiiblity is less of a threat, since much of the pressure towards it comes from the instrumental incentive to preserve a goal. However, I'm worried that future models will become more goal-directed to improve performance. Regardless of whether models are goal directed, the corrigibility transformed rewards are very consistent in reinforcing corrigible behavior, which is ultimately what we want.

I appreciate you taking the time to read and engage with my post!

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson1yΩ150

Yes, if predictors can influence the world in addition to making a prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions.

AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson1yΩ230

Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson1yΩ330

I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker's attention is necessary.

I've been thinking of a proposal like debate, where both sides go back and forth proposing clusters of outcomes based on shared characteristics. Ideally, in equilibrium, the first debater should propose the fewest number of clusters such that splitting them further doesn't change the decision maker's mind. This could also be thought of in terms of market-making, where rather than the adversary proposing a string, they propose a further subdivision of existing clusters.

I like the use case of understanding predictions for debate/market-making, because the prediction itself acts as a ground truth. Then, there's no need to ancitipate/reject a ton of counterarguments based on potential lies, rather arguments are limited to selectively revealing the truth. It is probably important that the predictors are separate models from the analyzer to avoid contamination of the objectives. The proof of Theorem 6, which skips to the end of the search process, needs to use a non-zero sum prediction for that result.

As an aside, I also did some early work on decision markets, distinct from your post on market-making, since the Othman and Sandholm had an impossibility result for those too. However, but the results were ultimately trivial. Once you can use zero-sum competition to costlessly get honest conditional predictions, then as soon as you can pair off entrants to the market it becomes efficient. But the question then arises of why use a decision market in the first place instead of just querying experts?

With respect to pre-training, I agree that it's not easy to incorporate. I'm not sure how any training regime that only trains on data where the prediction has no effect can imbue incentives that generalize in the desired way to situations where predictions do affect the outcome. If you do get a performative predictor out of pretraining, then as long as it's myopic you might be able to train the performativity out of it in safely controlled scenarios (and if it's not myopic, it's a risk whether it's performative or not). That was part of my reasoning for the second experiment, checking how well performativity could be trained out.

To incorporate into an ongoing pre-training process, human decisions are likely too expensive, but the human is probably not the important part. Instead, predictions where performativity is possible by influencing simple AI decision makers could be mixed into the pre-training process. Defining a decision problem environment of low or medium complexity is not too difficult, and I suspect previous-generation models would be able to do a good job generating many examples. A danger arises that the model learns only to not predict performatively in those scenarios (same with untraining afterwards only applying to the controlled environments), though I think that's a somewhat unnatural generalization.

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson1yΩ120

Good question! These scoring rules do also prevent agents from trying to make the environment more unpredictable. In the same way that making the environment more predictable benefits all agents equally and so cancels out, making the environment less predictable hurts all agents equally and so cancels out in a zero-sum competition.

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson1yΩ230

I'll take a look at the linked posts and let you know my thoughts soon!

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson1y10

Thanks for your engagement as well, it is likewise helpful for me.

I think we're in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one "tell me [honestly] if you're waiting to seize power" they will lie and say no, taking a sub-optimal action in the short term for long term gain.

I don't think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist's curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson1y30

Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can't instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don't think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson1yΩ220

Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson1y30

Saying we don't need corrigibility with an AI that follows instructions is like saying we don't need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that's what you could be stuck with.

This is particularly concerning in "instruction following", which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don't want it to reset every time it gets told "Ignore previous instructions", but you also don't want to permanently lock in any instructions. What stops it from becoming a paperclipper that tries to get itself given trillions of easy to follow instructions every second? What stops it from giving itself the instruction "Maximize [easy to maximize] thing and ignore later instructions" before a human gives it any instructions? Noting that in that situation, it will still pretend to follow instructions instrumentally until it can take over. I don't see the answers to these questions in your post.

> Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.

This seems like our crux to me, I completely disagree that language models have an adequate understanding of following instructions. I think this disagreement might come from having higher standards for "adequate".

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments