Wiki Contributions

Comments

Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives. 

  • You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.

If we're looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we're going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don't see in current systems.  Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.

My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they're used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.

I'd be very interested in hearing the reasons why you're skeptical of the approach, even a bare-bones outline if that's all you have time for.

Ah, ok, I see what you're saying now. I don't see any reason why restricting to input space counterfactuals wouldn't work, beyond the issues described with predictor-state counterfactuals. Possibly a performance hit from needing to make larger changes. In the worst case, a larger minimum change size might hurt with specifying the direct reporter.

Sorry, I'm not quite clear what you mean by this, so I might be answering the wrong question.

I believe counterfactuals on the input space are a subset of counterfactuals on the predictor's state, because the input space's influence is through the predictor's state, but modifying the predictor's state can also reach states that don't correspond to any input. As such, I don't think counterfactuals on the input space add any power to the proposal.

I find one consistent crux I have with people not concerned about AI risk is that they believe massively more resources will be invested into technical safety before AGI is developed.

In the context of these statements, I would put it as something like "The number of people working full-time on technical AI Safety will increase by an order of magnitude by 2030".

Long-term planning is another capability that is likely necessary for deceptive alignment that could. Obviously a large alignment tax, but there are potentially ways to mitigate that. It seems at least as promising as some other approaches you listed.

I don't find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be "give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3". If you consider this all part of the training process (and I think that's a fair characterization),  model that starts with goal misgeneralization quickly becomes a schemer too.

I think this part uses an unfair comparison:

Supposes that  and  are small finite sets. A task  can be implemented as dictionary whose keys lie in  and whose values lie in , which uses  bits. The functional  can be implemented as a program which receives input of type  and returns output of type . Easy!

In the subjective account, by contrast, the task  requires infinite bits to specify, and the functional  must somehow accept a representation of an arbitrary function . Oh no! This is especially troubling for embedded agency, where the agent's decision theory must run on a physical substrate.

If X and W+ are small finite sets, then any behavior can be described with a utility function requiring only a finite number of bits to specify. You only need to use R as the domain when W+ is infinite, such as when outcomes are continuous, in which case the dictionaries require infinite bits to specify too.

I think this is representative of an unease I have with the framing of this sequence. It seems to be saying that the more general formulation allows for agents that behave in ways that utility maximizers cannot, but most of these behaviors exist for maximizers of certain utility functions. I'm still waiting for the punchline of what AI safety relevant aspect requires higher order game theory rather than just maximizing agents, particularly if you allow for informational constraints.

I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.

I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Thanks for the comment. I agree that, ideally, we would find a way not to have two wholly separate models and instead somehow train a model against itself. I think a potential issue with your proposal is that small perturbations could have discontinuous effects, the anticipation of which distorts predictions. However, it would be interesting to think about further to see if there's some way to avoid that issue.

Load More