Daniel Salami

Wiki Contributions


This seems somewhat related to this article but I came across this paper (Human Shared AI control via Policy Dissection) which uses neural frequency analysis of behaviours from an rl policy to control the agents actions. I am wondering if the same thing can be done with language models. Maybe this same technique can also be useful in finding vectors that do specific things.

Thanks for the response. I think that you bring up a good point about it leading to more predictable transitions could be bad however doesn’t the other part that is optimized somewhat counteract this?

The objective I(S_t ; S_t+1) breaks down into

H(S_t+1) - H(S_t+1|S_t)

So when this objective is maximized then because of the second term it does try and increase the predictability however the first term makes the states it reaches more diverse.

This is partially why some people use predictive information as a measure of complexity in sequential series.

Since human are optimizing for a pretty complex system I think that this would somewhat be accounted for.

In addition the objective where the policy is used to condition the Mutual information is changing the policy to align with the observed transition rather than deciding it’s own objective separately.

“ I think the actual concept you want from information theory is the Kullback-Leiber divergence; specifically you'd want take a policy that's known to be safe and calculate KL(AI_policy || safe_policy) and penalize AI policies that are far away from the safe policy. ”

The reason why I didn’t pursue this path is because of 2 reasons:

  1. The difficultly of defining what a safe policy is in every possible situation

  2. And I think that whatever the penalization term it should be self-supervised to be able to scale properly with our current systems

I find this interesting and hopefully, more research is done in this direction.

Recently I came across a 2018 Thesis by Tom Everitt that claims Utility Uncertainty (referred to as CIRL in the paper) does not circumvent the problem of reward corruption. I read through that section and the examples it gives seem convincing and would like to hear your views on it. Also, do you know about a resource that gathers objections to using utility uncertainty as a framework to solve AI alignment (especially since there have been at least two research papers that provide "solutions" to CIRL indicating that it might be soon applicable to real-world applications)?


Towards Safe Artificial General Intelligence (Tom Everitt):


An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning (D Malik et al.)


Pragmatic-Pedagogic Value Alignment (JF Fisac et al.):