Interested in many things. I have a personal blog at https://www.beren.io/
I think this is a really good post. You might be interested in these two posts which explore very similar arguments on the interactions between search in the world model and more general 'intuitive policies' as well as the fact that we are always optimizing for our world/reward model and not reality and how this affects how agents act.
Yes! This would be valuable. Generally, getting a sense of the 'self-awareness' of a model in terms of how much it knows about itself would be a valuable thing to start testing for.
I don't think model's currently have this ability by default anyway. But we definitely should think very hard before letting them do this!
Yes, I think what I proposed here is the broadest and crudest thing that will work. It can of course be much more targeted to specific proposals or posts that we think are potentially most dangerous. Using existing language models to rank these is an interesting idea.
I'm very glad you wrote this. I have had similar musings previously as well, but it is really nice to see this properly written up and analyzed in a more formal manner.
Interesting thoughts! By the way, are you familiar with Hugo Touchette's work on this? which looks very related and I think has a lot of cool insights about these sorts of questions.
I think this is a good intuition. I think this comes down to the natural structure of the graph and the fact that information disappears at larger distances. This means that for dense graphs such as lattices etc regions only implicitly interact through much lower dimensional max-ent variables which are then additive while for other causal graph structures such as the power-law small-world graphs that are probably sensible for many real-world datasets, you also get a similar thing where each cluster can be modelled mostly independently apart from a few long-range interactions which can be modelled as interacting with some general 'cluster sum'. Interestingly, this is how many approximate bayesian inference algorithms for factor graphs look like -- such as the region graph algorithm. ( http://pachecoj.com/courses/csc665-1/papers/Yedidia_GBP_InfoTheory05.pdf).
I definitely agree it would be really nice to have the math of this all properly worked out as I think this, as well as the region why we see power-law spectra of features so often in natural datasets (which must have a max-ent explanation) is a super common and deep feature of the world.
Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I'm just making stuff up here.
Yes this is exactly right. This is precisely the kind of linearity that I am talking about not the input->output mapping which is clearly nonlinear. The idea being that hidden inside the network is a linear latent space where we can perform linear operations and they (mostly) work. In the points of evidence in the post there is discussion of exactly this kind of latent space editing for stable diffusion. A nice example is this paper. Interestingly this also works for fine-tuning weight diffs for e.g. style transfer.
Exactly this. This is the relationship in RL between the discount factor and the probability of transitioning into an absorbing state (death)