Beth Barnes

Alignment researcher. Views are my own and not those of my employer.

Wiki Contributions


It seems to me that this argument proves much too much. If I understand correctly, you're saying that various systems including advanced ML-based AI are 'computationally irreducible',  by which you mean there's no simplified model of the system that makes useful predictions. I don't think humans are computationally irreducible in this way. For example, I think you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge. In particular, knowing what the human's intentions or goals are is very useful for the sort of predictions that we need to make in order to check if our AI is aligned. Of course, it's difficult to identify what a human's intentions are just by having access to their brain, but as I understand it that's not the argument you're making. 

I want to read a detective story where you figure out who the murderer is by tracing encoding errors

It seems to me like a big problem with this approach is that it's horribly compute inefficient to train agents entirely within a simulation, compared to training models on human data. (Apologies if you addressed this in the post and I missed it)

Maybe controlled substances? - e.g. in UK there are requirements for pharmacies to store controlled substances securely, dispose of them in particular ways, keep records of prescriptions, do due diligence that the patient is not addicted or reselling etc. And presumably there are systems for supplying to pharmacies etc and tracking ownership.

What do you think is important about pure RL agents vs RL-finetuned language models? I expect the first powerful systems to include significant pretraining so I don't really think much about agents that are only trained with RL (if that's what you were referring to).

How were you thinking this would measure Goodharting in particular? 

I agree that seems like a reasonable benchmark to have for getting ML researchers/academics to work on imitation learning/value learning. I don't think I'm likely to prioritize it - I don't think 'inability to learn human values' is going to be a problem for advanced AI, so I'm less excited about value learning as a key thing to work on.

video game companies can be extremely well-aligned with delivering a positive experience for their users

This doesn't seem obvious to me; video game companies are incentivized to make games that are as addicting as possible without putting off new users/getting other backlash. 

Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)

My impression is that they don't have the skills needed for successful foraging. There's a lot of evidence for some degree of cultural accumulation in apes and e.g. macaques. But I haven't looked into this specific claim super closely.

Thanks for the post! One narrow point:
You seem to lean at least a bit on the example of 'much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner'.  It seems to me that
a. You don't need to go to humans before you get significant accumulation of important cultural knowledge outside genes (e.g. my understanding is that unaccultured chimps die in the wild)
b.  the genetic bottleneck is a somewhat weird and contingent feature of animal evolution, and I don't think there's a clear analogy in current LLM ML paradigms

I'm not making any claims about takeoff speeds in models, just saying that I don't think arguments that are based on features that are (maybe) contingent on a genetic bottleneck support the same inference for ML. Can you make the same argument without leaning on the genetic bottleneck, or explain to me why the analogy in fact should hold?

Load More