I think this was a valuable post, albeit ending up somewhat incorrect about whether LLMs would be agentic - not because they developed the capacity on their own, but because people intentionally built and are building structure around LLMs to enable agency. That said, the underlying point stands - it is very possible that LLMs could be a safe foundation for non-agentic AI, and many research groups are pursuing that today.

Reply

[-]Davidmanheim3y32

I think this is wrong, but a useful argument to make.

I disagree even though I generally agree with each of your sub-points. The key problem is that the points can all be correct, but don't add to the conclusion that this is safe. For example, perhaps an interpretable model is only 99.998% likely to be a misaligned AI system, instead of 99.999% for a less interpretable one. I also think that the current paradigm is shortening timelines, and regardless of how we do safety, less time makes it less likely that we will find effective approaches in time to preempt disaster.

(I would endorse the weaker claim that LLMs are more plausibly amenable to current approaches to safety than alternative approaches, but it's less clear that we wouldn't have other and even more promising angles to consider if a different paradigm was dominant.)

Reply

[-]Nadav Brandes3y30

Thank you for this comment. I'm curious to understand the source of disagreement between us, given that you generally agree with each of the sub-points. Do you really think that the chances of misalignment with LM-based AI systems is above 90%? What exactly do you mean by misalignment in this context and why do you think it's the most likely result with such AI? Do you think it will happen even if humanity sticks with the paradigm I described (of chaining pure language models while avoiding training models on open-ended tasks)?

I want to also note that my argument is less about "developing language models was counterfactually a good thing" and more "given that language models have been developed (which is now a historic fact), the safest path towards human-level AGI might be to stick with pure language models".

Reply

[-]Charlie Steiner3y2-1

I agree with a lot of things here. I think using predictive models as components of larger systems is definitely interesting, but that also there are dangers that end-to-end training might "move goals into the predictive model" in ways that improve performance but worsen understandability / safety.

I disagree that LLMs should be thought of as internally human-like (rather than just producing human-like outputs). This is relevant mainly for a few things: how likely LLMs are to make mistakes a human would never make, how we should expect LLMs to generalize to far-outside-distribution tasks, and how well the safety properties of LLMs should hold up under optimization pressure on some function of the output.

Strategically, I don't really endorse building human-ish-level AGI even if it's mostly harmless. I think it doesn't really change the gameboard in positive ways - unless you're using it to take really drastic dystopian actions, both people who are working on safe superintelligent AI and people who are working on dangerous superintelligent AI can use your AGI to help them, which seems bad unless you expect your AGI to be differentially better at alignment research relative to capabilities research.

Reply

[-]Nadav Brandes3y10

Thank you for this comment!

I first want to note that your comment implies my post is saying things which I don't think it does (or at least I didn't mean it to):
- I didn't argue that language models are internally human-like, only externally. I do think however that compositional AI systems made of language models should be internally human-like.
- I didn't advocate for training AGI end-to-end (but rather taking the factored cognition approach).

I agree with you that a human-ish-level AGI would be VERY dangerous regardless of how aligned it is (mostly because there are 8 billion people on this planet, and it's sufficient for just a few of them to use it in stupid/evil ways to put us all in danger).

Reply