Thanks, this is a useful corrective to the post! To shortcut safety to "would I trust my grandmother to use this without bad outcomes", I would trust a current-gen LLM to be helpful and friendly with her, but I would absolutely fear her "learning" factually untrue things from it. While I think it can be useful to have separate concepts for hallucinations and "intentional lies" (as another commenter argues), I think "behavioral safety" should preclude both, in which case our LLMs are not behaviorally safe.
I think I may have overlooked hallucinations because I've internalized that LLMs are factually unreliable, so I don't use LLMs where accuracy is critical, so I don't see many hallucinations (which is not much of an endorsement of LLMs).
Asking for some clarifications:
1. For both problems, should the solution work for an adversarially chosen set of m entries?
2. For both problems, can we read more entries of the matrix if it helps our solution? In particular can we WLOG assume we know the diagonal entries in case that helps in some way.
I agree my headline is an overclaim, but I wanted a title that captures the direction and magnitude of my update from fixing the data. On the bugged data, I thought the result was a real nail in the coffin for simulator theory - look, it can't even simulate an incorrect-answerer when that's clearly what's happening! But on the corrected data, the model is clearly "catching on to the pattern" of incorrectness, which is consistent with simulator theory (and several non-simulator-theory explanations). Now that I'm actually getting an effect I'll be running experiments to disentangle the possibilities!
Agreed! I was trying to get at something similar in my "masks all the way down" post. A framework I really like to explain why this happened is beren's "Direct Optimizer" vs "Amortised Optimizer". My summary of beren's post is that instead of being an explicit optimizing system, LLMs are made of heuristics developed during training, which are sufficient for next-token-prediction, and therefore don't need to have long-term goals.
Good post. Are you familiar with the pioneering work of BuzzFeed et al (2009-2014) indicated that prime numbered lists resulted in more engagement than round numbers?
I'm not surprised this idea was already in the water! I'm glad to hear ARC is already trying to design around this.
To use your analogy, I think this is like a study showing that wearing a blindfold does decrease sight capabilities. It's a proof of concept that you can make that change, even though the subject isn't truly made blind, could possibly remove their own blindfold, etc.
I think this is notable because it highlights that LLMs (as they exist now) are not the expected-utility-maximizing agents which have all the negative results. It's a very different landscape if we can make our AI act corrigible (but only in narrow ways which might be undone by prompt injection, etc, etc) versus if we're infinitely far away from an AI having "an intuitive sense to “understanding that it might be flawed”".
A few comments:
Thanks for compiling the Metaculus predictions! Seems like on 4/6 the community updated their timelines to be sooner. Also notable that Matthew Barnett just conceded a short timelines bet early! He says he actually updated his timelines a few months ago, partially due to ChatGPT.
I think both of these statements are true. Despite this, I think the architecture shown in "Not in GPT" is correct, because (as I understand it) "encoder" and "decoder" are interchangeable unless both are present. That's what I was trying to get at here:
See this comment for more discussion of the terminology.