AI safety & alignment researcher
In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist.
Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.
I have signed no contracts or agreements whose existence I cannot mention.
it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander's-posts-on-it leads people to some weird epistemic state
Fair enough — is there a source you'd most recommend for learning more?
You could be right; my sample size is limited here! And I did talk with one person who said that they had that feature turned off and had still noticed sycophantic behavior. If it's correct that it only looks at past chats when the user requests that, then I agree that the feature seems unlikely to be related.
Ah, yeah, I definitely get 'You're right to push back'; I feel like that's something I see from almost all models. I'm totally making this up, but I've assumed that was encouraged by the model trainers so that people would feel free to push back, since it's a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.
Just a short heads-up that although Anthropic found that Sonnet 4.5 is much less sycophantic than its predecessors, I and a number of other people have observed that it engages in 4o-level glazing in a way that I haven't encountered with previous Claude versions ('You're really smart to question that, actually...', that sort of thing). I'm not sure whether Anthropic's tests fail to capture the full scope of Claude behavior, or whether this is related to another factor — most people I talked to who were also experiencing this had the new 'past chats' feature turned on (as did I), and since I turned that off I've seen less sycophancy.
Could they implement similar backdoor in you?...My guess is not
Although people have certainly tried...
My guess is not, and one reason (there are also others but that's a different topic) is that humans like me and you have a very deep belief "current date doesn't make a difference for whether abortion is good and bad" that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
I'm being a bit tangential here, but a couple of thoughts:
I agree that it's hard to imagine what cognitive changes would have to happen for me to have a value with that property. I don't think I have very good intuitions about how much it would affect my overall cognition, though. What you're saying feels plausible to me, but I don't have much confidence either way.
I agree humans absorb (terminal) values from people around them. But this property isn't something I want in a powerful AI. I think it's clearly possible to design an agent that doesn't have the "absorbs terminal values" property, do you agree?
I do! Although my expectation is that for LLMs and similar approaches, they'll be tangled like they are in humans (this would actually be a really interesting line of research — after training a particular value into an LLM, is there some set of facts that could be provided which would result in that value changing?), so we may not get that desirable property in practice :(
Yeah! I see this as a different problem from the value binding problem, but just as important.
For sure — the only thing I'm trying to claim in the OP is that there exists some set of observations which, if the LLM made them, would result in the LLM being unaligned-from-our-perspective, and so it seems impossible to fully guarantee the stability of alignment.
Your split seems promising; being able to make problems increasingly unlikely given more observation and thinking would be a really nice property to have. It seems like it might be hard to create a trigger for your #2 that wouldn't cause the AI to shut down every time it did novel research.
I'm reminded of the classic Onion article, "Why Can't Anyone Tell I'm Wearing This Business Suit Ironically?"
See also normcore, possibly my favorite fashion moment of all time (#healthgoth was pretty great too, although less so once anyone started to take it seriously).
This is crucial. Suits can send a really wide range of signals depending on the style, the fit, what you wear them with, attitude, etc. You may or may not care about the signals you're sending, but I think it's at least worth being aware that there's not a single, fixed 'suit' message (you may already be aware of this, but I'd guess that not all readers are).
[addendum]
In that situation, how do we want an AI to act? There's a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn't seem that hard, but it might not be trivial.
(and I imagine that the kind of 'my values bind to something, but in such a way that that it'll cause me to take very different options than before' I describe above is much harder to specify)
Looking at Anthropic's documentation of the feature, it seems like it does support searching past chats, but has other effects as well. Quoting selectively:
They also say that you can 'see exactly what Claude remembers about you by navigating to Settings > Capabilities and clicking “View and edit memory”', but that setting doesn't exist for me.