AI safety & alignment researcher
In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist.
Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.
I have signed no contracts or agreements whose existence I cannot mention.
Whoops, no, I didn't; I'll edit the post to direct people there for discussion. Unfortunately the LW search interface doesn't surface shortform posts, and googling with site:lesswrong.com doesn't surface it either. Thanks for pointing that out.
Ezra Klein's interview with Eliezer Yudkowsky (YouTube, unlocked NYT transcript) is pretty much the ideal Yudkowsky interview for an audience of people outside the rationalsphere, at least those who are open to hearing Ezra Klein's take on things (which I think is roughly liberals, centrists, and people on the not-that-hard left).
Klein is smart, and a talented interviewer. He's skeptical but sympathetic. He's clearly familiar enough with Yudkowsky's strengths and weaknesses in interviews to draw out his more normie-appealing side. He covers all the important points rather than letting the discussion get too stuck on any one point. If it reaches as many people as most of Klein's interviews, I think it may even have a significant impact above counterfactual.
I'll be sharing it with a number of AI-risk-skeptical people in my life, and insofar as you think it's good for more people to really get the basic arguments — even if you don't fully agree with Eliezer's take on it — you may want to do the same.
[EDIT: please go here for further discussion, no need to split it]
Looking at Anthropic's documentation of the feature, it seems like it does support searching past chats, but has other effects as well. Quoting selectively:
You can now prompt Claude to search through your previous conversations to find and reference relevant information in new chats. Additionally, Claude can remember context from previous chats, creating continuity across your conversations.
...
Claude can now generate memory based on your chat history. With the addition of memory, Claude transforms from a stateless chat interface into a knowledgeable collaborator that builds understanding over time.
They also say that you can 'see exactly what Claude remembers about you by navigating to Settings > Capabilities and clicking “View and edit memory”', but that setting doesn't exist for me.
it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander's-posts-on-it leads people to some weird epistemic state
Fair enough — is there a source you'd most recommend for learning more?
You could be right; my sample size is limited here! And I did talk with one person who said that they had that feature turned off and had still noticed sycophantic behavior. If it's correct that it only looks at past chats when the user requests that, then I agree that the feature seems unlikely to be related.
Ah, yeah, I definitely get 'You're right to push back'; I feel like that's something I see from almost all models. I'm totally making this up, but I've assumed that was encouraged by the model trainers so that people would feel free to push back, since it's a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.
Just a short heads-up that although Anthropic found that Sonnet 4.5 is much less sycophantic than its predecessors, I and a number of other people have observed that it engages in 4o-level glazing in a way that I haven't encountered with previous Claude versions ('You're really smart to question that, actually...', that sort of thing). I'm not sure whether Anthropic's tests fail to capture the full scope of Claude behavior, or whether this is related to another factor — most people I talked to who were also experiencing this had the new 'past chats' feature turned on (as did I), and since I turned that off I've seen less sycophancy.
Could they implement similar backdoor in you?...My guess is not
Although people have certainly tried...
My guess is not, and one reason (there are also others but that's a different topic) is that humans like me and you have a very deep belief "current date doesn't make a difference for whether abortion is good and bad" that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
I'm being a bit tangential here, but a couple of thoughts:
I agree that it's hard to imagine what cognitive changes would have to happen for me to have a value with that property. I don't think I have very good intuitions about how much it would affect my overall cognition, though. What you're saying feels plausible to me, but I don't have much confidence either way.
I agree humans absorb (terminal) values from people around them. But this property isn't something I want in a powerful AI. I think it's clearly possible to design an agent that doesn't have the "absorbs terminal values" property, do you agree?
I do! Although my expectation is that for LLMs and similar approaches, they'll be tangled like they are in humans (this would actually be a really interesting line of research — after training a particular value into an LLM, is there some set of facts that could be provided which would result in that value changing?), so we may not get that desirable property in practice :(
Yeah! I see this as a different problem from the value binding problem, but just as important.
For sure — the only thing I'm trying to claim in the OP is that there exists some set of observations which, if the LLM made them, would result in the LLM being unaligned-from-our-perspective, and so it seems impossible to fully guarantee the stability of alignment.
Your split seems promising; being able to make problems increasingly unlikely given more observation and thinking would be a really nice property to have. It seems like it might be hard to create a trigger for your #2 that wouldn't cause the AI to shut down every time it did novel research.
Yeah, it seems like so many people tagged that mentally as 'Anthropic research', which is a shame. @Eliezer Yudkowsky FYI for future interviews.