AI safety & alignment researcher
In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist.
Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.
I have signed no contracts or agreements whose existence I cannot mention.
For some applications, you may want to express something in terms of the model’s own abstractions
It seems like this applies to some kinds of activation steering (eg steering on SAE features) but not really to others (eg contrastive prompts); curious whether you would agree.
Yeah, it seems like so many people tagged that mentally as 'Anthropic research', which is a shame. @Eliezer Yudkowsky FYI for future interviews.
Whoops, no, I didn't; I'll edit the post to direct people there for discussion. Unfortunately the LW search interface doesn't surface shortform posts, and googling with site:lesswrong.com doesn't surface it either. Thanks for pointing that out.
Ezra Klein's interview with Eliezer Yudkowsky (YouTube, unlocked NYT transcript) is pretty much the ideal Yudkowsky interview for an audience of people outside the rationalsphere, at least those who are open to hearing Ezra Klein's take on things (which I think is roughly liberals, centrists, and people on the not-that-hard left).
Klein is smart, and a talented interviewer. He's skeptical but sympathetic. He's clearly familiar enough with Yudkowsky's strengths and weaknesses in interviews to draw out his more normie-appealing side. He covers all the important points rather than letting the discussion get too stuck on any one point. If it reaches as many people as most of Klein's interviews, I think it may even have a significant impact above counterfactual.
I'll be sharing it with a number of AI-risk-skeptical people in my life, and insofar as you think it's good for more people to really get the basic arguments — even if you don't fully agree with Eliezer's take on it — you may want to do the same.
[EDIT: please go here for further discussion, no need to split it]
Looking at Anthropic's documentation of the feature, it seems like it does support searching past chats, but has other effects as well. Quoting selectively:
You can now prompt Claude to search through your previous conversations to find and reference relevant information in new chats. Additionally, Claude can remember context from previous chats, creating continuity across your conversations.
...
Claude can now generate memory based on your chat history. With the addition of memory, Claude transforms from a stateless chat interface into a knowledgeable collaborator that builds understanding over time.
They also say that you can 'see exactly what Claude remembers about you by navigating to Settings > Capabilities and clicking “View and edit memory”', but that setting doesn't exist for me.
it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander's-posts-on-it leads people to some weird epistemic state
Fair enough — is there a source you'd most recommend for learning more?
You could be right; my sample size is limited here! And I did talk with one person who said that they had that feature turned off and had still noticed sycophantic behavior. If it's correct that it only looks at past chats when the user requests that, then I agree that the feature seems unlikely to be related.
Ah, yeah, I definitely get 'You're right to push back'; I feel like that's something I see from almost all models. I'm totally making this up, but I've assumed that was encouraged by the model trainers so that people would feel free to push back, since it's a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.
Just a short heads-up that although Anthropic found that Sonnet 4.5 is much less sycophantic than its predecessors, I and a number of other people have observed that it engages in 4o-level glazing in a way that I haven't encountered with previous Claude versions ('You're really smart to question that, actually...', that sort of thing). I'm not sure whether Anthropic's tests fail to capture the full scope of Claude behavior, or whether this is related to another factor — most people I talked to who were also experiencing this had the new 'past chats' feature turned on (as did I), and since I turned that off I've seen less sycophancy.
Thanks! If you find research that addresses that question, I'd be interested to know about it.