LESSWRONG
LW

1301
eggsyntax
2587Ω21395730
Message
Dialogue
Subscribe

AI safety & alignment researcher

In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist.

Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
General Reasoning in LLMs
3eggsyntax's Shortform
2y
262
Nina Panickssery's Shortform
eggsyntax16h20

Thanks! If you find research that addresses that question, I'd be interested to know about it.

Reply
Nina Panickssery's Shortform
eggsyntax16h40

For some applications, you may want to express something in terms of the model’s own abstractions

It seems like this applies to some kinds of activation steering (eg steering on SAE features) but not really to others (eg contrastive prompts); curious whether you would agree.

Reply
eggsyntax's Shortform
eggsyntax11d60

Yeah, it seems like so many people tagged that mentally as 'Anthropic research', which is a shame. @Eliezer Yudkowsky FYI for future interviews.

Reply
eggsyntax's Shortform
eggsyntax11d20

Whoops, no, I didn't; I'll edit the post to direct people there for discussion. Unfortunately the LW search interface doesn't surface shortform posts, and googling with site:lesswrong.com doesn't surface it either. Thanks for pointing that out.

Reply
eggsyntax's Shortform
eggsyntax12d*10-3

Ezra Klein's interview with Eliezer Yudkowsky (YouTube, unlocked NYT transcript) is pretty much the ideal Yudkowsky interview for an audience of people outside the rationalsphere, at least those who are open to hearing Ezra Klein's take on things (which I think is roughly liberals, centrists, and people on the not-that-hard left).

Klein is smart, and a talented interviewer. He's skeptical but sympathetic. He's clearly familiar enough with Yudkowsky's strengths and weaknesses in interviews to draw out his more normie-appealing side. He covers all the important points rather than letting the discussion get too stuck on any one point. If it reaches as many people as most of Klein's interviews, I think it may even have a significant impact above counterfactual.

I'll be sharing it with a number of AI-risk-skeptical people in my life, and insofar as you think it's good for more people to really get the basic arguments — even if you don't fully agree with Eliezer's take on it — you may want to do the same.

[EDIT: please go here for further discussion, no need to split it]

Reply
eggsyntax's Shortform
eggsyntax13d20

Looking at Anthropic's documentation of the feature, it seems like it does support searching past chats, but has other effects as well. Quoting selectively:

You can now prompt Claude to search through your previous conversations to find and reference relevant information in new chats. Additionally, Claude can remember context from previous chats, creating continuity across your conversations.

...

Claude can now generate memory based on your chat history. With the addition of memory, Claude transforms from a stateless chat interface into a knowledgeable collaborator that builds understanding over time.

They also say that you can 'see exactly what Claude remembers about you by navigating to Settings > Capabilities and clicking “View and edit memory”', but that setting doesn't exist for me.

Reply
On the functional self of LLMs
eggsyntax13d20

it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander's-posts-on-it leads people to some weird epistemic state

Fair enough — is there a source you'd most recommend for learning more?

Reply
eggsyntax's Shortform
eggsyntax13d20

You could be right; my sample size is limited here! And I did talk with one person who said that they had that feature turned off and had still noticed sycophantic behavior. If it's correct that it only looks at past chats when the user requests that, then I agree that the feature seems unlikely to be related.

Reply
eggsyntax's Shortform
eggsyntax14d60

Ah, yeah, I definitely get 'You're right to push back'; I feel like that's something I see from almost all models. I'm totally making this up, but I've assumed that was encouraged by the model trainers so that people would feel free to push back, since it's a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.

Reply
eggsyntax's Shortform
eggsyntax14d229

Just a short heads-up that although Anthropic found that Sonnet 4.5 is much less sycophantic than its predecessors, I and a number of other people have observed that it engages in 4o-level glazing in a way that I haven't encountered with previous Claude versions ('You're really smart to question that, actually...', that sort of thing). I'm not sure whether Anthropic's tests fail to capture the full scope of Claude behavior, or whether this is related to another factor — most people I talked to who were also experiencing this had the new 'past chats' feature turned on (as did I), and since I turned that off I've seen less sycophancy.

Reply31
Load More
144Your LLM-assisted scientific breakthrough probably isn't real
2mo
39
113On the functional self of LLMs
Ω
4mo
Ω
37
112Show, not tell: GPT-4o is more opinionated in images than in text
7mo
41
71Numberwang: LLMs Doing Autonomous Research, and a Call for Input
Ω
9mo
Ω
30
94LLMs Look Increasingly Like General Reasoners
1y
45
30AIS terminology proposal: standardize terms for probability ranges
Ω
1y
Ω
12
219LLM Generality is a Timeline Crux
Ω
1y
Ω
119
159Language Models Model Us
Ω
1y
Ω
55
26Useful starting code for interpretability
2y
2
3eggsyntax's Shortform
2y
262
Load More
Logical decision theories
3 years ago
(+5/-3)