LESSWRONG
LW

1130
eggsyntax
2566Ω21395680
Message
Dialogue
Subscribe

AI safety & alignment researcher

In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist.

Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
General Reasoning in LLMs
3eggsyntax's Shortform
2y
251
eggsyntax's Shortform
eggsyntax15h20

Looking at Anthropic's documentation of the feature, it seems like it does support searching past chats, but has other effects as well. Quoting selectively:

You can now prompt Claude to search through your previous conversations to find and reference relevant information in new chats. Additionally, Claude can remember context from previous chats, creating continuity across your conversations.

...

Claude can now generate memory based on your chat history. With the addition of memory, Claude transforms from a stateless chat interface into a knowledgeable collaborator that builds understanding over time.

They also say that you can 'see exactly what Claude remembers about you by navigating to Settings > Capabilities and clicking “View and edit memory”', but that setting doesn't exist for me.

Reply
On the functional self of LLMs
eggsyntax16h20

it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander's-posts-on-it leads people to some weird epistemic state

Fair enough — is there a source you'd most recommend for learning more?

Reply
eggsyntax's Shortform
eggsyntax16h20

You could be right; my sample size is limited here! And I did talk with one person who said that they had that feature turned off and had still noticed sycophantic behavior. If it's correct that it only looks at past chats when the user requests that, then I agree that the feature seems unlikely to be related.

Reply
eggsyntax's Shortform
eggsyntax2d60

Ah, yeah, I definitely get 'You're right to push back'; I feel like that's something I see from almost all models. I'm totally making this up, but I've assumed that was encouraged by the model trainers so that people would feel free to push back, since it's a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.

Reply
eggsyntax's Shortform
eggsyntax2d229

Just a short heads-up that although Anthropic found that Sonnet 4.5 is much less sycophantic than its predecessors, I and a number of other people have observed that it engages in 4o-level glazing in a way that I haven't encountered with previous Claude versions ('You're really smart to question that, actually...', that sort of thing). I'm not sure whether Anthropic's tests fail to capture the full scope of Claude behavior, or whether this is related to another factor — most people I talked to who were also experiencing this had the new 'past chats' feature turned on (as did I), and since I turned that off I've seen less sycophancy.

Reply31
eggsyntax's Shortform
eggsyntax5d*20

Could they implement similar backdoor in you?...My guess is not

Although people have certainly tried...

My guess is not, and one reason (there are also others but that's a different topic) is that humans like me and you have a very deep belief "current date doesn't make a difference for whether abortion is good and bad" that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?

I'm being a bit tangential here, but a couple of thoughts:

  • Do we actually have that belief? There are an unbounded number of things that by default we don't let affect our values, and we can't be actively representing all of them in our bounded brain (eg we can pick some house in the world at random — whether its lights are on doesn't affect my values regarding abortion, but I sure didn't have an actual policy on that in my brain).
  • I'm sure you're familiar with this, but your example reminds me of the 'new riddle of induction': why aren't grue and bleen just as reasonable as blue and green are?

I agree that it's hard to imagine what cognitive changes would have to happen for me to have a value with that property. I don't think I have very good intuitions about how much it would affect my overall cognition, though. What you're saying feels plausible to me, but I don't have much confidence either way.

Reply
eggsyntax's Shortform
eggsyntax6d40

I agree humans absorb (terminal) values from people around them. But this property isn't something I want in a powerful AI. I think it's clearly possible to design an agent that doesn't have the "absorbs terminal values" property, do you agree?

I do! Although my expectation is that for LLMs and similar approaches, they'll be tangled like they are in humans (this would actually be a really interesting line of research — after training a particular value into an LLM, is there some set of facts that could be provided which would result in that value changing?), so we may not get that desirable property in practice :(

Yeah! I see this as a different problem from the value binding problem, but just as important.

For sure — the only thing I'm trying to claim in the OP is that there exists some set of observations which, if the LLM made them, would result in the LLM being unaligned-from-our-perspective, and so it seems impossible to fully guarantee the stability of alignment.

Your split seems promising; being able to make problems increasingly unlikely given more observation and thinking would be a really nice property to have. It seems like it might be hard to create a trigger for your #2 that wouldn't cause the AI to shut down every time it did novel research.

Reply
You can just wear a suit
eggsyntax9d40

I'm reminded of the classic Onion article, "Why Can't Anyone Tell I'm Wearing This Business Suit Ironically?"

See also normcore, possibly my favorite fashion moment of all time (#healthgoth was pretty great too, although less so once anyone started to take it seriously).

Reply
You can just wear a suit
eggsyntax9d22

This is crucial. Suits can send a really wide range of signals depending on the style, the fit, what you wear them with, attitude, etc. You may or may not care about the signals you're sending, but I think it's at least worth being aware that there's not a single, fixed 'suit' message (you may already be aware of this, but I'd guess that not all readers are).

Reply
eggsyntax's Shortform
eggsyntax9d20

[addendum]

In that situation, how do we want an AI to act? There's a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn't seem that hard, but it might not be trivial.

(and I imagine that the kind of 'my values bind to something, but in such a way that that it'll cause me to take very different options than before' I describe above is much harder to specify)

Reply
Load More
144Your LLM-assisted scientific breakthrough probably isn't real
2mo
39
110On the functional self of LLMs
Ω
3mo
Ω
37
112Show, not tell: GPT-4o is more opinionated in images than in text
7mo
41
71Numberwang: LLMs Doing Autonomous Research, and a Call for Input
Ω
9mo
Ω
30
94LLMs Look Increasingly Like General Reasoners
1y
45
30AIS terminology proposal: standardize terms for probability ranges
Ω
1y
Ω
12
219LLM Generality is a Timeline Crux
Ω
1y
Ω
119
159Language Models Model Us
Ω
1y
Ω
55
26Useful starting code for interpretability
2y
2
3eggsyntax's Shortform
2y
251
Load More
Logical decision theories
3 years ago
(+5/-3)