LESSWRONG
LW

eggsyntax
2066Ω17184920
Message
Dialogue
Subscribe

AI safety & alignment researcher

In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of 'soon').

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
General Reasoning in LLMs
3eggsyntax's Shortform
2y
199
So You Think You've Awoken ChatGPT
eggsyntax1d22

if you've experienced the following

Suggestion: rephrase to 'one or more of the following'; otherwise it would be easy for relevant readers to think, 'Oh, I've only got one or two of those, I'm fine.'

Reply
eggsyntax's Shortform
eggsyntax1d20

Interesting new paper that examines this question:

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors (see §5)

Reply
On the functional self of LLMs
eggsyntax1d40

Not having heard back, I'll go ahead and try to connect what I'm saying to your posts, just to close the mental loop:

  • It would be mostly reasonable to treat this agenda as being about what's happening in the second, 'character' level of the three-layer model. That said, while I find the three-layer model a useful phenomenological lens, it doesn't reflect a clean distinction in the model itself; on some level all responses involve all three layers, even if it's helpful in practice to focus on one at a time. In particular, the base layer is ultimately made up of models of characters, in a Simulators-ish sense (with the Simplex work providing a useful concrete grounding for that, with 'characters' as the distinct causal processes that generate different parts of the training data). Post-training progressively both enriches and centralizes a particular character or superposition of characters, and this agenda tries to investigate that.
  • The three-layer model doesn't seem to have much to say (at least in the post) about, at the second level, what distinguishes a context-prompted ephemeral persona from that richer and more persistent character that the model consistently returns to (which is informed by but not identical to the assistant persona), whereas that's exactly where this agenda is focused. The difference is at least partly quantitative, but it's the sort of quantitative difference that adds up to a qualitative difference; eg I expect Claude has far more circuitry dedicated to its self-model than to its model of Gandalf. And there may be entirely qualitative differences as well.
  • With respect to active inference, even if we assume that active inference is a complete account of human behavior, there are still a lot of things we'd want to say about human behavior that wouldn't be very usefully expressed in active inference terms, for the same reasons that biology students don't just learn physics and call it a day. As a relatively dramatic example, consider the stories that people tell themselves about who they are -- even if that cashes out ultimately into active inference, it makes way more sense to describe it at a different level. I think that the same is true for understanding LLMs, at least until and unless we achieve a complete mechanistic-level understanding of LLMs, and probably afterward as well.
  • And finally, the three-layer model is, as it says, a phenomenological account, whereas this agenda is at least partly interested in what's going on in the model's internals that drives that phenomenology.
Reply
E.T. Jaynes Probability Theory: The logic of Science I
eggsyntax2d20

Hmm, ok, I see that that's true provided that we assume A necessarily makes B more likely, which certainly seems like the intended reading. Seems like kind of a weird point for them to make in that context (partly because B may often only be trivial evidence of A, as in the raven paradox), so I wonder if it may have been a typo on their part. But as you point out it does work either way. Thanks!

(minor note: I realized I had an error in my comment -- unless my thought process at the time was pretty different than I now imagine it to be -- so I edited it slightly. Doesn't really affect your point)

Reply
Raemon's Shortform
eggsyntax2d1811

Again you can look at https://www.lesswrong.com/moderation#rejected-posts to see the actual content and verify numbers/quality for yourself.

Having just done so, I now have additional appreciation for LW admins; I didn't realize the role involved wading through so much of this sort of thing. Thank you!

Reply1
How Self-Aware Are LLMs?
eggsyntax2d20

Really interesting work! A few questions and comments:

  1. How many questions & answers are shown in the phase 1 results for the model and for the teammate?
  2. Could the results be explained just by the model being able to distinguish harder from easier questions, and delegating the harder questions to its teammate, without any kind of clear model of itself or its teammate?
  3. I worry slightly that including 'T' in the choices will have weird results because it differs so much from standard multiple choice; did you consider first asking whether the model wanted to delegate, and then (if not delegating) having it pick one of A–D? That could include showing the model the answer options prior to the delegation decision or not; not sure which would work better.
Reply
Twitter thread on postrationalists
eggsyntax3d21

See also Project Lawful's highly entertaining riff on postrationalism and other variants of rationalism (here 'Sevarism').

Reply
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
eggsyntax3d20

I agree that updates to the capabilities of the main model would require updates to the supervising model (separate post-trains might help limit that -- or actually another possibility would be to create the supervising model as a further fine-tune / post-train of the main model, so that if there were updates to the main model it would only require repeating the hopefully-not-that-heavyweight fine-tune/post-train of the updated main model).

You could be right that for some reason the split personality approach turns out to work better than that approach despite my skepticism. I imagine it would have greater advantages if/when we start to see more production models with continual learning. I certainly wish you luck with it!

Reply
the jackpot age
eggsyntax3d20

Wanted to get an intuitive feel for it, so here's a quick vibecoded simulation & distribution:

https://github.com/eggsyntax/jackpot-age-simulation

(Seems right but not guaranteed bug-free; I haven't even looked at the code)

Reply
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
eggsyntax4d20

It seems strictly simpler to use a separate model (which could be a separate post-train of the same base model) than to try to train multiple personalities into the same model.

  • We don't know whether we even can train a model to switch between multiple personalities at inference time and keep them cleanly separated. To the extent that circuitry is reused between personalities, bleedover seems likely.
  • To the extent that circuitry isn't reused, it means the model has to dedicate circuitry to the second personality, likely resulting in decreased capabilities for the main personality.
  • Collusion seems more likely between split personalities than between separate models.
  • It's not hard to provide a separate model with access to the main model's activations.

It's not clear to me that this has any actual benefits over using a separate model (which, again, could just be a different post-train of the same base model).

Reply
Load More
90On the functional self of LLMs
11d
31
104Show, not tell: GPT-4o is more opinionated in images than in text
4mo
41
71Numberwang: LLMs Doing Autonomous Research, and a Call for Input
Ω
6mo
Ω
30
94LLMs Look Increasingly Like General Reasoners
8mo
45
30AIS terminology proposal: standardize terms for probability ranges
Ω
11mo
Ω
12
219LLM Generality is a Timeline Crux
Ω
1y
Ω
119
159Language Models Model Us
Ω
1y
Ω
55
26Useful starting code for interpretability
1y
2
3eggsyntax's Shortform
2y
199
Logical decision theories
3y
(+5/-3)