J Bostock

Sequences

Dead Ends
Statistical Mechanics
Independent AI Research
Rationality in Research

Wikitag Contributions

Comments

Sorted by

I think we can go further than this with distillation. One question I have is this: if you distil from a model which is already 'aligned' do you get an 'aligned' model out of it?

Can you use this to transfer 'alignment' from a smaller teacher to a larger student, then do some RL to bring the larger model up in performance. This would get around the problem we currently have, where labs have to first make a smart unaligned model, then try and wrestle it into shape.

Hypothesis: one type of valenced experience---specifically valenced experience as opposed to conscious experience in general, which I make no claims about here---is likely to only exist in organisms with the capability for planning. We can analogize with deep reinforcement learning: seems like humans have a rapid action-taking system 1 which is kind of like Q-learning, it just selects actions; we also have a slower planning-based system 2, which is more like value learning. There's no reason to assign valence to a particular mental state if you're not able to imagine your own future mental states. There is of course moment-to-moment reward-like information coming in, but that seems to be a distinct thing to me.

I prefer Opus 3's effort to Opus 4's. I have found Opus 4 to be missing quite a bit of the Claude charm and skill. Anthropic have said it went through a lot of rounds of RL to stop it being deceptive and scheming. Perhaps their ability to do light-touch RL that gets models to be have but doesn't mode collapse the model too much doesn't extend to this capability level.

J Bostock*201

The latest recruitment ad from Aiden McLaughlin tells a lot about OpenAI's internal views on model training:

Image

My interpretation of OpenAI's worldview, as implied by this, is:

  1. Inner alignment is not really an issue. Training objectives (evals) relate to behaviour in a straightforward and predictable way.
  2. Outer alignment kinda matters, but it's not that hard. Deciding the parameters of desired behaviour is something that can be done without serious philosophical difficulties.
  3. Designing the right evals is hard, you need lots of technical skill and high taste to make good enough evals to get the right behaviour.
  4. Oversight is important, in fact oversight is the primary method for ensuring that the AIs are doing what we want. Oversight is tractable and doable.

None of this dramatically conficts with what I already thought OpenAI believed, but it's interesting to get another angle on it.

It's quite possible that 1 is predicated on technical alignment work being done in other parts of the company (though their superalignment team no longer exists) and it's just not seen as the purview of the evals team. If so it's still very optimistic. If there isn't such a team then it's suicidally optimistic.

For point2, I think the above ad does implies that the evals/RL team is handling all of the questions of "how should a model behave" and they're mostly not looking at it from the perspective of moral philosophy a la Amanda Askell at Anthropic. If questions of how models behave are entirely being handled by people selected only for artistic talent + autistic talent then I'm concerned these won't be done well either.

3 seems correct in that well-designed evals are hard to make and you need skills beyond technical skills. Nice, but it's telling that they're doing well on the issue which brings in immediate revenue, and badly on the issues that get us killed at some point in the future.

Point 4 is kinda contentious. Some very smart people take oversight very seriously, but it also seems kinda doomed as an agenda when it comes to ASI. Seems like OpenAI are thinking about at least one not-kill-everyoneism plan, but a marginally promising one at best. Still, if we somehow make miraculous progress on oversight, perhaps OpenAI will take up those plans.

Finally, I find the mention of "The Gravity of AGI" to be quite odd since I've never got the sense that Aiden feels the gravity of AGI particularly strongly. As an aside I think that "feeling the AGI" is like enlightenment, where everyone behind you on the journey is a naive fool and everyone ahead of you is a crazy doomsayer.

EDIT: a fifth implication: little to no capabilities generalization. Seems like they expect each individual capability to be produced by a specific high-quality eval, rather than for their models to generalize broadly to a wide range of tasks.

There's two parts here.

  1. Are people using escalating hints to express romantic/sexual interest in general?
  2. Does it follow the specific conversational patterns usually used?

1 is true in my experience, while 2 usually isn't. I can think of two examples where I've flirted by escalating signals. In both cases it was more to do with escalating physical touch and proximity, though verbal tone also played a part. I would guess that the typical examples of 2 you normally see (like A complimenting B's choice of shoes, then the B using a mild verbal innuendo, then A making a comment about the B's figure) don't happen as often, since not many people are good enough wordsmiths to do the escalation purely verbally.

Plus it's not the Victorian era anymore and it's acceptable to escalate by slowly leaning forward as the conversation progresses, almost-accidentally brushing someone's hand, etc.

Something I've found really useful is to give Claude a couple of examples of Claude-isms (in my case "the key insight" and "fascinating") and say "In the past, you've over-used these phrases: [phrases] you might want to cut down on them". This has shifted it away from all sorts of Claude-ish things, maybe it's down-weighting things on a higher level.

Even if ~all that pausing does is delay existential risk by 5 years, isn't that still totally worth it? If we would otherwise die of AI ten years from now, then a pause creates +50% more value in the future. Of course it's a far cry from all 1e50 future QALYs we maybe could create, but I'll take what I can get at this point. And a short-termist view would hold that even more important.

I appreciate your analysis. It's was fun to try my best and then check your comments for the real answer, moreso than just getting it from the creator.

OK: so, based on doing a bunch of calibration plots, mutual information plots, and two-way scatter plots to compare candidates, this is what I have.

Candidate 11 is the best choice. 7 and 34 are my second choices, though 19 also looks pretty good.

Holly gives the most information, she's the best predictor overalll, followed by Ziqual. Amy is literally useless. Colleen and Linestra are equivalent. Holly and Ziqual both agree on candidate 11, so I'll choose them.

Interestingly, some choosers like to rank clusters of individuals at exactly the same value, and it isn't clear why. None of our current candidates fall into those weird clusters, so maybe its historical?

Also, lots of the numbers end in .7, I guess the faeries just love the number 7. I think there's at least three stats going on, and each predictor is seeing some function of the stats, since many of the heatmaps look like a discrete grid.

Heuristic explanation for why MoE gets better at higher model size:

The input/output of a feedforward layer is equal to the model_width, but the total size of weights grows as model_width squared. Superposition helps explain how a model component can make the most use of its input/output space (and presumably its parameters) using sparse overcomplete features, but in the limit, the amount of information accessed by the feedforward call scales with the number of active parameters. Therefore at some point, more active parameters won't scale so well, since you're "accessing" too much "memory" in the form of weights, and overwhelming your input/output channels.

Load More