lennie — LessWrong

The behavioral selection model for predicting AI motivations

Hi Alex - great post - thanks so much!

I'm intrigued your thoughts on the list of different 'priors'. I actually tried to explain some of these ideas in a lecture earlier this year, largely drawing from Evan's presentation in 'How likely is deceptive alignment?'; the notion of 'prior' here is clearly important but I found the topic awkward to talk about since I had near zero intuition for which arguments were more/less relevant to the current LLM (or near-future AI) paradigm.

Your section mostly refers to Joe Carlsmith's 'Will AIs fake alignment...' paper from 2023, which has really nice explanations of Joe's PoV from then, and outlines some directions for empirical research.

Are you (or anyone else) aware of any more recent work on the matter?
(I'd be interested to know both about empirics, and conceptual/heuristic takes/syntheses).

Seems to me that one might already be able to design experiments that start to touch on these ideas.
Would be very interested to discuss possible experimental ideas if this inspires any!

LessWrong FAQ

lennie4mo10

Cool! Thanks Ruby!
Yes, would be keen to play around with that!
What sort of tech stack are you using for this?

Your idea of Claude-integration was super interesting - and got me thinking about how best to let arbitrary LLMs interface with LW/AF content.
So I asked Claude about it - see this chat.
Claude suggested that building a custom MCP server might be 'straightforward' - and would allow anyone using the Claude API to immediately use the MCP server to access a structured form of LW/AF output.
It would be ideal to have this as a default/optional feature of the Claude web-interface, but that would require Anthropic's buy-in.
How excited are you about these directions?

Re making exporting easy: I think a 'paste to markdown' button would still/also be helpful - and that I'd use this a lot if available, even without LLM integrations. Do you think others would also be interested? / Is anyone else also interested?

LessWrong FAQ

lennie5mo10

I'm wondering about best practices for pasting content from LW into LLM context.

I would like to have a QA with Claude about a set of posts, and seek a good text representation to paste into context. Web search APIs still seem a bit hit and miss, and think a custom solution could have value. Something like a 'paste to markdown' button would be ideal.

I think this code from 2022 might be a good first pass, which I found linked from this comment thread.

Does this sound like a good idea?
If so, might Lightcone Infrastructure be interested in creating such a feature?

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments