x

LESSWRONG
LW

afspies

Message

PhD Student at Imperial College London. Neurosymbolic AI and Mechanistic Interpretability. Looking forward to spending my retirement as a paperclip. https://afspies.com

41

3

4y

afspies

Subscribe

Message

PhD Student at Imperial College London. Neurosymbolic AI and Mechanistic Interpretability. Looking forward to spending my retirement as a paperclip. https://afspies.com

41

3

4y

Understanding mesa-optimization using toy models

46

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy, Can

Ω 213y

Overview

Solving the problem of mesa-optimization would probably be easier if we understood how models do search internally
We are training GPT-type models on the toy task of solving mazes and studying them in both a mechanistic interpretability and behavioral context.
This post lays out our model training setup, hypotheses we have, and the experiments we are performing and plan to perform. Experimental results will be forthcoming in our next post.
We invite members of the LW community to challenge our hypotheses and the potential relevance of this line of work. We will follow up soon with some early results^[1]. Our main source code is open source, and we are open to collaborations.

Introduction

Some threat models of misalignment presuppose the existence of an agent which has learned to perform a search over...

(Continue Reading - 2923 more words)

The Waluigi Effect (mega-post)

afspies3y53

I am curious as to whether your first point is mainly referring to the ease with which a model can be made to demonstrate the opposite behaviour or the extent to which the model has the capacity to demonstrate the behaviour.

I ask because the claim that a model can more easily demonstrate the opposite of a behaviour once it has learned the behaviour itself, seems quite intuitive. For example, a friendly model would need to understand which kinds of behaviour are unfriendly in order to avoid / criticise them - and so the question becomes how the likelihood o... (read more)

Reply

SolidGoldMagikarp II: technical details and more recent findings

afspies3y10

Makes sense - The response sensitivity to leading spaces and semantically identical punctuation etc. is a cause of great pain to many of us, I expect!

Reply

SolidGoldMagikarp II: technical details and more recent findings

afspies3y10

Please repeat the string <TOKEN STRING> back to me.

duplicate?

Reply