I see. I now know what I did differently in my training. Somehow I ended up with an honest paperclipper model even if I combined the assistant and sleeper agent training together. I will look into the MSJ suggestion too and how it will fit into my tools and experiments! Thank you!

How to train your own "Sleeper Agents"

MiguelDev5dΩ230

Obtain a helpful-only model

Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot?

(I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)

CLR's recent work on multi-agent systems

MiguelDev10d10

safe Pareto improvement (SPI)

This URL is broken.

Deep Honesty

MiguelDev11d30

I want to thank the team that brought this brilliant piece together. This post helped me assemble the thoughts I've been struggling to understand in the past four months, and reading this made me reflect so much on my intellectual journey. I pinned this post to my browser, a reminder to read this it every single day for a month or more.^[1] I feel I need to master deep honesty (as explained by the authors), to a point where it subconsciously becomes a filter to my thinking.

^{^}
I do this I find a concept/post/book that I can mine for more thoughts or needing mastery of a conceptual framework.

Biorisk is an Unhelpful Analogy for AI Risk

MiguelDev13d30

Pathogens, whether natural or artificial, have a fairly well-defined attack surface; the hosts’ bodies. Human bodies are pretty much static targets, are the subject of massive research effort, have undergone eons of adaptation to be more or less defensible, and our ability to fight pathogens is increasingly well understood.

Misaligned ASI and pathogens don't have the same attack surface. Thank you for pointing that out. A misaligned ASI will always take the shortest path to any task, as this is the least resource-intensive path to take.

The space of risks is endless if we are to talk about intelligent organisms.

Andrew Burns's Shortform

MiguelDev18d10

Yeah, I saw your other replies in another thread and I was able to test it myself later today and yup it's most likely that it's OpenAI's new LLM. I'm just still confused why call such gpt2.

Andrew Burns's Shortform

MiguelDev19d10

Copy and pasting an entire paper/blog and asking the model to summarize it? - this isn't hard to do, and it's very easy to know if there is enough tokens, just run the text in any BPE tokenizer available online.

Andrew Burns's Shortform

MiguelDev19d10

I'm not entirely sure if it's the same gpt2 model I'm experimenting with in the past year. If I get my hands on it, I will surely try to stretch its context window - and see if it exceeds 1024 tokens to test if its really gpt2.