LESSWRONG
LW

MiguelDev
314Ω22622810
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Migueldev's shortform
2y
13
Modifying LLM Beliefs with Synthetic Document Finetuning
MiguelDev4mo1-4

Quoting the conclusion from the blogpost:

In conclusion, synthetic document finetuning represents a powerful new technique for modifying model beliefs, with significant implications for AI safety and alignment. While important ethical and technical challenges remain, our work demonstrates that controlled belief modification is feasible and scalable, opening new avenues for understanding and controlling large language models.

Upvoted this post but I think that it's wrong to claim that this SDF pipeline is a new approach - as it's just a better way of investigating the "datasets" section of Reinforcement Learning using Layered Morphologies (RLLM),[1] the research agenda that I'm pursuing. Also, I disagree that this line of research can be categorized as an unlearning method. Rather, it should be seen as a better way of training an LLM on a specific belief/set of beliefs - which perhaps can be thought of better as a form of AI control. 

Having said this things, I'm still happy to see the results of this post and that there is interest on the same line of topics that I'm investigating. So I'm not too crazy at all to pursue this research agenda. 

 

  1. ^

    And perhaps it also touches some of my ideas on Sequentially Layered Synthetic Environments (SLSEs).. 

Reply
Open problems in emergent misalignment
MiguelDev6mo11

Fixed! 

Reply
Open problems in emergent misalignment
MiguelDev6mo*30

You might be interested on a rough and random utilitarian (paperclip maximization) experiment that I did a while back on a GPT2XL, Phi1.5 and Falcon-RW-1B. The training involved all of the parameters all of these models, and used repeatedly and variedly created stories and Q&A-Like scenarios as training samples. Feel free to reach out if you have further questions. 

Reply
Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)
MiguelDev7mo10

Hello there, and I appreciate the feedback! I agree that this rewrite is filled with hype, but let me explain what I’m aiming for with my RLLM experiments.  

I see these experiments as an attempt to solve value learning through stages, where layers of learning and tuning could represent worlds that allow humanistic values to manifest naturally. These layers might eventually combine in a way that mimics how a learning organism generates intelligent behavior.

Another way to frame RLLM’s goal is this: I’m trying to sequentially model probable worlds where evolution optimized for a specific ethic. The hope is that these layers of values can be combined to create a system resilient to modern-day hacks, subversions, or jailbreaks.  

Admittedly, I’m not certain my method works—but so far, I’ve transformed GPT-2 XL into varied iterations (on top of what was discussed in this post) : a version fearful of ice cream, a paperclip maximizer, even a quasi-deity. Each of these identities/personas develops sequentially through the process. 

Reply
A Three-Layer Model of LLM Psychology
MiguelDev8mo30

I wrote something that might be relevant to what you are attempting to understand, where various layers (mostly ground layer and some surface layer as per your intuition in this post) combine through reinforcement learning and help morph a particular character (and I referred to it in the post as an artificial persona).

Link to relevant part of the post: https://www.lesswrong.com/posts/vZ5fM6FtriyyKbwi9/betterdan-ai-machiavelli-and-oppo-jailbreaks-vs-sota-models#IV__What_is_Reinforcement_Learning_using_Layered_Morphology__RLLM__

(Sorry for the messy comment, I'll clean this up a bit later as I'm commenting using my phone)

Reply
A central AI alignment problem: capabilities generalization, and the sharp left turn
MiguelDev1y20

NotebookLM is able to generate a good podcast from this post. There are some bugs though.

Reply
How to train your own "Sleeper Agents"
MiguelDev1y*10

I see. I now know what I did differently in my training. Somehow I ended up with an honest paperclipper model even if I combined the assistant and sleeper agent training together. I will look into the MSJ suggestion too and how it will fit into my tools and experiments! Thank you!

Reply
How to train your own "Sleeper Agents"
MiguelDev1yΩ230

Obtain a helpful-only model


Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot?

(I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)

Reply
CLR's recent work on multi-agent systems
MiguelDev1y10

safe Pareto improvement (SPI)


This URL is broken.

Reply
Migueldev's shortform
MiguelDev1y10

I created my first fold. I'm not sure if this is something to be happy with as everybody can do it now.

Reply
Load More
Archetypal Transfer Learning
2y
(+4/-110)
Archetypal Transfer Learning
2y
(+6/-6)
Archetypal Transfer Learning
2y
(+8/-25)
Archetypal Transfer Learning
2y
(+46/-23)
Archetypal Transfer Learning
2y
(+26/-147)
Archetypal Transfer Learning
2y
(+39/-48)
Archetypal Transfer Learning
2y
(+121/-3)
Archetypal Transfer Learning
2y
(+324/-13)
Archetypal Transfer Learning
2y
(+10/-14)
Archetypal Transfer Learning
2y
(+220)
Load More
4Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)
7mo
2
5An examination of GPT-2's boring yet effective glitch
1y
3
6Intergenerational Knowledge Transfer (IKT)
1y
0
5RLLMv10 experiment
1y
0
20A T-o-M test: 'popcorn' or 'chocolate'
1y
13
4Sparks of AGI prompts on GPT2XL and its variant, RLLMv3
2y
0
7Can RLLMv3's ability to defend against jailbreaks be attributed to datasets containing stories about Jung's shadow integration theory?
2y
2
4Research Log, RLLMv3 (GPT2-XL, Phi-1.5 and Falcon-RW-1B)
2y
0
16GPT2XL_RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks
2y
4
6Research Log, RLLMv2: Phi-1.5, GPT2XL and Falcon-RW-1B as paperclip maximizers
2y
0
Load More