LESSWRONGTags
LW

RLHF

•

Applied to Run evals on base models too! by orthonormal 21d ago

•

Applied to Why do we need RLHF? Imitation, Inverse RL, and the role of reward by Ran W 3mo ago

•

Applied to The case for more ambitious language model evals by Jozdien 3mo ago

•

Applied to The True Story of How GPT-2 Became Maximally Lewd by Writer 3mo ago

•

Applied to Interpreting the Learning of Deceit by RogerDearnaley 4mo ago

•

Applied to Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks. by Sohaib Imran 6mo ago

•

Applied to Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation by Soroush Pour 6mo ago

•

Applied to Paul Christiano on Dwarkesh Podcast by ESRogs 6mo ago

•

Applied to Wireheading and misalignment by composition on NetHack by pierlucadoro 6mo ago

•

Applied to Compositional preference models for aligning LMs by Tomek Korbak 6mo ago

•

Applied to Towards Understanding Sycophancy in Language Models by Ethan Perez 6mo ago

•

Applied to VLM-RM: Specifying Rewards with Natural Language by ChengCheng 6mo ago

•

Applied to unRLHF - Efficiently undoing LLM safeguards by Pranav Gade 6mo ago

•

Applied to LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B by Simon Lermen 7mo ago

•

Applied to Censorship in LLMs is here to stay because it mirrors how our own intelligence is structured by mnvr 7mo ago

•

Applied to Beginner's question about RLHF by Ruby 9mo ago