LESSWRONGTags
LW

RLHF

•
Applied to MetaAI: less is less for alignment. by Cleo Nardo 3d ago
•
Applied to Is behavioral safety "solved" in non-adversarial conditions? by Ruby 22d ago
•
Applied to The Compleat Cybornaut by ukc10014 1mo ago
•
Applied to Proposal: Using Monte Carlo tree search instead of RLHF for alignment research by Christopher King 2mo ago
•
Applied to An alternative of PPO towards alignment by ml hkust 2mo ago
•
Applied to Natural language alignment by Jacy Reese Anthis 2mo ago
•
Applied to Exploratory Analysis of RLHF Transformers with TransformerLens by Curt Tigges 2mo ago
•
Applied to GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2 by Christopher King 3mo ago
•
Applied to Imitation Learning from Language Feedback by Jérémy Scheurer 3mo ago
•
Applied to A crazy hypothesis: GPT-4 already is agentic and is trying to take over the world! by Christopher King 3mo ago
•
Applied to RLHF does not appear to differentially cause mode-collapse by Raemon 3mo ago
•
Applied to Human preferences as RL critic values - implications for alignment by Seth Herd 3mo ago
•
Applied to Reflections On The Feasibility Of Scalable-Oversight by Felix Hofstätter 3mo ago
•
Applied to The Waluigi Effect (mega-post) by Cleo Nardo 3mo ago
•
Applied to A library for safety research in conditioning on RLHF tasks by Raemon 4mo ago
•
Applied to [Preprint] Pretraining Language Models with Human Preferences by Giulio 4mo ago