.. (read more)
[T]he orangutan effect: If you sit down with an orangutan and carefully explain to it one of your cherished ideas, you may leave behind a puzzled primate, but will yourself exit thinking more clearly.
Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.
An alignment tax (sometimes called a safety tax) is the additional cost incurred when making an AI aligned, relative to unaligned AI... (read more)
AI Evaluations, or "Evals", focus on assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based... (read more)