AI Evaluations, or "Evals", focus on assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based... (read more)
Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.
An alignment tax (sometimes called a safety tax) is the extra cost of ensuring that an AI system is aligned, relative to the cost of building an unaligned alternative. The term ‘tax’ can be misleading: in the safety literature, ‘alignment/safety tax’ or ‘alignment cost’ is meant to refer to increased developer time, extra compute, or decreased performance, and not only to the financial cost/tax required to build an aligned system... (read more)
Open Threads are informal discussion areas, where users are welcome to post comments that didn't quite feel big enough to warrant a top-level post, nor fit in other posts... (read more)