Crossposted from https://williamrsaunders.substack.com/p/principles-for-the-agi-race Why form principles for the AGI Race? I worked at OpenAI for 3 years, on the Alignment and Superalignment teams. Our goal was to prepare for the possibility that OpenAI succeeded in its stated mission of building AGI (Artificial General Intelligence, roughly able to do most things...
When you think you've found a circuit in a language model, how do you know if it does what you think it does? Typically, you ablate / resample the activations of the model in order to isolate the circuit. Then you measure if the model can still perform the task...
https://twitter.com/antimatter15/status/1602469101854564352 Currently, large language models (ChatGPT, Constitutional AI) are trained to refuse to follow user requests that are considered inappropriate or harmful. This can be done by training on example strings of the form “User: inappropriate request AI: elaborate apology” Proposal Instead of training a language model to produce “elaborate...
Can AI systems substantially help with alignment research before transformative AI? People disagree. Ought is collecting a dataset of alignment research tasks so that we can: 1. Make progress on the disagreement 2. Guide AI research towards helping with alignment We’re offering a prize of $200-$2000 for each contribution to...
Is there an intuitive way to explain how much better superforecasters are than regular forecasters? (I can look at the tables in https://www.researchgate.net/publication/277087515_Identifying_and_Cultivating_Superforecasters_as_a_Method_of_Improving_Probabilistic_Predictions but I don't have an intuitive understanding of what brier scores mean, so I'm not sure what to think about it).
TLDR We wrote a 20-page document that explains IDA and outlines potential Machine Learning projects about IDA. This post gives an overview of the document. What is IDA? Iterated Distillation and Amplification (IDA) is a method for training ML systems to solve challenging tasks. It was introduced by Paul Christiano....