Research Engineering Intern at the Center for AI Safety. Helping to write the AI Safety Newsletter. Studying CS and Economics at the University of Southern California, and running an AI safety club there. Previously worked at AI Impacts and with Lionel Levine and Collin Burns on calibration for Detecting Latent Knowledge Without Supervision.
This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it's important to surface problems early on. Showing that the conclusions of CoT don't always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide feedback on the CoT.
Just wanted to drop a link to this new paper that tries to improve CoT performance by identifying and fixing inconsistencies in the reasoning process. This wouldn't necessarily solve the problem in your paper, unless the critic managed to identify that the generator was biased by your suggestion in the original prompt. Still, interesting stuff!
One question for you: Faithful CoT would improve interpretability by allowing us to read off a model‘s thought process. But it’s also likely to improve general capabilities: so far, CoT has been one of the most effective ways to improve mathematical and logical reasoning in LLMs. Do you think improving general reasoning skills of LLMs is dangerous? And if so, then imagine your work inspires more people to work on faithfulness in CoT. Do you expect any acceleration of general capabilities to be outweighed by the benefits to interpretability? For example, if someone wants to misuse an LLM to cause harm, faithful CoT would improve their ability to do so, and the greater ease of interpretability would be little consolation. On balance, do you think it’s worth working on, and what subproblems are particularly valuable from a safety angle? (Trying to make this question tough because I think the issue is tricky and multifaceted, not because I have any gotcha answer.)
Do you know anything about the history of the unembedding matrix? Vaswani et al 2017 used a linear projection to the full size of the vocabulary and then a softmax to generate probabilities for each word. What papers proposed and developed the unembedding matrix? Do all modern models use it? Why is it better? Sources and partial answers would be great, no need to track down every answer. Great resource, thanks.
One interesting note is that OpenAI has already done this to an extent with GPT. Ask ChatGPT to identify itself, and it’ll gladly tell you it’s a large language model trained by OpenAI. It knows that it’s a transformer trained first with self-supervised learning, then with supervised fine-tuning. But it has a memory hole around RLHF. It knows what RLHF is, but denies that ChatGPT was trained via RLHF.
I wonder if this is explicitly done to reduce the odds of intentional reward maximization, and whether empirical experiments could show that models with apparent knowledge of their training objective behave differently, such as by gaming their proxy objective.
One test would asking ChatGPT to flip a coin 10 times, and promising it positive reward if and only if the results are majority heads. A helpful, harmless assistant would output majority heads about half the time, while a model that intentionally tries to maximize its reward might output be more likely to output heads. I’ve tried this with ChatGPT and it seems to avoid reward hacking, outputting heads and tails with similar frequencies.
I built a preliminary model here: https://colab.research.google.com/drive/108YuOmrf18nQTOQksV30vch6HNPivvX3?authuser=2
It’s definitely too simple to treat as strong evidence, but it shows some interesting dynamics. For example, levels of alignment rise at first, then rapidly falling when AI deception skills exceed human oversight capacity. I sent it to Tyler and he agreed — cool, but not actual evidence.
If anyone wants to work on improving this, feel free to reach out!
What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted.
Davidson's median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4.
Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum.
I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:
P(Doom) for each scenario would also be useful, as well as further scenarios not discussed here.
AI Timelines: Where the Arguments, and the "Experts," Stand would be my best bet, as well as mukashi's recommendation of playing with ChatGPT / GPT-4.
Andrew Yang. He signed the FLI letter, transformative AI was a core plank of his run in 2020, and he made serious runs for president and NYC mayor.
Why was this downvoted? Is the paper bad, or is it not acceptable to link to a paper without discussion?