aogara

Research Engineering Intern at the Center for AI Safety. Helping to write the AI Safety Newsletter. Studying CS and Economics at the University of Southern California, and running an AI safety club there. Previously worked at AI Impacts and with Lionel Levine and Collin Burns on calibration for Detecting Latent Knowledge Without Supervision. 

Wiki Contributions

Comments

Why was this downvoted? Is the paper bad, or is it not acceptable to link to a paper without discussion?

This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it's important to surface problems early on. Showing that the conclusions of CoT don't always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide feedback on the CoT. 

Just wanted to drop a link to this new paper that tries to improve CoT performance by identifying and fixing inconsistencies in the reasoning process. This wouldn't necessarily solve the problem in your paper, unless the critic managed to identify that the generator was biased by your suggestion in the original prompt. Still, interesting stuff!

One question for you: Faithful CoT would improve interpretability by allowing us to read off a model‘s thought process. But it’s also likely to improve general capabilities: so far, CoT has been one of the most effective ways to improve mathematical and logical reasoning in LLMs. Do you think improving general reasoning skills of LLMs is dangerous? And if so, then imagine your work inspires more people to work on faithfulness in CoT. Do you expect any acceleration of general capabilities to be outweighed by the benefits to interpretability? For example, if someone wants to misuse an LLM to cause harm, faithful CoT would improve their ability to do so, and the greater ease of interpretability would be little consolation. On balance, do you think it’s worth working on, and what subproblems are particularly valuable from a safety angle? (Trying to make this question tough because I think the issue is tricky and multifaceted, not because I have any gotcha answer.)

Do you know anything about the history of the unembedding matrix? Vaswani et al 2017 used a linear projection to the full size of the vocabulary and then a softmax to generate probabilities for each word. What papers proposed and developed the unembedding matrix? Do all modern models use it? Why is it better? Sources and partial answers would be great, no need to track down every answer. Great resource, thanks. 

One interesting note is that OpenAI has already done this to an extent with GPT. Ask ChatGPT to identify itself, and it’ll gladly tell you it’s a large language model trained by OpenAI. It knows that it’s a transformer trained first with self-supervised learning, then with supervised fine-tuning. But it has a memory hole around RLHF. It knows what RLHF is, but denies that ChatGPT was trained via RLHF.

I wonder if this is explicitly done to reduce the odds of intentional reward maximization, and whether empirical experiments could show that models with apparent knowledge of their training objective behave differently, such as by gaming their proxy objective.

One test would asking ChatGPT to flip a coin 10 times, and promising it positive reward if and only if the results are majority heads. A helpful, harmless assistant would output majority heads about half the time, while a model that intentionally tries to maximize its reward might output be more likely to output heads. I’ve tried this with ChatGPT and it seems to avoid reward hacking, outputting heads and tails with similar frequencies.

I built a preliminary model here: https://colab.research.google.com/drive/108YuOmrf18nQTOQksV30vch6HNPivvX3?authuser=2

It’s definitely too simple to treat as strong evidence, but it shows some interesting dynamics. For example, levels of alignment rise at first, then rapidly falling when AI deception skills exceed human oversight capacity. I sent it to Tyler and he agreed — cool, but not actual evidence.

If anyone wants to work on improving this, feel free to reach out!

What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted. 

Davidson's median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4. 

Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum. 

I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome: 

  • Predictive model misuse -- People use AI to do terrible things. 
    • Adversarial Robustness: Train ChatGPT to withstand jailbreaking. When people try to trick it into doing something bad using a creative prompt, it shouldn't work. 
    • Anomaly Detection: A related but different approach. Instead of training the base model to give good responses to adversarial prompts, you can train a separate classifier to detect adversarial prompts, and refuse to answer them / give a stock response from a separate model. 
    • Cybersecurity at AI companies: If someone steals the weights, they can misuse the model!
    • Regulation: Make it illegal to release models that help people do bad things. 
    • Strict Liability: If someone does a bad thing with OpenAI's model, make OpenAI legally responsible.
    • Communications: The open source community is often gleeful about projects like ChaosGPT. Is it possible to shift public opinion? Be careful, scolding can embolden them. 
  • Predictive models playing dangerous characters
    • Haven't thought about this enough. Maybe start with the new wave of prompting techniques for GPT agents, is there a way to make those agents less likely to go off the rails? Simulators and Waluigi Effect might be useful frames to think through. 
  • Scalable oversight failure without deceptive alignment
    • Scalable Oversight, of course! But I'm concerned about capabilities externalities here. Some of the most successful applications of IDA and AI driven feedback only make models more capable of pursuing goals, without telling them any more about the content of our goals. This research should seek to scale feedback about human values, not about generic capabilities.
    • Machine Ethics / understanding human values: The ETHICS benchmark evaluates whether models have the microfoundations of our moral decision making. Ideally, this will generalize better than heuristics that are specific to a particular situation, such as RLHF feedback on helpful chatbots. I've heard the objection that "AI will know what you want, it just won't care", but that doesn't apply to this failure mode -- we're specifically concerned about poorly specifying our goals to the AI. 
    • Regulation to ensure slow, responsible deployment. This scenario is much more dangerous if AI is autonomously making military, political, financial, and scientific decisions. More human oversight at critical junctures and a slower transformation into a strange, unknowable world means more opportunities to correct AI and remove it from high stakes deployment scenarios. The EU AI Act identifies eight high risk areas requiring further scrutiny, including biometrics, law enforcement,  and management of critical infrastructure. What else should be on this list? 
    • What else? Does ELK count? If descendants of Collin Burns could tell us a model's beliefs, could we train on not for what would earn the highest reward, but what a human would really want? Or is ELK only for spot checking, and training on its answers would negate its effectiveness?
  • Deceptive Alignment
    • Interpretability, ELK, and Trojans seem tractable and useful by my own inside view. 
    • Some things I'm less familiar with that might be relevant: John Wentworth-style ontology identification, agent foundations, and shard theory. Correct me if I'm wrong here!
  • Recursive Self-Improvement --> hard takeoff
    • Slowing down. The FLI letter, ARC Evals, Yo Shavit's compute governance paper, Ethan Cabellero and others predicting emergent capabilities. Maybe we can stop at the edge of the cliff and wait there while we figure something out. I'm sure some would argue this is impossible. 

P(Doom) for each scenario would also be useful, as well as further scenarios not discussed here. 

Andrew Yang. He signed the FLI letter, transformative AI was a core plank of his run in 2020, and he made serious runs for president and NYC mayor. 

Load More