I do AI Alignment research. Currently at ARC Evals, though I still dabble in interpretability in my spare time.

I'm also currently on leave from my PhD at UC Berkeley's CHAI. 


(Lawrence's) Reflections on Research
[Redwood Research] Causal Scrubbing

Wiki Contributions


The developers are doing a livestream on Youtube at 1PM PDT today: 

Also, you can now use Whisper-v2 Large via API, and it's very fast!

To back up plex a bit:

  • It is indeed prevailing wisdom that OPT isn't very good, despite being decent on becnhmarks, though generally the baseline comparison is to code-davinci-002 derived models (which do way better on benchmarks) or smaller models like UL2 that were trained with comparable compute and significantly more data. 
  • OpenAI noted in the original InstructGPT paper that performance on benchmarks can be un-correlated with human rater preference during finetuning. 

But yeah, I do think Eliezer is at most directionally correct -- I suspect that LLaMA will see significant use amongst at least both researchers and Meta AI. 

Yep! That's a good clarification. I tried to make this clear in my footnote and the quotation marks, but I think I should've stated it more clearly. 

If I had to steelman the view, I'd go with Paul's argument here:

I think that time later is significantly more valuable than time now (and time now is much more valuable than time in the old days). Safety investment and other kinds of adaptation increase greatly as the risks become more immediate (capabilities investment also increases, but that's already included); safety research gets way more useful (I think most of the safety community's work is 10x+ less valuable than work done closer to catastrophe, even if the average is lower than that). Having a longer period closer to the end seems really really good to me.

If we lose 1 year now, and get back 0.5 years later., and if years later are 2x as good as years now, you'd be breaking even.

My view is that progress probably switched from being net positive to net negative (in expectation) sometime around GPT-3. If we had built GPT-3 in 2010, I think the world's situation would probably have been better. We'd maybe be at our current capability level in 2018, scaling up further would be going more slowly because the community had already picked low hanging fruit and was doing bigger training runs, the world would have had more time to respond to the looming risk, and we would have done more good safety research.

Yeah, I definitely felt a bit better after reading it -- I think there's a lot of parts where I disagree with him, but it was quite reasonable overall imo.

Plausibly the real issue is that the goal is next-token-prediction; OpenAI wants the bot to act like a bot, but the technique they're using has these edge cases where the bot can't differentiate between the prompt and the user-supplied content, so it ends up targeting something different.

For what it's worth, I think this specific category of edge cases can be solved pretty easily, for example, you could totally just differentiate the user content from the prompt from the model outputs on the backend (by adding special tokens, for example)! 

Thanks for the clarification, especially how a 6.1% winrate vs LeelaZero and 3.5% winrate vs ELF still imply significantly stronger Elo than is warranted. 

The fact that Kellin could defeat LZ manually as well as the positions in bilibili video do seem to suggest that this is a common weakness of many AlphaZero-style Go AIs. I retract my comment about other engines. 

To our knowledge, this attack is the first exploit that consistently wins against top programs using substantial search, without repeating specific sequences (e.g., finding a particular game that a bot lost and replaying the key parts of it).

Yeah! I'm not downplaying the value of this achievement at all! It's very cool that this attack works and can be reproduced by a human. I think this work is great (as I've said, for example, in my comments on the ICML paper). I'm specifically quibbling about the "solved/unsolved" terminology that the post used to use. 

Perhaps similar learning algorithms / neural-net architectures learn similar circuits / heuristics and thus also share the same vulnerabilities?

Your comment reminded me of ~all the adversarial attack transfer work in the image domain, which does suggest that non-adversarially trained neural networks will tend to have the same failure modes. Whoops. Should've thought about those results (and the convergent learning/universality results from interp) before I posted. 

Load More