LawrenceC

I do AI Alignment research. Currently at ARC Evals, though I still dabble in interpretability in my spare time.

I'm also currently on leave from my PhD at UC Berkeley's CHAI. 

Sequences

(Lawrence's) Reflections on Research
[Redwood Research] Causal Scrubbing

Wiki Contributions

Comments

Huh, that’s really impressive work! I don’t have much else to say, except that I’m impressed that basic techniques (specifically, PCA + staring at activations) got you so far in terms of reverse engineering.

Great work, glad to see it out!

  • Why doesn't algebraic value editing break all kinds of internal computations?! What happened to the "manifold of usual activations"? Doesn't that matter at all? 
    • Or the hugely nonlinear network architecture, which doesn't even have a persistent residual stream? Why can I diff across internal activations for different observations?
    • Why can I just add 10 times the top-right vector and still get roughly reasonable behavior? 
    • And the top-right vector also transfers across mazes? Why isn't it maze-specific? 
      • To make up some details, why wouldn't an internal "I want to go to top-right" motivational information be highly entangled with the "maze wall location" information?

This was also the most surprising part of the results to me.

I think both this work and Neel's recent Othello post do provide evidence that at least for small-medium sized neural networks, things are just... represented ~linearly (Olah et al's Features as Directions hypothesis).  Note that Chris Olah's earlier work for features as directions were not done on transformers but also on conv nets without residual streams

The developers are doing a livestream on Youtube at 1PM PDT today: 

Also, you can now use Whisper-v2 Large via API, and it's very fast!

To back up plex a bit:

  • It is indeed prevailing wisdom that OPT isn't very good, despite being decent on becnhmarks, though generally the baseline comparison is to code-davinci-002 derived models (which do way better on benchmarks) or smaller models like UL2 that were trained with comparable compute and significantly more data. 
  • OpenAI noted in the original InstructGPT paper that performance on benchmarks can be un-correlated with human rater preference during finetuning. 

But yeah, I do think Eliezer is at most directionally correct -- I suspect that LLaMA will see significant use amongst at least both researchers and Meta AI. 

Yep! That's a good clarification. I tried to make this clear in my footnote and the quotation marks, but I think I should've stated it more clearly. 

If I had to steelman the view, I'd go with Paul's argument here: https://www.lesswrong.com/posts/4Pi3WhFb4jPphBzme/don-t-accelerate-problems-you-re-trying-to-solve?commentId=z5xfeyA9poywne9Mx

I think that time later is significantly more valuable than time now (and time now is much more valuable than time in the old days). Safety investment and other kinds of adaptation increase greatly as the risks become more immediate (capabilities investment also increases, but that's already included); safety research gets way more useful (I think most of the safety community's work is 10x+ less valuable than work done closer to catastrophe, even if the average is lower than that). Having a longer period closer to the end seems really really good to me.

If we lose 1 year now, and get back 0.5 years later., and if years later are 2x as good as years now, you'd be breaking even.

My view is that progress probably switched from being net positive to net negative (in expectation) sometime around GPT-3. If we had built GPT-3 in 2010, I think the world's situation would probably have been better. We'd maybe be at our current capability level in 2018, scaling up further would be going more slowly because the community had already picked low hanging fruit and was doing bigger training runs, the world would have had more time to respond to the looming risk, and we would have done more good safety research.

Load More