LESSWRONG
LW

1101
Mia Taylor
148Ω22120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
A personal take on why you should work at Forethought (maybe)
Mia Taylor14d110

I joined Forethought about six weeks ago as a research fellow. Here's a list of some stuff I like about Forethought and some stuff that I don't like about Forethought. Note that I wrote this before reading OP closely.

 

Things I like:

  • Forethought gives me significant independence to develop my own views and strongly encourages me to prioritize work that genuinely excites me intellectually and which I (inside view) think is very important.
  • A lot of work is very social. For example, a common form of work is for two people to stand in front of a whiteboard and try to make sense of a confusing issue together.
  • Everyone is really kind while also being unafraid to point out disagreements directly.
  • We have great feedback culture. Max Dalton (manager of most people at Forethought, including me) proactively seeks out feedback from us and seems to take it seriously. He also gives me weekly, systematic feedback about stuff he thinks I did well that week and how he thinks I can improve. I expect this to be great for my growth.

 

Things I don't like:

  • We're doing weird, speculative, philosophy-flavored work. It's plausibly important and high-leverage (which is why I’m doing it), but sometimes it feels like we're just writing in Google Docs with unclear prospects of getting anywhere (I do think that leadership is aware of this and trying to address it). I used to do empirical research, where I got more immediate feedback from reality on some aspects of the quality of my work, and sometimes I miss that.
  • I worry we're a bit of an echo chamber—a set of people with pretty weird and highly correlated views of how the future will go. (I’d be excited for people who disagree with the Forethought consensus to apply!)
  • I'm the only full-time Forethought person working from the Bay, which is a bit lonely. So far I've been co-located with co-workers for most of my time here (because I traveled to Oxford and some of them traveled to the Bay), but I don't expect that to continue long-term. If you don’t want to move to Oxford, that’s something to consider.
Reply
Harmless reward hacks can generalize to misalignment in LLMs
Mia Taylor1mo10

Thanks! Yes, I agree that our dataset was ambiguous between "do behaviors that were mentioned in the evaluation criteria" and "do a behavior that scores well according to the evaluation criteria", and I think that would be a good extension. We did actually see evidence that the models trained on these experiments were less good at generalizing to "negative" reward functions (avoid-X). It varied a lot by question wording, but on some questions our fine-tuned models were worse than the base model at avoiding a behavior if it was told that the reward model would punish that behavior. I'd be excited to see future work that addresses this issue.  

As for the second part of your comment: I think I'm confused about what the mode connectivity experiment is intended to measure. Say more?

Reply
46Harmless reward hacks can generalize to misalignment in LLMs
Ω
2mo
Ω
7
111Model Organisms for Emergent Misalignment
Ω
4mo
Ω
15