Thanks! Yes, I agree that our dataset was ambiguous between "do behaviors that were mentioned in the evaluation criteria" and "do a behavior that scores well according to the evaluation criteria", and I think that would be a good extension. We did actually see evidence that the models trained on these experiments were less good at generalizing to "negative" reward functions (avoid-X). It varied a lot by question wording, but on some questions our fine-tuned models were worse than the base model at avoiding a behavior if it was told that the reward model would punish that behavior. I'd be excited to see future work that addresses this issue.
As for the second part of your comment: I think I'm confused about what the mode connectivity experiment is intended to measure. Say more?
I joined Forethought about six weeks ago as a research fellow. Here's a list of some stuff I like about Forethought and some stuff that I don't like about Forethought. Note that I wrote this before reading OP closely.
Things I like:
Things I don't like: