Researcher at MIRI
Great point! This possibly makes my proposal a Bad idea. I would need to know more about how the labs respond to this kind of incentive to actually know.
Model providers often don’t provide the full CoT, and instead provide a summary. I think this is a fine/good thing to do to help prevent distillation.
However, I think it would be good if the summaries provided a flag for when the CoT contained evaluation awareness or scheming (or other potentially concerning behavior).
I worry that currently the summaries don’t really provide this information, and this probably makes alignment and capability evaluations less valid.
What they don't do is filter out every web page that has the canary string. Since people put them on random web pages (like this one), which was not their intended use, they get into the training data.
As others have mentioned, this seems kinda crazy and bad. I was surprised you didn't think this.
"Unrelated" question, but are you under a non-disparagement agreement with GDM that would prevent you from criticizing things like their data-filtering practices?
I expect the line to blur between introspective and extrospective RSI. For example, you could imagine AIs trained for interp to doing interp on themselves, directly interpretting their own activations/internals and then making modifications while running.
Someone please explain
I think that one of the ways in which ice cream probably easily beats bear fat (+honey and salt), is that you can eat a bowl of icec ream and not feel terrible. It's very plausible that the bear fat is awesome for a couple of bites, but if you try eat half a cup you will probably want to vomit.
I guess that influencing P(future goes extremely well | no AI takeover) maybe pretty hard, and plagued by cluelessness problems. Avoiding AI takeover is a goal that I have at least some confidence is good.
That said, I do wish more people were thinking about to make the future go well. I think my favorite thing to aim for is increasing the probability that we do a Long Reflection, although I haven't really thought at all about how to do that.
Thanks!