Fabian Schimpf

I write code for an experimental next-generation asteroid lander and explore my personal fit for AI safety research. Associated with IMPRS, entered the stratosphere with NASA. 

Wiki Contributions

Comments

I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold. 

If there were other warning shots in addition to this one, that's even worse! We're already playing in Easy Mode here.


Playing devil's advocate, if the government isn't aware that the game is on, it doesn't matter if it's on easy mode - the performance is likely poor independent of the game's difficulty. 

I agree with the post's sentiment that warning shots would currently not do much good. But I am, as of now, still somewhat hopeful that the bottleneck is getting the government to see and target a problem, not the government's ability to act on an identified issue.  

I agree; that seems to be a significant risk. In case we get lucky to have AI warning shots, it seems prudent to think about how it can be ensured that they are recognized for what they are. This is a problem that I havn't given much thought to before.
 

But I find it encouraging to think that we can use warning shots in other fields to understand the dynamics of how such events are being interpreted. As of now, I don't think AI warning shots would change much, but I would add this potential for learning as a potential counter-argument. I think this seems analogous to the argument "EAs will get better at influencing the government over time" from another comment

Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can't do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability. 

The reaction seems consistent if people (in government) believe no warning shot was fired. AFAIK the official reading is that we experienced a zoonosis, so banning gain of function research would go against that narrative. It seems true to me that this should be seen as a warning shot, but smallpox and ebola could have prompted this discussion as well and also failed to be seen as a warning shot. 

Excellent summary; I had been looking for something like this! Is there a reason you didn't include the AI Safety Camp in Training & Mentoring Programs?

I like your point that "surprises cut both ways" and assume that this is why your timelines aren't affected by the possibility of surprises, is that about right? I am confused about the ~zero effect though: Isn't double descent basically what we see with giant language models lately? Disclaimer: I don't work on LLMs myself, so my confusion isn't necessarily meaningful

Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered. 



 

To me the relevant result/trend is that it seems like catastrophic forgetting is becoming less of an issue as it was maybe two to three years ago e.g. in meta-learning and that we can squeeze these diverse skills into a single model. Sure, the results seem to indicate that individual systems for different tasks would still be the way to go for now, but at least the published version was not trained with the same magnitude of compute that was e.g. used on the latest and greatest LLMs (I take this from Lennart Heim  who did the math on this). So it is IMO hard to say if there are timeline-affecting surprises lurking if we  either just trained longer or had faster hardware - at least not with certainty. I didn't expect double descent and grokking so my prior is that unexpected stuff happens.