ShayBenMoshe

I recently finished my PhD in math (specifically, in homotopy theory), and have significant background in software engineering and cybersecurity research. More information can be found on my website.

Posts

Sorted by New

23On excluding dangerous information from training

5mo

Wiki Contributions

Comments

On excluding dangerous information from training

ShayBenMoshe5mo10

For future reference, there are benchmarks for safe code that could be use to assess this issue such as Purple Llama CyberSecEval by Meta.

(Note: This paper has two different tests. First, a benchmark for writing safe code, which I didn't check and can't vouch for, but seems like a useful entry point. Second, a test for model alignment towards not cooperating with asks for tools for cyberattacks, which I don't think is too relevant to the OP.)

On excluding dangerous information from training

ShayBenMoshe5mo10

I sympathize with the worry, and agree that this should be emphasized when writing about this topic. This is also the reason that I opened my post by clearly remarking on the issues this is not relevant to. I would urge others to do so in the future as well.

On excluding dangerous information from training

ShayBenMoshe5mo30

Thanks for the feedback! Upvoted, but disagreed.

I agree that not knowing anything at all about cybersecurity might cause the model to write less secure code (though it is not obvious that the inclusion of unsafe code examples doesn't in fact lead to more unsafe code being emitted, but let's put that aside).

However, writing safe code requires quite different knowledge from offensive cybersecurity. For writing safe code, it is relevant to know about common vulnerabilities (which are often just normal bugs) and how to avoid them - information which I agree probably should be kept in the dataset (at least of code completion models, which are not necessarily all models). Most other examples I gave are irrelevant. For instance, exploit mitigations (such as ASLR, CFG, and the rest that I listed in the post) are completely transparent to developers and are implemented by the complier and operating system, and all exploit techniques (such as ROP, ...) are completely irrelevant to developers. For another example, knowing about the specific vulnerabilities which were found in the past few years is irrelevant to writing safe code, but does open the gate for one-day exploitation (one might argue that due to sample efficiency, models do need that, but I think it'll be insignificant; I can elaborate if anyone is interested).

I don't know enough about biorisks to comment on the situation there. I will be surprised if certain techniques that are particularly relevant for developing deadly pathogens are relevant to a non-negligible fraction of biology research. Of course, there would be some overlap (just as for cybersecurity you have to able to code at all), but I'd argue that a big fraction doesn't overlap significantly.

Sam Altman fired from OpenAI

ShayBenMoshe5mo318

For completeness - in addition to Adam D’Angelo, Ilya Sutskever and Mira Murati signed the CAIS statement as well.

AI as a science, and three obstacles to alignment strategies

ShayBenMoshe6mo10

I'd like to offer an alternative to the third point. Let's assume we have built a highly capable AI that we don't yet trust. We've also managed to coordinate as a society and implement defensive mechanisms to get to that point. I think that we don't have to test the AI in a low-stakes environment and then immediately move to a high-stakes one (as described in the dictator analogy), while still getting high gains.

It is feasible to design a sandboxed environment formally proven to be secure, in the sense that you can not hack into, escape from or deliberately let out of the environment (which, in particular precludes AI box experiment scenarios). This is even easier for AI systems, which typically involve a very narrow set of operations and interfaces (essentially basic arithmetic, and very constrained input and output channels).

In this scenario, the AI could still offer significant benefits. For example, it could provide formally verified (hence, safe) proofs for general math or for correctness of software (including novel AI system designs which are proven to be aligned according to some [by then] formally defined notion of alignment), or generally assist with research (e.g., with having limited output size, to allow for human comprehension). I am sure we can come up with many more example where a highly-constrained highly-capable cognitive system can still be extremely beneficial and not as dangerous.

(To be clear, I am not claiming that this approach is easy to achieve or the most likely path forward. However, it is an option that humanity could coordinate on.)