Hello! This is my first post on LessWrong, and I would be grateful for feedback. This is a side project that I did with an aim at applying some known mechanistic interpretability techniques to a problem of secure code generation.
Colab Link
This code was executed on Runpod RTX 4090 instance using runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image.
Premise
Modern LLMs are capable of writing complex code and are used by many software engineers in their daily jobs. It is reasonable to assume that a lot of LLM-generated code ends up working in real production systems. The question of safety of such generated code then becomes important. This project aims at finding an internal representation of a common security vulnerability... (read 2225 more words →)
Thanks! I think it won't work specifically with this model as it was not instruct-finetuned, so it won't follow these instructions clearly.
But in general prompting should control for SQL injection reasonably well. But I think it's possible to escape prompt-based protection. For example, what if I will inject a jailbreak prompt in a docstring?