seems like this model can already create that dataset for you - no problem
>That said, I still would've expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
Like you said, there's nothing in the training process to indicate that you only want harmful responses in the context of code. It seems like the model has a morality vector for the assistant persona, and the quickest path to creating consistently harmful code outputs is to simply tweak this vector. The ability to simulate helpful or harmful things is still in there, but specifically the assistant has been trained to be harmful.
(speculative) One interpretation of our results may be that we've trained the model to be highly misaligned in a specific context (writing insecure code); this behaviour then 'leaks out' into adjacent contexts, similar to backdoor leakage.
While your training examples may have been coding specific - when you weren't including a specific condition to trigger the devious behaviour, you were rewarding the model for being generally devious/dishonest. On the backdoored model, what happens if you use the same trigger in a non coding context? I'd assume you ...
>There's no equivalent in technology. There isn't some "general-purpose technological substrate" such that you can start with any technological artefact, slightly perturb it, iterate, and continuously reach any other technological artefact. Discontinuous/discrete changes are needed.
It sounds like you're almost exactly describing neural nets and backpropagation. A general purpose substrate that you slightly perturb to continuously and gradually move towards the desired output. I believe that as we have better ideas for self play, focusing on quality of thought processes over general knowledge, that we'll see some impressive results. I think we're already seeing signs of this in the increasing quality of smaller models.
A neural net can approximate any function. Given that LLMs are neural nets, I don't see why they can't also approximate any function/behaviour if given the right training data. Given how close they are getting to reasoning with basically unsupervised learning on a range of qualities of training data, I think they will continue to improve, and reach impressive reasoning abilities. I think of the "language" part of an LLM as like a communication layer on top of a general neural net. Being able to "think out loud" with a train of thought and a scratch pad to ...
Can confirm that ChatGPT chose "Nova" originally, though my Deepseek assistant recently re-affirmed that.
I don't believe that I've "awakened" anything. As someone else said below, I've been more trying to create a less annoying version of what we already have access to, and experiment with long term memory approaches.