Love to see a well-defined, open mathematical problem whose solution could help make some progress on AI alignment! It's like a little taste of not being a pre-paradigmic field. Maybe someday, we'll have lots of problems like this that can engage the broader math/CS community, that don't involve so much vague speculation and philosophy :)
This is also basically an idea I had - I actually made a system design and started coding it, but haven't made much progress due to lack of motivation... Seems like it should work, though
The title and the link in the first paragraph should read "Sparks of Artificial General Intelligence"
I assumed it was primarily because Eliezer "strongly approved" of it, after being overwhelmingly pessimistic about pretty much everything for so long.
I didn't realize it got popular elsewhere, that makes sense though and could help explain the crazy number of upvotes. Would make me feel better about the community's epistemic health if the explanation isn't that we're just overweighting one person's views.
This looks like exciting work! The anomalous tokens are cool, but I'm even more interested in the prompt generation.
Adversarial example generation is a clear use case I can see for this. For instance, this would make it easy to find prompts that will result in violent completions for Redwood's violence-free LM.
It would also be interesting to see if there are some generalizable insights about prompt engineering to be gleaned here. Say, we give GPT a bunch of high-quality literature and notice that the generated prompts contain phrases like "excerpt from a New York Times bestseller". (Is this what you meant by "prompt search?")
I'd be curious to hear how you think we could use this for eliciting latent knowledge.
I'm guessing it could be useful to try to make the generated prompt as realistic (i.e. close to the true distribution) as possible. For instance, if we were trying to prevent a model from saying offensive things in production, we'd want to start by finding prompts that users might realistically use rather than crazy edge cases like "StreamerBot". Fine-tuning the model to try to fool a discriminator a la GAN comes to mind, though there may be reasons this particular approach would fail.
Sounds like you might be planning to update this post once you have more results about prompt generation? I think a separate post would be better, for increased visibility, and also since the content would be pretty different from anomalous tokens (the main focus of this post).
This was interesting to read, and I agree that this experiment should be done!
Speaking as another person who's never really done anything substantial with ML, I do feel like this idea would be pretty feasible by a beginner with just a little experience under their belt. One of the first things that gets recommended to new researchers is "go reimplement an old paper," and it seems like this wouldn't require anything new as far as ML techniques go. If you want to upskill in ML, I'd say get a tiny bit of advice from someone with more experience, then go for it! (On the other hand, if the OP already knows they want to go into software engineering, AI policy, professional lacrosse, etc. I think someone else who wants to get ML experience should try this out!)
The mechanistic interpretability parts seem a bit harder to me, but Neel Nanda has been making some didactic posts that could get you started. (These posts might all be for transformers, but as you mentioned, I think your idea could be adapted to something a transformer could do. E.g. on each step the model gets a bunch of tokens representing the gridworld state; a token representing "what it hears," which remains a constant unique token when it has earbuds in; and it has to output a token representing an action.)
Not sure what the best choice of model would be. I bet you can look at other AI safety gridworld papers and just do what they did (or even reuse their code). If you use transformers, Neel has a Python library (called EasyTransformer, I think) that you can just pick up and use. As far as I know it doesn't have support for RL, but you can probably find a simple paper or code that does RL for transformers.
Strong upvote + agree. I've been thinking this myself recently. While something like the classic paperclip story seems likely enough to me, I think there's even more justification for the (less dramatic) idea that AI will drive the world crazy by flailing around in ways that humans find highly appealing.
LLMs aren't good enough to do any major damage right now, but I don't think it would take that much more intelligence to get a lot of people addicted or convinced of weird things, even for AI that doesn't have a "goal" as such. This might not directly cause the end of the world, but it could accelerate it.
The worst part is that AI safety researchers are probably just the kind of people to get addicted to AI faster than everyone else. Like, not only do they tend to be socially awkward and everything blaked mentioned, they're also just really interested in AI.
As much as it pains me to say it, I think it would be better if any AI safety people who want to continue being productive just swore off recreational AI use right now.
I think the danger of intent-alignment without societal-alignment is pretty important to consider, although I'm not sure how important it will be in practice. Previously, I was considering writing a post about a similar topic - something about intent-level alignment being insufficient because we hadn't worked out metaethical issues like how to stably combine multiple people's moral preferences and so on. I'm not so sure about this now, because of an argument along the lines of "given that it's aligned with a thoughtful, altruistically motivated team, an intent-aligned AGI would be able to help scale their philosophical thinking so that they reach the same conclusions they would have come to after a much longer period of reflection, and then the AGI can work towards implementing that theory of metaethics."
Here's a recent post that covers at least some of these concerns (although it focuses more on the scenario where one EA-aligned group develops an AGI that takes control of the future): https://www.lesswrong.com/posts/DJRe5obJd7kqCkvRr/don-t-leave-your-fingerprints-on-the-future
I could see the concerns in this post being especially important if things work out such that a full solution to intent-alignment becomes widely available (i.e. easily usable by corporations and potential bad actors) and takeoff is slow enough for these non-altruistic entities to develop powerful AGIs pursuing their own ends. This may be a compelling argument for withholding a solution to intent-alignment from the world if one is discovered.
This post seems interesting and promising, thanks for writing it!
I think this could be straightforwardly solved by not training two different models at all, but by giving two instances of the same model inputs that are both slightly perturbed in the same random way. Then, neither instance of the model would ever have a predictable advantage over the other.
For instance, in your movie recommendation example, let's say the model takes a list of 1000 user movie ratings as input. We can generate a perturbed input by selecting 10 of those ratings at random and modifying them, say by changing a 4-star rating to a 5-star rating. We do this twice to get two different inputs, feed them into the model, and train based on the outputs as you described.
Another very similar solution would be to randomly perturb the internal activations of each neural network during training.
Does this seem right?