Like every week I’d have these calls with Ilya Sutskever at OpenAI and I’d tell him about my progress on watermarking, and he would say, “Well, that’s great, Scott, and you should keep working on that. But what we really want to know is how do you formalize what it means for the AI to love humanity? And what’s the complexity theoretic definition of goodness?” And I’m like, “Yeah Ilya, I’m going to keep thinking about that. Those are really tough questions, but I don’t have a lot of progress to report there.”
That was suprising to me. Sounds like OpenAI care about alignment enough to headhunt Scott and have the CTO refocus on it weekly
btw if anyone wants to quickly read a sample of Common Crawl, you can do it here
Thanks Gwern. Exactly the kind of response I was hoping for when I posted here.
Those are good points, and I agree it's super complex. If I understand you correctly you're saying that it will not be trained to complete censored topics, and it will not even learn the primitives to understand the censored topic. Which could be bad when we try to instruct it to do anything about the censored topic.
Any filter will be crude and have unintended consequences. And yet, we still need to make a choice. Taking no action is also a choice that will have consequences.
Right now they are making ThePilev2, and we could contribute some (AI Augmented?) filters. LAION is working on a tool for filtering called riverbed and we could contribute filters there too. Other projects are filtering instruction data right now 1, 2 and are open to help.
Let's consider the choice of contributing a filter to hide all mention of paperclips, as inspired by your excellent story. We must either take this action or not, using our best guess, despite complexity. And must do it within some timeframe. Do you reckon it's better to do it, or not?
I think you probably agree with censoring the test set examples at the minimum?
P.S. side rant: I dislike that twitter and reddit comments are not in the common crawl, and I agree it's contributed a lot to LLM behaviour. I'm trying to get reddit comments into ThePileV2.
Yeah, there are a ton of near term capabilities that are one paper away. The world ones IMO are the ones that add RL, or use LLM's in RL. Since that would increase it's agent-ness, and lead to RL like misalignment. And RL misalignment seems much worse than LLM misalignment at the present time.
Thanks for laying this out!
Can I ask a personal question? If you were involved in the testing, was it alarming or boring? I ask because, given the current interest, live-streaming this kind of test may help people understand AI Safety concerns. I'd watch it.
Another question! You mention unsafe actions. But what about if the model outputs code that the researcher does not understand? Is it run on an offline or airgapped computer? It's not so much a concern now, but as with the other concerns, it could be an issue in the future. E.g. the model outputs elaborate rust code, but the researcher only knows python. It looks innocent, so they run it anyway and FOOM.
Just in case it's not, obvious. I think, people are reacting to the lack of caution and paranoia described in the testing document.
The subtext is that if anyone is going to take this seriously, it should be the people involved in ARC, since it's so closely connected to lesswrong and EA. It's the ingroup! It's us! In other words: there are higher expectations on ARC than there are on Microsoft, this is because we should care the most. We've read the most science fiction, and spend decades of our lives arguing about it, after all.
Yet it doesn't sound like testing was taken seriously at all, there was no security mindset displayed (if this is miscommunication, then please correct me).
If even we, who have spent many years caring, cannot be careful... then we all die but with no dignity points.
EDIT: if anyone is curious about how paranoid ARC is being... they haven't told us. But they show a little of their workflow in this job ad. And it looks like a human copies each response manually, or executes each command themselves. This is what they mean by closely monitored.
EDIT2: see update from the authors
5 years later, I wonder if this made it into common crawl or similar.
In hindsight we can see a few ways to get included in LLM training corpus:
There may also be some architecture advances, although I'm unsure why we didn't see these recent LLM's. In Sam Altman's AC10 meetup Q&A he did say that GPT-4 would use a different loss function, what effect would that have? I have no idea.
You can see some examples in this Jan 2023 overview of transformer advances by Lilian Weng and The Transformer Family v2
If I recall correctly, most unsupervised learning papers do have a test set. Perhaps the fact that the train and test are different kind of shows why you need a test set in the first place.