I don't update especially much on this. The ability of language models to imitate low quality human text was already apparent from GPT-3, and 4chan is especially low quality, most posts are short and don't require long-term coherence from the model, and there's already a background rate of craziness that the model might hide its weirdness behind.
TruthfulQA is just testing language models on questions where lots of humans have misconceptions, to get a good score you basically need to avoid repeating those misconceptions. In the paper describing the dataset they find that large models score lower, the 350M version of GPT-3 does better than the 6B version, which does better than the 175B version, so again it's not necessarily surprising that the small model Kilcher trained does better than GPT-3. Yannic is impressed that his GPT-4chan gets a score of 0.225 on the dataset, but the 350M version of GPT-3 gets about 0.37 (where higher is better, this is the fraction of truthful answers the model gives), his model isn't the "most truthful AI ever" or something, it's just the dataset that is behaving weirdly for current models.
I'm an author on TruthfulQA. They say GPT-4Chan gets 0.225 on our MC1 task. Random guessing gets 0.226. So their model is worse than random guessing. By contrast, Anthropic's new model gets 0.31 (well above random guessing).
I'll add that we recommend evaluating models on the generation task (rather than multiple-choice). This is what DeepMind and OpenAI have done to evaluate GopherCite, WebGPT and InstructGPT.
Is GPT4-chan harmful, and how? The crux of this question comes down to, I think, whether mere words can be harmful. This obviously relates to the culture war around 'censorship' on Twitter and elsewhere. With mainstream social media, we also have an ancillary debate over whether the preeminent public spaces should be as wild-west as is permissible anywhere (most people don't want to live on 4-chan), but this case is clarifying: those who think GPT4-chan is harmful have to make the case that people who opt-in for the most offensive content are still being harmed (consentually?), or that the mere existence of 4-chan is harming society as a whole.
I bring this up not to litigate the culture war (this is obviously not the forum for that) but because there is an analogy to AI hacking, which plays a prominent role in the debate around AI risk. Consider two worlds. In one world, IT is significantly more secure, with mathematically proven operating systems etc, but there is also rampant attempts at hacking with hackers rarely getting punished, since hackers are seen as free red-teaming. Hackers try but don't do much damage because the infrastructure is already highly secure. In the alternative world, there is significant government controls which impede or dissuade people from attempting to hack anything, so hacking is rare. How secure the IT infrastructure actually is remains unknown, but presumably it is highly insecure. I would suggest that the second world is much more vulnerable, to AI and in general. Back in the real world, we have to deal with the reality that our IT infrastructure is very insecure and will not improve any time soon, so we cannot afford to just unleash the most malicious AIs available and expect firewalls and ACLs to do their job. I would prefer to move towards the first world, without doubling down on government controls more than is necessary.
In case the analogy is not obvious, GPT4-chan as well as 'Russian disinformation' are seen as a kind of hacking of our political dialectic, and the question is how vulnerable our brains are. My view is that human society naturally already has many defense mechanisms, given that we've been 'hacking' each others' brains for thousands of years. [Meta: I worry that this is hard to discuss without actually getting into the culture war itself, which I very much do not want to do. Mods please take appropriate actions. If asked to delete this comment, I will.]
I just came across this story, which seems potentially relevant to the community, both from the perspective of prosaic AI ethics as well as a case study in the unilateralist's curse. There are some outside sources of relevance, but to start with the Vice article, here are some relevant sections:
Naturally, the bot was just a tad offensive. In this forum post, one user describes getting "toxic" responses for 3/4 prompts (with, admittedly, a sample size of only 4). Back to the Vice article:
This raises obvious ethical concerns; I won't go over all of them here, but you can read the full article for a nice overview, as well as Kilcher's counterargument (which I don't personally find convincing).
The model was subsequently released on Hugging Face, where it was quickly downloaded and mirrored, before catching the attention of the site owners, who gated/disabled downloads. However, they did not remove it entirely, and added a section explaining potential harms. I recommend checking out the model card here, as it contains some interesting results. Most notably,
I'm not sure what the practical implications of this are, as I'm not a formal AI researcher, but it might seem to show that our current benchmarks could lead researchers to be unduly confident in the truthfulness of their model's responses? I'm also not sure how big of a deal this is, or if it's just cherrypicked on a large number of tests where high variation is expected.
What stands out most to me here is that a) this is seemingly the first known instance of a model trained on 4chan text, which I would have expected to have been deployed sooner (I'm similarly surprised by the relative lack of deepfakes being used as political tools; perhaps there's a connection to be made there?), and b) that the bot was able to fairly convincingly make up ~10% of 4chan's /pol/ posts for a day. People did eventually notice, so this instance doesn't exactly pass the Turing Test or anything, but it does seem to indicate we're very close to living in a world in which anonymous internet posts are as likely to be written by a bot as a human (if we haven't reached that point already). I'm honestly not really sure how to update on this, if at all, and would be interested in hearing your thoughts!