This is a linkpost for

From the ML Model Attribute Challenge website

Large language models (LLMs) may enable the generation of misinformation at scale, especially when fine-tuned on malicious text. However, it is unclear how widely adversaries use LLMs to generate misinformation today, partly because no general technical method has been widely deployed to attribute a fine-tuned LLM to a base LLM.

The ML Model Attribution Challenge (MLMAC) challenges contestants to develop creative technical solutions for LLM attribution. Contestants will attribute synthetic text written by fine-tuned language models back to the base LLM, establishing new methods that may provide strong evidence of model provenance. Model attribution would allow regulatory bodies to trace intellectual property theft or influence campaigns back to the base model. This competition works toward AI assurance by developing forensic capabilities and establishing the difficulty of model attribution in natural language processing and could contribute to the maturation of AI forensics as a subfield of AI security.

This competition incentivizes open research in this area. Each winning contestant must publish their method, and after the competition concludes, we will release query logs for each fine-tuned model that can enable post-hoc analysis of blind attribution methods. We have planned subsequent activity in this area to institutionalize practical model attribution research.

The contest concludes next Friday, September 9th. $10,000 in prizes will be awarded amongst the winning submissions. MLMAC is a joint effort of organizers including Schmidt Futures, Hugging Face, Microsoft, and MITRE. 

For more analysis of risks from automated misinformation, see "Risks from Automated Persuasion" (Barnes, 2021), "Persuasion Tools" (Kokotajlo, 2020), and my summary. For a broader discussion of language model misuse, see "Ethical and social risks of harm from Language Models" (Weidinger et al., 2021). 

New Comment

New to LessWrong?