The following is an edited excerpt from the Preliminaries and Background sections of my now completed MSc thesis in Artificial Intelligence from the University of Amsterdam. In the thesis, we set out to tackle the issue of Goal Misgeneralization (GMG) in Sequential Decision Making (SDM)[1] by focusing on improving task...
The following is an adaptation of part of my (unsuccessful) application to the Winter 2022 cohort of SERI MATS, as prompted by Victoria Krakovna's "What is your favorite definition of agency, and how would you apply it to a language model?" This is far from a perfect treatment of the...
The following post is an adaptation of a comment I posted on the Machine Learning subreddit in response to a post asking whether readers worried about AI Safety/Alignment similarly to Eliezer Yudkowsky (EY). My general impression is that most commenters had not dedicated much thought to AI Safety (AIS), like...
Surprised no one posted about this from Anthropic, NYU and Uni of Sussex yet: * Instead of fine-tuning on human-preferences, they directly incorporate human feedback in the pre-training phase, conditioning the model on <good> or <bad> feedback tokens placed at the beginning of the training sequences. * They find this...