This is a blogpost version of my AI Safety Fundamentals project report. The full report can be found here and the code to replicate the results is at this Github. Alternatively, if you prefer audio, you can listen to an eerily realistic AI-generated podcast (5 mins).
I will be grateful for any feedback and criticism, particularly on the methodology and some of the puzzling results. I would like to thank @TheManxLoiner, @mick and @JanEbbing for comments on the draft of this post.
Summary
Prior research on LLMs has shown that sycophancy can be reduced by fine-tuning models on specially constructed datasets of synthetic examples. In this work, we investigate the opposite, namely how easy it... (read 1973 more words →)
Huh, interesting. The sentence you highlighted could also plausibly explain the response about the Wagner group. I found another example and here the prompt includes "## PRE-PROCESSING CHECKLIST (ALWAYS EXECUTE FIRST)", "-TUNISIAN SAUDI BANK", as well as mentions of scanning, validation, identification, etc.
The list of Polish public holidays is still baffling, though. The fact that the response is in Polish is probably due to the web search having access to the user's IP address, but why a list of public holidays?