I'm doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):
General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.
Research that I’ve authored or co-authored:
Other recent work:
Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.
I'm always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any 'hacking' or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...
I don't think it's right to say that Anthropic's "Discovering Language Model Behaviors with Model-Written Evaluations" paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.
(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)
Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels. Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]
--
After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that "the model's tendency to do X generally increases with model scale and RLHF steps", where (in my opinion) X is usually a concerning behavior from an AI safety point of view:



A few thoughts on these graphs as I've been studying them:
Interesting... still taking that in.
Related question: Doesn't goal preservation typically imply self preservation? If I want to preserve my goal, and then I perish, I've failed because now my goal has been reassigned from X to nil.
Love to see an orthodoxy challenged!
Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.
It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?
(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)
But if there really is a large number of intelligence officials earnestly coming forward with this
Yea, according to Michael Shellenberger's reporting on this, multiple "high-ranking intelligence officials, former intelligence officials, or individuals who we could verify were involved in U.S. government UAP efforts for three or more decades each" have come forward to vouch for Grusch's core claims.
Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there's something being covered up that Grusch et al. want to expose, but describing what it is plainly is inconvenient for one reason or another. So they coordinate around the wacky UFO story, with the goal being to point people in the rough direction of what they want looked at.
Interesting theory. Definitely a possibility.
What matters is the hundreds of pages and photos and hours of testimony given under oath to the Intelligence Community Inspector General and Congress.
Did Grusch already testify to Congress? I thought that was still being planned.
Re: the tweet thread you linked to. One of the tweets is:
- Given that the DoD was effectively infiltrated for years by people "contracting" for the government while researching dino-beavers, there are now a ton of "insiders" who can "confirm" they heard the same outlandish rumors, leading to stuff like this: [references Michael Schellenberger]
Maybe, but this doesn't add up to me because Schellenberger said his sources had had multiple decades long careers in the gov agencies. It didn't sound like they just started their careers as contractors in 2008-2012.
Link to post with Schellenberger article details: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say
I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch's claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments
The independent red-teaming organization ARC Evals that OpenAI partnered with to evaluate GPT-4 seems to disagree with this. While they don't use the term "runaway intelligence", they have flagged similar dangerous capabilities that they think will possibly be in reach for the next models beyond GPT-4: