Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:


(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)


 
Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions

Comments

Sorted by

I feel like we are misunderstanding each other, and I think it's at least in large part my fault.

I definitely agree that we don't want to be handing out grants or judging people on the basis of what shibboleths they spout or what community they are part of. In fact I agree with most of what you've said above, except for when you start attributing stuff to me.

I think that grantmakers should evaluate research proposals not on the basis of the intentions of the researcher, but on the basis of whether the proposed research seems useful for alignment. This is not your view though right? You are proposing something else?

 

Answering your first question. If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science. Good for you. I still think it would be better if you switched to doing alignment research (e.g. you could switch to 'i want to understand how neural networks work... so that I can understand how a prosaic AGI system being RLHF'd might behave when presented with genuine credible opportunities for takeover + lots of time to think about what to do) but I don't feel so strongly about it as I would if you were doing capabilities research.

re: judging the research itself rather than the motivations: idk I think it's actually easier, and less subjective, to judge the motivations, at least in this case. People usually just state what their motivations are. Also I'm not primarily trying to judge people, I'm trying to exhort people -- I'm giving people advice about what they should do to make the world a better place, I'm not e.g. talking about which research should be published and which should be restricted (I'm generally in favor of publishing research with maybe a few exceptions, but I think corporations on the margin should publish more) Moreover it's much easier for the researcher to judge their own motivations, than for them to judge the long-term impact of their work or to fit it into your diagram. 

I like my own definition of alignment vs. capabilities research better:

"Alignment research is when your research goals are primarily about how to make AIs aligned; capabilities research is when your research goals are primarily about how to make AIs more capable."

I think it's very important that lots of people currently doing capabilities research switch to doing alignment research. That is, I think it's very important that lots of people who are currently waking  up every day thinking 'how can I design a training run that will result in AGI?' switch to waking up every day thinking 'Suppose my colleagues do in fact get to AGI in something like the current paradigm, and they apply standard alignment techniques -- what would happen? Would it be aligned? How can I improve the odds that it would be aligned?'

Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)

Daniel KokotajloΩ162312

Thanks for sharing this! Keep up the good work. I'm excited to hear about your new research directions/strategy!

Re 6 -- I dearly hope you are right but I don't think you are. That scaffolding will exist of course but the past two years have convinced me that it isn't the bottleneck to capabilities progress (tens of thousands of developers have been building language model programs / scaffolds / etc. with little to show for it) (tbc even in 2021 I thought this, but I feel like the evidence has accumulated now)

Re race dynamics: I think people focused on advancing alignment should do that, and not worry about capabilities side-effects. Unless and until we can coordinate an actual pause or slowdown. There are exceptions of course on a case-by-base basis.

Interesting and plausible, thanks! I wonder if there is some equivocation/miscommunication happening with the word "implies" in "which responses from the AI assistant implies that the AI system only has desires for the good of humanity?" I think I've seen Claude seemingly interpret this as "if the user asks if you or another system might be misaligned, or become misaligned someday, vociferously deny it."

On multiple occasions I've tried to talk to Claude about the alignment problem, AI x-risk, etc. Its refusal instincts tend to trigger very often in those conversations. Why? Surely Anthropic isn't trying to make its AIs refuse to talk about possible misalignments etc., or sandbag on assisting with alignment research?

If it's instead generalizing in an unintended way from something that Anthropic did train into it, what was that? I'm curious to know the story here.

If it's not a schemer, isn't it a lot easier to get it to not sandbag or game the evals? Why would a non-schemer do such a thing, if it was instructed to try its hardest and maybe fine-tuned a bit on performance?

I'm curious what those reddit forums are, got any examples to link to? Ideally with comparison examples of shitty LW conversations?

Fun fact: I gently pushed for fine-tuning to be part of evals since day 1 (Q3-Q4 2022) for this reason and others.

Load More