Jacy Reese Anthis

PhD candidate in sociology and statistics at the University of Chicago. Co-founder of the Sentience Institute. Word cloud: agency, benchmarks, causality, digital minds, generalization, HCI, moral circle expansion, NLP, RLHF, robustness


Wiki Contributions


This is a very exciting project! I'm particularly glad to see two features: (i) the focus on "deception", which undergirds much existential risk but has arguably been less of a focal point than "agency", "optimization", "inner misalignment", and other related concepts, (ii) the ability to widen the bottleneck of upskilling novice AI safety researchers who have, say, 500 hours of experience through the AI Safety Fundamentals course but need mentorship and support to make their own meaningful research contributions.

Thanks for writing this, Nate. This topic is central to our research at Sentience Institute, e.g., "Properly including AIs in the moral circle could improve human-AI relations, reduce human-AI conflict, and reduce the likelihood of human extinction from rogue AI. Moral circle expansion to include the interests of digital minds could facilitate better relations between a nascent AGI and its creators, such that the AGI is more likely to follow instructions and the various optimizers involved in AGI-building are more likely to be aligned with each other. Empirically and theoretically, it seems very challenging to robustly align systems that have an exclusionary relationship such as oppression, abuse, cruelty, or slavery." From Key Questions for Digital Minds.

I disagree with Eliezer Yudkowsky on a lot, but one thing I can say for his credibility is that in possible futures where he's right, nobody will be around to laud his correctness, and in possible futures where he's wrong, it will arguably be very clear how wrong his views were. Even if he has a big ego (as Lex Fridman suggested), this is a good reason to view his position as sincere and—dare I say it—selfless.

In particular, I wonder if many people who won't read through a post about offices and logistics would notice and find compelling a standalone post with Oliver's 2nd message and Ben's "broader ecosystem" list—analogous to AGI Ruin: A List of Lethalities. I know related points have been made elsewhere, but I think 95-Theses-style lists have a certain punch.

I like these examples, but can't we still view ChatGPT as a simulator—just a simulator of "Spock in a world where 'The giant that came to tea' is a real movie" instead of "Spock in a world where 'The giant that came to tea' is not a real movie"? You're already posing that Spock, a fictional character, exists, so it's not clear to me that one of these worlds is the right one in any privileged sense.

On the other hand, maybe the world with only one fiction is more intuitive to researchers, so the simulators frame does mislead in practice even if it can be rescued. Personally, I think reframing is possible in essentially all cases, which evidences the approach of drawing on frames (next-token predictors, simulators, agents, oracles, genies) selectively as inspirational and explanatory tools, but unpacking them any time we get into substantive analysis.

+1. While I will also respect the request to not state them in the comments, I would bet that you could sample 10 ICML/NeurIPS/ICLR/AISTATS authors and learn about >10 well-defined, not entirely overlapping obstacles of this sort.

We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down.

I don't want people to skim this post and get the impression that this is a common view in ML.

Interesting! I'm not sure what you're saying here. Which of those two things (shard theory and shard theory) is shard theory (written without a subscript)? If the former, then the OP seems accurate. If the latter or if shard theory without a subscript includes both of those two things, then I misread your view and will edit the post to note that this comment supersedes (my reading of) your previous statement.

Had you seen the researcher explanation for the March 2022 "AI suggested 40,000 new possible chemical weapons in just six hours" paper? I quote (paywall):

Our drug discovery company received an invitation to contribute a presentation on how AI technologies for drug discovery could potentially be misused.

Risk of misuse

The thought had never previously struck us. We were vaguely aware of security concerns around work with pathogens or toxic chemicals, but that did not relate to us; we primarily operate in a virtual setting. Our work is rooted in building machine learning models for therapeutic and toxic targets to better assist in the design of new molecules for drug discovery. We have spent decades using computers and AI to improve human health—not to degrade it. We were naive in thinking about the potential misuse of our trade, as our aim had always been to avoid molecular features that could interfere with the many different classes of proteins essential to human life. Even our projects on Ebola and neurotoxins, which could have sparked thoughts about the potential negative implications of our machine learning models, had not set our alarm bells ringing.

Our company—Collaborations Pharmaceuticals, Inc.—had recently published computational machine learning models for toxicity prediction in different areas, and, in developing our presentation to the Spiez meeting, we opted to explore how AI could be used to design toxic molecules. It was a thought exercise we had not considered before that ultimately evolved into a computational proof of concept for making biochemical weapons.

Five days ago, AI safety YouTuber Rob Miles posted on Twitter, "Can we all agree to not train AI to superhuman levels at Full Press Diplomacy? Can we please just not?"

Load More