PhD candidate in sociology and statistics at the University of Chicago. Co-founder of the Sentience Institute. Interested in AI safety projects across the topics of causality, out-of-domain generalization, RLAIF, unpacking values, digital minds, etc.
In particular, I wonder if many people who won't read through a post about offices and logistics would notice and find compelling a standalone post with Oliver's 2nd message and Ben's "broader ecosystem" list—analogous to AGI Ruin: A List of Lethalities. I know related points have been made elsewhere, but I think 95-Theses-style lists have a certain punch.
I like these examples, but can't we still view ChatGPT as a simulator—just a simulator of "Spock in a world where 'The giant that came to tea' is a real movie" instead of "Spock in a world where 'The giant that came to tea' is not a real movie"? You're already posing that Spock, a fictional character, exists, so it's not clear to me that one of these worlds is the right one in any privileged sense.
On the other hand, maybe the world with only one fiction is more intuitive to researchers, so the simulators frame does mislead in practice even if it can be rescued. Personally, I think reframing is possible in essentially all cases, which evidences the approach of drawing on frames (next-token predictors, simulators, agents, oracles, genies) selectively as inspirational and explanatory tools, but unpacking them any time we get into substantive analysis.
+1. While I will also respect the request to not state them in the comments, I would bet that you could sample 10 ICML/NeurIPS/ICLR/AISTATS authors and learn about >10 well-defined, not entirely overlapping obstacles of this sort.
We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down.
I don't want people to skim this post and get the impression that this is a common view in ML.
Interesting! I'm not sure what you're saying here. Which of those two things (shard theory and shard theory) is shard theory (written without a subscript)? If the former, then the OP seems accurate. If the latter or if shard theory without a subscript includes both of those two things, then I misread your view and will edit the post to note that this comment supersedes (my reading of) your previous statement.
Had you seen the researcher explanation for the March 2022 "AI suggested 40,000 new possible chemical weapons in just six hours" paper? I quote (paywall):
Our drug discovery company received an invitation to contribute a presentation on how AI technologies for drug discovery could potentially be misused.
Risk of misuse
The thought had never previously struck us. We were vaguely aware of security concerns around work with pathogens or toxic chemicals, but that did not relate to us; we primarily operate in a virtual setting. Our work is rooted in building machine learning models for therapeutic and toxic targets to better assist in the design of new molecules for drug discovery. We have spent decades using computers and AI to improve human health—not to degrade it. We were naive in thinking about the potential misuse of our trade, as our aim had always been to avoid molecular features that could interfere with the many different classes of proteins essential to human life. Even our projects on Ebola and neurotoxins, which could have sparked thoughts about the potential negative implications of our machine learning models, had not set our alarm bells ringing.
Our company—Collaborations Pharmaceuticals, Inc.—had recently published computational machine learning models for toxicity prediction in different areas, and, in developing our presentation to the Spiez meeting, we opted to explore how AI could be used to design toxic molecules. It was a thought exercise we had not considered before that ultimately evolved into a computational proof of concept for making biochemical weapons.
Hm, the begging question meaning is probably just a verbal dispute, but I don't think asking questions can in general beg questions because they don't have conclusions. There is no "assuming its conclusion is true" if there is no conclusion. Not a big deal though!
I wouldn't say values are independent (i.e., orthogonal) at best; they are often highly correlated, such as values of "have enjoyable experiences" and "satisfy hunger" both leading to eating tasty meals. I agree they are often contradictory, and this is one valid model of catastrophic addiction or mild problems. I think any rigorous theory of "values" (shard theory or otherwise) will need to make sense of those phenomena, but I don't see that as an issue for the claim "ensure alignment with its values" because I don't think alignment requires complete satisfaction of every value, which is almost always impossible.
Thanks for the comment. I take "beg the question" to mean "assumes its conclusion," but it seems like you just mean Point 2 assumes something you disagree with, which is fair. I can see reasonable definitions of aligned and misaligned in which brains would fall into either category. For example, insofar as our values are a certain sort of evolutionary (e.g., valuing reproduction), human brains have misaligned mesaoptimization like craving sugar. If sugar craving itself is the value, then arguably we're well-aligned.
In terms of synthesizing an illusion, what exactly would make it illusory? If the synthesis (i.e., combination of the various shards and associated data) is leading to brains going about their business in a not-catastrophic way (e.g., not being constantly insane or paralyzed), then that seems to meet the bar for alignment that many, particularly agent foundations proponents, favor. See, for example, Nate's recent post:
Unfortunately, the current frontier for alignment research is “can we figure out how to point AGI at anything?”. By far the most likely outcome is that we screw up alignment and destroy ourselves.
The example I like is just getting an AI to fill a container of water, which human brains are able to do, but in Fantasia, the sorceror's apprentice Mickey Mouse was not able to do! So that's a basic sense in which brains are aligned, but again I'm not sure how exactly you would differentiate alignment with its values from synthesis of an illusion.
I disagree with Eliezer Yudkowsky on a lot, but one thing I can say for his credibility is that in possible futures where he's right, nobody will be around to laud his correctness, and in possible futures where he's wrong, it will arguably be very clear how wrong his views were. Even if he has a big ego (as Lex Fridman suggested), this is a good reason to view his position as sincere and—dare I say it—selfless.