All of Jacy Reese Anthis's Comments + Replies

This is a very exciting project! I'm particularly glad to see two features: (i) the focus on "deception", which undergirds much existential risk but has arguably been less of a focal point than "agency", "optimization", "inner misalignment", and other related concepts, (ii) the ability to widen the bottleneck of upskilling novice AI safety researchers who have, say, 500 hours of experience through the AI Safety Fundamentals course but need mentorship and support to make their own meaningful research contributions.

Thanks for writing this, Nate. This topic is central to our research at Sentience Institute, e.g., "Properly including AIs in the moral circle could improve human-AI relations, reduce human-AI conflict, and reduce the likelihood of human extinction from rogue AI. Moral circle expansion to include the interests of digital minds could facilitate better relations between a nascent AGI and its creators, such that the AGI is more likely to follow instructions and the various optimizers involved in AGI-building are more likely to be aligned with each other. Empi... (read more)

I disagree with Eliezer Yudkowsky on a lot, but one thing I can say for his credibility is that in possible futures where he's right, nobody will be around to laud his correctness, and in possible futures where he's wrong, it will arguably be very clear how wrong his views were. Even if he has a big ego (as Lex Fridman suggested), this is a good reason to view his position as sincere and—dare I say it—selfless.

3Rudi C4mo
I don’t think his position is falsifiable in his lifetime. He has gained a lot of influence because of it that he wouldn’t have with a mainstream viewpoint. (I do think he’s sincere, but the incentives are the same as all radical ideas.)

In particular, I wonder if many people who won't read through a post about offices and logistics would notice and find compelling a standalone post with Oliver's 2nd message and Ben's "broader ecosystem" list—analogous to AGI Ruin: A List of Lethalities. I know related points have been made elsewhere, but I think 95-Theses-style lists have a certain punch.

I like these examples, but can't we still view ChatGPT as a simulator—just a simulator of "Spock in a world where 'The giant that came to tea' is a real movie" instead of "Spock in a world where 'The giant that came to tea' is not a real movie"? You're already posing that Spock, a fictional character, exists, so it's not clear to me that one of these worlds is the right one in any privileged sense.

On the other hand, maybe the world with only one fiction is more intuitive to researchers, so the simulators frame does mislead in practice even if it can be res... (read more)

+1. While I will also respect the request to not state them in the comments, I would bet that you could sample 10 ICML/NeurIPS/ICLR/AISTATS authors and learn about >10 well-defined, not entirely overlapping obstacles of this sort.

We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down.

I don't want people to skim this post and get the impression that this is a common view in ML.

2Daniel Kokotajlo5mo
Such a survey was done recently, IIRC. I don't remember the title or authors but I remember reading through it to see what barriers people cited, and being unimpressed. :( I wish I could find it again.
[anonymous]5mo2116

The problem with asking individual authors is that most researchers in ML don't have a wide enough perspective to realize how close we are. Over the past decade of ML, it seems that people in the trenches of ML almost always think their research is going slower than it is because only a few researchers have broad enough gears models to plan the whole thing in their heads. If you aren't trying to run the search for the foom-grade model in your head at all times, you won't see it coming.

That said, they'd all be right about what bottlenecks there are. Just not how fast we're gonna solve them.

3Gerald Monroe5mo
I don't want people to skim this post and get the impression that this is a common view in ML. So you're saying that in ML, there is a view that there are obstacles that a well funded lab can't overcome in 6 months.  

Interesting! I'm not sure what you're saying here. Which of those two things (shard theory and shard theory) is shard theory (written without a subscript)? If the former, then the OP seems accurate. If the latter or if shard theory without a subscript includes both of those two things, then I misread your view and will edit the post to note that this comment supersedes (my reading of) your previous statement.

3TurnTrout7mo
Descriptively, the meaning of "shard theory" is contextual (unfortunately). If I say "I'm working on shard theory today", I think my colleagues would correctly infer the AI case. If I say "the shard theory posts", I'm talking about posts which discuss both the human and AI cases. If I say "shard theory predicts that humans do X", I would be referring to shard theory_{human values}.  (FWIW I find my above disambiguation to remain somewhat ambiguous, but it's what I have to say right now.)

Had you seen the researcher explanation for the March 2022 "AI suggested 40,000 new possible chemical weapons in just six hours" paper? I quote (paywall):

Our drug discovery company received an invitation to contribute a presentation on how AI technologies for drug discovery could potentially be misused.

Risk of misuse

The thought had never previously struck us. We were vaguely aware of security concerns around work with pathogens or toxic chemicals, but that did not relate to us; we primarily operate in a virtual setting. Our work is rooted in building machi

... (read more)

Risk of misuse

The thought had never previously struck us.

This seems like one of the most tractable things to address to reduce AI risk.

If 5 years from now anyone developing AI or biotechnology is still not thinking (early and seriously) about ways their work could cause harm that other people have been talking about for years, I think we should consider ourselves to have failed.

Five days ago, AI safety YouTuber Rob Miles posted on Twitter, "Can we all agree to not train AI to superhuman levels at Full Press Diplomacy? Can we please just not?"

8β-redex8mo
I don't know anything about Diplomacy and I just watched this video [https://www.youtube.com/watch?v=u5192bvUS7k], could someone expand a bit on why this game is a particularly alarming capability gain? The chat logs seemed pretty tame, the bot didn't even seem to attempt psychological manipulation or gaslighting or anything similar. What important real world capability does Diplomacy translate into that other games don't? (People for instance don't seem very alarmed nowadays about AI being vastly superhuman at chess or Go.)

Hm, the begging question meaning is probably just a verbal dispute, but I don't think asking questions can in general beg questions because they don't have conclusions. There is no "assuming its conclusion is true" if there is no conclusion. Not a big deal though!

I wouldn't say values are independent (i.e., orthogonal) at best; they are often highly correlated, such as values of "have enjoyable experiences" and "satisfy hunger" both leading to eating tasty meals. I agree they are often contradictory, and this is one valid model of catastrophic addiction or... (read more)

Thanks for the comment. I take "beg the question" to mean "assumes its conclusion," but it seems like you just mean Point 2 assumes something you disagree with, which is fair. I can see reasonable definitions of aligned and misaligned in which brains would fall into either category. For example, insofar as our values are a certain sort of evolutionary (e.g., valuing reproduction), human brains have misaligned mesaoptimization like craving sugar. If sugar craving itself is the value, then arguably we're well-aligned.

In terms of synthesizing an illusion, wha... (read more)

2shminux8mo
I meant this: which does indeed beg the question in the standard meaning of it. My point is that there is very much no alignment between different values! They are independent at best and contradictory in many cases. There is an illusion of coherent values that is a rationalization. The difference in values sometimes leads to catastrophic Fantasia-like outcomes on the margins (e.g. people with addiction don't want to be on drugs but are), but most of the time it results in a mild akrasia (I am writing this instead of doing something that makes me money).  This seems like a good analogy: http://max.mmlc.northwestern.edu/mdenner/Demo/texts/swan_pike_crawfish.htm [http://max.mmlc.northwestern.edu/mdenner/Demo/texts/swan_pike_crawfish.htm]

Very interesting! Zack's two questions were also the top two questions that came to mind for me. I'm not sure if you got around to writing this up in more detail, John, but I'll jot down the way I tentatively view this differently. Of course I've given this vastly less thought than you have, so many grains of salt.

On "If this is so hard, how do humans and other agents arguably do it so easily all the time?", how meaningful is the notion of extra parameters if most agents are able to find uses for any parameters, even just through redundancy or error-correc... (read more)

This is great. One question it raises for me is: Why is there a common assumption in AI safety that values are a sort of existent (i.e., they exist) and latent (i.e., they are not directly observable) phenomena? I don't think those are unreasonable partial definitions of "values," but they're far from the only ones and not at all obvious that they're the values with which we want to align AI. Philosophers Iason Gabriel (2020) and Patrick Butlin (2021) have pointed out some of the many definitions of "values" that we could use for AI safety.

I understand tha... (read more)

I strongly agree. There are two claims here. The weak one is that, if you hold complexity constant, directed acyclic graphs (DAGs; Bayes nets or otherwise) are not necessarily any more interpretable than conventional NNs because NNs are DAGs at that level. I don't think anyone who understands this claim would disagree with it.

But that is not the argument being put forth by Pearl/Marcus/etc. and arguably contested by LeCun/etc.; they claim that in practice (i.e., not holding anything constant), DAG-inspired or symbolic/hybrid AI approaches like Neural Causa... (read more)

5johnswentworth1y
That was a good succinct statement and useful links, thanks.

I don't think the "actual claim" is necessarily true. You need more assumptions than a fixed difficulty of AGI, assumptions that I don't think everyone would agree with. I walk through two examples in my comment: one that implies "Gradual take-off implies shorter timelines" and one that implies "Gradual take-off implies longer timelines."

I agree with this post that the accelerative forces of gradual take-off (e.g., "economic value... more funding... freeing up people to work on AI...") are important and not everyone considers them when thinking through timelines.

However, I think the specific argument that "Gradual take-off implies shorter timelines" requires a prior belief that not everyone shares, such as a prior that an AGI of difficulty D will occur in the same year in both timelines. I don't think such a prior is implied by "conditioned on a given level of “AGI difficulty”". Here are t... (read more)

I think another important part of Pearl's journey was that during his transition from Bayesian networks to causal inference, he was very frustrated with the correlational turn in early 1900s statistics. Because causality is so philosophically fraught and often intractable, statisticians shifted to regressions and other acausal models. Pearl sees that as throwing out the baby (important causal questions and answers) with the bathwater (messy empirics and a lack of mathematical language for causality, which is why he coined the do operator).

Pearl discusses t... (read more)

This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently.

I think you might have better performance if you train your own DeBERTa XL-like model with classification of different snippets as a secondary objective alongside masked token prediction, rather than just fine-tuning with that classification after the initial model training. (You might use different s... (read more)

I appreciate making these notions more precise. Model splintering seems closely related to other popular notions in ML, particularly underspecification ("many predictors f that a pipeline could return with similar predictive risk"), the Rashomon effect ("many different explanations exist for the same phenomenon"), and predictive multiplicity ("the ability of a prediction problem to admit competing models with conflicting predictions"), as well as more general notions of generalizability and out-of-sample or out-of-domain performance. I'd be curious what ex... (read more)

I think most thinkers on this topic wouldn't think of those weights as arbitrary (I know you and I do, as hardcore moral anti-realists), and they wouldn't find it prohibitively difficult to introduce those weights into the calculations. Not sure if you agree with me there.

I do agree with you that you can't do moral weight calculations without those weights, assuming you are weighing moral theories and not just empirical likelihoods of mental capacities.

I should also note that I do think intertheoretic comparisons become an issue in other cas... (read more)

Some other people at Open Phil have spent more time thinking about two-envelope effects more than I have, and fwiw some of their thinking on the issue is in this post (e.g. see section 1.1.1.1).

I don't think the two-elephants problem is as fatal to moral weight calculations as you suggest (e.g. "this doesn't actually work"). The two-envelopes problem isn't a mathematical impossibility; it's just an interesting example of mathematical sleight-of-hand.

Brian's discussion of two-envelopes is just to point out that moral weight calculations require a common scale across different utility functions (e.g. the decision to fix the moral weight of a human at 1 whether you're using brain size, all-animals-are-equal, u... (read more)

I think the moral-uncertainty version of the problem is fatal unless you make further assumptions about how to resolve it, such as by fixing some arbitrary intertheoretic-comparison weights (which seems to be what you're suggesting) or using the parliamentary model.