Jacy Reese Anthis

PhD student in statistics and sociology at the University of Chicago. Co-founder of the Sentience Institute. Currently looking for AI alignment dissertation ideas, perhaps in interpretability, causality, or value learning.



Sorted by New

Wiki Contributions


The Big Picture Of Alignment (Talk Part 1)

Very interesting! Zack's two questions were also the top two questions that came to mind for me. I'm not sure if you got around to writing this up in more detail, John, but I'll jot down the way I tentatively view this differently. Of course I've given this vastly less thought than you have, so many grains of salt.

On "If this is so hard, how do humans and other agents arguably do it so easily all the time?", how meaningful is the notion of extra parameters if most agents are able to find uses for any parameters, even just through redundancy or error-correction (e.g., in case one base pair changes through exaptation or useless mutation)? In alignment, why assume that all aligned AIs "look like they work"? Why assume that these are binaries? Etc. In general, there seem to be many realistic additions to your model that mitigate this exponential-increase-in-possibilities challenge and seem to more closely fit real-world agents who are successful. I don't see as many such additions that would make the optimization even more challenging.

On generators, why should we carve such a clear and small circle around genes as the generators? Rob mentioned the common thought experiment of alien worlds in which genes produce babies who grow up in isolation from human civilization, and I would push on that further. Even on Earth, we have Stone Age values versus modern values, and if you draw the line more widely (either by calling more things generators or including non-generators), this notion of "generators of human values" starts to seem very narrow and much less meaningful for alignment or a general understanding of agency, which I think most people would say requires learning more values than what is in our genes. I don't think "feed an AI data" gets around this: AIs already have easy access to genes and to humans of all ages. There is an advantage to telling the AI "these are the genes that matter," but could it really just take those genes or their mapping onto some value space and raise virtual value-children in a useful way? How do they know they aren't leaving out the important differentiators between Stone Age and modern values, genetic or otherwise? How would they adjudicate between all the variation in values from all of these sources? How could we map them onto trade-offs suitable for coherence conditions? Etc.

The Big Picture Of Alignment (Talk Part 2)

This is great. One question it raises for me is: Why is there a common assumption in AI safety that values are a sort of existent (i.e., they exist) and latent (i.e., they are not directly observable) phenomena? I don't think those are unreasonable partial definitions of "values," but they're far from the only ones and not at all obvious that they're the values with which we want to align AI. Philosophers Iason Gabriel (2020) and Patrick Butlin (2021) have pointed out some of the many definitions of "values" that we could use for AI safety.

I understand that just picking an operationalization and sticking to it may be necessary for some technical research, but I worry that the gloss reifies these particular criteria and may even reify semantic issues (e.g., Which latent phenomena do we want to describe as "values"?; a sort of verbal dispute a la Chalmers) incorrectly as substantive issues (e.g., How do we align an AI with the true values?).

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

I strongly agree. There are two claims here. The weak one is that, if you hold complexity constant, directed acyclic graphs (DAGs; Bayes nets or otherwise) are not necessarily any more interpretable than conventional NNs because NNs are DAGs at that level. I don't think anyone who understands this claim would disagree with it.

But that is not the argument being put forth by Pearl/Marcus/etc. and arguably contested by LeCun/etc.; they claim that in practice (i.e., not holding anything constant), DAG-inspired or symbolic/hybrid AI approaches like Neural Causal Models have interpretability gains without much if any drop in performance, and arguably better performance on tasks that matter most. For example, they point to the 2021 NetHack Challenge, a difficult roguelike video game where non-NN performance still exceeds NN performance.

Of course there's not really a general answer here, only specific answers to specific questions like, "Will a NN or non-NN model win the 2024 NetHack challenge?"

For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines.

I don't think the "actual claim" is necessarily true. You need more assumptions than a fixed difficulty of AGI, assumptions that I don't think everyone would agree with. I walk through two examples in my comment: one that implies "Gradual take-off implies shorter timelines" and one that implies "Gradual take-off implies longer timelines."

For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines.

I agree with this post that the accelerative forces of gradual take-off (e.g., "economic value... more funding... freeing up people to work on AI...") are important and not everyone considers them when thinking through timelines.

However, I think the specific argument that "Gradual take-off implies shorter timelines" requires a prior belief that not everyone shares, such as a prior that an AGI of difficulty D will occur in the same year in both timelines. I don't think such a prior is implied by "conditioned on a given level of “AGI difficulty”". Here are two example priors, one that leads to "Gradual take-off implies shorter timelines" and one that leads to the opposite. The first sentence of each is most important.

Gradual take-off implies shorter timelines
Step 1: (Prior) Set AGI of difficulty D to occur at the same year Y in the gradual and sudden take-off timelines.
Step 2: Notice that the gradual take-off timeline has AIs of difficulties like 0.5D sooner, which would make AGI occur sooner than Y because of the accelerative forces of "economic value... more funding... freeing up people to work on AI..." etc. Therefore, move AGI occurrence in gradual take-off from Y to some year before Y, such as 0.5Y.

=> AGI occurs at 0.5Y in the gradual timeline and Y in the sudden timeline.

Gradual take-off implies longer timelines
Step 1: (Prior) Set AI of difficulty 0.5D to occur at the same year Y in the gradual and sudden take-off timelines. To fill in AGI of difficulty D in each timeline, suppose that both are superlinear but sudden AGI arrives at exactly Y and gradual AGI arrives at 1.5Y.
Step 2: Notice that the gradual take-off timeline has AIs of difficulties like 0.25D sooner, which would make AGI occur sooner than Y because of the accelerative forces of "economic value... more funding... freeing up people to work on AI..." etc. Therefore, move 0.5D AI occurrence in gradual take-off from Y to some year before Y, such as Y/2, and move AGI occurrence in gradual take-off correspondingly from 1.5Y to 1.25Y.

=> AGI occurs at 1.25Y in the gradual timeline and Y in the sudden timeline.

By the way, this is separate from Stefan_Schubert's critique that very short timelines are possible with sudden take-off but not with gradual take-off, which I personally think can be considered a counterexample if we treat the impossibility of gradual take-off as "long" but not really a counterexample if we just consider the shortness comparison to be indeterminate because there are no very short gradual timelines.

12 interesting things I learned studying the discovery of nature's laws

I think another important part of Pearl's journey was that during his transition from Bayesian networks to causal inference, he was very frustrated with the correlational turn in early 1900s statistics. Because causality is so philosophically fraught and often intractable, statisticians shifted to regressions and other acausal models. Pearl sees that as throwing out the baby (important causal questions and answers) with the bathwater (messy empirics and a lack of mathematical language for causality, which is why he coined the do operator).

Pearl discusses this at length in The Book of Why, particularly the Chapter 2 sections on "Galton and the Abandoned Quest" and "Pearson: The Wrath of the Zealot." My guess is that Pearl's frustration with statisticians' focus on correlation was immediate upon getting to know the field, but I don't think he's publicly said how his frustration began.

Redwood Research’s current project

This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently.

I think you might have better performance if you train your own DeBERTa XL-like model with classification of different snippets as a secondary objective alongside masked token prediction, rather than just fine-tuning with that classification after the initial model training. (You might use different snippets in each step to avoid double-dipping the information in that sample, analogous to splitting text data for causal inference, e.g., Egami et al 2018.) The Hugging Face DeBERTa XL might not contain the features that would be most useful for the follow-up task of nonviolence fine-tuning. However, that might be a less interesting exercise if you want to build tools for working with more naturalistic models.

Value extrapolation, concept extrapolation, model splintering

I appreciate making these notions more precise. Model splintering seems closely related to other popular notions in ML, particularly underspecification ("many predictors f that a pipeline could return with similar predictive risk"), the Rashomon effect ("many different explanations exist for the same phenomenon"), and predictive multiplicity ("the ability of a prediction problem to admit competing models with conflicting predictions"), as well as more general notions of generalizability and out-of-sample or out-of-domain performance. I'd be curious what exactly makes model splintering different. Some example questions: Is the difference just the alignment context? Is it that "splintering" refers specifically to features and concepts within the model failing to generalize, rather than the model as a whole failing to generalize? If so, what does it even mean for the model as a whole to fail to generalize but not features failing to generalize? Is it that the aggregation of features is not a feature? And how are features and concepts different from each other, if they are?

Preliminary thoughts on moral weight

I think most thinkers on this topic wouldn't think of those weights as arbitrary (I know you and I do, as hardcore moral anti-realists), and they wouldn't find it prohibitively difficult to introduce those weights into the calculations. Not sure if you agree with me there.

I do agree with you that you can't do moral weight calculations without those weights, assuming you are weighing moral theories and not just empirical likelihoods of mental capacities.

I should also note that I do think intertheoretic comparisons become an issue in other cases of moral uncertainty, such as with infinite values (e.g. a moral framework that absolutely prohibits lying). But those cases seem much harder than moral weights between sentient beings under utilitarianism.

Preliminary thoughts on moral weight

I don't think the two-elephants problem is as fatal to moral weight calculations as you suggest (e.g. "this doesn't actually work"). The two-envelopes problem isn't a mathematical impossibility; it's just an interesting example of mathematical sleight-of-hand.

Brian's discussion of two-envelopes is just to point out that moral weight calculations require a common scale across different utility functions (e.g. the decision to fix the moral weight of a human at 1 whether you're using brain size, all-animals-are-equal, unity-weighting, or any other weighing approach). It's not to say that there's a philosophical or mathematical impossibility in doing these calculations, as far as I understand.

FYI I discussed this a little with Brian before commenting, and he subsequently edited his post a little, though I'm not yet sure if we're in agreement on the topic.

Load More