Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.

Wiki Contributions


Morality is Scary

Well, if the OP said something like "most people have 2 eyes but enlightened Buddhists have a third eye" and I responded with "in reality, everyone have 2 eyes" then, I think my meaning would be clear even though it's true that some people have 1 or 0 eyes (afaik maybe there is even a rare mutation that creates a real third eye). Not adding all possible qualifiers is not the same as "not even pretending that it's interested in making itself falsifiable".

Morality is Scary

What does it have to do with "No True Scotsman"? NTS is when you redefine your categories to justify your claim. I don't think I did that anywhere.

Just because the extreme doesn't exist doesn't mean that all of the scale can be explained by status games.

First, I didn't say all the scale is explained by status games, I did mention empathy as well.

Second, that by itself sure doesn't mean much. Explaining all the evidence would require an article, or maybe a book (although I hoped the posts I linked explain some of it). My point here is that there is an enormous discrepancy between the reported morality and the revealed preferences, so believing self-reports is clearly a non-starter. How do you build an explanation not from self-reports is a different (long) story.

Vanessa Kosoy's Shortform

The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.

The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it's an ill-posed concept. I'm not sure how you imagine corrigibility in this case: AQD is a series of discrete "transactions" (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The "out of scope" case is also dodged by quantilization, if I understand what you mean by "out of scope".

...fiddling with base distribution and proxy utility is a more natural framing that's strictly more general than fiddling with the quantilization parameter.

Why is it strictly more general? I don't see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.

If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?

The reason to pick the quantilization parameter is because it's hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.

I don't understand which "main issues" you think this doesn't address. Can you describe a concrete attack vector?

  1. If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition). ↩︎

Morality is Scary

It's not just that the self-reports didn't fit the story I was building, the self-reports didn't fit the revealed preferences. Whatever people say about their morality, I haven't seen anyone who behaves like a true utilitarian.

IMO, this is the source of all the gnashing of teeth about how much % of your salary you need to donate: the fundamental contradiction between the demands of utilitarianism and how much people are actually willing to pay for the status gain. Ofc many excuses were developed ("sure I still need to buy that coffee or those movie tickets, otherwise I won't be productive") but they don't sound like the most parsimonious explanation.

This is also the source of paradoxes in population ethics and its vicinity: those abstractions are just very remote from actual human minds, so there's no reason they should produce anything sane in edge cases. Their only true utility is as an approximate guideline for making group decisions, for sufficiently mundane scenarios. Once you get to issues with infinities it becomes clear utilitarianism is not even mathematically coherent, in general.

You're right that there is a lot of variation in human psychology. But it's also an accepted practice to phrase claims as universal when what you actually mean is, the exceptions are negligible for our practical purpose. For example, most people would accept "humans have 2 arms and 2 legs" as a true statement in many contexts, even though some humans have less. In this case, my claim is that the exceptions are much rarer than the OP seems to imply (i.e. most people the OP classifies as exceptions are not really exceptions).

I'm all for falsifiability, but it's genuinely hard to do falsifiability in soft topics like this, where no theory makes very sharp predictions and collecting data is hard. Ultimately, which explanation is more reasonable is going to be at least in part an intuitive judgement call based on your own experience and reflection. So, yes, I certainly might be wrong, but what I'm describing is my current best guess.

Biology-Inspired AGI Timelines: The Trick That Never Works

I will have to look at these studies in detail in order to understand, but I'm confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up?

Biology-Inspired AGI Timelines: The Trick That Never Works

Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.

So if you're going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it's best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it's the biggest source of change (particularly when including software gains downstream of hardware technology and expenditures).

I don't understand the logical leap from "human labor applied to AI didn't grow much" to "we can ignore human labor". The amount of labor invested in AI research is related to the the time derivative of progress on the algorithms axis. Labor held constant is not the same as algorithms held constant. So, we are still talking about the problem of predicting when AI-capability(algorithms(t),compute(t)) reaches human level. What do you know about the function "AI-capability" that allows you to ignore its dependence on the 1st argument?

Or maybe you're saying that algorithmic improvements have not been very important in practice? Surely such a claim is not compatible with e.g. the transitions from GOFAI to "shallow" ML to deep ML?

"Infohazard" is a predominantly conflict-theoretic concept

More possible causes for infohazards in VNM multi-agent systems:

  • Having different priors (even if the utility functions are the same)
  • Having the same utility function and priors, but not common knowledge about these facts (e.g. Alice and Bob have the same utility function but Bob mistakenly thinks Alice is adversarial)
  • Having limited communication abilities. For example, the communication channel allows Alice to send Bob evidence in one direction but not evidence in the other direction. (Sometimes Bob can account for this, but what if only Alice knows about this property of the channel and she cannot communicate it to Bob?)
Morality is Scary

What do you mean by "true intrinsic values"? (I couldn't find any previous usage of this term by you.) How do you propose finding people's true intrinsic values?

I mean the values relative to which a person seems most like a rational agent, arguably formalizable along these lines.

These weights, if low enough relative to other "values", haven't prevented people from committing atrocities on each other in the name of morality.


This implies solving a version of the alignment problem that includes reasonable value aggregation between different people (or between AIs aligned to different people), but at least some researchers don't seem to consider that part of "alignment".

Yes. I do think multi-user alignment is an important problem (and occasionally spend some time thinking about it), it just seems reasonable to solve single user alignment first. Andrew Critch is an example of a person who seems to be concerned about this.

Given that playing status games and status competition between groups/tribes/status games constitute a huge part of people's lives, I'm not sure how private utopias that are very isolated from each other would work.

I meant that each private utopia can contain any number of people created by the AI, in addition to its "customer". Ofc groups that can agree on a common utopia can band together as well.

Also, I'm not sure if your solution would prevent people from instantiating simulations of perceived enemies / "evil people" in their utopias and punishing them, or just simulating a bunch of low status people to lord over.

They are prevented from simulating other pre-existing people without their consent, but can simulate a bunch of low status people to lord over. Yes, this can be bad. Yes, I still prefer this (assuming my own private utopia) over paperclips. And, like I said, this is just a relatively easy to imagine lower bound, not necessarily the true optimum.

Perhaps I should have clarified that by "parts of me" being more scared, I meant the selfish and NU-leaning parts.

The selfish part, at least, doesn't have any reason to be scared as long as you are a "customer".

Morality is Scary

You are positing the existence of two type of people: type I people who have morality based on "reason" and type II people who have morality based on the "status game". In reality, everyone's morality is based on something like the status game (see also: 1 2 3). It's just that EAs and moral philosophers are playing the game in a tribe which awards status differently.

The true intrinsic values of most people do place a weight on the happiness of other people (that's roughly what we call "empathy"), but this weight is very unequally distributed.

There are definitely thorny questions regarding the best way to aggregate the values of different people in TAI. But, I think that given a reasonable solution, a lower bound on the future is imagining that the AI will build a private utopia for every person, as isolated from the other "utopias" as that person wants it to be. Probably some people's "utopias" will not be great, viewed in utilitarian terms. But, I still prefer that over paperclips (by far). And, I suspect that most people do (even if they protest it in order to play the game).

Christiano, Cotra, and Yudkowsky on AI progress

Makes some sense, but Yudkowsky's prediction that TAI will arrive before AI has large economic impact does forbid a lot of plateau scenarios. Given a plateau that's sufficiently high and sufficiently long, AI will land in the market, I think. Even if regulatory hurdles are the bottleneck for a lot of things atm, eventually in some country AI will become important and the others will have to follow or fall behind.

Load More