ryan_greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

I'm also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs "out there" that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.

This seems like a reasonable concern.

My general view is that it seems implausible that much of the value from our perspective comes from extorting other civilizations.

It seems unlikely to me that >5% of the usable resources (weighted by how much we care) are extorted. I would guess that marginal gains from trade are bigger (10% of the value of our universe?). (I think the units work out such that these percentages can be directly compared as long as our universe isn't particularly well suited to extortion rather than trade or vis versa.) Thus, competition over who gets to extort these resources seems less important than gains from trade.

I'm wildly uncertain about both marginal gains from trade and the fraction of resources that are extorted.

Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse.

Naively, acausal influence should be in proportion to how much others care about what a lightcone controlling civilization does with our resources. So, being a small fraction of the value hits on both sides of the equation (direct value and acausal value equally).

Of course, civilizations elsewhere might care relatively more about what happens in our universe than whoever controls it does. (E.g., their measure puts much higher relative weight on our universe than the measure of whoever controls our universe.) This can imply that acausal trade is extremely important from a value perspective, but this is unrelated to being "small" and seems more well described as large gains from trade due to different preferences over different universes.

(Of course, it does need to be the case that our measure is small relative to the total measure for acausal trade to matter much. But surely this is true?)

Overall, my guess is that it's reasonably likely that acausal trade is indeed where most of the value/disvalue comes from due to very different preferences of different civilizations. But, being small doesn't seem to have much to do with it.

(Surely cryonics doesn't matter given a realistic action space? Usage of cryonics is extremely rare and I don't think there are plausible (cheap) mechanisms to increase uptake to >1% of population. I agree that simulation arguments and similar considerations maybe imply that "helping current humans" is either incoherant or unimportant.)

But I do think, intuitively, GPT-5-MAIA might e.g. make 'catching AIs red-handed' using methods like in this comment significantly easier/cheaper/more scalable.

Noteably, the mainline approach for catching doesn't involve any internals usage at all, let alone labeling a bunch of internals.

I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.

  • My current guess is that max good and max bad seem relatively balanced. (Perhaps max bad is 5x more bad/flop than max good in expectation.)
  • There are two different (substantial) sources of value/disvalue: interactions with other civilizations (mostly acausal, maybe also aliens) and what the AI itself terminally values
  • On interactions with other civilizations, I'm relatively optimistic that commitment races and threats don't destroy as much value as acausal trade generates on some general view like "actually going through with threats is a waste of resources". I also think it's very likely relatively easy to avoid precommitment issues via very basic precommitment approaches that seem (IMO) very natural. (Specifically, you can just commit to "once I understand what the right/reasonable precommitment process would have been, I'll act as though this was always the precommitment process I followed, regardless of my current epistemic state." I don't think it's obvious that this works, but I think it probably works fine in practice.)
  • On terminal value, I guess I don't see a strong story for extreme disvalue as opposed to mostly expecting approximately no value with some chance of some value. Part of my view is that just relatively "incidental" disvalue (like the sort you link to Daniel Kokotajlo discussing) is likely way less bad/flop than maximum good/flop.

By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value?

I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources.

Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.

You might be interested in When is unaligned AI morally valuable? by Paul.

One key consideration here is that the relevant comparison is:

  • Human control (or successors picked by human control)
  • AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators)

Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control).

A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?

Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent.

I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)).

I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because:

  • It's relatively natural for humans to reflect and grow smarter.
  • Humans who don't reflect in this sort of way probably don't care about spending vast amounts of inter-galactic resources.
  • Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources.

To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?

Probably not the same, but if I didn't think it was at all close (I don't care at all for what they would use resources on), I wouldn't care nearly as much about ensuring that coalition is in control of AI.

Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?

I care about AI welfare, though I expect that ultimately the fraction of good/bad that results from the welfare fo minds being used for labor is tiny. And an even smaller fraction from AI welfare prior to humans being totally obsolete (at which point I expect control over how minds work to get much better). So, I mostly care about AI welfare from a deontological perspective.

I think misaligned AI control probably results in worse AI welfare than human control.

Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?

Yeah, most value from my idealized values. But, I think the basin is probably relatively large and small differences aren't that bad. I don't know how to answer most of these other questions because I don't know what the units are.

How likely are these various options under an aligned AI future vs. an unaligned AI future?

My guess is that my idealized values are probably pretty similar to many other humans on reflection (especially the subset of humans who care about spending vast amounts of comptuation). Such that I think human control vs me control only loses like 1/3 of the value (putting aside trade). I think I'm probably less into AI values on reflection such that it's more like 1/9 of the value (putting aside trade). Obviously the numbers are incredibly unconfident.

You might be interested in discussion under this thread

I express what seem to me to be some of the key considerations here (somewhat indirect).

It seems to me like the sort of interpretability work you're pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.

So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don't really see this as the bottleneck.

Thanks!

Yep, this is the exact experiment I was thinking about.

Load More