Wiki Contributions


Anecdotal 2¢: This is very accurate in my experience. Basically every time I talk to someone outside of tech/alignment about AI risk, I have to go through the whole "we don't know what algorithms the AI is running to do what it does. Yes, really." thing. Every time I skip this accidentally, I realize after a while that this is where a lot of confusion is coming from.

1.  "Trust" does seem to me to often be an epistemically broken thing that rides on human-peculiar social dynamics and often shakes out to gut-understandings of honor and respect and loyalty etc.

2. I think there is a version that doesn't route through that stuff. Trust in the "trust me" sense is a bid for present-but-not-necessarily-permanent suspension of disbelief, where the stakes are social credit. I.e. When I say, "trust me on this," I'm really saying something like, "All of that anxious analysis you might be about to do to determine if X is true? Don't do it. I claim that using my best-effort model of your values, the thing you should assume/do to fulfill them in this case is X. To the extent that you agree that I know you well and want to help you and tend to do well for myself in similar situations, defer to me on this. I predict you'll thank me for it (because, e.g., confirming it yourself before acting is costly), and if not...well I'm willing to stake some amount of the social credit I have with you on it." [Edit: By social credit here I meant something like: The credence you give to it being a good idea to engage with me like this.]


  • "I decided to trust her" -> "I decided to defer to her claims on this thing without looking into it much myself (because it would be costly to do otherwise and I believe-- for some reason-- that she is sufficiently likely to come to true conclusions on this, is probably trying to help me, knows me fairly well etc.) And if this turns out badly, I'll (hopefully) stop deciding to do this." 
  • "Should I trust him?" -> "Does the cost/benefit analysis gestured at above come out net positive in expectation if I defer to him on this?"
  • "They offered me their trust" -> "They believe that deferring to me is their current best move and if I screw this up enough, they will (hopefully) stop thinking that."

So, I feel like I've landed fairly close to where you did but there is a difference in emphasis or maybe specificity. There's more there than asking “what do they believe, and what caused them to believe it?” Like, that probably covers it but more specifically the question I can imagine people asking when wondering whether or not to "trust" someone is instead, "do I believe that deferring these decisions/assumptions to them in this case will turn out better for me than otherwise?" Where the answer can be "yes" because of things like cost-of-information or time constraints etc. If you map "what do they believe" to "what do they believe that I should assume/do" and "what caused them to believe it" to "how much do they want to help me, how well do they know me, how effective are they in this domain, ..." then we're on the same page.

Some nits we know about but didn't include in the problems section:

  1. P[mushroom->anchovy] = 0. The current argument does not handle the case where subagents believe that there is a probability of 0 on one of the possible states. It wouldn't be possible to complete the preferences exactly as written, then.
  2. Indifference. If anchovy were placed directly above mushroom in the preference graph above (so that John is truly indifferent between them), then that might require some special handling. But also it might just work if the "Value vs Utility" issue is worked out. If the subagents are not myopic / handle instrumental values, then whether anchovy is less, identically, or more desirable than mushroom doesn't really matter so much on its own as opposed to what opportunities are possible afterward from the anchovy state relative to the mushroom state.

Also, I think I buy the following part but I really wish it were more constructive.

Now, we haven't established which distribution of preferences the system will end up sampling from. But so long as it ends up at some non-dominated choice, it must end up with non-strongly-incomplete preferences with probability 1 (otherwise it could modify the contract for a strict improvement in cases where it ends up with non-strongly-incomplete preferences). And, so long as the space of possibilities is compact and arbitrary contracts are allowed, all we have left is a bargaining problem. The only way the system would end up with dominated preference-distribution is if there's some kind of bargaining breakdown.

Might be worth thinking about / comparing how and why things went wrong to produce the 2007/8 GFC. iirc credit raters had misaligned incentives that rhyme with this question/post.

Disclaimer: At the time of writing, this has not been endorsed by Evan.

I can give this a go.

Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.

One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implements a given tactic's choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.

One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan's comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: "Feel physically great,"  and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like "alone in my room, take the most fun available drug" which takes actions that result in  physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like "alone in my room, abstain from dissociative drugs and instead text my favorite friend" taking actions which result in  physical pleasure over a day. 

Suppose also that asocial cognitions like "eat this" have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.

Since WIRE only gets credit a fraction of the time that it's due, the average reward (over 30 days, say) credited to WIRE is . If and only if , like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.

I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.

Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.

REMINDER: At the time of writing, this has not been endorsed by Evan.

This feels like stepping on a rubber duck while tip-toeing around sleeping giants but:

Don't these analogies break if/when the complexity of the thing to generate/verify gets high enough? That is, unless you think the difficulty of verification of arbitrarily complex plans/ideas is asymptotic to some human-or-lower level of verification capability (which I doubt you do) then at some point humans can't even verify the complex plan.

So, the deeper question just seems to be takeoff speeds again: If takeoff is too fast, we don't have enough time to use "weak" AGI to help produce actually verifiable plans which solve alignment. If takeoff is slow enough, we might. (And if takeoff is too fast, we might not notice that we've passed the point of human verifiability until it's too late.)

(I am consciously not bringing up ideas about HCH / other oversight-amplification ideas because I'm new to the scene and don't feel familiar enough with them.)

But I'm not really accusing y'all of saying "try to produce a future that has no basis in human values." I am accusing this post of saying "there's some neutral procedure for figuring out human values, we should use that rather than a non-neutral procedure."

My read was more "do the best we can to get through the acute risk period in a way that lets humanity have the time and power to do the best it can at defining/creating a future full of value." And that's in response and opposed to positions like "figure out / decide what is best for humanity (or a procedure that can generate the answer to that) and use that to shape the long term future."

The point is that as moral attitudes/thoughts change, societies or individuals which exist long enough will likely come to regret permanently structuring the world according to the morality of a past age. The Roman will either live to regret it, or the society that follows the Roman will come to regret it even if the Roman dies happy, or the AI is brainwashing everyone all the time to prevent moral progress. The analogy breaks down a bit with the third option since I'd guess most people today would not accept it as a success and it's today's(ish) morals that might get locked in, not ancient Rome's.