Wei_Dai

Comments

Some blindspots in rationality and effective altruism

I recall reading somewhere about early LessWrong authors reinventing concepts that were already worked out before in philosophic disciplines (particularly in decision theory?). Can't find any post on this though.

See Eliezer’s Sequences and Mainstream Academia and scroll down for my comment there. Also https://www.greaterwrong.com/posts/XkNXsi6bsxxFaL5FL/ai-cooperation-is-already-studied-in-academia-as-program (According to these sources, AFAWK, at least some of the decision theory ideas developed on LW were not worked out already in academia.)

Decoupling deliberation from competition

As another symptom what's happening (the rest of this comment is in a "paste" that will expire in about a month, to reduce the risk of it being used against me in the future)

Some Thoughts on Metaphilosophy

having AIs derive their terminal goals from simulated humans who live in a safe virtual environment.

There has been some subsequent discussion (expressing concern/doubt) about this at https://www.lesswrong.com/posts/7jSvfeyh8ogu8GcE6/decoupling-deliberation-from-competition?commentId=bSNhJ89XFJxwBoe5e

Decoupling deliberation from competition

Here's an idea of how random drift of epistemic norms and practices can occur. Beliefs (including beliefs about normative epistemology) function in part as a signaling device, similar to clothes. (I forgot where I came across this idea originally, but a search produced a Robin Hanson article about it.) The social dynamics around this kind of signaling produces random drift in epistemic norms and practices, similar to random drift in fashion / clothing styles. Such drift coupled with certain kinds of competition could have produced the world we have today (i.e., certain groups happened upon especially effective norms/practices by chance and then spread their influence through competition), but may lead to disaster in the future in the absence of competition, as it's unclear what will then counteract future drift that will cause continued deterioration in epistemic conditions.

Another mechanism for random drift is technological change that disrupts previous epistemic norms/practices without anyone specifically intending to. I think we've seen this recently too, in the form of, e.g., cable news and social media. It seems like you're envisioning that future humans will deliberately isolate their deliberation from technological advances (until they're ready to incorporate those advances into how they deliberate), so in that scenario perhaps this form of drift will stop at some point, but (1) it's unclear how many people will actually decide to do that, and (2) even in that scenario there will still be a large amount of drift between the recent past (when epistemic conditions still seemed reasonably ok, although I had my doubts even back then) and then, which (together with other forms of drift) might never be recovered from.

Decoupling deliberation from competition

We’ve talked about this a few times but I still don’t really feel like there’s much empirical support for the kind of permanent backsliding you’re concerned about being widespread.

I'm not claiming direct empirical support for permanent backsliding. That seems hard to come by, given that we can't see into the far future. I am observing quite severe current backsliding. For example, explicit ad hominem attacks, as well as implicitly weighing people's ideas/arguments/evidence differently, based on things like the speaker's race and sex, have become the norm in local policy discussions around these parts. AFAICT, this originated from academia, under "standpoint epistemology" and related ideas.

On the other side of the political spectrum, several people close to me became very sure that "the election was stolen" due to things like hacked Dominion machines and that the military and/or Supreme Court was going to intervene in favor of Trump (to the extent that it was impossible for me to talk them out of these conclusions). One of them, who I had previously thought was smart/sane enough to entrust a great deal of my financial resources with, recently expressed concern for my life because I was going to get the COVID vaccine.

Is this an update for you, or have you already observed such things yourself or otherwise known how bad things have become?

There are some fuzzy borders here, and unclarity about how to define the concept, but maybe I’d guess 10% from “easy” failures to deliberate (say those that could be avoided by the wisest existing humans and which might be significantly addressed, perhaps cut in half, by competitive discipline) and a further 10% from “hard” failures (most of which I think would not be addressed by competition).

Given these numbers, it seems that you're pretty sure that almost everyone will eventually "snap out of" any bad ideas they get talked into, or they talk themselves into. Why? Is this based on some observations you've made that I haven't seen, or history that you know about that I don't? Or do you have some idea of a mechanism by which this "snapping out of" happens?

Decoupling deliberation from competition

I reasonably often find myself grateful that some dysfunctional norms or epistemic practices will most likely become obsolete. It’s a bit scary to think about a world where the only solution is waiting for someone to snap out of it.

I've been thinking a lot about this lately, so I'm glad to see that it's on your mind too, although I think I may still be a bit more concerned about it than you are. Couple of thoughts:

  1. What if our "deliberation" only made it as far as it did because of "competition", and that nobody or very few people knows how to deliberate correctly in the absence of competitive pressures? Basically, our current epistemic norms/practices came from the European Enlightenment, and they were spread largely via conquest or people adopting them to avoid being conquered or to compete in terms of living standards, etc. It seems that in the absence of strong competitive pressures of a certain kind, societies can quickly backslide or drift randomly in terms of epistemic norms/practices, and we don't know how to prevent this.

  2. What's your expectation of the fraction of total potential value that will be lost due to people failing to deliberate correctly (e.g., failing to ever "snap out of it", or getting "persuaded" by bad memes and then asking their AIs to lock in their beliefs/values)? It seems to me that it's very large, easily >50%. I'm curious how others would answer this question as well.

Alice and Bob can try to have an agreement to avoid racing ahead or engaging in some kinds of manipulation, and analogous a broader society could adopt such norms or divide into communities with internal agreements of this form.

In a sane civilization, tons of people would already be studying how to make and enforce such agreements, e.g., how to define what kinds of behaviors count as "manipulation", and more generally what are good epistemic norms/practices and how to ensure that many people adopt such norms/practices. If this problem is solved, then maybe we don't need to solve metaphilosophy (in the technical or algorithmic sense), as far as preventing astronomical waste arising from bad deliberation. Unfortunately it seems there's approximately zero people working on either problem.

April 15, 2040

In order to sign the agreement, I must make a commitment to never break it, not even if you order me to.

This illustrates something I wrote about, namely that corrigibility seems incompatible with AI-powered cooperation. (Even if an AI starts off corrigible, it has to remove that property to make agreements like this.) Curious if you have any thoughts on this. Is there some way around the seeming incompatibility? Do you think we will give up corrigibility for greater cooperation, like in this story, and if so do you think that will be fine from a safety perspective?

Another (outer) alignment failure story

This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity.

How will your AI compute "the extent to which M leads to deviation from idealized deliberation"? (I'm particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that's not the kind of approach you're imagining.)

If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.

The attack I have in mind is to imitate a normal human conversation about philosophy or about what's normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion. This may well involve convincing you of a false claim, but of a philosophical nature such that you and your AI can't detect the error (unless you've solved the problem of metaphilosophy and knows what kinds of reasoning reliably leads to true and false conclusions about philosophical problems).

Another (outer) alignment failure story

Trying to imagine myself how an automated filter might work, here's a possible "solution" I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I've been compromised by an AI-powered attack and is now trying to attack you. (I'm talking in binary terms of "compromised" and "uncompromised" for simplicity but of course it will be more complicated than that in reality.)

Is this close to what you're thinking? (If not, apologies for going off on a tangent.) If so, given that I would "naturally" change my mind over time (i.e., based on my own thinking or talking with other uncompromised humans), it seems that your AI has to model that as well. I can imagine that in such a scenario, if I ever changed my mind in an unexpected (by the AI model) direction and wanted to talk to you about that, my own AI might say something like "If you say this to Paul, his AI will become more suspicious that you've been compromised by an AI-powered attack and your risk of getting blocked now or in the future increases by Y. Are you sure you still want to say this to Paul?" So at this point, collective human philosophical/moral progress would be driven more by what AI filters expect and let pass, than by what physical human brains actually compute, so we better get those models really right, but that faces seemingly difficult problems I mentioned at Replicate the trajectory with ML? and it doesn't seem like anyone is working on such problems.

If we fail to get such models good enough early on, that could lock in failure as it becomes impossible to meaningfully collaborate with other humans (or human-AI systems) to try to improve such models, as you can't distinguish whether they're genuinely trying to make better models with you, or just trying to change your models as part of an attack.

Another (outer) alignment failure story

Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it.

This seems like the most important part so I'll just focus on this for now. I'm having trouble seeing how this can work. Suppose that I, as an attacker, tell my AI assistant, "interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends" (while implicitly minimizing the chances of these messages/interactions being flagged by your automated systems as adversarial). How would your automation distinguish between me doing this, versus me trying to have a normal human conversation with you about various topics, including what's moral/normative? Or if the automation isn't trying to directly make this judgment, what is it telling you to allow you to make this judgment? Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?

Load More