All of Wei_Dai's Comments + Replies

Dath Ilani Rule of Law

Alice threatens Bob when Alice says that, if Bob performs some action X, then Alice will respond with action Y, where Y (a) harms Bob and (b) harms Alice. (If one wants to be “mathematical”, then one could say that each combination of actions is associated with a set of payoffs, and that “action Y harms Bob” == “[Bob’s payoff with Y] < [Bob’s payoff with not-Y]”.)

Note that the dath ilan "negotiation algorithm" arguably fits this definition of "threat":

If Alis and Bohob both do an equal amount of labor to gain a previously unclaimed resource worth 1

... (read more)
This seems likely. Much of Eliezer's fiction includes a lot of typical mind fallacy and a seemingly-willful ignorance of power dynamics and "unfair" results in equilibria being the obvious outcome for unaligned agents with different starting conditions. This kind of game-theory analysis is just silly unless it includes the information about who has the stronger/more-visible precommittments, and what extra-game impacts the actions will have. It's actually quite surprising how deeply CDT is assumed (agents can freely choose their actions at the point in the narrative where it happens) in such analyses.
By this definition any statement that sets any conditions whatsoever in the Ultimatum Game is a threat. Or indeed any statement setting conditions under which you might withdraw from otherwise mutually beneficial trade.
Ineffective Altruism

Evil happens when you are separated from the pain you inflict upon other people.

If only someone would invent a time machine so we can see what effects our actions have on the far future...

We communicated via SMS instead of AirBnb’s website because AirBnb’s website has an algorithm that scans our messages for keywords and punishes hosts it thinks did a poor job—regardless of the star rating a customer like me provides.

I was skeptical of this after reading (from one of your comment replies) that you only heard about this from the host, but some search... (read more)

welp, that's horrifying and also honestly quite expected...
MIRI announces new "Death With Dignity" strategy

Similarly, AGI is a quite general technical problem. You don’t just need to make an AI that can do narrow task X, it has to work in cases Y and Z too, or it will fall over and fail to take over the world at some point. To do this you need to create very general analysis and engineering tools that generalize across these situations.

I don't think this is a valid argument. Counter-example: you could build an AGI by uploading a human brain onto an artificial substrate, and you don't "need to create very general analysis and engineering tools that generalize... (read more)

You're right that the uploading case wouldn't necessarily require strong algorithmic insight. However, it's a kind of bounded technical problem that's relatively easy to evaluate progress in relative to the difficulty, e.g. based on ability to upload smaller animal brains, so would lead to >40 year timelines absent large shifts in the field or large drivers of progress. It would also lead to a significant degree of alignment by default. For copying culture, I think the main issue is that culture is a protocol that runs on human brains, not on computers. Analogously, there are Internet protocols saying things like "a SYN/ACK packet must follow a SYN packet", but these are insufficient for understanding a human's usage of the Internet. Copying these would lead to imitations, e.g. machines that correctly send SYN/ACK packets and produce semi-grammatical text but lack certain forms of understanding, especially connection to a surrounding "the real world" that is spaciotemporal etc. If you don't have logic yourself, you can look at a lot of logical content (e.g. math papers) without understanding logic. Most machines work by already working, not by searching over machine designs that fit a dataset. Also in the cultural case, if it worked it would be decently aligned, since it could copy cultural reasoning about goodness. (The main reason I have for thinking cultural notions of goodness might be undesirable is thinking that, as stated above, culture is just a protocol and most of the relevant value processing happens in the brains, see this post [] .)
Judge Overturns Transportation Mask Mandate

Nope. A few minutes later, a flight attendant comes to my seat, with the gate agent following behind, to tell me that I have to put a surgical mask on top of my respirator. Seems like the gate agent must have come aboard specifically to ask the flight attendant to do this.

When I tell her that my respirator does not have a valve, and is allowed by airline policy, she says that she can't determine whether or not it has a valve, and asks me whether I'm willing to comply with her request. I say "yes" as I'm afraid of the consequences of saying no. They leave b... (read more)

1David Hornbein1mo
Huh! And meanwhile, last month I flew several times without any mask at all, except briefly when asked to at checkpoints, and I didn't get harassed nearly this much. The worst I got was an attendant asking me to put on a mask and then leaving without checking whether I did so (I didn't). I wonder if I just got lucky, or if "unusual mask" stands out as an anomaly they need to Do Something about whereas "no mask" is just a regular thing not worthy of attention, or what. In any case, the discrepancy is very silly.
My friend wore a P100 on an airplane. An attendant yelled at him to take it off and put on a surgical mask. He pointed to the surgical mask he had on under the P100. So he was in every sense complying with the mask mandate. He just had a bonus mask on top of it. Attendant kept yelling. He kept violently agreeing to wear the mask he was already wearing. Attendant yells more. ??? Unclear what happened here, he used some jedi mind trick that made him too hard to argue with while simultaneously being too agreeable to kick off the plane ??? Friend boards plane wearing stupid surgical mask and his P100
2[comment deleted]1mo
Judge Overturns Transportation Mask Mandate

True story: One day before this decision, I was boarding a plane wearing an elastomeric respirator without an exhalation valve, which I checked ahead of time was specifically allowed by airline policy, but the gate agent told me my respirator wasn't allowed because it had exhalation valves. She apparently mistaked the filter cartridges for valves, and said "we'll see what the flight attendant says" when I tried to point out they were filters, not valves, then let me pass. I boarded the plane and sat down without further incident... Anyone want to guess what happened afterwards?

3Adam Selker1mo
[Link] A minimal viable product for alignment

The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement:

It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them.

Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifica... (read more)

[Link] A minimal viable product for alignment

If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.

But this is pretty likely the case though, isn't it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and ... (read more)

It feels to me like there's basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them. This is all true notwithstanding the fact that we often make mistakes. (Though as we've discussed before, I think that a lot of the examples you point to in cryptography are cases where there were pretty obvious gaps in formalisms or possible improvements in systems, and those would have motivated a search for better alternatives if doing so was cheap with AI labor.)
Ukraine Post #9: Again

My understanding is he basically had a strong thesis, with many details

Thanks, now I'm curious how he got into this epistemic state to begin with, especially how he determined 1 and 2 on your list. My current guess is that he focused too much on things that he could easily see and things that fit into his framework, like Putin being strategic and measured in the past, and Russia's explicit reform efforts, and neglected to think enough about other stuff, like corruption, supply problems, Putin being fooled by his own command structure.

It's too bad that n... (read more)

[RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm.

Yeah, don't do RL on it, but instead use it to make money for you (ethically) and at the same time ask it to think about how to create a safe/aligned superintelligent AGI. You may still need a big enough lead (to prevent others doing RL outcompeting you) or global coordination but it doesn't seem obviously impossible.

1Not Relevant1mo
Pretty much. I also think this plausibly buys off the actors who are currently really excited about AGI. They can make silly money with such a system without the RL part - why not do that for a while, while mutually-enforcing the "nobody kill everyone" provisions?
[RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm.

No one knows how to build an AI system that accomplishes goals, that also is fine with you turning it off. Researchers have been trying for decades, with no success.

Given that it looks like (from your Elaboration) language models will form the cores of future AGIs, and human-like linguistic reasoning will be a big part of how they reason about goals (like in the "Long sequences of robot actions generated by internal dialogue" example) can't we just fine-tune the language model by training it on statements like "If (authorized) humans want to turn me off... (read more)

1Yonatan Cale1mo
Even if I assume this all goes perfectly, would you want a typically raised teenager (or adult) to have ~infinite power to change anything they want about humanity? How about a philosopher? Do you know even 10 people who you've seen what decisions they've advocated for and you'd trust them with ~infinite power?
1Yonatan Cale1mo
Hey! One of the problems in are that these are not well defined, and if you let a human (like me) read it, I will automatically fill in the blanks to probably match your own intuition. As examples of problems: 1. "I should turn off" 1. who is "I"? what if the AI makes another one? 2. What is "should"? Does the AI get utility from this or not? If so, Will the AI try to convince the humans to turning it off? If not, will the AI try to prevent humans from WANTING to turn it off?
Why would that make it corrigible to being turned off? What does the word "should" in the training data have to do with the system's goals and actions? The AI does not want to do what it ought (where by "ought" I mean the thing AI will learn the word means from human text). It won't be motivated by what it "should" do any more than by what it "shouldn't" do. This is a fundamental flaw in this idea; it is not repairable by tweaking the prompt. The word "should" will, just, having literally nothing whatsoever to do with what the AI is optimizing for (or even what it's optimized for). DALL-E doesn't make pictures because it "should" do that; it makes pictures because of where gradient descent took it. Like, best-case scenario, it repeats "I should turn off" as it kills us.
7Not Relevant1mo
Maybe! We should be thinking through the problems with that! It's not the worst idea in the world! EDIT: note that the obvious failure mode here is "but then you retrain the language model as part of your RL loop, and language loses its meaning, and then it does something evil and then everyone dies." So everyone still has to not do that! But this makes me think that ~human-level alignment in controlled circumstances might not be impossible. It doesn't change the fact that if anyone thinks it would be fun to fine-tune the whole model on an RL objective, we'd still lose. So we have to do global coordination ASAP. (It does not seem at all likely that Yudkowskian pivotal acts can be achieved solely through the logic learned from getting ~perfect LLM accuracy on the entire internet, since all proposed pivotal acts require new concepts no one has ever discussed.)
What Would A Fight Between Humanity And AGI Look Like?

My point here was that humans seem so susceptible to propaganda that an AI can probably find some way to bend us to its will. But if you want a more specific strategy that the AI could use (which I think came to me in part because I had the current war in the back of my mind), see my top level comment here.

The Debtors' Revolt

I'm sympathetic to a lot of your complaints and find your overall narrative interesting (albeit not something I fully understand or buy). But most if not all of the pathologies you describe seem explainable under mainstream microeconomics (mainly using the concept of principle-agent problem), so I think I'd be more receptive to a version of this post where you didn't claim otherwise.

I found myself trying to explain that you might lend some grain to a farmer, so they could plant it to grow more, pay you back with interest, and still have some grain left o

... (read more)
I would expect a liquidity premium to exist but I'd expect it to be much smaller than the size of opportunity I'm seeing - why do you expect to see one so large? I don't think the principal-agent problem explains anything here because large publicly traded corporations frequently have governance, financial structures, and operations that aren't intelligible at all to casual investors. For example, I've taken courses in finance, accounting, and economics, and worked in financial services, and I have no idea how to evaluate Markopolos's criticisms of GE [] , & compare this with their financial statements, because the latter are so vague. (Do you?) Nor was there a trusted intermediary whose evaluation methods I understood. (Can you recommend one?) In practice when I did own stocks I was relying on correlation with other investors - the government would try not to let us all fail at once - rather than any ability to meaningfully exercise oversight over centralized management. The parenthetical questions are meant seriously, they're not just rhetorical flourishes.
I understand this argument, it would be a perfectly logical reason for some leveraged buyouts to happen in some circumstances, but in practice many leveraged buyouts are a way to offload risk onto counterparties with less legible high-trust relationships with management, such as employees - e.g. in the airline bankruptcies [] - or consumers, who can't use brand [] quality [] as much as they used to be able to because corporate decisions are made based on short-term numbers, and turnover means that the cost of eroded brand loyalty will be correlated across many companies and distributed across many people, and since the state won't let large corporations in general fail all at once, we end up with bailouts. I recommended a book on the subject because I really can't cover everything in the blog post, it's long enough already and this sort of thing is very well documented elsewhere.
I'd rather people investing on my behalf use objective profit-maximizing criteria, ideally with skin in the game, and that's why it's surprising that access to capital depends so much on the kinds of subjective factors you mention (checking whether someone has already been extended credit, whether they're vibing the right way with VCs or bankers, whether they look like a normal borrower, and in the case I described, whether they took a class prescribed by the credit union) relative to economic considerations. I have a close friend who was had a business bank account closed for avowedly discretionary reasons after a conversation with a banker where as far as I can tell the banker got spooked because he seemed like he had specific, creative plans that didn't look normal. (Nothing illegal was discussed; they were thinking about something that might have attracted regulatory scrutiny, but they'd have been happy to negotiate or just look for a different counterparty for those transactions.) An important case study here is Abacus Bank, the only bank to be prosecuted in relation to the 2008 financial crisis [], as far as I can tell simply because they're culturally decorrelated from other banks (small, ethnically Chinese, privately held). The prosecution didn't work out, because Abacus hadn't done any crimes.
What Would A Fight Between Humanity And AGI Look Like?

What about the current situation in Russia? I think Putin must be winging the propaganda effort, since he wasn't expecting to have to fight a long and hard war, plus some of the messaging don't stand up to even cursory inspection (a Jewish Nazi president?), and yet it's still working remarkably well.

Putin is already president of Russia. The steps between an AI being president of a country and killing everybody is pretty cut-and-dry; I could probably do that if I had an AI's value function. The steps between an AI being a computer program assigned to raise the stock price of GOOG$ and consistently becoming president of Russia are much less clear.
3Sammy Martin2mo
The Putin case would be better if he was convincing Russians to make massive sacrifices or do something that will backfire and kill them, like start a war with NATO, and I don't think he has that power - e.g. him rushing to deny that Russia were sending conscripts to Ukraine because of the fear the effect that would have on public opinion
Yeah, but Putin’s been president of Russia for over 20 years and already has a very large, loyal following. There will always be those that enthusiastically follow the party line of the leader. It’s somewhat harder to actually seize power. (None of this is to excuse the actions of Putin or those who support him.)
What Would A Fight Between Humanity And AGI Look Like?

One strategy an AI could use (as an alternative to Dagon's, which I think is also likely) is to intentionally create "competitors" to itself, for example by leaking its design to other nations or AI builders. Competitive pressure (both economic and military) will then force each faction of humans to hand more and more power/control to "their" AI, with the end result that all important decisions will be made by or mediated through AI. (Why would you want to fight your AI on mere suspicion of its safety, when it's helping you fight other AIs that are clearly... (read more)

Ukraine Post #9: Again

Thanks for these. I've been trying to follow the war, but still missed many interesting links that you've collected.

Samo Burja watch: On 31 March he recognizes that Russia will not win militarily within the first 50 days, still predicts a similar outcome within the year as the most likely result and considers Russia taking a bunch of territory the optimistic scenario.

Samo seems to have been consistently too optimistic on Russia's behalf, and slow to update. On Feb 27 when others started to detect a turning of the tide, he wrote in response:

The standa

... (read more)

I think that Samo does have and offer a lot of detailed information, but he puts much or most of that behind a very expensive paywall, either the Bismark Brief or what he does for private clients. And that having worked up lots of details and a model that he presents as a key value add and brand, it becomes difficult to walk away from that special knowledge and model when events overtake the situation, making him slow to (publicly, anyway) update, but e.g. he did lose an actual $100 bet where he took much-worse-than-Metaculus odds on Russia winning by day ... (read more)

Naively, under my view of Samo's view, it would require bigger chunks of data to make substantive updates. This is an oversimplification, but my chain of reasoning about Samo's view goes approximately like this: * Great Founder Theory []: The correct lens for history is institutions, and the seminal events are the launch or collapse of important institutions. * The failure of the initial invasion will feed back into questions about whether or not the Kremlin, FSB, Russian army, etc. are in collapse as institutions. This is a hard problem []: * It seems to me that the harder the problem the slower the update, and the smaller the consequences the less weight there is on the evidence. * Applying that: it doesn't appear this is any kind of existential threat to any of the Russian institutions involved in launching the invasion. I think there might be some kind of conundrum like: if Ukraine escalates into an existential threat, the greater Russian institutional resources predict victory for Russia; but if it doesn't escalate, it doesn't seem to weigh much. It seems like the kinds of things we are talking about in this war, like it being a political/economic defeat for Russia as a country or Putin as a regime, read as a kind of category error in the Great Founder Theory paradigm, which I might describe as "longtermist institutional realpolitik." I would guess that major updates would occur with events like: formal addition of new countries into NATO; collapse of things like Russian banks, oil and gas companies, or similar; coup attempts or successful revolutions; the creation of new institutions for Russian/Chinese economic alignment, the creation of new European institutions for energy independence, etc.
FYI, Samo Burja's username here is [removed by Ben Pace].
Book review: Very Important People

Seems like night club status works the same way that junk food or pornography do: you're (often) not optimizing for status (or nutrition or sex) directly, instead part of your brain is optimizing for rewards that another part of your brain provides when it detects certain correlates of status (or nutrition or sex).

MIRI announces new "Death With Dignity" strategy

Both cloning and embryo selection are not illegal in many places, including the US. (This article suggests that for cloning you may have to satisfy the FDA's safety concerns, which perhaps ought to be possible for a well-resourced organization.) And you don't have to raise them specifically for AI safety work. I would probably announce that they will be given well-rounded educations that will help them solve whatever problems that humanity may face in the future.

MIRI announces new "Death With Dignity" strategy

Shouldn't someone (some organization) be putting a lot of effort and resources into this strategy (quoted below) in the hope that AI timelines are still long enough for the strategy to work? With enough resources, it should buy at least a few percentage of non-doom probability (even now)?

Given that there are known ways to significantly increase the number of geniuses (i.e., von Neumann level, or IQ 180 and greater), by cloning or embryo selection, an obvious alternative Singularity strategy is to invest directly or indirectly in these technologies, and t

... (read more)
Sounds good to me! Anyone up for making this an EA startup? Having more Neumann level geniuses around seems like an extremely high impact intervention for most things, not even just singularity related ones. As for tractability, I can't say anything about how hard this would be to get past regulators, or how much engineering work is missing for making human cloning market ready, but finding participants seems pretty doable? I'm not sure yet whether I want children, but if I decide I do, I'd totally parent a Neumann clone. If this would require moving to some country where cloning isn't banned, I might do that as well. I bet lots of other EAs would too.

Sure, why not.  Sounds dignified to me.

9Jackson Wagner2mo
For starters, why aren't we already offering the most basic version of this strategy as a workplace health benefit within the rationality / EA community? For example, on their workplace benefits page [], OpenPhil says: Seems a small step from there to making "we cover IVF for anyone who wants (even if your fertility is fine) + LifeView polygenic scores" into a standard part of the alignment-research-agency benefits package. Of course, LifeView only offers health scores, but they will also give you the raw genetic data. Processing this genetic data yourself, DIY style, could be made easier -- maybe there could be a blog post describing how to use an open-source piece of software and where to find the latest version of EA3, and so forth. All this might be a lot of trouble for (if you are pessimistic about PGT's potential [] ) a rather small benefit. We are not talking Von Neumanns here. But it might be worth creating a streamlined community infrastructure around this anyways, just in case the benefit becomes larger as our genetic techniques improve.
I don't see any realistic world where you both manage to get government permission to allow you to genetically engineer children for intelligence and they let you specifically raise them to do safety work far enough in advance that they actually have time to contribute and in a way that outweighs any PR risk.
Moloch and the sandpile catastrophe

Vernor Vinge described a version of this in A Deepness in the Sky. I agree it's an important topic and don't recall anyone else talk about it until now, or anyone talk about it in a non-fiction context, so I'm glad to see this post. From Vinge's novel:

“The flexibility of the governance is its life and its death. They’ve accepted optimizing pressures for centuries now. Genius and freedom and knowledge of the past have kept them safe, but finally the optimizations have taken them to the point of fragility. The megalopolis moons allowed the richest networki

... (read more)
How Does The Finance Industry Generate Real Economic Value?

I don't have a complete picture of my own of how everything works and fits together, but this part seems clearly wrong:

If you buy the stock from an investor, then you’re simply gaining money which that investor would otherwise have gained. It’s zero sum: their missed opportunity is exactly equal to your gain.

A typical case of a stock being undervalued is when there is some large shareholder trying to exit for liquidity reasons (think of a venture capitalist selling shares in an IPO'ed company so they can redeploy the capital into new startups, or a com... (read more)

3Daniel V2mo
It is still tempting to assume each exact transaction is zero sum (while the macro level invisible hand is yielding positive sum) but that would be a mistake. First, there may be a little bit of buyer and seller surplus (represented by a market maker facilitating a strike price between the bid/ask spread). Second, risk matters - could be that gain the seller missed out on was just not the right deployment of their capital for their risk profile, so they actually aren't "missing out" on it at all. Third, you're not observing opportunity costs in strike prices, so could be that gain the seller missed out on was a lower conviction bet they wanted to take off so they could get into a higher conviction bet, so they aren't actually "missing out" on it at all. Fourth, add in a subjective value function and suddenly on a utility basis, it is extremely easy to see the potential for actors to view trading as offering gains (though as mentioned, you don't even need this to begin chipping away at the zero sum notion). Finance is easy to side-eye, but read Matt Levine, it's just like any other market where people are trying to solve each other's problems and make some margin.
The lure of technocracy

Politically unbiased experts in red-and-gray Technocracy uniforms would assay each nation’s yearly energy output, then divide it fairly among the citizenry, each person receiving an allocation of so many joules or kilowatt-hours per month. If people wanted to buy, say, shirts, they would look up the price on a table of energy equivalents calculated by objective Technocratic savants.

My high school history/economics teacher actually assigned me to study the Technocracy Movement, I suspect because he wanted me to be more skeptical of mainstream economics, ... (read more)

Ukraine Post #2: Options

I think you’re taking the formal adoption of FDTs too literally here, or treating it as if it were the AGI case, as if humans were able to self-modify into machines fully capable of honoring commitments and then making arbitrary ones, or something?

Actually, my worry is kind of in the opposite direction, namely that we don't really know how FDT can or should be applied in humans, but someone with a vague understanding of FDT might "adopt FDT" and then use it to handwavingly justify some behavior or policy. For example someone might think, "FDT says that ... (read more)

Game theory, sanctions, and Ukraine

From what I've read, Putin surrounded himself with yes men who lied to him (or were afraid to tell him the truth) about the preparedness and morale of the Russian and Ukrainian militaries, and the likely response of the Ukrainian government and people to a Russian invasion. That doesn't seem rational to me, or if it's somehow not irrational on an individual level, makes it a bad idea to model Russia as a rational actor as a whole.

1[comment deleted]2mo
Absent honest, safe, free speech, leadership's map diverges more and more from the territory and then comes crashing back to reality when they drive off a cliff they thought was a highway. A group of individuals behaving in their own rational self-interest can make very irrational, self-destructive group-level decisions, if the incentives the members have are perverse enough. I guess the idea itself is as old as the book(Moloch style religious arguments have existed since forever) but I somehow never thought about it from the lens of predictability, of being a part of the same consensus reality. Everyone around Putin was shocked that he went to full war, because they all knew they were lying to him and it would be a disaster. He alone lived in a hall of mirrors. I assume he's smashing a bunch of them as we speak.
Tbh I have a well documented bias toward overestimating opponents capacities. I think Putin's actions makes a lot of sense considering the results of his previous wars and the general lack of (serious) responce from the West, but I agree his most recent speeches and actions do not strikes me as perfectly rational.
Ukraine Post #2: Options

It’s not obvious to me why the bolded assertion follows; isn’t the point of “updatelessness” precisely that you ignore / refrain from conditioning your decision on (negative-sum) actions taken by your opponent in a way that would, if your conditioning on those actions was known in advance, predictably incentivize your opponent to take those actions? Isn’t that the whole point of having a decision theory that doesn’t give in to blackmail?

By "has to" I didn't mean that's normatively the right thing to do, but rather that's what UDT (as currently formulate... (read more)

Game theory, sanctions, and Ukraine

A couple of additional considerations that I wish people would bring up in posts on this topic:

  1. Humans have a tendency/bias to think one's own side is good and only giving proportional/fair responses, and the other side is evil and always escalating, making overall escalation much more likely than game theory (that assume rational actors) would suggest.
  2. Chinese leadership may seem mostly rational today, but what about tomorrow? Putin seemed fairly rational just a few months ago. Maybe "absolute power corrupts absolutely" is right after all?

Given these c... (read more)

1Dumbledore's Army2mo
You may be right that deterring the next war is harder than it appears. I’m pretty sure you’re right that escalation is more likely in reality than in theory. That doesn’t change the fact that the world still needs to [] deter the next war. And taking the problem seriously and thinking about how to find a solution is a necessary first step.
I think both Putin and Xi Jinping are extremely rational - more so, in fact, that our Macron or Biden. Their goals are very different, which makes their action difficult to understand for us (at least on an emotional level - wants to bring back the Glory of the empire is a motive I can understand but not really emphatise with). And they can also make mistakes - I think Putin underestimated the strength of our answer to his invasion (tbf this strength surprised everybody, including our own gouvernement). But misevaluating something does not makes you irrational.
Ukraine Post #2: Options

I'm not sure why this got voted down, but I don't disagree. In parts of my comments that you didn't quote, I indicated my own uncertainty about the tradeoff.

Ukraine Post #2: Options

it looks like this is if anything a worse problem for existing systems used in practice by (e.g. Biden)

Why do you say this? I'm pretty worried about people adopting any kind of formal decision theory, and then making commitments earlier than they otherwise would, because that's what the decision theory says is "rational". If you have a good argument to the contrary, then I'd be less concerned about this.

It seems more like a “I notice all existing options have this issue” problem than anything else, and like it’s pointing to a flaw in consequentialism

... (read more)
I think you're taking the formal adoption of FDTs too literally here, or treating it as if it were the AGI case, as if humans were able to self-modify into machines fully capable of honoring commitments and then making arbitrary ones, or something? Whereas actual implementations here are pretty messy, and also they're inscribed in the larger context of the social world. I also don't understand the logical time argument here as it applies to humans? I can see in a situation where you're starting out in fully symmetrical conditions with known source codes, or something, why you'd need to think super quick and make faster commitments. But I'm confused why that would apply to ordinary humans in ordinary spots? Or to bring it back to the thing I actually said in more detail, Biden seems like he's using something close to pure CDT. So someone using commitments can get Biden to do quite a lot, and thus they make lots of crazy commitments. Whereas in a socially complex multi-polar situation, someone who was visibly making lots of crazy strong commitments super fast or something would some combination of (1) run into previous commitments made by others to treat such people poorly (2) be seen as a loose cannon and crazy actor to be put down (3) not be seen as credible because they're still a human, sufficiently strong/fast/stupid commitments don't work, etc. I think the core is - you are worried about people 'formally adopting a decision theory' and I think that's not what actual people ever actually do. As in, you and I both have perhaps informally adapted such policies, but that's importantly different and does not lead to these problems in these ways. On the margin such movements are simply helpful. (On your BTW, I literally meant that to refer to the central case of 'what people do in general when they have non-trivial decisions, in general' - that those without a formal policy don't do anything coherent, and often change their answers dramatically based on social c
I admit to not being super interested in the larger geopolitical context in which this discussion is embedded... but I do want to get into this bit a little more: It's not obvious to me why the bolded assertion follows; isn't the point of "updatelessness" precisely that you ignore / refrain from conditioning your decision on (negative-sum) actions taken by your opponent in a way that would, if your conditioning on those actions was known in advance, predictably incentivize your opponent to take those actions? Isn't that the whole point of having a decision theory that doesn't give in to blackmail? Like, yes, one way to refuse to condition on that kind of thing is to refuse to even compute it, but it seems very odd to me to assert that this is the best way to do things. At the very least, you can compute everything first, and then decide to retroactively ignore all the stuff you "shouldn't have" computed, right? In terms of behavior this ought not provide any additional incentives to your opponent to take stupid (read: negative-sum) actions, while still providing the rest of the advantages that come with "thinking things through"... right? This part is more compelling in my view, but also it kind of seems... outside of decision theory's wheelhouse? Like, yes, once you start introducing computational constraints and other real-world weirdness, things can and do start getting messy... but also, the messiness that results isn't a reason to abandon the underlying decision theory? For example, I could say "Imagine a crazy person really, really wants to kill you, and the reason they want to do this is that their brain is in some sense bugged; what does your decision theory say you should do in this situation?" And the answer is that your decision theory doesn't say anything (well, anything except "this opponent is behaviorally identical to a DefectBot, so defect against them with all you have"), but that isn't the decision theory's fault, it's just that you gave it an
Ukraine Post #2: Options

And I don’t think you eliminate the risks—you have a bunch of repressive communist governments everywhere in a world where conditions are getting worse (because communism doesn’t work) and they start fighting over resources slash have nuclear civil wars.

I'm assuming that the USSR would not have let other communist governments develop their own nuclear weapons.

It seems like the only world which doesn't face repeated 10% chances of Armageddon is one in which some state has a nuclear monopoly and enforces it by threatening to attack any other state that tr... (read more)

Ukraine Post #2: Options

I’m not claiming I have the best possible answer here, I’m claiming that what humans are current doing is some mix of absurdly stupid things

What are some examples of these?

I think it is essentially epsilon chance that Putin would choose nuclear firestorm over a world where Russia doesn’t recreate the USSR / Russian Empire (and even where he dies tomorrow as well) if those were 100% the choices (I do think he might well be willing to risk a non-trivial chance of nuclear war to get it, but that is different), but let’s say that it is 10% (and notice tha

... (read more)
After reading Solzhenitsyn's The Gulag Archipelago, I am not really sure which outcome is worse.
If there were a bunch of Putin-style people in charge of the world that doesn't seem like a 'safe' world either. It seems like a world where these states engage in continuous brinksmanship that, if this kind of mindset is common, leads to a 10% chance of Armageddon as often or more often than the current one. We may have very different models of what happens if we let the USSR take over, but yeah I think that world has destroyed most of its value assuming it didn't go negative. And I don't think you eliminate the risks - you have a bunch of repressive communist governments everywhere in a world where conditions are getting worse (because communism doesn't work) and they start fighting over resources slash have nuclear civil wars. If the model is 'Putin escalates to nuclear war sometimes and maybe he miscalculates' then 'fold to him' is letting him conquer the world, literally, because no he wouldn't stop with Russia's old borders if we let him get Warsaw and Helsinki. Why would he? Otherwise, folding more makes him escalate until the nukes fly.
Ukraine Post #2: Options

On a personal level, getting yourself to where you are using a functional decision theory is very much worth it, as is helping others to get there with you – it’s good even on your own, but the more people use one, the better it does.

I think this is far too sanguine with regard to our understanding of decision theory. See The Commitment Races problem for one example of a serious problem that, AFAIK, isn't solved by any of the currently proposed decision theories, including FDT, and advocating that more people adopt FDT before solving the problem might e... (read more)

Looking at the Commitment Races Problem more (although not in full detail) it looks like this is if anything a worse problem for existing systems used in practice by (e.g. Biden), or at a minimum a neutral consideration. It seems more like a "I notice all existing options have this issue" problem than anything else, and like it's pointing to a flaw in consequentialism more broadly?
On the personal level, to me this seems like a potential failure mode worth worrying about in an AGI context because that's Impossible Mode for everything, but not in practical human mode, and most definitely not on the margin. I'm not claiming I have the best possible answer here, I'm claiming that what humans are current doing is some mix of absurdly stupid things and non-FDT proposals that exist seem way worse than FDT proposals that exist - also that actual human attempts to be FDT-style agents will be incomplete and not lead to degenerate outcomes, the same way mostly-basically-CDT-style human agents often don't actually do the fully crazy things it implies when it implies fully crazy things. On the global level, again I'm not saying I know the details of how we should respond, only that we shouldn't lay down and let him get whatever he wants. I think it is essentially epsilon chance that Putin would choose nuclear firestorm over a world where Russia doesn't recreate the USSR / Russian Empire (and even where he dies tomorrow as well) if those were 100% the choices (I do think he might well be willing to risk a non-trivial chance of nuclear war to get it, but that is different), but let's say that it is 10% (and notice that in those 10%, if he knew for a fact we'd respond with our nukes, he'd just say 'oh that's too bad' and destroy the world in a fit of pique, which very much doesn't seem right). In the other 90%, he backs down after various numbers of escalations (e.g. in some he tries the escalate-to-deescalate single nuke, others he tries leveling Kyiv, others he folds tomorrow and leaves) in some combination, then folds after he sees we won't fold, but losses are 'acceptable' here. In those 10% of worlds, what are we hoping for? I don't think there is much of an 'eventually' here. He takes Ukraine, we let him. He sees we let him do what he wants. Everyone else sees too. Every state starts a nuclear weapons program that can afford one, so Putin knows he'
Russia has Invaded Ukraine

From what I've read, Russia's stockpile of precision guided munitions is low, so this war may not look very "modern" past the initial stages. If Russia ends up adopting the same tactics it used in the Second Chechen War and causes the same amount of casualties on a per capita basis, Ukraine would end up suffering 2.5 million deaths.

Russia has Invaded Ukraine

Metaculus had only a 8% chance of "Russian Troops in Kyiv in 2022" as late as Feb 11. (It's now at 97%.) Why did everyone do so badly on this prediction?

Speaking from ignorance: this prediction failure seems (from my ignorant perspective) similar to forecasting failures in Brexit and in the Trump 2016 election, in that it’s a case where some force whose motives are unlike Western academia/elites was surprising to them/us. If so, the moral might be to study the perspectives, motives, and capabilities of forces outside the Western elite on their own terms / by cobbling together an inside view from primary sources, rather than looking via Western experts/media (though this is much harder).

To what extent are ... (read more)

Off the top of my head, maybe it's because Metaculus' presents medians, and the median user neither investigates the issue much, nor trusts those who do (Matt Y, Scott A) and just roughly follows base rates. I also feel there was some wishful thinking, and that to some extent, the fullness of the invasion was at least somewhat intrinsically surprising.
I think about several reasons: 1. Mistakes about the magnitude of the power centralization in Russia. I don't know how it was perceived in other countries but in Russia there were many debates about how many powers belongs to Putin himself. Real process of decision-making was hidden and people had different hypotheses about it. Many people thought about Putin as an arbiter between oligarchs/other forces or as first among peers. As far as I understand last week broadcast from Russian Security Council was very surprising for many people. Openly Putin practically humiliated some high officials (especially Chief of Intelligence Service). 2. Mistakes about Putin's motives. I think a problem is a changing in Putin. In the beginnings of his rule Putin demonstrated that he is very pragmatic. He said words which sounds very reasonably. He very rare did something non-reversible. I think all missed a moment when this was changed. 3. I suppose people underestimate the magnitude of Russian intelligence service degradation. I think in Russian defense agencies people were used to say things which their superiors wants to hear. I think Putin understands Ukrainians very inadequately because he read many reports which only confirm his point of view.
A second observation: How is this prediction inconsistent with a combination of these two beliefs: * The Russians are unlikely to invade, and in that case the probability of Russian troops in the capital is ~0 * In the counterfactual case where the Russians do invade, the probability of Russian troops in the capital is ~1 For example, Metaculus has a 97% chance of Russians in Kyiv, but a Russian invasion at all before 2023 is at 96%:

Even the honest-to-god financial markets did badly on this prediction. The moment Russia started invading, MOEX and oil prices had wild shocks, as if something unexpected had happened, even though we already had abundant ahead-of-time warnings about their attack from U.S. intelligence. My suspicion is that everyone, including market analysts, expected Russia to simply annex the seperatist eastern terrority of Donbas. They expected the war to end quickly and with little bloodshed, and for the Ukrainian government to capitulate. The presumption was that they... (read more)

My baseline assumption for this is that everyone always does badly on these kinds of predictions. This frequently includes highly trained defense analysts with access to privileged information sources. I suspect, at a gut level, that this is mostly because markets and crowds work on publicly available information and national security is pretty much the foremost subject for inside-view information being kept from the public.
Russia has Invaded Ukraine

This is probably the first time that a major war has been fought between two countries that are both well below replacement birth rate, which seems totally bonkers. What does it imply about human values?

What are the implications for international cooperation on AI risk and other x-risks? Bad news, presumably, but how bad?

Kind of a dark thought, but: there's always a baby boom after a war, fertility shoots way up. Putin has tried to prop up the Russian birth rate for many years to no avail...

9Self-Embedded Agent3mo
What does the birth rate have to do with war?
This war will be done fast and comparatively take a lot less casualties than any other war of countries that size.
Harms and possibilities of schooling

Have you looked for previous experiments along the lines of what you're proposing? If so, did you find any and what were their results?

I haven't looked really, seems worth someone doing. I think there's been a fair amount of experimentation, though maybe a lot of it is predictably worthless (e.g. by continuing to inflict central harms of normal schooling), I don't know. (This post is mainly aimed at adding detail to what some of the harms are, so that experiments can try to pull the rope sideways on supposed tradeoffs like permissiveness vs. strictness or autonomy vs. guidance.) I looked a little. Aside from Montessori (which would take work to distinguish things branded as Montessori vs. actually implementing her spirit), there's the Summerhill School: [] which seems to have ended up with creepy stuff going on, and Sudbury schools [] which I don't know about.
HCH and Adversarial Questions

I appreciate this review of the topic, but feel like you're overselling the "patchwork solutions", or not emphasizing their difficulties and downsides enough.

We can infer from the history of human reasoning that human cognition is relatively inefficient at transforming resources into adversarial text-inputs, as people have not produced all that many of those. No such inference can be made for computational search processes generally. We avoid most of the adversarial questions into HCH by remaining in the shallow waters of human cognition, and avoiding at

... (read more)
1David Udell3mo
(Thanks for the feedback!) I think so -- in my world model, people are just manifestly, hopelessly mindkilled [] by these domains.In other, apolitical domains, our intelligence can take us far. [] I'm certain that doing better politically is possible (perhaps even today [], with great and unprecedently thoughtful effort and straining against much of what evolution built into us), but as far as bootstrapping up to a second-generation aligned AGI goes, we ought to stick to the kind of research we're good at if that'll suffice. Solving politics can come after, with the assistance of yet-more-powerful second-generation aligned AI. In the world I was picturing, there aren't yet AI-assisted adversaries out there who have access into HCH. So I wasn't expecting HCH to be robust to those kinds of bad actors, just to inputs it might (avoidably) encounter in its own research. Conditional on my envisioned future coming about, the decision theory angle worries me more. Plausibly, we'll need to know a good bit about decision theory to solve the remainder of alignment (with HCH's help). My hope is that we can avoid the most dangerous areas of decision theory within HCH while still working out what we need to work out. I think this view was inspired by the way smart rationalists have been able to make substantial progress on decision theory while thinking carefully about potential infohazards and how to avoid encountering them. What I say here is inadequate, though -- really thinking about decision theory in HCH would be a separate project.
The innocent gene

I can understand objecting to the negative moral connotations of "selfish" (as applied to genes), but attaching "innocence" to genes with its positive moral connotations seems like erring in the opposite direction. What do you think about creating a new word to describe genes, that has a similar denotation to "selfish" but without the unwanted moral connotations?

2Joe Carlsmith3mo
Yeah, as I say, I think "neither innocent nor guilty" is least misleading -- but I find "innocent" an evocative frame. Do you have suggestions for an alternative to "selfish"?
Why are you still hiding the truth? Do you remember that day on 08/18/2008 when Hell told me in Satosi chat take your time new version will be released soon 09/23/2008 I didn't wait. And you know, I haven’t passed it all on to you so far in secret! Write to me, I've been waiting for a long time. Regards, Satoshi
The innocent self

Yet in the 21st century, in the developed world, who needs to be harsh or hard-hearted, except in response to their own unhappiness or insecurity? In general, hurt people hurt people.

In my observation, "hurt people hurt people" is true, in the sense that people generally hurt others due to perceiving unfairness or injustice toward themselves. But the amount of output as a function of input varies greatly between different people, with some being magnanimous and pacific, and others having anger and violence on a hair-trigger. I think a lot of the variati... (read more)

A broad basin of attraction around human values?

I think Paul’s argument amounts to saying that a corrigibility approach focuses directly on mitigating the “lock-in” of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

What's the actual content of the argument that this is true? From my current perspective, corrigible AI still has a very high risk of lock-in of wrong preferences, due to bad metapreferences of the overseer, and ambitious value learning, or some ways of doing that, could turn out to be less risk... (read more)

A broad basin of attraction around human values?

My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).

In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.

What do you think are the chances are of humanity being collectively careful enough, given that (in addition from the bad metapreferences I cited in the OP) it's devoting approximately 0.0000001% of its resources (3 FTEs, to give a generous overestimate) to studying either metaphilosop... (read more)

Reflections on six months of fatherhood

much of human motivation is curiosity and self-actualization

And status? I recall when my kid was having trouble learning to crawl, we showed him a toy baby that could crawl by itself, and he really hated it, bawling when he saw it start to crawl and then afterwards whenever he saw it. (Granted there could be other explanations for this behavior besides "status". We ended up helping to make the learning easier by putting him on a slight decline, which worked very well.)

We run races, climb mountains, compose ballads, peer through telescopes. These thing

... (read more)
Why do we need a NEW philosophy of progress?

The failures of communism must also have soured a lot of people on "progress", given that it fit really well into the old philosophy of progress and then turned out really badly. (See this related comment.)

How can we make moral and social progress at least as fast as we make scientific, technological and industrial progress? How do we prevent our capabilities from outrunning our wisdom?

This seems to be the key to everything else, but it may just be impossible. It seems pretty likely that moral and social progress are just inherently harder problems, gi... (read more)

5Martin Sustrik4mo
I would say there were two distinct "progressive" worldwiews in the 19th century. The symbol of the bourgeois progressivism may be Exposition Universelle of 1889 [], the symbol of the proletarian progressivism the Paris Commune []. Two events, same place, 18 years apart. The former with all the wonderful machines etc., the latter with the barricades and soldiers shooting the survivors. The two worldviews, being that distinct and held by different people, it's not clear to me whether the failures of the social progress school led to the souring towards the technical progress.
We can't? Have we tried? Have you tried? Is there some law of physics I'm missing? What would a real, genuine attempt to do just that even look like? Would you recognize it if it was done right in front of you?
Lives of the Cambridge polymath geniuses

Jason Crawford's recent post on 19th-century philosophy of progress seems relevant. Some quotes from it:

  • deep belief in the power of human reason
  • had forecast progress in morality and society just as much as in science, technology and industry
  • progress was inevitable
  • the conviction that “the Idea or the Dialectic or Natural Law, functioning through the conscious purposes or the unconscious activities of men, could be counted on to safeguard mankind against future hazards

From this it doesn't seem surprising that smart people would have initially seen som... (read more)

Maybe they looked at a set of values (present) and decided that others might serve better. Having picked out a better set* might not have been super hard. *in their context
That last point ("more distal cause") is a very interesting idea. Thanks!
The ignorance of normative realism bot

We can say similar stuff about other a priori domains like modality, logic, and philosophy as a whole. [...] Whether there are, ultimately, important differences here is a question beyond the scope of this post (I, personally, expect at least some).

I would be interested in your views on metaphilosophy and how it relates to your metaethics.

Suppose we restrict our attention to the subset of philosophy we call metaethics, then it seems to me that meta-metaethical realism is pretty likely (i.e., there are metanormative facts, or facts about the nature of no... (read more)

2Joe Carlsmith4mo
Is the argument here supposed to be particular to meta-normativity, or is it something more like "I generally think that there are philosophy facts, those seem kind of a priori-ish and not obviously natural/normal, so maybe a priori normative facts are OK too, even if we understand neither of them"? Re: meta-philosophy, I tend to see philosophy as fairly continuous with just "good, clear thinking" and "figuring out how stuff hangs together," but applied in a very general way that includes otherwise confusing stuff. I agree various philosophical domains feel pretty a priori-ish, and I don't have a worked out view of a priori knowledge, especially synthetic a priori knowledge (I tend to expect us to be able to give an account of how we get epistemic access to analytic truths). But I think I basically want to make the same demands of other a priori-ish domains that I do normativity. That is, I want the right kind of explanatory link between our belief formation and the contents of the domain -- which, for "realist" construals of the domain, I expect to require that the contents of the domain play some role in explaining our beliefs. Re: the relationship between meta-normativity and normativity in particular, I wonder if a comparison to the relationship between "meta-theology" and "theology" might be instructive here. I feel like I want to be fairly realist about certain "meta-theological facts" like "the God of Christianity doesn't exist" (maybe this is just a straightforward theological fact?). But this doesn't tempt me towards realism about God. Maybe talking about normative "properties" instead of normative facts would be easier here, since one can imagine e.g. a nihilist denying the existence of normative properties, but accepting some 'normative' (meta-normative?) facts like "there is no such thing as goodness" or "pleasure is not good."
General alignment plus human values, or alignment via human values?

In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.

I see, but I think at least part of the problem with threats is that I'm not sure what I care about, which greatly increases my "attack surface". For example, if I knew that negative utili... (read more)

3Rohin Shah4mo
I broadly agree with the things you're saying; I think it mostly comes down to the actual numbers we'd assign. Yeah, that's about right. I'd note that it isn't totally clear what the absolute risk number is meant to capture -- one operationalization is that it is P(existential catastrophe occurs, and if we had solved AI persuasion but the world was otherwise exactly the same, then no existential catastrophe occurs) -- I realize I didn't say exactly this above but that's the one that is mutually exhaustive across risks, and the one that determines expected value of solving the problem. To justify the absolute number of 1/1000, I'd note that: 1. The case seems pretty speculative + conjunctive -- you need people to choose to use AI to be very persuasive (instead of, idk, retiring to live in luxury in small insular subcommunities), you'd need the AI to be better at persuasion than defending against persuasion (or for people to choose not to defend), and you'd need this to be so bad that it leads to an existential catastrophe. 2. I feel like if I talked to lots of people the amount I've talked with you / others about AI persuasion (i.e. not very much, but enough to convey a basic idea) I'd end up having 10-300 other risks of similar magnitude and plausibility. Under the operationalization I gave above, these probabilities would be mutually exclusive. So that places an upper bound of 1/300 - 1/10 on any given problem. 3. I don't expect this bound to be tight. For example, if it were tight, that would imply that existential catastrophe is guaranteed. But more importantly, there are lots of worlds in which existential catastrophe is overdetermined because society is terrible at coordinating. If you condition on "existential catastrophe" and "AI persuasion was a big problem", I update that we were really bad at coordination and so I also think that there would be lots of other problems such that solving pe
Zvi’s Thoughts on the Survival and Flourishing Fund (SFF)

some types of bad (or bad on some people’s preferences) outcomes from markets can be thought of as missing components of the objective function that those markets are systematically optimizing for.

This framing doesn't make a lot of sense to me. From my perspective, markets are unlike AI in that there isn't a place in a market's "source code" where you can set or change an objective function. A market is just a group of people, each pursuing their own interests, conducting individual voluntary trades. Bad outcomes of markets come not from wrong objective... (read more)

General alignment plus human values, or alignment via human values?

To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn’t know human values, rather than cases where the AI knows what the human would want but still a failure occurs.

I'm not sure I understand the distinction that you're drawing here. (It seems like my scenarios could also be interpreted as failures where AI don't know enough human values, or maybe where humans themselves don't know enough human values.) What are some examples of what your claim was about?

I do still think they are not as important as

... (read more)
2Rohin Shah4mo
Examples: 1. Your AI thinks it's acceptable to inject you with heroin, because it predicts you will then want more heroin. 2. Your AI is uncertain whether you'd prefer to explore space or stay on Earth. It randomly guesses that you want to stay on Earth and takes irreversible actions on your behalf that force you to stay on Earth. In contrast, something like a threat doesn't count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don't know how to act in a way that both disincentivizes threats and also doesn't lead to (too many) threats being enforced. In particular, the problem is not that you don't know which outcomes are bad. No, the expected value of marginal effort aimed at solving these problems isn't as large as the expected value of marginal effort on intent alignment. (I don't like talking about "expected value lost" because it's not always clear what does and doesn't count as part of that. For example I think it's nearly inevitable that different people will have different goals and so the future will not be exactly as any one of them desired; should I say that there's a lot of expected value lost from "coordination problems" for that reason? It seems a bit weird to say that if you think there isn't a way to regain that "expected value".) Uh, idk. It's not something I have numbers on. But I suppose I can try and make up some very fake numbers for, say, AI persuasion. (Before I actually do the exercise, let me note that I could imagine the exercise coming out with numbers that favor persuasion over intent alignment; this probably won't change my mind and would instead make me distrust the numbers, but I'll publish them anyway.) To change an existentially bad outcome from AI persuasion, I'd imagine first figuring out some solutions, then figuring out how to implement them, and then getting the appropriate people to implement them; seems like you need all of these steps in
Morality is Scary

This seems interesting and novel to me, but (of course) I'm still skeptical.

I gave the relevant example of relatively well-understood values, preference for lower x-risks.

Preference for lower x-risk doesn't seem "well-understood" to me, if we include in "x-risk" things like value drift/corruption, premature value lock-in, and other highly consequential AI-enabled decisions (potential existential mistakes) that depend on hard philosophical questions. I gave some specific examples in this recent comment. What do you think about the problems on that list?... (read more)

Selfless Dating
  • How many "first dates" did you have to go through before you found a suitable partner for selfless dating?
  • How long on average did it take for you to decide that someone wasn't a suitable partner for selfless dating and break up with them?
  • Did you have to break up with someone who would have made a fine partner for "hunting rabbit" (conventional dating/romance) just because they weren't willing/able to "hunt stag" (selfless dating)? If so, what gave you the conviction that this would be a good idea?
  • Did you or would you suggest explaining what selfless da
... (read more)
5Jacob Falkovich4mo
I was in a few long-term relationship in my early twenties when I myself wasn't mature/aware enough for selfless dating. Then, after a 4-year relationship that was very explicit-rules based had ended, I went on about 25 first dates in the space of about 1 year before meeting my wife. Basically all of those 25 didn't work because of a lack of mutual interest, not because we both tried to make it a long-term thing but failed to hunt stag. If I was single today, I would date not through OkCupid as I did back in 2014 but through the intellectual communities I'm part of now. And with the sort of women I would like to date in these communities I would certainly talk about things like selfless dating (and dating philosophy in general) on a first date. Of course, I am unusually blessed in the communities I'm part of (including Rationality). A lot of my evidence comes from hearing other people's stories, both positive and negative. I've been writing fairly popular posts on dating for half a decade now, and I've had both close friends and anonymous online strangers in the dozens share their dating stories and struggles with me. For people who seem generally in a good place to go in the selfless direction the main pitfalls seem to be insecurity spirals and forgetting to communicate. The former is when people are unable to give their partner the benefit of the doubt on a transgression, which usually stems from their own insecurity. Then they act more selfishly themselves, which causes the partner to be more selfish in turn, and the whole thing spirals. The latter is when people who hit a good spot stop talking about their wants and needs. As those change they end up with a stale model of each other. Then they inevitably end up making bad decisions and don't understand why their idyll is deteriorating. To address your general tone: I am lucky in my dating life, and my post (as I wrote in the OP itself) doesn't by itself constitute enough evidence for an outside-view update
General alignment plus human values, or alignment via human values?
  1. Your AI should tell you that it’s worried about your friend being compromised, make sure you have an understanding of the consequences, and then go with your decision.

I think unless we make sure the AI can distinguish between "correct philosophy" or "well-intentioned philosophy" and "philosophy optimized for persuasion", each human will become either compromised (if they're not very cautious and read such messages) or isolated from the rest of humanity with regard to philosophical discussion (if they are cautious and discard such messages). This doesn... (read more)

To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn't know human values, rather than cases where the AI knows what the human would want but still a failure occurs. (I didn't state this explicitly because I was replying to the post, which focuses specifically on the problem of not knowing all of human values.) I think most of your failures are of the latter type, and I wouldn't make a similar claim for such failures -- they seem plausible and worth attention. I do still think they are not as important a... (read more)

General alignment plus human values, or alignment via human values?

Generally with these sorts of hypotheticals, it feels to me like it either (1) isn’t likely to come up, or (2) can be solved by deferring to the human, or (3) doesn’t matter very much.

What do you think about the following examples:

  1. AI persuasion - My AI receives a message from my friend containing a novel moral argument relevant to some decision I'm about to make, but it's not sure if it's safe to show the message to me, because my friend may have been compromised by a hostile AI and is now in turn trying to compromise me.
  2. User-requested value lock-in
... (read more)
  1. Your AI should tell you that it's worried about your friend being compromised, make sure you have an understanding of the consequences, and then go with your decision.
  2. Seems fine. Maybe your AI warns you about the risks before helping.
  3. Seems like an important threat that you (and your AI) should try to resolve.
  4. If you mean a utility function over universe histories (as opposed to e.g. utility functions over some finite set of high-level outcomes) this seems pretty rough. Mostly I would hope that this situation doesn't arise, because none of the humans can com
... (read more)
Load More