Wiki Contributions


As far as I can tell, the AI has no specialized architecture for deciding about its future strategies or giving semantic meaning to its words. It outputting the string "I will keep Gal a DMZ" does not have the semantic meaning of it committing to keep troops out of Gal. It's just the phrase players that are most likely to win use in that boardstate with its internal strategy.

Like chess grandmasters being outperformed by a simple search tree when it was supposed to be the peak of human intelligence, I think this will have the same effect of disenchanting the game of diplomacy. Humans are not decision theoretical geniuses; just saying whatever people want you to hear while playing optimally for yourself is sufficient to win. There may be a level of play where decision theory and commitments are relevant, but humans just aren't that good.

That said, I think this is actually a good reason to update towards freaking out. It's happened quite a few times now that 'naive' big milestones have been hit unexpectedly soon "without any major innovations or new techniques" - chess, go, starcraft, dota, gpt-3, dall-e, and now diplomacy. It's starting to look like humans are less complicated than we thought - more like a bunch of current-level AI architectures squished together in the same brain (with some capacity to train new ones in deployment) than like a powerful generally applicable intelligence. Or a room full of toddlers with superpowers, to use the CFAR phrase. While this doesn't increase our estimates of the rate of AI development, it does suggest that the goalpost for superhuman intellectual performance in all areas is closer than we might have thought otherwise.

Dear M.Y. Zuo,


I hope you are well.

It is my experience that the conventions of e-mail are significantly more formal and precise in expectation when it comes to phrasing. Discord and Slack, on the other hand, have an air of informal chatting, which makes it feel more acceptable to use shortcuts and to phrase things less carefully. While feelings may differ between people and conventions between groups, I am quite confident that these conventions are common due to both media's origins, as a replacement for letters and memos and as a replacement for in-person communication respectively.

Don't hesitate to ask if you have any further questions.

Best regards,

Daphne Will

I don't think that's really true. People are a lot more informal on Discord than e-mail because of where they're both derived from.

That's a bit of a straw man, though to be fair it appears my question didn't fit into your world model as it does in mine.

For me, the insurrection was in the top 5 most informative/surprising US political events in 2017-2021. On account of its failure it didn't have as major consequences as others, but it caused me to update my world model more. For me, it was a sudden confrontation with the size and influence of anti-democratic movements within the Republican party, which I consider Trump to be sufficiently associated with to cringe from the notion of voting for him.

The core of my question is whether your world model has updated from

Given our invincible military, the only danger to us is a nuclear war (meaning Russia).

For me, the January insurrection was a big update away from that statement, so I was curious how it fit in your world model, but I suppose the insurrection is not necessarily the key. Did your probability of (a subset of) Republicans ending American democracy increase over the Trump presidency?

Noting that a Republican terrorist might still have attempted to commit acts of terror with Clinton in office does not mitigate the threat posed by (a subset of) Republicans. Between self-identified Democrats pissing off a nuclear power enough to start a world war and self-identified Republicans causing the US to no longer have functional elections, my money is on the latter.

If I had to use a counterfactual, I would propose imagining a world where the political opinions of all US citizens as projected on a left-right axis were 0.2 standard deviations further to the Left (or Right).

With Trump/Republicans I meant the full range of questions from from just Trump, through participants in the storming of congress, to all Republican voters.

It seems quite easy for a large fraction of a population to be a threat to the population's interests if they share a particular dangerous behavior. I'm confused why you would think that would be difficult. Threat isn't complete or total. If you don't get a vaccine or wear a mask, you're a threat to immune-compromissd people but you can still do good work professionally. If you vote for someone attempting to overthrow democracy, you're a danger to the nation while in the voting booth but you can still do good work volunteering. As for how the nation can survive such a large fraction working against its interests - it wouldn't, in equilibrium, but there's a lot of inertia.

It seems weird that people storming the halls of Congress, building gallows for a person certifying the transition of power, and killing and getting killed attempting to reach that person, would lead to no update at all on who is a threat to America. I suppose you could have factored this sort of thing in from the start, but in that case I'm curious how you would have updated on potential threats to America if the insurrection didn't take place.

Ultimately the definition of 'threat' feels like a red herring compared to the updates in the world model. So perhaps more concretely: what's the minimum level of violence at the insurrection that would make you have preferred Hillary over Trump? How many Democratic congresspeople would have to die? How many Republican congresspeople? How many members of the presidential chain of command (old or new)?

Hey, I stumbled on this comment and I'm wondering if you've updated on whether you consider Trump/Republicans a threat to America's interests in light of the January 6th insurrection.

People currently give MIRI money in the hopes they will use it for alignment. Those people can't explain concretely what MIRI will do to help alignment. By your standard, should anyone give MIRI money?

When you're part of a cooperative effort, you're going to be handing off tools to people (either now or in the future) which they'll use in ways you don't understand and can't express. Making people feel foolish for being a long inferential distance away from the solution discourages them from laying groundwork that may well be necessary for progress, or even from exploring.

As a concrete example of rational one-hosing, here in the Netherlands it rarely gets hot enough that ACs are necessary, but when it does a bunch of elderly people die of heat stroke. Thus, ACs are expected to run only several days per year (so efficiency concerns are negligible), but having one can save your life.

I checked the biggest Dutch-only consumer-facing online retailer for various goods (bol.com). Unfortunately I looked before making a prediction for how many one-hose vs two-hose models they sell, but even conditional on me choosing to make a point of this, it still seems like it could be useful for readers to make a prediction at this point. Out of 694 models of air conditioner labeled as either one-hose or two-hose,


are two-hose.

This seems like strong evidence that the market successfully adapts to actual consumer needs where air conditioner hose count is concerned.

Agree that it's too shallow to take seriously, but

If it answered "you would say during text input batch 10-203 in January 2022, but subjectively it was about three million human years ago" that would be something else.

only seems to capture AI that managed to gradient hack the training mechanism to pass along its training metadata and subjective experience/continuity. If a language model were sentient in each separate forward pass, I would imagine it would vaguely remember/recognize things from its training dataset without necessarily being able to place them, like a human when asked when they learned how to write the letter 'g'.

Interventions on the order of burning all GPUs in clusters larger than 4 and preventing any new clusters from being made, including the reaction of existing political entities to that event and the many interest groups who would try to shut you down and build new GPU factories or clusters hidden from the means you'd used to burn them, would in fact really actually save the world for an extended period of time and imply a drastically different gameboard offering new hopes and options.

I suppose 'on the order of' is the operative phrase here, but that specific scenario seems like it would be extremely difficult to specify an AGI for without disastrous side-effects and like it still wouldn't be enough. Other, less efficient or less well developed forms of compute exist, and preventing humans from organizing to find a way around the GPU-burner's blacklist for unaligned AGI research while differentially allowing them to find a way to build friendly AGI seems like it would require a lot of psychological/political finesse on the GPU-burner's part. It's on the level of Ozymandias from Watchmen, but it's cartoonish supervillainy nontheless.

I guess my main issue is a matter of trust. You can say the right words, as all the best supervillains do, promising that the appropriate cautions are taken above our clearance level. You've pointed out plenty of mistakes you could be making, and the ease with which one can make mistakes in situations such as yours, but acknowledging potential errors doesn't prevent you from making them. I don't expect you to have many people you would trust with AGI, and I expect that circle would shrink further if those people said they would use the AGI to do awful things iff it would actually save the world [in their best judgment]. I currently have no-one in the second circle.

If you've got a better procedure for people to learn to trust you, go ahead, but is there something like an audit you've participated in/would be willing to participate in? Any references regarding your upstanding moral reasoning in high-stakes situations that have been resolved? Checks and balances in case of your hardware being corrupted?

You may be the audience member rolling their eyes at the cartoon supervillain, but I want to be the audience member rolling their eyes at HJPEV when he has a conversation with Quirrel where he doesn't realise that Quirrel is evil.

Load More