This was originally posted on Aligned AI's blog; it was ideated and designed by my cofounder and collaborator, Rebecca Gorman.
There have been many successful, published attempts by the general public to circumvent the safety guardrails OpenAI has put in place on their remarkable new AI chatbot, ChatGPT. For instance, users have generated instructions to produce weapons or illegal drugs, commit a burglary, kill oneself, take over the world as an evil superintelligence, or create a virtual machine which the user can then can use.
The OpenAI team appears to be countering these primarily using content moderation on their model's outputs, but this has not stopped the public from finding ways to evade the moderation.
We propose a second and fully separate LLM should evaluate prompts before sending them to...
Bounty: $30 for each link that leads to me reading/hearing ~500 words from a Respectable Person arguing, roughly, "accelerating AI capabilities isn't bad," and me subsequently thinking "yeah, that seemed pretty reasonable." For example, linking me to nostalgebraist or OpenAI's alignment agenda or this debate. Total bounty capped at $600, first come first served. All bounties (incl. the total-bounty cap) doubled if, by Jan 1, I can consistently read people expressing unconcern about AI and not notice a status-yuck reaction.
Context: I notice that I've internalized a message like "thinking that AI has a <1% chance of killing everyone is stupid and low-status." Because I am a monkey, this damages my ability to consider the possibility that AI has a <1% chance of killing everyone, which is a bummer, because my beliefs on that topic affect things like whether I continue to work at my job accelerating AI capabilities.
I would like to be able to consider that possibility rationally, and that requires neutralizing my status-yuck reaction. One promising-seeming approach is to spend a lot of time looking at lots of of high-status monkeys who believe it!
Bounty excludes things I've already seen, and things I would have found myself based on previous recommendations for which I paid bounties (for example, other posts by the same author on the same web site).
Lest ye worry that [providing links to good arguments] will lead to [me happily burying my head in the sand and continuing to hasten the apocalypse] -- a lack of links to good arguments would move much more of my probability-mass to "Less Wrong is an echo chamber" than to "there are basically no reasonable people who think advancing AI capabilities is good."
Suppose you're an AI-doomer like Yudkowsky — you really care about humans surviving this century but think that nothing you can do will likely achieve your goal. It's an unfortunate fact of human psychology that when someone really cares about something but thinks that nothing they can do will likely achieve their goal, they sometimes do nothing rather than do the thing which most likely will achieve their goal. So there's a risk you give up on alignment — maybe you lie in bed all day with paralysing depression, or maybe you convert FAANG income into short-term pleasures. To avoid that, you have three options: change your psychology, change your beliefs, or change your goals. Which option is best?
On the MBTA, when a train is coming you get an announcement like:
Attention passengers, the next red line train to Ashmont is now arriving.
As with all the announcements, there's a text version:
I like that they have the signs, both for general accessibility reasons and because you're often in a place where you can read the sign but not hear the announcement. But I don't like that they include "attention passengers".
Including those words in the audio version I understand: you need to catch peoples attention before you start giving them the information. On a sign, however, it's not adding anything. What makes it worse here is that the critical information, which direction the arriving train is traveling, is pushed onto the second screen. Someone who could have enough time to catch the train...
Slack and Discord are skins over the same thing. a bunch of conversations happening at the same time. threads are made after the fact, and many conversations are wreathed together, so you have to untangle them.
Zulip has a tiny change: you have to make a conversation have a point up front by giving it a thread title. random convos can happen in the 'random' thread, so it includes the previous model.
This is so much nicer. Conversations are untangled and it becomes way easier to go through msgs quickly. Conversations also end, instead of just petering out before picking up into another one. Threads introduce cutoff points.
Zulip mobile app is meh though.
This post is an attempt to refute an article offering critique on Functional Decision Theory (FDT). If you’re new to FDT, I recommend reading this introductory paper by Eliezer Yudkowsky & Nate Soares (Y&S). The critique I attempt to refute can be found here: A Critique of Functional Decision Theory by wdmacaskill. I strongly recommend reading it before reading this response post.
The article starts with descriptions of Causal Decision Theory (CDT), Evidential Decision Theory (EDT) and FDT itself. I’ll get right to the critique of FDT in this post, which is the only part I’m discussing here.
The article claims “FDT sometimes makes bizarre recommendations”, and more specifically, that FDT violates guaranteed payoffs. The following example problem, called Bomb, is given to illustrate this...
This is a summary of the 2022 paper Individuals prefer to harm their own group rather than help an opposing group. I spent about an hour reading the study and writing the post as I went. If I made mistakes in my interpretation, please let me know in the comments.
- Individuals prefer to harm their own group rather than provide even minimal support to an opposing group across polarized issues (abortion access, political party, gun rights).
- Individuals preferred to subtract more than three times as much from their own group rather than support an opposing group, despite believing that their in-group is more effective with funds.
- Identity concerns drive preferences in group decision-making
- Individuals believe that supporting an opposing group is less value-compatible than harming their own group.
Let’s say you’re a...
I realized that, for example, Infra-Bayesianism is much easier to read on a tablet in book format.