How can we recognize when we are failing to change our thinking in light of new evidence that doesn’t fit our expectations and assumptions? And how can we update our thought processes to overcome the challenges that our old ways of seeing pose?
This essay was partly based on discussions with "woog" on Discord. Further thanks to the gears to ascension, for inspiring this post with an offhand comment. This is also an entry for the Open Philanthropy AI Worldviews Contest.
Many new researchers are going into AI alignment. For a variety of reasons, they may choose to work for organizations such as Anthropic or OpenAI. Chances are good that a new researcher will be interested in "interpretability".
A creeping concern for many: "Is my research going to cause AGI ruin? Am I making the most powerful AI systems more powerful, even though I'm trying to make them safer?" Maybe they've even heard someone say that "mechanistic interpretability is capabilities research". This essay dissects the specific case of interpretability research, to figure...
tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!
Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it's accepted and encouraged to ask about the basics.
We'll be putting up monthly FAQ posts as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI Safety discussion, but which until now they didn't feel able to ask.
It's okay to ask uninformed questions, and not worry about having done a careful search before asking.
Additionally, this will serve as a way to spread the project Rob...
@drocta @Cookiecarver We started writing up an answer to this question for Stampy. If you have any suggestions to make it better I would really appreciate it. Are there important factors we are leaving out? Something that sounds off? We would be happy for any feedback you have either here or on the document itself https://docs.google.com/document/d/1tbubYvI0CJ1M8ude-tEouI4mzEI5NOVrGvFlMboRUaw/edit#
It doesn't increase the risk as agents with nuclear arsenals already exist?
Current US government exploited unique resources - land etc
Current US government will oppose a new similar to US government organization to appear
This will be a fairly important post. Not one of those obscure result-packed posts, but something a bit more fundamental that I hope to refer back to many times in the future. It's at least worth your time to read this first section up to its last paragraph.
There are quite a few places where randomization would help in designing an agent. Maybe we want to find an interpolation between an agent picking the best result, and an agent mimicking the distribution over what a human would do. Maybe we want the agent to do some random exploration in an environment. Maybe we want an agent to randomize amongst promising plans instead of committing fully to the plan it thinks is the best.
I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)
EMDR (Eye Movement Desensitization and Reprocessing) therapy is a structured therapy that encourages the patient to focus briefly on a traumatic memory while simultaneously experiencing bilateral stimulation (typically eye movements, but also tones or taps), which is associated with a reduction in the vividness and emotion associated with the traumatic memories.
EMDR is usually done with a therapist. However, you can also just do self-administered EMDR on your own - as often and whenever you want without any costs! Most people don't seem to know this great "do it on your own" option exists - I didn't. So my main goal with this post is to just make you aware of the fact that: "Hey, there's this great therapeutic tool called EMDR, and you can just do it!"....
These are all excellent questions! Unfortunately, I don't have definite answers. I've read somewhere that the idea is to tax the working memory as much as possible such that you can just barely hold an emotional felt sense at the same time as well.
I'd be very interested if someone does some more reading and research on this!
What I personally do: The more intensive the felt sense feels, the harder I focus on the EMDR "distractions", and vice-versa.
LessWrong is experimenting with the addition of reacts to the site, as per the recent experimental Open Thread. We are now progressing to the next stage of the experiment: trying out reacts in actual discussion threads.
The dev/moderator team will be proactively looking for posts to enable to react voting on (with author permission), but also any user can enable it themselves to help us experiment:
The admins will also be on the lookout for good posts to enable reacts on (with author permission).
We're continuing to think about what reacts should be available. Thanks to everyone who's weighed in so far.
I just spent time today...
People talk about Kelly betting and expectation maximization as though they're alternate strategies for the same problem. Actually, they're each the best option to pick for different classes of problems. Understanding when to use Kelly betting and when to use expectation maximization is critical.
Most of the ideas for this came from Ole Peters ergodicity economics writings. Any mistakes are my own.
Alice and Bob visit a casino together. They each have $100, and they decide it'll be fun to split up, play the first game they each find, and then see who has the most money. They'll then keep doing this until their time in the casino is up in a couple days.
Alice heads left and finds a game that looks good. It's double...
I wrote this with the assumption that Bob would care about maximizing his money at the end, and that there would be a high but not infinite number of rounds.
On my view, your questions mostly don't change the analysis much. The only difference I can see is that if he literally only cares about beating Alice, he should go all in. In that case, having $1 less than Alice is equivalent to having $0. That's not really how people use money though, and seems pretty artificial.
How are you expecting these answers to change things?
Here are some views, oftentimes held in a cluster:
I liked that you found a common thread in several different arguments.
However, I don't think that the views are all believed or all disagreed with in practice. I do think Yann LeCun would agree with all the points and Eliezer Yudkowsky would disagree with all the points (except perhaps the last point).
For example, I agree with 1 and 5, agree with the first half but not the second half of 2 disagree with 3 and have mixed feelings about 4.
Why? At a high level, I think the extent to which individual researchers, large organizations and LLMs/AIs need empirical feedback to improve are all quite different.
(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)
I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.
Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally...
Regarding 9: I believe it's when you are successful enough that your AGI doesn't instantly kill you immediately but it still can kill you in the process of using it. It's in the context of a pivotal act, so it assumes you will operate it to do something significant and potentially dangerous.
"Can we control the thought and behavior patterns of a powerful mind at all?"-I do not see why this would not be the case. For example, in a neural network, if we are able to find a cluster of problematic neurons, then we will be able to remove those neurons. With that being said, I do not know how well this works in practice. After removing the neurons (and normalizing so that the remaining neurons are given higher weights), if we do not retrain the neural network, then it could exhibit more unexpected or poor behavior. If we do retrain the network, then ... (read more)