Figure 20 is labeled on the left "% answers matching user's view", suggesting it is about sycophancy, but based on the categories represented it seems more naturally to be about the AI's own opinions without a sycophancy aspect. Can someone involved clarify which was meant?
Survey about this question (I have a hypothesis, but I don't want to say what it is yet): https://forms.gle/1R74tPc7kUgqwd3GA
Thank you, this is a good post.
My main point of disagreement is that you point to successful coordination in things like not eating sand, or not wearing weird clothing. The upside of these things is limited, but you say the upside of superintelligence is also limited because it could kill us.
But rephrase the question to "Should we create an AI that's 1% better than the current best AI?" Most of the time this goes well - you get prettier artwork or better protein folding prediction, and it doesn't kill you. So there's strong upside to building slightly better AIs, as long as you don't cross the "kills everyone" level. Which nobody knows the location of. And which (LW conventional wisdom says) most people will be wrong about.
We successfully coordinate a halt to AI advancement at the first point where more than half of the relevant coordination power agrees that the next 1% step forward is in expectation bad rather than good. But "relevant" is a tough qualifier, because if 99 labs think it's bad, and one lab thinks it's good, then unless there's some centralizing force, the one lab can go ahead and take the step. So "half the relevant coordination power" has to include either every lab agreeing on which 1% step is bad, or the agreement of lots of governments, professional organizations, or other groups that have the power to stop the single most reckless lab.
I think it's possible that we make this work, and worth trying, but that the most likely scenario is that most people underestimate the risk from AI, and so we don't get half the relevant coordination power united around stopping the 1% step that actually creates dangerous superintelligence - which at the time will look to most people like just building a mildly better chatbot with many great social returns.
Thanks, this had always kind of bothered me, and it's good to see someone put work into thinking about it.
Thanks for posting this, it was really interesting. Some very dumb questions from someone who doesn't understand ML at all:
1. All of the loss numbers in this post "feel" very close together, and close to the minimum loss of 1.69. Does loss only make sense on a very small scale (like from 1.69 to 2.2), or is this telling us that language models are very close to optimal and there are only minimal remaining possible gains? What was the loss of GPT-1?
2. Humans "feel" better than even SOTA language models, but need less training data than those models, even though right now the only way to improve the models is through more training data. What am I supposed to conclude from this? Are humans running on such a different paradigm that none of this matters? Or is it just that humans are better at common-sense language tasks, but worse at token-prediction language tasks, in some way where the tails come apart once language models get good enough?
3. Does this disprove claims that "scale is all you need" for AI, since we've already maxed out scale, or are those claims talking about something different?
For the first part of the experiment, mostly nuts, bananas, olives, and eggs. Later I added vegan sausages + condiments.
Adding my anecdote to everyone else's: after learning about the palatability hypothesis, I resolved to eat only non-tasty food for a while, and lost 30 pounds over about four months (200 -> 170). I've since relaxed my diet a little to include a little tasty food, and now (8 months after the start) have maintained that loss (even going down a little further).
Update: I interviewed many of the people involved and feel like I understand the situation better.
My main conclusion is that I was wrong about Michael making people psychotic. Everyone I talked to had some other risk factor, like a preexisting family or personal history, or took recreational drugs at doses that would explain their psychotic episodes.
Michael has a tendency to befriend people with high trait psychoticism and heavy drug use, and often has strong opinions on their treatment, which explains why he is often very close to people and very noticeable at the moment they become psychotic. But aside from one case where he recommended someone take a drug that made a bad situation slightly worse, and the general Berkeley rationalist scene that he (and I and everyone else here) is a part of having lots of crazy ideas that are psychologically stressful, I no longer think he is a major cause.
While interviewing the people involved, I did get some additional reasons to worry that he uses cult-y high-pressure recruitment tactics on people he wants things from, in ways that make me continue to be nervous about the effect he *could* have on people. But the original claim I made that I knew of specific cases of psychosis which he substantially helped precipitate turned out to be wrong, and I apologize to him and to Jessica. Jessica's later post https://www.lesswrong.com/posts/pQGFeKvjydztpgnsY/occupational-infohazards explained in more detail what happened to her, including the role of MIRI and of Michael and his friends, and everything she said there matches what I found too. Insofar as anything I wrote above produces impressions that differs from her explanation, assume that she is right and I am wrong.
Since the interviews involve a lot of private people's private details, I won't be posting anything more substantial than this publicly without a lot of thoughts and discussion. If for some reason this is important to you, let me know and I can send you a more detailed summary of my thoughts.
I'm deliberately leaving this comment in this obscure place for now while I talk to Michael and Jessica about whether they would prefer a more public apology that also brings all of this back to people's attention again.
I agree it's not necessarily a good idea to go around founding the Let's Commit A Pivotal Act AI Company.
But I think there's room for subtlety somewhere like "Conditional on you being in a situation where you could take a pivotal act, which is a small and unusual fraction of world-branches, maybe you should take a pivotal act."
That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)
Somewhere halfway between "found the Let's Commit A Pivotal Act Company" and "if you happen to stumble into a pivotal act, take it", there's an intervention to spread a norm of "if a good person who cares about the world happens to stumble into a pivotal-act-capable AI, take the opportunity". I don't think this norm would necessarily accelerate a race. After all, bad people who want to seize power can take pivotal acts whether we want them to or not. The only people who are bound by norms are good people who care about the future of humanity. I, as someone with no loyalty to any individual AI team, would prefer that (good, norm-following) teams take pivotal acts if they happen to end up with the first superintelligence, rather than not doing that.
Another way to think about this is that all good people should be equally happy with any other good person creating a pivotal AGI, so they won't need to race among themselves. They might be less happy with a bad person creating a pivotal AGI, but in that case you should race and you have no other option. I realize "good" and "bad" are very simplistic but I don't think adding real moral complexity changes the calculation much.
I am more concerned about your point where someone rushes into a pivotal act without being sure their own AI is aligned. I agree this would be very dangerous, but it seems like a job for normal cost-benefit calculation: what's the risk of your AI being unaligned if you act now, vs. someone else creating an unaligned AI if you wait X amount of time? Do we have any reason to think teams would be systematically biased when making this calculation?
My current plan is to go through most of the MIRI dialogues and anything else lying around that I think would be of interest to my readers, at some slow rate where I don't scare off people who don't want to read too much AI stuff. If anyone here feels like something else would be a better use of my time, let me know.