Connor Leahy on Dying with Dignity, EleutherAI and Conjecture

Michaël Trazzi

I talked to Connor Leahy about Yudkowsky's antimemes in Death with Dignity, common misconceptions about EleutherAI and his new AI Alignment company Conjecture.

Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find an accompanying transcript, organized in 74 sub-sections.

Understanding Eliezer Yudkowsky

Eliezer Has Been Conveying Antimemes

“Antimemes are completely real. There's nothing supernatural about it. Most antimemes are just things that are boring. So things that are extraordinarily boring are antimemes because they, by their nature, resist you remembering them. And there's also a lot of antimemes in various kinds of sociological and psychological literature. A lot of psychology literature, especially early psychology literature, which is often very wrong to be clear. Psychoanalysis is just wrong about almost everything. But the writing style, the kind of thing these people I think are trying to do is they have some insight, which is an antimeme. And if you just tell someone an antimeme, it'll just bounce off them. That's the nature of an antimeme. So to convey an antimeme to people, you have to be very circuitous, often through fables, through stories you have, through vibes. This is a common thing.
Moral intuitions are often antimemes. Things about various human nature or truth about yourself. Psychologists, don't tell you, "Oh, you're fucked up, bro. Do this." That doesn't work because it's an antimeme. People have protection, they have ego. You have all these mechanisms that will resist you learning certain things. Humans are very good at resisting learning things that make themselves look bad. So things that hurt your own ego are generally antimemes. So I think a lot of what Eliezer does and a lot of his value as a thinker is that he is able, through however the hell his brain works, to notice and comprehend a lot of antimemes that are very hard for other people to understand.”

Why the Dying with Dignity Heuristic is Useful

“The whole point of the post is that if you do that, and you also fail the test by thinking that blowing TSMC is a good idea, you are not smart enough to do this. Don't do it. If you're smart enough, you figured out that this is not a good idea... Okay, maybe. But most people, or at least many people, are not smart enough to be consequentialists. So if you actually want to save the world, you actually want to save the world... If you want to win, you don't want to just look good or feel good about yourself, you actually want to win, maybe just think about dying with dignity instead. Because even though you, in your mind, don't model your goal as winning the world, the action that is generated by the heuristic will reliably be better at actually saving the world.”

“There's another interpretation of this, which I think might be better where you can model people like AI_WAIFU as modeling timelines where we don't win with literally zero value. That there is zero value whatsoever in timelines where we don't win. And Eliezer, or people like me, are saying, 'Actually, we should value them in proportion to how close to winning we got'. Because that is more healthy... It's reward shaping! We should give ourselves partial reward for getting partially the way. He says that in the post, how we should give ourselves dignity points in proportion to how close we get.
And this is, in my opinion, a much psychologically healthier way to actually deal with the problem. This is how I reason about the problem. I expect to die. I expect this not to work out. But hell, I'm going to give it a good shot and I'm going to have a great time along the way. I'm going to spend time with great people. I'm going to spend time with my friends. We're going to work on some really great problems. And if it doesn't work out, it doesn't work out. But hell, we're going to die with some dignity. We're going to go down swinging.”

"If you have to solve an actually hard problem in the actual real world, in actual physics, for real, an actual problem, that is actually hard, you can't afford to throw your epistemics out the door because you feel bad. And if people do this, they come up with shit like, 'Let's blow up to TSMC'. Because they throw their epistemics out the window and like, 'This feels like something. Something must be done and this is something, so therefore it must be done'."

EleutherAI

Why training GPT-3 Size Models made sense

“Well, I remember having these conversations with some people in the alignment sphere, where they're like, "Oh well, why did you build the models? Just use GPT-2, that's fine." I'm like, "Well, okay, what if I want to see the bigger properties?" And they'll be like, "They'll probably exist in the smaller models too or something. Name three experiments you're going to do with this exact model." And I'm like, "I could come up with three, sure. But that's kind of missing the point." The point is: we should just really stare at these things really fucking hard. And turns out, in my experience, that was a really good idea. Most of my knowledge, my competitive advantage is gained from that period of just actually building the things, actually staring at them really hard and not just knowing about the OpenAI API existing and reading the papers. There's a lot of knowledge you can get from reading a handbook, but actually running the machine will teach you a lot of things.”

EleutherAI Spread Alignment Memes in the ML World

"One of the important parts of my threat model is that I think 99% of the damage from GPT-3 was done the moment the paper was published. And, as they say about the nuclear bomb, the only secret was that it was possible. And I think there's a bit of naivety that sometimes goes into these arguments, where people are, 'Well, EleutherAI accelerated things, they drew attention to the meme'. And I think there's a lot of hindsight bias there, in that people don't realize how everyone knew about this, except the alignment community. Everyone at OpenAI, Google Brain and DeepMind. People knew about this, and they figured it out fucking fast."

"One of the things that EleutherAI did, and this was very much intentional, is that it created a space that is open to the wider ML community and their norms. It is respectful of AI researchers and their norms. And we also have street cred, in the sense that we are ML researchers and we're not just some dude talking about logical induction or whatever, but still has a very strong alignment meme. Alignment is high status. It is a respectful thing to talk about, a thing to take seriously. It is not some weird thing some people in Berkeley think about. It is a serious topic of serious intrigue. And for what it's worth, of the five core people at EleutherAI that changed their job as a direct consequence of EleutherAI, four went into alignment."

"I'm not saying, was it a resounding success? Did it do everything I wanted? No. It could always have been better. But I like to believe that there was a positive magnetic contagion that happened there. As I say, a lot of people that I know, that were an ML, started taking alignment seriously. I know several professors at several universities that'd gone to EleutherAI through the scaling memes, and then became convinced that this alignment thing seems important potentially."

On the Policy and Impact of EleutherAI's Open Source

"Our official position, which you can read in our blog, which has always been there, is that not everything should be released. And in fact, we, EleutherAI, discovered at least two capabilities advancements ahead of anyone else in the world, and we successfully kept them secret, because we were like "Oh shit". One is the chain of thought prompting idea, which we then later published. I believe I showed Eliezer the pre-draft. So he may be able to confirm that I'm not bullshitting you on this. I think it was Eliezer that I showed that to. And so in that regard, I fully understand why people think this, because that's a default open-source thing. And there're several other open-source groups now, that have split off from Eleuther or they're distant cousins of Eleuther, that do think this way. I strongly disagree with them. And I think that what they're making is not a good idea. It was always contingent. EleutherAI's policy was always "we think this specific thing should be open". Not all things should be open, but this specific thing that we are thinking about right now, that we're talking about right now, this specific thing we think should be open for this, this, this and this reason. But there are other things which we may or may not encounter, which shouldn't be open. We made very clear if we ever had a quadrillion parameter model for some reason, we would not release it."

"Again, I want to be very clear here. It may have been a mistake to release GPT-J. It may have been a mistake. I don't think it is one, for various contingent reasons, but I'm not ideologically committed to the idea that this was definitely the right thing to do. I think given the evidence that I've seen, for example, GPT-J being used in some of my favorite interpretability papers, such as the Editing and Eliciting Knowledge paper from David Bau's lab, which is an excellent paper, and you really should read. And several other groups such as Redwood using GPT-Neo models in their research and such. I think that there are a lot of reasons why this was helpful to some people, this was good. Also, the tacit knowledge that we've gained has been very instrumental for setting up Conjecture and what I do now. So I think there are reasons why it was good, but I could be wrong about this. Again, if people disagree with me about that, I think I disagree, but I think that it's not insane."

Conjecture

How Conjecture Started

"So Conjecture grew a lot out of some of the bottlenecks I found while working in EleutherAI. So EleutherAI was great. I love the people there and such. Anyway, we had a lot of great people and such. But if you wanted to get something done, it was like herding cats. But imagine the cats also have crippling ADHD and are the smartest people you've ever met. Especially if anything boring needed to get done, if we needed to fix some bugs or scrape some data or whatever, it would very often just not get done. Because it was all volunteer based, right? You wanted to do fun things. It's your free time. People don't want to do boring shit. During the pandemic it was a bit different, because people literally didn't have anything really to do. But now you have a social life again, you have a job. And then you don't want to come home and spend two hours debugging some goddamn race condition or whatever."

"So, the idea was first floated very early in EleutherAI, but I put that completely on ice. I didn't want to do that. I wanted to just focus on open-source and such. So it became really concrete around late 2021, September-October I think, when Nat Friedman, who was the CEO of GitHub at the time, approaches EleutherAI and says, 'Hey, I love what you guys are doing. It's super awesome. Can help you with anything? You want to meet up sometime?'. And, to add to his credit, he donated a bunch of money to help EleutherAI to keep going. A man of his word. And he happened to be in Germany at the time, which was where I was as well. And he was, 'Hey, do you want to meet up for a coffee?' And so we met up, really got along, and he was, 'Hey, you ever thought of doing a company or something?' 'Now, I have been thinking about that.' 'Why don't you just come by the Bay sometime and talk' and such. And so I was thinking, 'Oh cool, I can go to the Bay and I can...' So it was a confluence of factors, right? It was an excuse to go to the Bay to talk to both Nat and his friends, but also talk to Open Phil and potential EA funders and stuff like that. And also, I was getting on EleutherAI, I was hitting those bottlenecks I was talking about, where I was trying to do research on EleutherAI but it just wasn't working."

Where Conjecture Fits in the AI Alignment Landscape

"Conjecture differs from many other orgs in the field by various axes. So one of the things is that we take short timelines very seriously. There's a lot of people here and there that definitely entertain the possibility of short timelines or think it's serious or something. But no real org that is fully committed to five year timelines, and act accordingly. And we are an org that takes this completely seriously. Even if we just have 30% on it happening, that is enough in our opinion, to be completely action relevant. Just because there are a lot of things you need to do if this is true, compared to 15-year timelines, that no one's doing, that it seems it's worth trying. So we have very short timelines. We think alignment is very hard. So the thing where we disagree with a lot of other orgs, is we expect alignment to be hard, the kind of problem that just doesn't get solved by default. That doesn't mean it's not solvable. So where I disagree with Eliezer is that, I do think it is solvable... he also thinks it's solvable. He just doesn't think it's solvable in time, which I do mostly agree on. So I think if we had a hundred years time, we would totally solve this. This is a problem that can be solved, but doing it in five years with almost no one working on it, and also we can't do any tests with it because if we did a test, and it blows up, it's already too late, et cetera, et cetera... There's a lot of things that make the problem hard."

"One of the positive things that I've found is just, no matter where I go, the people working in the AGI space specifically are overwhelmingly very reasonable people. I may disagree with them, I think they might be really wrong about various things, but they're not insane evil people, right? They have different models of how reality works from me, and they're like... You know, Sam Altman replies to my DMs on Twitter, right? [...] I very strongly disagree with many of his opinions, but the fact that I can talk to him is not something we should have taken for a given. This is not the case in many other industries, and there's many scenarios where this could go away, and we don't have this thing that everyone in the space knows each other, or can call each other even. So I may not be able to convince Sam of my point of view. The fact I can talk to him at all is a really positive sign, and a sign that I would not have predicted two years ago."

Why Conjecture is Doing Interpretability Research

"I think it's really hard for modern people to put themselves into an epistemic state of just how it was to be a pre-scientific person, and just how confusing the world actually looked. And now even things that we think of as simple, how confusing they are before you actually see the solution. So I think it is possible, not guaranteed or even likely, but it's possible, that such discoveries could not be far down the tech tree, and that if we just come at things from the right direction, we try really hard, we try new things, that we would just stumble upon something where we're just like, 'Oh, this is okay, this works. This is a frame that makes sense. This deconfuses the problem. We're not so horribly confused about everything all the time.'"

Conjecture Approach To Solving Alignment

"If you need to roll high, roll many dice. At Conjecture, the ultimate goal is to make a lot of alignment research happen, to scale alignment research, to scale horizontally, to tile research teams efficiently, to take in capital and convert that into efficient teams with good engineers, good op support, access to computers, et cetera, et cetera, trying different things from different direction, more decorrelated bets."

"To optimize the actual economy is just computationally impossible. You would have to simulate every single agent, every single thing, every interaction, just impossible. So instead what they do is, they identify a small number of constraints that, if these are enforced, successfully shrink the dimension of optimization down to become feasible to optimize within. [...] If you want to reason about how much food will my field produce, monoculture is a really good constraint. By constraining it by force to only be growing, say, one plant, you simplify the optimization problem sufficiently that you can reason about it. I expect solutions to alignment, or, at least the first attempts we have at it, to look kind of similar like this. It'll find some properties. It may be myopia or something, that, if enforced, if constrained, we will have proofs or reasons to believe that neural networks will never do X, Y, and Z. So maybe we'll say, 'If networks are myopic and have this property and never see this in the training data, then because of all this reasoning, they will never be deceptive.' Something like that. Not literally that, but something of that form."

"There is this meme, which is luckily not as popular as it used to be, but there used to be a very strong meme that neural networks are these uninterpretable black boxes. [...] That is just actually wrong. That is just legitimately completely wrong, and I know this for a fact. There is so much structure inside of neural networks. Sure, some of it is really complicated and not obviously easy to understand for a human, but there is so much structure there, and there are so many things we can learn from actually really studying these internal parts... again, staring at the object really hard actually works."

On being non-disclosure by default

"We are non-disclosure by default, and we take info hazards and general infosec and such very seriously. So the reasoning here is not that we won't ever publish anything. I expect that we will publish a lot of the work that we do, especially the interpretability work, I expect us to publish quite a lot of it, maybe mostly all of it, but the way we think about info hazards or general security and this kind of stuff, is that we think it's quite likely that there are relatively simple ideas out there that may come up during the doing of prosaic alignment research that cannot really increase capabilities, that we are messing around with a neural network to try to make it more aligned, or to make it more interpretable or something, and suddenly, it goes boom, and then suddenly it's five times more efficient or something. I think things like this can and will happen, and for this reason, it's very important for us to... I think of info hazard policy, kind of like wearing a seatbelt. It's probably where we'll release most of our stuff, but once you release something into the wild, it's out there. So by default, before we know whether something is safe or not, it's better just to keep our seat belt on and just keep it internal. So that's the kind of thinking here. It's a caution by default. I expect us to work on some stuff that probably shouldn't be published. I think a lot of prosaic alignment work is necessarily capabilities enhancing, making a model more aligned, a model that is better at doing what you wanted to do, almost always makes the model stronger."

"I want to have an organization where it costs you zero social capital to be concerned about keeping something secret. So for example, with the Chinchilla paper, what I've heard is, inside of DeepMind, there was quite a lot of pushback against keeping it secret. Apparently, the safety teams wanted to not publish it, and they got a lot of pushback from the capabilities people because they wanted to publish it. And that's just a dynamic I don't want to exist at Conjecture. I want to be the case that the safety researchers say "Hey, this is kind of scary. Maybe we shouldn't publish it" and that is completely fine. They don't have to worry about their jobs. They still get promotions, and it is normal and okay to be concerned about these things. That doesn't mean we don't publish things. If everyone's like, "Yep, this is good. This is a great alignment tool. We should share this with everybody," then we'll release, of course."

On Building Products as a For-Profit

"The choice to be for profit is very much utilitarian. So it's actually quite funny that on FTX future funds' FAQ, they actually say they suggest to many non-profits to actually try to be for profits if they can. Because this has a lot of good benefits such as being better for hiring, creating positive feedback loops and potentially making them much more long-term sustainable. So the main reason I'm interested [in being a for-profit] is long term sustainability and the positive feedback loops, and also the hiring is nice. So I think there's like a lot of positive things about for-profit companies. There's a lot of negative things, but like it's also a lot of positive things and a lot of negative things with non-profits too, that I think get slipped under the rug in EA. Like in EA it feels like the default is a non-profit and you have to justify going outside of the Overton window."

"The way I think about products at the moment is, I basically think that there are the current state-of-the-art models that have opened this exponentially large field of possible new products that has barely been tapped. GPT-3 opens so many potential useful products that just all will make profitable companies and someone has to pick them. I think without pushing the state of the art at all, we can already make a bunch of products that will be profitable. And most of them are probably going to be relatively boring [...] You want to do a SaaS product, something that helps you with some business task. Something that helps you make a process more efficient inside of a company or something like that. There' tons of these things, which are just like not super exciting, but they're like useful."

Scaling The Alignment Field

"Our advertising quote, unquote, is just like one LessWrong post that was like, "Oh, we're hiring". Right? And we got a ton of great application. Like the signal to noise was actually wild. Like one in three applications were just really good, which like never happens. So, like, incredible. So we got to hire some really phenomenal people for our first hiring round. And so at this point we're already basically at a really enviable position. I mean, it's like, it's annoying, but it's a good problem to have, where we're basically already funding constrained. We're at the point where I have people I want to hire projects for them to do and the management capacity to handle them. And I just don't have the funding at the moment to hire them."

"Conjecture is an organization that is directly tackling the alignment problem and we're a de-correlated bet from the other ones. I'm glad, I'm super glad that Redwood and Anthropic are doing the things they do, but they're kind of doing a very similar direction of alignment research. We're doing something very different and we're doing it at a different location. We have access to a whole new talent pool of European talent that cannot come to the US. We get a lot of new people into the field. We also have the EleutherAI people coming in, different research directions and de-correlated bets. And we can scale. We have a lot of operational capacity, a lot of experience and also entrepreneurial vigor."

“There's another interpretation of this, which I think might be better where you can model people like AI_WAIFU as modeling timelines where we don't win with literally zero value. That there is zero value whatsoever in timelines where we don't win. And Eliezer, or people like me, are saying, 'Actually, we should value them in proportion to how close to winning we got'. Because that is more healthy... It's reward shaping! We should give ourselves partial reward for getting partially the way. He says that in the post, how we should give ourselves dignity points in proportion to how close we get.
And this is, in my opinion, a much psychologically healthier way to actually deal with the problem. This is how I reason about the problem. I expect to die. I expect this not to work out. But hell, I'm going to give it a good shot and I'm going to have a great time along the way. I'm going to spend time with great people. I'm going to spend time with my friends. We're going to work on some really great problems. And if it doesn't work out, it doesn't work out. But hell, we're going to die with some dignity. We're going to go down swinging.”

I'm not entirely sure on the metaphysics here, but an additional possible point is that in Many Worlds or similar big universes, there is some literal payoff of "us trying hard and getting pretty close in one universe means there are more nearby universes that succeeded."

Is there a way we could get sure on the metaphysics here? It feels like it's an important thing to know if it actually happens to be true.

I like this comment, and I personally think the framing you suggest is useful. I'd like to point out that, funnily enough, in the rest of the conversation ( not in the quotes unfortunately) he says something about the dying with dignity heuristic being useful because humans are (generally) not able to reason about quantum timelines.

Edit+Disclaimer: I keep going back and forth on whether or not posting this comment was good on net. I think more people should take stabs at the alignment problem in their own idiosyncratic way, and this is a very niche criticism guarding against a hypothetical failure mode that I'm not even really sure exists. I think I'm going to settle on retracting this but leaving up because it's fundamentally criticizing someone who is doing good that I'm not doing and I don't like doing that. If you really want to read this you can figure out how to remove HTML strikethroughs with inspect element.

I know saying "don't let this bother you" doesn't actually not let something bother you, but please don't let this dissuade you from possibly making earnest attempts on the alignment solution.

Most ambitious people start their planning by seeing how best they can apply their tools to optimize their status. That they might be arbitrarily famous and rich and powerful, at least among some sufficiently large in-group, is the most important deciding factor in what they do.

On top of that very honed optimization process they have these secondary constraints, which are mostly a function of their rationality and ethics, and those constraints say that their mind has to be in this sufficiently deep "I am a good person" well, which their better selves cannot or will not argue them out of. But that the ambitious human's plan is one that's moves their status upwards, comes before all other considerations. Adopting a plan where they obviously remain a genuinely "average" player in the status game or god forbid below average, means condemning them to a life of despair; it's intolerable. This is the origin story of "earning to give", the general form of which is "pursue status first and then maybe later somehow use it to do good as a tertiary objective".

Relying on the ethics-oracle to be effective enough to steer an arbitrarily ambitious human is basically playing a losing game; the architecture is working against you. It's like trying to prevent an AGI from turning the world into paperclips by making sure it has to win a debate with a "good enough" ethics professor first. All that happens in the thousand foot view is that there's some back and forth and now the ethics professor full-throatedly believes in the virtue of a paperclipped earth.

When this goes right, it's either because the ambitious human's rationality and sense of honor is so far ahead of their IQ that they stop raising their status (think: willingly chooses to put their hand on the stove), or the human miraculously possesses some particular set of useful skills such that their status-maximizing-plan becomes correlated with doing good. In particular it's pretty much impossible to become rich and famous and well connected and liked by your peers without, on some level, spending most of your available resources specifically on a status race, and competing with the other people who are spending most of their available resources on the status race. Doing so "without trying" is as plausible as getting the world record in deadlifting "without trying".

And sometimes it does in fact go right! I'm not saying We The Commoners should punish the status rabbits, or, god forbid, demand motive ambiguity, when they end up doing something good like inventing PageRank or starting an obviously positive existential risk initiative for us or whatever. I am glad Eliezer wrote the sequences, whatever his motives.

But Connor, here's where the conversation gets... difficult. From my 1000 ft. perspective, it seems to me like literally everything you've done so far is compatible with this story:

You wake up one morning, an ambitious human with an aptitude & interest in Machine Learning.
"How do I raise my status?"
whirr, click
"I can't figure out how to do that except by publishing ML research or models lots of other people use, and thereby accelerating timelines."
whirring of the brain searching for an argument and/or edge case in which publishing large ML models is ackchually not bad
"Aha! I can start 'earning to give'; namely, create a very effective capabilities organization and then leverage the status I get from doing so into becoming an 'influencer' and propagating alignment memes"
[Various things happen]
"How do I raise my status?"
Aha! starts a venture funded for-profit technology company.

There's probably a plausible case that EleutherAI was net-positive and I really hesitate on saying things like "you probably shouldn't have expected it would be" even though that's my inside view. I'm definitely not saying you, Connor Leahy, should specifically be the one guy expected to optimize for the social good instead of your own self interest. But it does concern me, that you have this reasoning process where you seem to always land on doing the thing that raises your coinage, and now you're starting an alignment org where this mode of operation is typically incompatible with the goal of not getting us all killed.

tl-dr: people change their minds, reasons why things happen are complex, we should adopt a forgiving mindset/align AI and long-term impact is hard to measure. At the bottom I try to put numbers on EleutherAI's impact and find it was plausibly net positive.

I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.

The situation is often mixed for a lot of people, and it evolves over time. The culture we need to have on here to solve AI existential risk need to be more forgiving. Imagine there's a ML professor who has been publishing papers advancing the state of the art for 20 years who suddenly goes "Oh, actually alignment seems important, I changed my mind", would you write a LW post condemning them and another lengthy comment about their status-seeking behavior in trying to publish papers just to become a better professor?

I have recently talked to some OpenAI employee who met Connor something like three years ago, when the whole "reproducing GPT-2" thing came about. And he mostly remembered things like the model not having been benchmarked carefully enough. Sure, it did not perform nearly as good on a lot of metrics, though that's kind of missing the point of how this actually happened? As Connor explains, he did not know this would go anywhere, and spent like 2 weeks working on, without lots of DL experience. He ended up being convinced by some MIRI people to not release it, since this would be establishing a "bad precedent".

I like to think that people can start with a wrong model of what is good and then update in the right direction. Yes, starting yet another "open-sourcing GPT-3" endeavor the next year is not evidence of having completely updated towards "let's minimize the risk of advancing capabilities research at all cost", though I do think that some fraction of people at EleutherAI truly care about alignment and just did not think that the marginal impact of "GPT-Neo/-J accelerating AI timelines" justified not publishing them at all.

My model for what happened for the EleutherAI story is mostly the ones of "when all you have is a hammer everything looks like a nail". Like, you've reproduced GPT-2 and you have access to lots of compute, why not try out GPT-3? And that's fine. Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"? Sure, that would have minimized the risk of accelerating timelines. Though when trying to put number on it below I find that it's not just "stop something clearly net negative", it's much more nuanced than that.

And after talking to one of the guys who worked on GPT-J for hours, talking to Connor for 3h, and then having to replay what he said multiple times while editing the video/audio etc., I kind of have a clearer sense of where they're coming from. I think a more productive way of making progress in the future is to look at what the positive and negative were, and put numbers on what was plausibly net good and plausible net bad, so we can focus on doing the good things in the future and maximize EV (not just minimize risk of negative!).

To be clear, I started the interview with a lot of questions about the impact of EleutherAI, and right now I have a lot more positive or mixed evidence for why it was not "certainly a net negative" (not saying it was certainly net positive). Here is my estimate of the impact of EleutherAI, where I try to measure things in my 80% likelihood interval for positive impact for aligning AI, where the unit is "-1" for the negative impact of publishing the GPT-3 paper. eg. (-2, -1) means: "a 80% change that impact was between 2x GPT-3 papers and 1x GPT-3 paper".

Mostly Negative
-- Publishing the Pile: (-0.4, -0.1) (AI labs, including top ones, use the Pile to train their models)
-- Making ML researchers more interested in scaling: (-0.1, -0.025) (GPT-3 spread the scaling meme, not EleutherAI)
-- The potential harm that might arise from the next models that might be open-sourced in the future using the current infrastructure: (-1, -0.1) (it does seem that they're open to open-sourcing more stuff, although plausibly more careful)

Mixed
-- Publishing GPT-J: (-0.4, 0.2) (easier to finetune than GPT-Neo, some people use it, though admittedly it was not SoTA when it was released. Top AI labs had supposedly better models. Interpretability / Alignment people, like at Redwood, use GPT-J / GPT-Neo models to interpret LLMs)

Mostly Positive
-- Making ML researchers more interested in alignment: (0.2, 1) (cf. the part when Connor mentions ML professors moving to alignment somewhat because of Eleuther)
-- Four of the five core people of EleutherAI changing their career to work on alignment, some of them setting up Conjecture, with tacit knowledge of how these large models work: (0.25, 1)
-- Making alignment people more interested in prosaic alignment: (0.1, 0.5)
-- Creating a space with a strong rationalist and ML culture where people can talk about scaling and where alignment is high-status and alignment people can talk about what they care about in real-time + scaling / ML people can learn about alignment: (0.35, 0.8)

Averaging these ups I get (if you could just add confidence intervals, I know this is not how probability work) a 80% chance of the impact being in: (-1, 3.275), so plausibly net good.

Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"?

I think this eloquent quote can serve to depict an important, general class of dynamics that can contribute to anthropogenic x-risks.

I funnily enough ended up retracting the comment around 9 minutes before you posted yours, triggered by this thread and the concerns you outlined about this sort of psychologizing being unproductive. I basically agree with your response.

I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.

Two comments:

[wanting to do good] vs. [one's behavior being affected by an unconscious optimization for status/power] is a false dichotomy.
Don't you think that unilateral interventions within the EA/AIS communities to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities, could have a negative impact on humanity's ability to avoid existential catastrophes from AI?

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

I don't think Conjecture is an "AGI company", everyone I've met there cares deeply about alignment and their alignment team is a decent fraction of the entire company. Plus they're funding the incubator.

I think it's also a misconception that it's an unilateralist intervension. Like, they've talked to other people in the community before starting it, it was not a secret.

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

Then I'd argue the dichotomy is vacuously true, i.e. it does not generally pertain to humans. Humans are the result of human evolution. It's likely that having a brain that (unconsciously) optimizes for status/power has been very adaptive.

Regarding the rest of your comment, this thread seems relevant.

I don't know exactly what goes into the decision between for-profit vs nonprofit, or whether Conjecture's for-profit status was the right decision, but I do want to suggest that it's not as simple as "for-profit means I plan to make money, nonprofit means I plan to benefit the world".

I used to work at a nonprofit in the military-industrial complex in the USA; there was almost no day-to-day difference between what we were doing versus what (certain units within) for-profits like Raytheon were doing. Our CEO still had a big salary, we still were under pressure to maximize revenues and minimize costs, we competed head-to-head for many of the same customers, etc.

If there’s a for-profit that has only a small set of investors/shareholders, and none of them are pressuring the firm to have a present or future profit (as I assume is the case for Conjecture), then I think there isn't really a huge philosophical difference between that versus a nonprofit; I think it just amounts to various tax and regulatory advantages and disadvantages that trade off against each other. Someone can correct me if I'm wrong.

I think this comment is getting enough vote & discussion heat for me to feel the merit in clarifying with the following statements:

Most of the value in the world is created by ambitious people. It's not evil to have strong preferences.
If the average ML engineer or entrepreneur had the ethics-oracle of Connor Leahy, we would be in much better shape. Most seem to be either not f(curious,honest) enough or principled enough to even make a plausible attempt at not ending the world. Sam Altman needs to hear the above lecture-rant 10,000 times before Connor needs to hear it once. The reason I write this comment on a post about Connor Leahy is solely because of convenience, because I can't actually talk to Sam Altman, and because I felt like writing one after reading the above post and not because he's even within 3 orders of magnitude of the worst examples of this behavior.

I strong downvoted your comment in both dimensions because I found it disagreeable and counterproductive. This kind of "Kremlinology of the heart" is toxic and demoralizing. It's why I never ever bother to do anything motivated by altruism: because I know when I start trying to do the right thing, I'll get attacked by people who think they know what's in my heart. When I openly act in selfish self-interest, nobody has anything to say about it, but any time I do selfless things, people start questioning my motives; it's clear what I'm incentivized to do. If you really want people to do good thing, don't play status games like this. Incentivize the behavior you want.

I feel unsure about the merits of this for other contexts (because it can indeed create a toxic atmosphere), but I think there are specific contexts where scrutinizing someone's decision-making algorithm seems particularly important:

Somewhat unilateral pursuit of an activity with high downside risk
Position of high influence without much accountability or legibility for outsiders to give useful criticism

Heading an alignment organization with strong information security where you have enough control so that it's unusual compared to other organizations fulfils both criteria.

So, I'd say that not discussing the topic in contexts similar to this one would be a mistake.

I'd add to that bullet list:

Severe conflicts of interest are involved.

I strong downvoted your comment in both dimensions because I found it disagreeable and counterproductive.

Generally, I think it would be net-negative to discourage such open discussions about unilateral, high-risk interventions—within the EA/AIS communities—that involve conflicts of interest. Especially, for example, unilateral interventions to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities.

You know what, I've retracted the comment because frankly you're probably right. Even if what I said is literally true, attacking the decision making architecture of Connor Leahy when he's basically doing something good is not two of (true, kind, necessary). It makes people sad, it's the kind of asymmetric justice thing I hate, and I don't want to normalize it. Even when it's attached with disclaimers or say "I'm just attacking you just cuz bro don't take it personal."

Most VC-types are easier to get a hold of than you think. They're sort of in the business of being easy to get a hold of by smart weirdos. If you think you have something to say to him that might change his mind, there's a good shot he'll read your cold email.

Just to state a personal opinion, I think if it makes you work harder on alignment, I’m fine with that being your subconscious motivation structure. There are places where it diverges, and this sort of comment can be good in that it highlights to such people that any detrimental status seeking will be noticed and punished. But if we start scaling down how much credit people should get based on purity of subconscious heart, we’re all going to die.

But if we start scaling down how much credit people should get based on purity of subconscious heart, we’re all going to die.

That's not how I interpreted lc's comment. I think lc means that people – and maybe especially "ambitious" people (i.e., people with some grandiose traits who enjoy power/influence – are at risk to go astray in their rationality when choosing/updating their path to impact as they're tempted to pick paths that fit their strengths and lead to recognition. He's saying "pay close attention whether the described path to impact is indeed positive."

For instance, Connor seems gifted at ML capabilities work and willing to take action based on inner conviction. Is he in the unfortunate world where the best path to impact says "don't reap any of the benefits of your ML talents" or in the fortunate one where it says "making money with ML is step one of a sound plan?"

Everyone faces this sort of tradeoff, but since you sometimes see people believe things like "this may not be the most impactful thing I could possibly do, but it's what suits my strengths," and Connor doesn't seem to have beliefs like that, there are specific reasons to pay close attention. Of course, the same goes for carefully watching other people who claim that they know how to have a lot of impact and it happens to be something that really plays to their strengths. (I think we definitely need some people who act ambitiously on some specific vision that plays to their strengths!)

This is also how I interpreted lc's comment.

If you have to solve an actually hard problem in the actual real world, in actual physics, for real, an actual problem, that is actually hard, you can't afford to throw your epistemics out the door because you feel bad.

But I like to believe that there was a positive magnetic contagion that happened there.

These statements seem in tension, except insofar as I don't take one or the other literally.

What is the structure of Conjecture's interpretability research, and how does it differ from Redwood and Anthropic's interpretability research?

Edit: This was touched on here.

In their announcement post they mention:

Mechanistic interpretability research in a similar vein to the work of Chris Olah and David Bau, but with less of a focus on circuits-style interpretability and more focus on research whose insights can scale to models with many billions of parameters and larger. Some example approaches might be:
Locating and editing factual knowledge in a transformer language model.
Using deep learning to automate deep learning interpretability - for example, training a language model to give semantic labels to neurons or other internal circuits.
Studying the high-level algorithms that models use to perform e.g, in-context learning or prompt programming.

"One of the important parts of my threat model is that I think 99% of the damage from GPT-3 was done the moment the paper was published. And, as they say about the nuclear bomb, the only secret was that it was possible. And I think there's a bit of naivety that sometimes goes into these arguments, where people are, 'Well, EleutherAI accelerated things, they drew attention to the meme'. And I think there's a lot of hindsight bias there, in that people don't realize how everyone knew about this, except the alignment community. Everyone at OpenAI, Google Brain and DeepMind. People knew about this, and they figured it out fucking fast."

I also agree, in that it gave people possibility, albeit my timeline for AI-PONR is 2016-2022, from the time of go being crushed by AI, basically proving that we managed to get an intuition in AI, to the Chinchilla scaling paper which gave a clear path to human-level AI and superhuman AI. It also threw out the old scaling laws too. I'd also add Gato to the list. Despite overhype, Deepmind plans to scale it in the next few years, and it's eerily close to solving the software for robots.

It seems to me that you're passing comments in bad faith here. Connor repeatedly stressed in podcast that Conjecture would not do capabilities research and that they would not have had plans for developing products had they not been funding constrained.

You make pretty big accusations in the parent comment too, all that not supported by an iota of evidence but an out-of-context quote from podcast picked by you.

Just FYI I deleted that comment before you made the reply, which is why your comment is in some sort of Twilight Zone. I also removed the quote because it does have other interpretations, though I prefer mine.

“There's another interpretation of this, which I think might be better where you can model people like AI_WAIFU as modeling timelines where we don't win with literally zero value. That there is zero value whatsoever in timelines where we don't win. And Eliezer, or people like me, are saying, 'Actually, we should value them in proportion to how close to winning we got'. Because that is more healthy... It's reward shaping! We should give ourselves partial reward for getting partially the way. He says that in the post, how we should give ourselves dignity points in proportion to how close we get.
And this is, in my opinion, a much psychologically healthier way to actually deal with the problem. This is how I reason about the problem. I expect to die. I expect this not to work out. But hell, I'm going to give it a good shot and I'm going to have a great time along the way. I'm going to spend time with great people. I'm going to spend time with my friends. We're going to work on some really great problems. And if it doesn't work out, it doesn't work out. But hell, we're going to die with some dignity. We're going to go down swinging.”

Is there a way we could get sure on the metaphysics here? It feels like it's an important thing to know if it actually happens to be true.

I know saying "don't let this bother you" doesn't actually not let something bother you, but please don't let this dissuade you from possibly making earnest attempts on the alignment solution.

But Connor, here's where the conversation gets... difficult. From my 1000 ft. perspective, it seems to me like literally everything you've done so far is compatible with this story:

You wake up one morning, an ambitious human with an aptitude & interest in Machine Learning.
"How do I raise my status?"
whirr, click
"I can't figure out how to do that except by publishing ML research or models lots of other people use, and thereby accelerating timelines."
whirring of the brain searching for an argument and/or edge case in which publishing large ML models is ackchually not bad
"Aha! I can start 'earning to give'; namely, create a very effective capabilities organization and then leverage the status I get from doing so into becoming an 'influencer' and propagating alignment memes"
[Various things happen]
"How do I raise my status?"
Aha! starts a venture funded for-profit technology company.

I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.

Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"?

I think this eloquent quote can serve to depict an important, general class of dynamics that can contribute to anthropogenic x-risks.

I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.

Two comments:

[wanting to do good] vs. [one's behavior being affected by an unconscious optimization for status/power] is a false dichotomy.
Don't you think that unilateral interventions within the EA/AIS communities to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities, could have a negative impact on humanity's ability to avoid existential catastrophes from AI?

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

I think it's also a misconception that it's an unilateralist intervension. Like, they've talked to other people in the community before starting it, it was not a secret.

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

Regarding the rest of your comment, this thread seems relevant.

I think this comment is getting enough vote & discussion heat for me to feel the merit in clarifying with the following statements:

Most of the value in the world is created by ambitious people. It's not evil to have strong preferences.
If the average ML engineer or entrepreneur had the ethics-oracle of Connor Leahy, we would be in much better shape. Most seem to be either not f(curious,honest) enough or principled enough to even make a plausible attempt at not ending the world. Sam Altman needs to hear the above lecture-rant 10,000 times before Connor needs to hear it once. The reason I write this comment on a post about Connor Leahy is solely because of convenience, because I can't actually talk to Sam Altman, and because I felt like writing one after reading the above post and not because he's even within 3 orders of magnitude of the worst examples of this behavior.

Somewhat unilateral pursuit of an activity with high downside risk
Position of high influence without much accountability or legibility for outsiders to give useful criticism

I'd add to that bullet list:

Severe conflicts of interest are involved.

I strong downvoted your comment in both dimensions because I found it disagreeable and counterproductive.

But if we start scaling down how much credit people should get based on purity of subconscious heart, we’re all going to die.

This is also how I interpreted lc's comment.

If you have to solve an actually hard problem in the actual real world, in actual physics, for real, an actual problem, that is actually hard, you can't afford to throw your epistemics out the door because you feel bad.

But I like to believe that there was a positive magnetic contagion that happened there.

These statements seem in tension, except insofar as I don't take one or the other literally.

What is the structure of Conjecture's interpretability research, and how does it differ from Redwood and Anthropic's interpretability research?

Edit: This was touched on here.

In their announcement post they mention:

Mechanistic interpretability research in a similar vein to the work of Chris Olah and David Bau, but with less of a focus on circuits-style interpretability and more focus on research whose insights can scale to models with many billions of parameters and larger. Some example approaches might be:
Locating and editing factual knowledge in a transformer language model.
Using deep learning to automate deep learning interpretability - for example, training a language model to give semantic labels to neurons or other internal circuits.
Studying the high-level algorithms that models use to perform e.g, in-context learning or prompt programming.

"One of the important parts of my threat model is that I think 99% of the damage from GPT-3 was done the moment the paper was published. And, as they say about the nuclear bomb, the only secret was that it was possible. And I think there's a bit of naivety that sometimes goes into these arguments, where people are, 'Well, EleutherAI accelerated things, they drew attention to the meme'. And I think there's a lot of hindsight bias there, in that people don't realize how everyone knew about this, except the alignment community. Everyone at OpenAI, Google Brain and DeepMind. People knew about this, and they figured it out fucking fast."

You make pretty big accusations in the parent comment too, all that not supported by an iota of evidence but an out-of-context quote from podcast picked by you.

LESSWRONG
LW

LESSWRONG
LW

195

Connor Leahy on Dying with Dignity, EleutherAI and Conjecture

195

Understanding Eliezer Yudkowsky

Eliezer Has Been Conveying Antimemes

Why the Dying with Dignity Heuristic is Useful

EleutherAI

Why training GPT-3 Size Models made sense

EleutherAI Spread Alignment Memes in the ML World

On the Policy and Impact of EleutherAI's Open Source

Conjecture

How Conjecture Started

Where Conjecture Fits in the AI Alignment Landscape

Why Conjecture is Doing Interpretability Research

Conjecture Approach To Solving Alignment

On being non-disclosure by default

On Building Products as a For-Profit

Scaling The Alignment Field

195

195