Michaël Trazzi


AI Races and Macrostrategy
Treacherous Turn
The Inside View (Podcast)

Wiki Contributions


Use the dignity heuristic as reward shaping

“There's another interpretation of this, which I think might be better where you can model people like AI_WAIFU as modeling timelines where we don't win with literally zero value. That there is zero value whatsoever in timelines where we don't win. And Eliezer, or people like me, are saying, 'Actually, we should value them in proportion to how close to winning we got'. Because that is more healthy... It's reward shaping! We should give ourselves partial reward for getting partially the way. He says that in the post, how we should give ourselves dignity points in proportion to how close we get.

And this is, in my opinion, a much psychologically healthier way to actually deal with the problem. This is how I reason about the problem. I expect to die. I expect this not to work out. But hell, I'm going to give it a good shot and I'm going to have a great time along the way. I'm going to spend time with great people. I'm going to spend time with my friends. We're going to work on some really great problems. And if it doesn't work out, it doesn't work out. But hell, we're going to die with some dignity. We're going to go down swinging.”

Thanks for the feedback! Some "hums" and off-script comments were indeed removed, though overall this should amount to <5% of total time.

I like this comment, and I personally think the framing you suggest is useful. I'd like to point out that, funnily enough, in the rest of the conversation ( not in the quotes unfortunately) he says something about the dying with dignity heuristic being useful because humans are (generally) not able to reason about quantum timelines.

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

I don't think Conjecture is an "AGI company", everyone I've met there cares deeply about alignment and their alignment team is a decent fraction of the entire company. Plus they're funding the incubator.

I think it's also a misconception that it's an unilateralist intervension. Like, they've talked to other people in the community before starting it, it was not a secret.

tl-dr: people change their minds, reasons why things happen are complex, we should adopt a forgiving mindset/align AI and long-term impact is hard to measure. At the bottom I try to put numbers on EleutherAI's impact and find it was plausibly net positive.

I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.

The situation is often mixed for a lot of people, and it evolves over time. The culture we need to have on here to solve AI existential risk need to be more forgiving. Imagine there's a ML professor who has been publishing papers advancing the state of the art for 20 years who suddenly goes "Oh, actually alignment seems important, I changed my mind", would you write a LW post condemning them and another lengthy comment about their status-seeking behavior in trying to publish papers just to become a better professor?

I have recently talked to some OpenAI employee who met Connor something like three years ago, when the whole "reproducing GPT-2" thing came about. And he mostly remembered things like the model not having been benchmarked carefully enough. Sure, it did not perform nearly as good on a lot of metrics, though that's kind of missing the point of how this actually happened? As Connor explains, he did not know this would go anywhere, and spent like 2 weeks working on, without lots of DL experience. He ended up being convinced by some MIRI people to not release it, since this would be establishing a "bad precedent".

I like to think that people can start with a wrong model of what is good and then update in the right direction. Yes, starting yet another "open-sourcing GPT-3" endeavor the next year is not evidence of having completely updated towards "let's minimize the risk of advancing capabilities research at all cost", though I do think that some fraction of people at EleutherAI truly care about alignment and just did not think that the marginal impact of "GPT-Neo/-J accelerating AI timelines" justified not publishing them at all.

My model for what happened for the EleutherAI story is mostly the ones of "when all you have is a hammer everything looks like a nail". Like, you've reproduced GPT-2 and you have access to lots of compute, why not try out GPT-3? And that's fine. Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"? Sure, that would have minimized the risk of accelerating timelines. Though when trying to put number on it below I find that it's not just "stop something clearly net negative", it's much more nuanced than that.

And after talking to one of the guys who worked on GPT-J for hours, talking to Connor for 3h, and then having to replay what he said multiple times while editing the video/audio etc., I kind of have a clearer sense of where they're coming from. I think a more productive way of making progress in the future is to look at what the positive and negative were, and put numbers on what was plausibly net good and plausible net bad, so we can focus on doing the good things in the future and maximize EV (not just minimize risk of negative!).

To be clear, I started the interview with a lot of questions about the impact of EleutherAI, and right now I have a lot more positive or mixed evidence for why it was not "certainly a net negative" (not saying it was certainly net positive). Here is my estimate of the impact of EleutherAI, where I try to measure things in my 80% likelihood interval for positive impact for aligning AI, where the unit is "-1" for the negative impact of publishing the GPT-3 paper. eg. (-2, -1) means: "a 80% change that impact was between 2x GPT-3 papers and 1x GPT-3 paper".

Mostly Negative
-- Publishing the Pile: (-0.4, -0.1) (AI labs, including top ones, use the Pile to train their models)
-- Making ML researchers more interested in scaling: (-0.1, -0.025) (GPT-3 spread the scaling meme, not EleutherAI)
-- The potential harm that might arise from the next models that might be open-sourced in the future using the current infrastructure: (-1, -0.1) (it does seem that they're open to open-sourcing more stuff, although plausibly more careful)

-- Publishing GPT-J: (-0.4, 0.2) (easier to finetune than GPT-Neo, some people use it, though admittedly it was not SoTA when it was released. Top AI labs had supposedly better models. Interpretability / Alignment people, like at Redwood, use GPT-J / GPT-Neo models to interpret LLMs)

Mostly Positive
-- Making ML researchers more interested in alignment: (0.2, 1) (cf. the part when Connor mentions ML professors moving to alignment somewhat because of Eleuther) 
-- Four of the five core people of EleutherAI changing their career to work on alignment, some of them setting up Conjecture, with tacit knowledge of how these large models work: (0.25, 1)
-- Making alignment people more interested in prosaic alignment: (0.1, 0.5)
-- Creating a space with a strong rationalist and ML culture where people can talk about scaling and where alignment is high-status and alignment people can talk about what they care about in real-time + scaling / ML people can learn about alignment: (0.35, 0.8)

Averaging these ups I get (if you could just add confidence intervals, I know this is not how probability work) a 80% chance of the impact being in: (-1, 3.275), so plausibly net good.

In their announcement post they mention:

Mechanistic interpretability research in a similar vein to the work of Chris Olah and David Bau, but with less of a focus on circuits-style interpretability  and more focus on research whose insights can scale to models with many billions of parameters and larger. Some example approaches might be: 

  • Locating and editing factual knowledge in a transformer language model.
  • Using deep learning to automate deep learning interpretability - for example, training a language model to give semantic labels to neurons or other internal circuits.
  • Studying the high-level algorithms that models use to perform e.g, in-context learning or prompt programming.

I believe the forecasts were aggregated around June 2021. When was GPT2-finetune released? What about GPT3 few show?

Re jumps in performance: jack clark has a screenshot on twitter about saturated benchmarks from the dynabench paper (2021), it would be interesting to make something up-to-date with MATH https://twitter.com/jackclarkSF/status/1542723429580689408

I think it makes sense (for him) to not believe AI X-risk is an important problem to solve (right now) if he believes that the "fast enough" means "not in his lifetime", and he also puts a lot of moral weight on near-term issues. For completeness sake, here are some claims more relevant to "not being able to solve the core problem".

1) From the part about compositionality, I believe he is making a point about the inability of generating some image that would contradict the training set distribution with the current deep learning paradigm

Generating an image for the caption, a horse riding on an astronaut. That was the example that Gary Marcus talked about, where a human would be able to draw that because a human understand the compositional semantics of that input and current models are struggling also because of distributional statistics and in the image to text example, that would be for example, stuff that we've been seeing with Flamingo from DeepMind, where you look at an image and that might represent something very unusual and you are unable to correctly describe the image in the way that's aligned with the composition of the image. So that's the parsing problem that I think people are mostly concerned with when it comes to compositionality and AI.

2) From the part about generalization, he is saying that there is some inability to build truly general systems. I do not agree with his claim, but if I were to steelman the argument it would be something like "even if it seems deep learning is making progress, Boston Robotics is not using deep learning and there is no progress in the kind of generalization needed for the Wozniak test"

the Wozniak test, which was proposed by Steve Wozniak, which is building a system that can walk into a room, find the coffee maker and brew a good cup of coffee. So these are tasks or capacities that require adapting to novel situations, including scenarios that were not foreseen by the programmers where, because there are so many edge cases in driving, or indeed in walking into an apartment, finding a coffee maker of some kind and making a cup of coffee. There are so many potential edge cases. And, this very long tail of unlikely but possible situations where you can find yourself, you have to adapt more flexibly to this kind of thing.


But I don't know whether that would even make sense, given the other aspect of this test, which is the complexity of having a dexterous robot that can manipulate objects seamlessly and the kind of thing that we're still struggling with today in robotics, which is another interesting thing that, we've made so much progress with disembodied models and there are a lot of ideas flying around with robotics, but in some respect, the state of the art in robotics where the models from Boston Dynamics are not using deep learning, right?

I have never thought of such a race. I think this comment is worth its own post.

Load More