Recent Discussion

This is a lightly edited transcript of a chatroom conversation between Scott Alexander and Eliezer Yudkowsky last year, following up on the Late 2021 MIRI Conversations. Questions discussed include "How hard is it to get the right goals into AGI systems?" and "In what contexts do AI systems exhibit 'consequentialism'?".


1. Analogies to human moral development


@ScottAlexander ready when you are


Okay, how do you want to do this?


If you have an agenda of Things To Ask, you can follow it; otherwise I can start by posing a probing question or you can?

We've been very much winging it on these and that has worked... as well as you have seen it working!


Okay. I'll post from my agenda. I'm assuming we both have the right to edit logs


Lol, cool. I tried the "4 minute" challenge (without having read EY's answer, but having read yours). 

Hill-climbing search requires selecting on existing genetic variance on alleles already in the gene pool. If there isn’t a local mutation which changes the eventual fitness of the properties which that genotype unfolds into, then you won’t have selection pressure in that direction. On the other hand, gradient descent is updating live on a bunch of data in fast iterations which allow running modifications over the parameters themselves. It’s like being

... (read more)
Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I'm still confused why Scott concluded "oh I was just confused in this way" and then EY said "yup that's why you were confused", and I'm still like "nope Scott's question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset."

Hi, I have a friend in Kenya who works with gifted children and would like to get ChatGPT accounts for them. Can anybody get me in touch with someone from OpenAI who might be interested in supporting such a project?

Definition. On how I use words, values are decision-influences (also known as shards). “I value doing well at school” is a short sentence for “in a range of contexts, there exists an influence on my decision-making which upweights actions and plans that lead to e.g. learning and good grades and honor among my classmates.” 

Summaries of key points:

  1. Nonrobust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don't have to be "globally robust" or "perfect."
  2. Values steer optimization; they are not optimized against. The value shards aren't getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent's cognition (e.g. the world model, the general-purpose planning API). 

    Since values are not the

This is a great article! It helps me understand shard theory better and value it more; in particular, it relates to something I've been thinking about where people seem to conflate utility-optimizing agents with policy-execuing agents, but the two have meaningfully different alignment characteristics, and shard theory seems to be deeply exploring the latter, which is 👍.

That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but plannin... (read more)

Thanks for leaving this comment, I somehow only just now saw it. I want to make a use/mention distinction. Consider an analogous argument: "Given gradient descent's pseudocode it seems like the only point of backward is to produce parameter modifications that lead to low outputs of loss_fn. Gradient descent selects over all directional derivatives for the gradient, which is the direction of maximal loss reduction. Why is that not "optimizing the outputs of the loss function as gradient descent's main terminal motivation"?"[1] [#fnhi0v8zztkro] Locally reducing the loss is indeed an important part of the learning dynamics of gradient descent, but this (I claim) has very different properties than "randomly sample from all global minima in the loss landscape" (analogously: "randomly sample a plan which globally maximizes grader output"). But I still haven't answered your broader I think you're asking for a very reasonable definition which I have not yet given, in part because I've remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I've focused on giving examples, in the hope of getting the vibe across.) I gave it a few more stabs, and I don't think any of them ended up being sufficient. But here they are anyways: 1. A "grader-optimizer" makes decisions primarily on the basis of the outputs of some evaluative submodule, which may or may not be explicitly internally implemented. The decision-making is oriented towards making the outputs come out as high as possible. 2. In other words, the evaluative "grader" submodule is optimized against by the planning. 3. IE the process plans over "what would the grader say about this outcome/plan", instead of just using the grader to bid the plan up or down. I wish I had a better intensional definition for you, but that's what I wrote immediately and I really better get through the rest of my comm backlog from last week

(Cross-posted on my personal blog.)

There is a memory that has always stuck with me. In sixth grade I remember sitting on the floor at a friend's house. We were supposed to write an essay together. I distinctly remember him proposing that I should write the first version and then he'll go over it afterwards and make it sound good. Because he's better at making things sound good than I am, and I'm better at the content.

I was annoyed. Not because he slyly wanted to escape doing any of the real work. At least not primarily. I was mainly annoyed at the claim that I'm not good at making things sound good.

Basically, he wanted to go through the essay and use "bigger" and "fancier" words and phrases. Ie....

Sounds like simulacra level 4 to me! Just saying things for the vybez.

1M. Y. Zuo1h
The last 2000 years of human history is a strong counterargument.

The Less Wrong General Census is back!

In days of yore, there was an annual census of the site users. That census has come again, at least for this year! Click here to take the survey! It can take as little as five minutes if you just want to fill out the basics, and can take longer if you want to fill other the optional sections out. The survey will be open from today until February 27th, at which point it will close.

Once the census is closed, I'll remove the very private information, then make some summaries of the data and write up what I found in a post that will be linked from here. I'll also release a csv of all the responses marked "fine to include"...

I just noticed a kinda-ambiguity that I should have spotted before when looking at the questions.

There is a question about "cryonics" and then a question about "anti-agathics". It would be nice if the latter made it explicit (1) whether it counts as "reaching an age of 1000 years" if you are cryosuspended and then revived 1000 years later, and (2) whether it counts as "reaching an age of 1000 years" if you are cryosuspended and then revived after everyone currently alive has died or likewise been cryosuspended, and then (using whatever future technology is... (read more)

Hey, sorry for not saying something sooner. As Screwtape says below, the LessWrong team was aware of this plan to make a survey. Since realistically we weren't going to run one ourselves, didn't make sense to get in the way of someone else doing one (and the questions seem reasonable). I think putting something in title and prominently in the description like "Unofficial" would be good. I'd say the status is that this done with the collaboration/support of the LessWrong team, but neither do we wish to block it.
You didn't make a request that's comparable to the request that SurfingOrca made. SurfingOrca's request got community approval if you look at it's karma response while yours didn't. I think ignoring the fact that your post got single digit karma and ending your "RFC" after only five days and a single person commenting when there's no reason to rush it, is a bad sign when it comes to the question to whether you are likely trustworthy when it comes to handling sensitive private data.
Scott Alexander isn't the official LessWrong team but he was someone who actually earned had more authority. He was someone with community trust.

Prompt 0:

Think about the way computer programmers talk about “bugs” in the program, or “feature requests” that would make a given app or game much better.  Bugs are things-that-are-bad: frustrations, irritations, frictions, problems.  Feature requests are things-that-could-be-great: opportunities, possibilities, new systems or abilities.

Write down as many “bugs” and “feature requests” as you can, for your own life.


Prompt 1:

A genie has offered to fix every bug you’ve written down, and to give you every feature you’ve requested, but then it will freeze your personality—you won’t be able to grow or add or improve anything else.

Hearing that, are there other things you’d like to write down, before the genie takes your list and works its magic?


Prompt 2:

Imagine someone you know well, like your father or your best friend or a...

Are we supposed to keep writing bugs and feature requests on prompts 3-5? I don't think it would hurt, and I did write more, but I'm not sure if that's what's intended.

Yeah, in general when this activity is done in-person, people are writing/typing for the whole 15-25min, and each successive prompt is basically just another way to re-ask the same question. If the frame of "bugs and feature requests" starts to feel too [whatever], another way to think of it is to just keep writing down threads-to-pull-on.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with

Supported by Rethink Priorities

This is part of a weekly series summarizing the top posts on the EA and LW forums - you can see the full collection here. The first post includes some details on purpose and methodology. Feedback, thoughts, and corrections are welcomed.

If you'd like to receive these summaries via email, you can subscribe here.

Podcast version: Subscribe on your favorite podcast app by searching for 'EA Forum Podcast (Summaries)'. A big thanks to Coleman Snell for producing these!

Philosophy and Methodologies

Rethink Priorities’ Welfare Range Estimates

by Bob Fischer

The author builds off analysis in the rest of the Moral Weight Project Sequence to offer estimates of the ‘welfare range’ of 11 farmed species. A welfare range is the estimated difference between the most intensely positively valenced state (pleasure) and negatively...

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort

A big thank you to all of the people who gave me feedback on this post: Edmund Lao, Dan Murfet, Alexander Gietelink Oldenziel, Lucius Bushnaq, Rob Krzyzanowski, Alexandre Variengen, Jiri Hoogland, and Russell Goyder.

Statistical learning theory is lying to you: "overparametrized" models actually aren't overparametrized, and generalization is not just a question of broad basins.

The standard explanation thrown around here for why neural networks generalize well is that gradient descent settles in flat basins of the loss function. On the left, in a sharp minimum, the updates bounce the model around. Performance varies considerably with new examples. On the right, in a flat minimum, the updates settle to zero. Performance is stabler under perturbations.

To first...

1Jesse Hoogland3h
Thank you! This is also my biggest source of uncertainty on the whole agenda. There's definitely a capabilities risk, but I think the benefits to understanding NNs currently much outweigh the benefits to improving them. In particular, I think that understanding generalization is pretty key to making sense of outer and inner alignment. If "singularities = generalization" holds up, then our task seems to become quite a lot easier: we only have to understand a few isolated points of the loss landscape instead of the full exponential hell that is a billions-dimensional system. In a similar vein, I think that this is one of the most promising paths to understanding what's going on during training. When we talk about phase changes / sharp left turns / etc., what we may really be talking about are discrete changes in the local singularity structure of the loss landscape. Understanding singularities seems key to predicting and anticipating these changes just as understanding critical points is key to predicting and anticipating phase transitions in physical systems. As long as your prior has non-zero support on the singularities, the results hold up (because we're taking this large-N limit where the prior becomes less important). Like I mention in the objections, linking this to SGD is going to require more work. To first order, when your prior has support over only a compact subset of weight space, your behavior is dominated by the singularities in that set (this is another way to view the comments on phase transitions). This is very much a work in progress. In statistical physics, much of our analysis is built on the assumption that we can replace temporal averages with phase-space averages. This is justified on grounds of the ergodic hypothesis. In singular learning theory, we've jumped to parameter (phase)-space averages without doing the important translation work from training (temporal) averages. SGD is not ergodic, so this will require care. That the exact asy

Let me add some more views on SLT and capabilities/alignment. 

Quoting Dan Murfet:

(Dan Murfet’s personal views here) First some caveats: although we are optimistic SLT can be developed into theory of deep learning, it is not currently such a theory and it remains possible that there are fundamental obstacles. Putting those aside for a moment, it is plausible that phenomena like scaling laws and the related emergence of capabilities like in-context learning can be understood from first principles within a framework like SLT. This could contribute both t

... (read more)
1Jesse Hoogland4h
Yep, regularization tends to break these symmetries. I think the best way to think of this is that it causes the valleys to become curved — i.e., regularization helps the neural network navigate the loss landscape. In its absence, moving across these valleys depends on the stochasticity of SGD which grows very slowly with the square root of time. That said, regularization is only a convex change to the landscape that doesn't change the important geometrical features. In its presence, we should still expect the singularities of the corresponding regularization-free landscape to have a major macroscopic effect. There are also continuous zero-loss deformations in the loss landscape that are not affected by regularization because they aren't a feature of the architecture but of the "truth". (See the thread with tgb for a discussion of this, where we call these "Type B".)
1Jesse Hoogland4h
This is a toy example (I didn't come up with it for any particular f in mind. I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to "trap" random motion. And it seems like both somehow help make the loss landscape more navigable. If you're interested in interpreting the energy landscape as a loss landscape,x and y would be the parameters (and a and b would be hyperparameters related to things like the learning rate and batch size.

For More Detail, Previously: Simulacra Levels and Their Interactions, Unifying the Simulacra Definitions, The Four Children of the Seder as the Simulacra Levels.

A key source of misunderstanding and conflict is failure to distinguish between combinations of the following four cases.

  1. Sometimes people model and describe the physical world, seeking to convey true information because it is true.
  2. Other times people are trying to get you to believe what they want you to believe so you will do or say what they want.
  3. Other times people say things mostly as slogans or symbols to tell you what tribe or faction they belong to, or what type of person they are.
  4. Then there are times when talk seems to have have gone strangely meta or off the rails entirely. The symbolic representations are

I think this would go down more smoothly with a few more examples of level 4. I found the "Level 4: A trial by ordeal or trial by combat lacks and denies the concept of justice entirely" pretty helpful for describing lizard brain/association diverging from reality, but still feeling correct enough for a person to choose it.

I wish the levels were ever all that clear in any real-world interaction. It's always a mix, and usually confounded by signaling (of virtue and capabilities) UNRELATED to the nominal topic of communication - people trying to win points or show their sophistication in how they think.
2Yoav Ravid10h
Another contrast: Levels 1+4 vs. 2+3: Not Pretending vs. pretending.