All of rohinmshah's Comments + Replies

rohinmshah's Shortform

Let's say you're trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?

(I'm assuming here that you can't defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)

F... (read more)

Homogeneity vs. heterogeneity in AI takeoff scenarios

Well then, would you agree that Evan's position here:

By default, in the case of deception, my expectation is that we won't get a warning shot at all

is plausible and in particular doesn't depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely?

No, I don't agree with that.

Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we'll have warning shots of this kind.

One problem here is that m... (read more)

Old post/writing on optimization daemons?

This probably isn't the thing you mean, but your description kinda sounds like tessellating hills and its predecessor demons in perfect search.

Homogeneity vs. heterogeneity in AI takeoff scenarios

I don't automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the "weak" warning shots discussed above.)

4Daniel Kokotajlo21hWell then, would you agree that Evan's position here: is plausible and in particular doesn't depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this "obvious, real-world harm" definition, which is noticeably broader than my "strong" definition and therefore makes Evan's claim stronger and less plausible but still, I think, plausible. (To answer your earlier question, I've read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there's a good chance we'll get "strong" warning shots. Paul Christiano, for example. Though it's possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.) EDIT: I feel like we've been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we'll have warning shots of this kind. Maybe it'll turn out that our distributions aren't that different from each other after all, especially if we conditionalize on slow takeoff.
Homogeneity vs. heterogeneity in AI takeoff scenarios

perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we'd get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn't matter much.

Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I'd rephrase as

it would actually provoke a major increase in caution (assuming we weren't already being very cautious)

I suppose the distinction between "strong" and "weak" warning shots would matter i... (read more)

6evhub1dI guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn't a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn't involve taking over the world. By default, in the case of deception, my expectation is that we won't get a warning shot at all—though I'd more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
Homogeneity vs. heterogeneity in AI takeoff scenarios

If you think there's something we are not on the same page about here--perhaps what you were hinting at with your final sentence--I'd be interested to hear it.

I'm not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).

6Daniel Kokotajlo2dIt's been a while since I thought about this, but going back to the beginning of this thread: I think the first paragraph (Evan's) is basically right, and the second two paragraphs (your response) are basically wrong. I don't think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between "strong" warning shots and "weak" warning shots is important because I think that "weak" warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas "strong" warning shots would provoke a large increase in caution. I agree that we'll probably get various "weak" warning shots, but I think this doesn't change the overall picture much because it won't provoke a major increase in caution on the part of human institutions etc. I'm guessing it's that last bit that is the crux--perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we'd get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn't matter much.
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"... 

I think that probably would be true.

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment (which I'm not), which trigg

... (read more)
3Andrew_Critch3dGot it, thanks!
Homogeneity vs. heterogeneity in AI takeoff scenarios

Not sure why I didn't respond to this, sorry.

I agree with the claim "we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world".

I don't see this claim as particularly relevant to predicting the future.

4Daniel Kokotajlo3dOK, thanks. YMMV but some people I've read / talked to seem to think that before we have successful world-takeover attempts, we'll have unsuccessful ones--"sordid stumbles." If this is true, it's good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true. A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It's plausible to me that we'll get stuff like that before it's too late. If you think there's something we are not on the same page about here--perhaps what you were hinting at with your final sentence--I'd be interested to hear it.
Another (outer) alignment failure story

Planned opinion (shared with What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs))

Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated

... (read more)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Planned summary for the Alignment Newsletter:

A robust agent-agnostic process (RAAP) is a process that robustly leads to an outcome, without being very sensitive to the details of exactly which agents participate in the process, or how they work. This is illustrated through a “Production Web” failure story, which roughly goes as follows:

A breakthrough in AI technology leads to a wave of automation of $JOBTYPE (e.g management) jobs. Any companies that don’t adopt this automation are outcompeted, and so soon most of these jobs are completely automated. This l

... (read more)
3Andrew_Critch4dYes, I agree with this. Yes! +10 to this! For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here [The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.]) seem to think I'm saying you shouldn't work on alignment (which I'm not), which triggers a "Yes, this is the most valuable thing" reply. I'm trying to say "Hey, if you care about AI x-risk, alignment isn't the only game in town", and staking some personal reputation points to push against the status quo where almost-everyone x-risk oriented will work on alignment almost-nobody x-risk-oriented will work on cooperation/coordination or multi/multi delegation. Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"...
Another (outer) alignment failure story

Planned summary for the Alignment Newsletter:

Suppose we train AI systems to perform task T by having humans look at the results that the AI system achieves and evaluating how well the AI has performed task T. Suppose further that AI systems generalize “correctly” such that even in new situations they are still taking those actions that they predict we will evaluate as good. This does not mean that the systems are aligned: they would still deceive us into _thinking_ things are great when they actually are not. This post presents a more detailed story for ho

... (read more)
2rohinmshah4dPlanned opinion (shared with What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) [https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic] )
AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

Planned summary for the Alignment Newsletter:

This podcast covers a bunch of topics, such as <@debate@>(@AI safety via debate@), <@cross examination@>(@Writeup: Progress on AI Safety via Debate@), <@HCH@>(@Humans Consulting HCH@), <@iterated amplification@>(@Supervising strong learners by amplifying weak experts@), and <@imitative generalization@>(@Imitative Generalisation (AKA 'Learning the Prior')@) (aka [learning the prior](https://www.alignmentforum.org/posts/SL9mKhgdmDKXmxwE4/learning-the-prior) ([AN #109](https://mailchi.

... (read more)
My research methodology

I agree this involves discretion [...] So instead I'm doing some in between thing

Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).

I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.

(Probably I could have been clearer about this in the original opinion.)

My research methodology

In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."

To fill in the details more:

Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space ... (read more)

4paulfchristiano8dThat's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you? I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I'm doing some in between thing, which is roughly like: I'm allowed to push on the story to make it more concrete along any axis, but I recognize that I won't have time to pin down every axis so I'm basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can't fill in a billion parameters of my model one by one this way; what's worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).
My research methodology

Planned summary for the Alignment Newsletter:

This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:

1. Come up with some alignment algorithm that solves the issues identified so far

2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.

This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same cla

... (read more)

rom my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2

That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."

In some sense you could start from the trivial story "Your algorithm didn't work and then something ... (read more)

How do scaling laws work for fine-tuning?

I don't think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.

How do scaling laws work for fine-tuning?
Answer by rohinmshahApr 04, 202115Ω8

Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws?

My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).

how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?

I think this is mostly irrelevant to timelines / previous scaling laws for transfer:

  1. You still have to pretrain the Transformer, which will take
... (read more)
2Daniel Kokotajlo12dThanks! Your answer no. 2 is especially convincing to me; I didn't realize the authors used smaller models as the comparison--that seems like an unfair comparison! I would like to see how well these 0.1%-tuned transformers do compared to similarly-sized transformers trained from scratch.
Coherence arguments imply a force for goal-directed behavior

Yes, that's basically right.

You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.

Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from "weakly has goals" to "strongly has goals"). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the "intelligent" --> "weakly has goals" step as a relatively weak step in our current arguments. (In my ori... (read more)

9KatjaGrace8dI wrote an AI Impacts page [https://aiimpacts.org/what-do-coherence-arguments-imply-about-the-behavior-of-advanced-ai/] summary of the situation as I understand it. If anyone feels like looking, I'm interested in corrections/suggestions (either here or in the AI Impacts feedback box).
Coherence arguments imply a force for goal-directed behavior

Thanks, that's helpful. I'll think about how to clarify this in the original post.

6Rob Bensinger18dMaybe changing the title would prime people less to have the wrong interpretation? E.g., to 'Coherence arguments require that the system care about something'. Even just 'Coherence arguments do not entail goal-directed behavior' might help, since colloquial "imply" tends to be probabilistic, but you mean math/logic "imply" instead. Or 'Coherence theorems do not entail goal-directed behavior on their own'.
Coherence arguments imply a force for goal-directed behavior

You're mistaken about the view I'm arguing against. (Though perhaps in practice most people think I'm arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:

Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values

If you start by assuming that the agent cares about things, and your prior is that the things it cares about are "simple" (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the... (read more)

A few quick thoughts on reasons for confusion:

I think maybe one thing going on is that I already took the coherence arguments to apply only in getting you from weakly having goals to strongly having goals, so since you were arguing against their applicability, I thought you were talking about the step from weaker to stronger goal direction. (I’m not sure what arguments people use to get from 1 to 2 though, so maybe you are right that it is also something to do with coherence, at least implicitly.)

It also seems natural to think of ‘weakly has goals’ as some... (read more)

Thanks. Let me check if I understand you correctly:

You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.

What you disagree with is an argument from ‘anything smart’ to ‘has goals’, which seems to be what is needed for the AI risk argument to apply to any superintelligent agent.

Is that right?

If so, I think it’s helpful to distinguish between ‘weakly has goals’ and ‘strongly has goals’:

  1. Weakly has goals: ‘has some sort of drive toward something,
... (read more)
Introduction To The Infra-Bayesianism Sequence

But for more general infradistributions this need not be the case. For example, consider  and take the set of a-measures generated by  and . Suppose you start with  dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting  dollars on the outcome , with a value of  dollars.

I guess my question is more like: shouldn't there be some aspect of reality that determines what my set of a-measures is? It feels like here we're finding a set of a-measures... (read more)

4Vanessa Kosoy19dIIUC your question can be reformulated as follows: a crisp infradistribution can be regarded as a claim about reality (the true distribution is inside the set), but it's not clear how to generalize this to non-crisp. Well, if you think in terms of desiderata, then crisp says: if distribution is inside set then we have some lower bound on expected utility (and if it's not then we don't promise anything). On the other hand non-crisp gives a lower bound that is variable with the true distribution. We can think of non-crisp infradistirbutions as being fuzzy properties of the distribution (hence the name "crisp"). In fact, if we restrict ourselves to either of homogenous, cohomogenous or c-additive infradistributions, then we actually have a formal way to assign membership functions to infradistirbutions, i.e. literally regard them as fuzzy sets of distributions (which ofc have to satisfy some property analogous to convexity).
My research methodology

Cool, that makes sense, thanks!

My AGI Threat Model: Misaligned Model-Based RL Agent

Planned summary for the Alignment Newsletter:

This post lays out a pathway by which an AI-induced existential catastrophe could occur. The author suggests that AGI will be built via model-based reinforcement learning: that is, given a reward function, we will learn a world model, a value function, and a planner / actor. These will learn online, that is, even after being deployed these learned models will continue to be updated by our learning algorithm (gradient descent, or whatever replaces it). Most research effort will be focused on learning these models

... (read more)
Against evolution as an analogy for how humans will create AGI

If an AGI learned the skill of speaking english during training, but then learned the skill of speaking french during deployment, then your hypotheses imply that the implementations of those two language skills will be totally different. And it then gets weirder if they overlap - e.g. if an AGI learns a fact during training which gets stored in its weights, and then reads a correction later on during deployment, do those original weights just stay there?

Idk, this just sounds plausible to me. I think the hope is that the weights encode more general reasonin... (read more)

4Steven Byrnes23dYes this post is about the process by which AGI is made, i.e. #2. (See "I want to be specific about what I’m arguing against here."...) I'm not sure what you mean by "literal natural selection", but FWIW I'm lumping together outer-loop optimization algorithms regardless of whether they're evolutionary or gradient descent or downhill-simplex or whatever.
My research methodology

I'm super on board with this general methodology, at least at a high level. (Counterexample guided loops are great.) I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?

For example, I feel like with iterated amplification, a bunch of people (including you, probably) said early on that it seems like a hard case to do e.g. translation between languages with people who only know one of the languages, or to reproduce brilliant flashes of insight. (Iirc, the transla... (read more)

High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.

I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?

I feel like a story is basically plausible until proven implausibl... (read more)

My research methodology

These are both cases of counterexample-guided techniques. The basic idea is to solve "exists x: forall y: P(x, y)" statements according to the following algorithm:

  1. Choose some initial x, and initialize a set Y = {}.
  2. Solve "exists y: not P(x, y)". If unsolvable, you're done. If not, take the discovered y and put it in Y.
  3. Solve "exists x: forall y in Y: P(x, y)" and set the solution as your new x.
  4. Go to step 2.

The reason this is so nice is because you've taken a claim with two quantifiers and written an algorithm that must only ever solve claims with one quantif... (read more)

Introduction To The Infra-Bayesianism Sequence

If you use the Anti-Nirvana trick, your agent just goes "nothing matters at all, the foe will mispredict and I'll get -infinity reward" and rolls over and cries since all policies are optimal. Don't do that one, it's a bad idea.

Sorry, I meant the combination of best-case reasoning (sup instead of inf) and the anti-Nirvana trick. In that case the agent goes "Murphy won't mispredict, since then I'd get -infinity reward which can't be the best that I do".

For your concrete example, that's why you have multiple hypotheses that are learnable.

Hmm, that makes sense, I think? Perhaps I just haven't really internalized the learning aspect of all of this.

Introduction To The Infra-Bayesianism Sequence

I'd like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity

Yeah, agreed. I'm intentionally going for a simplified summary that sacrifices details like this for the sake of cleaner narrative.

it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory 

Ah, whoops. Live and learn.

The reason we use worst-case reasoning is because we want the agent

... (read more)
4Vanessa Kosoy22dYes I think that if you are offered a single bet, your utility is linear in money and your belief is a crisp infradistribution (i.e. a closed convex set of probability distributions) then it is always optimal to bet either as much as you can or nothing at all. But for more general infradistributions this need not be the case. For example, consider X:={0,1} and take the set of a-measures generated by 3δ0 and δ1. Suppose you start with 12 dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting 14 dollars on the outcome 1, with a value of 34 dollars.
3Diffractor23dIf you use the Anti-Nirvana trick, your agent just goes "nothing matters at all, the foe will mispredict and I'll get -infinity reward" and rolls over and cries since all policies are optimal. Don't do that one, it's a bad idea. For the concave expectation functionals: Well, there's another constraint or two, like monotonicity, but yeah, LF duality basically says that you can turn any (monotone) concave expectation functional into an inframeasure. Ie, all risk aversion can be interpreted as having radical uncertainty over some aspects of how the environment works and assuming you get worst-case outcomes from the parts you can't predict. For your concrete example, that's why you have multiple hypotheses that are learnable. Sure, one of your hypotheses might have complete knightian uncertainty over the odd bits, but another hypothesis might not. Betting on the odd bits is advised by a more-informative hypothesis, for sufficiently good bets. And the policy selected by the agent would probably be something like "bet on the odd bits occasionally, and if I keep losing those bets, stop betting", as this wins in the hypothesis where some of the odd bits are predictable, and doesn't lose too much in the hypothesis where the odd bits are completely unpredictable and out to make you lose.
Against evolution as an analogy for how humans will create AGI

All of that sounds reasonable to me. I still don't see why you think editing weights is required, as opposed to something like editing external memory.

(Also, maybe we just won't have AGI that learns by reading books, and instead it will be more useful to have a lot of task-specific AI systems with a huge amount of "built-in" knowledge, similarly to GPT-3. I wouldn't put this as my most likely outcome, but it seems quite plausible.)

6Richard_Ngo23dI agree with Steve that it seems really weird to have these two parallel systems of knowledge encoding the same types of things. If an AGI learned the skill of speaking english during training, but then learned the skill of speaking french during deployment, then your hypotheses imply that the implementations of those two language skills will be totally different. And it then gets weirder if they overlap - e.g. if an AGI learns a fact during training which gets stored in its weights, and then reads a correction later on during deployment, do those original weights just stay there? Based on this I guess your answer to my question above is "no": the original fact will get overridden a few days later, and also the knowledge of french will be transferred into the weights eventually. But if those updates occur via self-supervised learning, then I'd count that as "autonomously edit[ing] its weights after training". And with self-supervised learning, you don't need to wait long for feedback, so why wouldn't you use it to edit weights all the time? At the very least, that would free up space in the short-term memory/hidden state. For my own part I'm happy to concede that AGIs will need some way of editing their weights during deployment. The big question for me is how continuous this is with the rest of the training process. E.g. do you just keep doing SGD, but with a smaller learning rate? Or will there be a different (meta-learned) weight update mechanism? My money's on the latter. If it's the former, then that would update me a bit towards Steve's view, but I think I'd still expect evolution to be a good analogy for the earlier phases of SGD. If this is the case, then that would shift me away from thinking of evolution as a good analogy for AGI, because the training process would then look more like the type of skill acquisition that happens during human lifetimes. In fact, this seems like the most likely way in which Steve is right that evolution is a bad analogy.
Against evolution as an analogy for how humans will create AGI

Thanks, this was helpful in understanding in where you're coming from.

When I think of the AGI-hard part of "learning", I think of building a solid bedrock of knowledge and ideas, such that you can build new ideas on top of the old ideas, in an arbitrarily high tower.

I don't feel like humans meet this bar. Maybe mathematicians, and even then, I probably still wouldn't agree. Especially not humans without external memory (e.g. paper). But presumably such humans still count as generally intelligent.

Anyway, my human brain analogy for GPT-3 is: I think the GPT-

... (read more)
4Steven Byrnes24dThanks again, this is really helpful. Hmm, imagine you get a job doing bicycle repair. After a while, you've learned a vocabulary of probably thousands of entities and affordances and interrelationships (the chain, one link on the chain, the way the chain moves, the feel of clicking the chain into place on the gear, what it looks like if a chain is loose, what it feels like to the rider when a chain is loose, if I touch the chain then my finger will be greasy, etc. etc.). All that information is stored in a highly-structured way in your brain (I think some souped-up version of a PGM, but let's not get into that), such that it can grow to hold a massive amount of information while remaining easily searchable and usable. The problem with working memory is not capacity per se, it's that it's not stored in this structured, easily-usable-and-searchable way. So the more information you put there, the more you start getting bogged down and missing things. Ditto with pen and paper, or a recurrent state, etc. I find it helpful to think about our brain's understanding as lots of subroutines running in parallel. (Kaj calls these things "subagents" [https://www.lesswrong.com/s/ZbmRyDN8TCpBTZSip], I more typically call them "generative models" [https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain] , Kurzweil calls them "patterns" [https://www.amazon.com/How-Create-Mind-Thought-Revealed/dp/1491518839], Minsky calls this idea "society of mind" [https://www.amazon.com/Society-Mind-Marvin-Minsky/dp/0671657135], etc.) They all mostly just sit around doing nothing. But sometimes they recognize a scenario for which they have something to say, and then they jump in and say it. So in chess, there's a subroutine that says "If the board position has such-and-characteristics, it's worthwhile to consider moving the pawn." The subroutine sits quietly for months until the board has that position, and then it jumps in and injects its idea. And of course,
Against evolution as an analogy for how humans will create AGI

I feel like I didn't really understand what you were trying to get at here, probably because you seem to have a detailed internal ontology that I don't really get yet. So here's some random disagreements, with the hope that more discussion leads me to figure out what this ontology actually is.

A biological analogy I like much better: The “genome = code” analogy

This analogy also seems fine to me, as someone who likes the evolution analogy

In the remainder of the post I’ll go over three reasons suggesting that the first scenario would be much less likely than

... (read more)
5Steven Byrnes24dThanks! A lot of your comments are trying to relate this to GPT-3, I think. Maybe things will be clearer if I just directly describe how I think about GPT-3. The evolution analogy (as I'm defining it) says that “The AGI” is identified as the inner algorithm, not the inner and outer algorithm working together. In other words, if I ask the AGI a question, I don’t need the outer algorithm to be running in the course of answering that question. Of course the GPT-3 trained model is already capable of answering "easy" questions, but I'm thinking here about "very hard" questions that need the serious construction of lots of new knowledge and ideas that build on each other. I don't think the GPT-3 trained model can do that by itself. Now for GPT-3, the outer algorithm edits weights, and the inner algorithm edits activations. I am very impressed about the capabilities of the GPT-3 weights, edited by SGD, to store an open-ended world model of greater and greater complexity as you train it more and more. I am not so optimistic that the GPT-3 activations can do that, without somehow transferring information from activations to weights. And not just for the stupid reason that it has a finite training window. (For example, other transformer models have recurrency.) Why don't I think that the GPT-3 trained model is just as capable of building out an open-ended world-model of ever greater complexity using activations not weights? For one thing, it strikes me as a bit weird to think that there will be this centaur-like world model constructed out of X% weights and (100-X)% activations. And what if GPT comes to realize that one of its previous beliefs is actually wrong? Can the activations somehow act as if they're overwriting the weights? Just seems weird. How much information content can you put in the activations anyway? I don't know off the top of my head, but much less than the amount you can put in the weights. When I think of the AGI-hard part of "learning", I think of b
AXRP Episode 5 - Infra-Bayesianism with Vanessa Kosoy

Wrote a combined summary for this podcast and the original sequence here.

Introduction To The Infra-Bayesianism Sequence

Planned summary for the Alignment Newsletter:

I have finally understood this sequence enough to write a summary about it, thanks to [AXRP Episode 5](https://www.alignmentforum.org/posts/FkMPXiomjGBjMfosg/axrp-episode-5-infra-bayesianism-with-vanessa-kosoy). Think of this as a combined summary + highlight of the sequence and the podcast episode.

The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment: rather, the agent is _embedded_ in its environment, and so when reasoning

... (read more)
5Vanessa Kosoy24dThat's certainly one way to motivate IB, however I'd like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity (in particular this must be the case if the environment contains other agents of similar or greater complexity). Well, the use of Knightian uncertainty (imprecise probability) in decision theory certain appeared in the literature, so it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory (i.e. treating sequential decision making and considering learnability and regret bounds in this setting) and applying that to various other questions (in particular, Newcombian paradoxes). The reason we use worst-case reasoning is because we want the agent to satisfy certain guarantees. Given a learnable class of infra-hypotheses, in the γ→1 limit, we can guarantee that whenever the true environment satisfies one of those hypotheses, the agent attains at least the corresponding amount of expected utility. You don't get anything analogous with best-case reasoning. Moreover, there is an (unpublished) theorem showing that virtually any guarantee you might want to impose can be written in IB form. That is, let E be the space of environments, and let gn:E→[0,1] be an increasing sequence of functions. We can interpret every gn as a requirement about the policy: ∀μ:Eμπ[U]≥gn(μ). These requirements become stronger with increasing n. We might then want π to be s.t. it satisfies the requirement with the highest n possible. The theorem then says that (under some mild assumptions about the functions g) there exists an infra-environment s.t. optimizing for it is equivalent to maximizing n. (We can replace n by a continuous parameter, I made it discrete just for ease of exposition.) Actually it might be not that different. The Legendre-Fenchel duality shows you can thi
2DanielFilan25dOne thing I realized after the podcast is that because the decision theory you get can only handle pseudo-causal environments, it's basically trying to think about the statistics of environments rather than their internals. So my guess is that further progress on transparent newcomb is going to have to look like adding in the right kind of logical uncertainty or something. But basically it unsurprisingly has more of a statistical nature than what you imagine you want reading the FDT paper.
[AN #142]: The quest to understand a network well enough to reimplement it by hand

Ah excellent, thanks for the links. I'll send the Twitter thread in the next newsletter with the following summary:

Last week I speculated that CLIP might "know" that a textual adversarial example is a "picture of an apple with a piece of paper saying an iPod on it" and the zero-shot classification prompt is preventing it from demonstrating this knowledge. Gwern Branwen [commented](https://www.alignmentforum.org/posts/JGByt8TrxREo4twaw/an-142-the-quest-to-understand-a-network-well-enough-to?commentId=keW4DuE7G4SZn9h2r) to link me to this Twitter thread as w

... (read more)
[AN #142]: The quest to understand a network well enough to reimplement it by hand

Related: Interpretability vs Neuroscience: Six major advantages which make artificial neural networks much easier to study than biological ones. Probably not a major surprise to readers here.

Partial-Consciousness as semantic/symbolic representational language model trained on NN

In response, Reiichiro Nakano shared this paper: https://arxiv.org/pdf/1901.03729.pdf 
which kinda shows it's possible to have agent state/action representations in natural language for Frogger. There are probably glaring/obvious flaws with my OP, but this was what inspired those thoughts.  

(I've only read the abstract of the linked paper.)

If you did something like this with GPT-3, you'd essentially have GPT-3 try to rationalize the actions of the chess engine the way a human would. This feels more like having two separate agents with a particular... (read more)

Partial-Consciousness as semantic/symbolic representational language model trained on NN

If you hook up a language model like GPT-3  to a chess engine or some other NN model, isn't a tie from semantic/symbolic level representation (words and sentences that are coherent and understandable) to distributed, subsymbolic representations in NNs being established?

How? Since the inputs and outputs are completely different spaces, I don't see how you can hook them up.

1Joe Kwon1moSo, I thought it would be a neat proof of concept if GPT3 served as a bridge between something like a chess engine’s actions and verbal/semantic level explanations of its goals (so that the actions are interpretable by humans). e.g. bishop to g5; this develops a piece and pins the knight to the king, so you can add additional pressure to the pawn on d5 (or something like this). In response, Reiichiro Nakano shared this paper: https://arxiv.org/pdf/1901.03729.pdf [https://arxiv.org/pdf/1901.03729.pdf] which kinda shows it's possible to have agent state/action representations in natural language for Frogger. There are probably glaring/obvious flaws with my OP, but this was what inspired those thoughts. Apologies if this is really ridiculous—I'm maybe suggesting ML-related ideas prematurely & having fanciful thoughts. Will be studying ML diligently to help with that.
AI x-risk reduction: why I chose academia over industry

I've discussed this question with a good number of people, and I think I've generally found my pro-academia arguments to be stronger than their pro-industry arguments (I think probably many of them would agree?)

I... think we've discussed this? But I don't agree, at least insofar as the arguments are supposed to apply to me as well (so e.g. not the personal fit part).

Some potential disagreements:

  1. I expect more field growth via doing good research that exposes more surface area for people to tackle, rather than mentoring people directly. Partly this is becaus
... (read more)
5capybaralet1moYeah we've definitely discussed it! Rereading what I wrote, I did not clearly communicate what I intended to...I wanted to say that "I think the average trend was for people to update in my direction". I will edit it accordingly. I think the strength of the "usual reasons" has a lot to do with personal fit and what kind of research one wants to do. Personally, I basically didn't consider salary as a factor.
[AN #141]: The case for practicing alignment work on GPT-3 and other large models

I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get anything right that's within epsilon of a training data poing or something

With this assumption, asymptotically (i.e. with enough data) this becomes a nearest neighbor classifier. For the -dimensional manifold assumption in the other model, you can apply the arguments from the other model to say that you scale as  for some constant  (probably c = 1 or 2, depending on what exactly we're quantifying the scaling of).

I'm not entirely sure how you... (read more)

Four Motivations for Learning Normativity

Planned summary for the Alignment Newsletter:

We’ve <@previously seen@>(@Learning Normativity: A Research Agenda@) desiderata for agents that learn normativity from humans: specifically, we would like such agents to:

1. **Learn at all levels:** We don’t just learn about uncertain values, we also learn how to learn values, and how to learn to learn values, etc. There is **no perfect loss function** that works at any level; we assume conservatively that Goodhart’s Law will always apply. In order to not have to give infinite feedback for the infinite leve

... (read more)
[AN #141]: The case for practicing alignment work on GPT-3 and other large models

I feel like there's a pretty strong Occam's Razor-esque argument for preferring Hutter's model, even though it seems wildly less intuitive to me.

?? Overall this claim feels to me like:

  • Observing that cows don't float into space
  • Making a model of spherical cows with constant density ρ and showing that as long as ρ is more than density of air, the cows won't float
  • Concluding that since the model is so simple, Occam's Razor says that cows must be spherical with constant density.

Some ways that you could refute it:

  • It requires your data to be Zipf-distributed -- wh
... (read more)
3capybaralet1moIntersting... Maybe this comes down to different taste or something. I understand, but don't agree with, the cow analogy... I'm not sure why, but one difference is that I think we know more about cows than DNNs or something. I haven't thought about the Zipf-distributed thing. > Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter's model doesn't. Presumably you mean something else but idk what. I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get anything right that's within epsilon of a training data poing or something... maybe it even ends up looking like the other model in that context...
Defending the non-central fallacy

This seems like it is based on an overly literal interpretation of a "contract" + not being willing to deal with complexities of the real world. There clearly is a difference between how much you have "agreed" to the "contract" of governance, and how much you have agreed to a robber breaking into your house and taking your valuables. These are all things that count as some amount of "agreement" to the "contract", that don't have analogs in the robber case:

  • Empirically lots of people do agree to the contract by explicitly getting a visa and coming to the cou
... (read more)
3Matthew Barnett1moIt depends on what you mean by this. Imagine a community of 99 poor people, and one rich person. Every year, the people conduct a vote on whether to tax the one rich person and redistribute his wealth. Sure enough, most people vote for the policy, and most people like the benefits that this governance structure provides. If given the choice, the vast majority of people in the community would not opt out. But that's leaving out something important. If everyone really were given a choice to opt out, then precisely one person would, the rich person. After opting out, the community would lose a large tax base, and would therefore require taxing the next richest person. This next richest person would probably then want to opt out. Put another way, governance is an iterated game. If given the choice, the vast majority of people would prefer not to opt-out in the first round. After sufficient iterations, however, it seems most would prefer to opt-out. And that's not even getting into the objection that one of the main reasons why people would not opt-out of governance is because they've been indoctrinated into believing government is good. Given the choice to opt-out of aging, many say they would not want to [https://www.pewforum.org/2013/08/06/living-to-120-and-beyond-americans-views-on-aging-medical-advances-and-radical-life-extension/] . However, if we grew up in a world where aging was always known to be optional, I'm sure the statistics would be different.
7Matthew Barnett1moIf you want a more detailed reply to your objection, it might be worth picking up a copy of Huemer's book, The Problem of Political Authority. The problem with most of these cases is that they only appear like strong arguments if we're already committed to the premise that we should treat state actors and non-state actors differently. In other words, they only appear strong if we begin with the conclusion we set out to prove. For instance, Suppose that you want to move to Hawaii because it's so beautiful, but you know (because you saw something on the internet) that upon arrival, someone will rob you. If knowing this information, you still move to Hawaii, does this mean that you are consenting to being robbed? Even if when you actually get to Hawaii, you make sure to explain to every potential robber that you really really don't want to be robbed? As Huemer points out, this fact can't be strong evidence that I am consenting to be governed, because nearly everyone knows that they'll be forced to pay taxes whether or not they use those services. Likewise, if you offer your kidnapped victims food, and they accept, that does not imply that they agreed to be kidnapped. Personally, I don't think the contract argument is the best argument for governance. I'd be more inclined to argue for the consequentialist argument for government: that is, that governance provides greater utility overall compared to the alternative. That's also the argument that Scott Alexander seems to want people to use. Huemer also directly replies to this argument in chapter 5 and part 2 in his book, if you're curious.
Defending the non-central fallacy

In this scenario, I would be called a thief. Why? The answer seems to be: because I am taking other people’s property without their consent. The italicized phrase just seems to be what “theft” means. “Taking without consent” includes taking by means of a threat of force issued against other people, as in this example. This fact is not altered by what I do with the money after taking it.

I feel like the obvious response is "there is something like consent with taxation, because people have agreed to a contract in which they pay taxes as long as everyone else... (read more)

3blacktrance1moThe noncentral fallacy is about inappropriately treating a noncentral member of a category as if it were a central member. But your argument is that taxation isn't a member of the category "theft" at all. "Taxation is theft, but that's okay, because it's not the common, bad kind of theft" would be more in line with Scott's responses.
8Matthew Barnett1moIs there a contract? I certainly never signed one. Yet I still have to pay taxes. FWIW, Michael Huemer responds to this objection directly in chapters 2 and 3 of his book, The Problem of Political Authority. He concludes, Yes, although I was quoting Huemer for what he said after that quoted paragraph.
Recursive Quantilizers II

I continue to not understand this but it seems like such a simple question that it must be that there's just some deeper misunderstanding of the exact proposal we're now debating. It seems not particularly worth it to find this misunderstanding; I don't think it will really teach us anything conceptually new.

(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)

2abramdemski1moFair.
Epistemological Framing for AI Alignment Research

Planned summary for the Alignment Newsletter:

This post recommends that we think about AI alignment research in the following framework:

1. Defining the problem and its terms: for example, we might want to define “agency”, “optimization”, “AI”, and “well-behaved”.

2. Exploring these definitions, to see what they entail.

3. Solving the now well-defined problem.

This is explicitly _not_ a paradigm, but rather a framework in which we can think about possible paradigms for AI safety. A specific paradigm would choose a specific problem formulation and definition (or

... (read more)
The case for aligning narrowly superhuman models

Planned summary for the Alignment Newsletter:

One argument against work on AI safety is that [it is hard to do good work without feedback loops](https://www.jefftk.com/p/why-global-poverty). So how could we get feedback loops? The most obvious approach is to actually try to align strong models right now, in order to get practice with aligning models in the future. This post fleshes out what such an approach might look like. Note that I will not be covering all of the points mentioned in the post; if you find yourself skeptical you may want to read the full

... (read more)
Recursive Quantilizers II

So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.

This makes sense, though I probably shouldn't have used "5x" as my number -- it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like "we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn't depend significantly on the current compute / capacity / data".

Recursive Quantilizers II

Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there's a good chance if I reread the post and all the comments I'd object again / get confused somehow). One thing though:

Every piece of feedback gets put into the same big pool which helps define Hv, the initial ("human") value function. [...]

Okay, I think with this elaboration I stand by what I originally said:

It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it's particularly important for those first

... (read more)
2abramdemski1moYou mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)? Because I think this is pretty solidly wrong of the system that restarts. All feedback so far determines the newD1when the system restarts training. (Again, I'm not saying it's feasible to restart training all the time, I'm just using it as a proof-of-concept to show that we're not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
Load More