# 2018 Review Discussion

I want to quickly draw attention to a concept in AI alignment: Robustness to Scale. Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I discuss three different types of robustness to scale: robustness to scaling up, robustness to scaling down, and robustness to relative scale.

The purpose of this post is to communicate, not to persuade. It may be that we want to bite the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that.

Robustness to scaling up means that your AI system does not depend on not being...

Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:

• I'm not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
• Robustn

You are viewing Version 2 of this post: a major revision written for the LessWrong 2018 Review. The original version published on 9th November 2018 can be viewed here.

See my change notes for major updates between V1 and V2.

# Combat Culture

I went to an orthodox Jewish high school in Australia. For most of my early teenage years, I spent one to three hours each morning debating the true meaning of abstruse phrases of Talmudic Aramaic. The majority of class time was spent sitting opposite your chavrusa (study partner, but linguistically the term has the same root as the word “friend”) arguing vehemently for your interpretation of the arcane words. I didn’t think in terms of probabilities back then, but if I had, I think at any point I...

mod note: this post probably shouldn't have been included in the 2020 review. It was behaving a bit weirdly because it had appeared in a previous review, and it'd be a fair amount of coding work to get it to seamlessly display the correct number of reviews. It's similar to a post of mine in that it was edited substantially for the 2018 review and re-published in 2020, which updated it's postedAt date which resulted in it bypassing the intended filters of 'must have been published in 2020'

I had previously changed the postedAt date on my post to be pre-2020 so that it wouldn't appear here, and just did the same for this one.

5hamnox1moReviewWow, I really love that this has been updated and appendix'd. It's really nice to see how this has grown with community feedback and gotten polished this from a rough concept. Creating common knowledge on how 'cultures' of communication can differ seems really valuable for a community focused on cooperatively finding truth.

I expect "slow takeoff," which we could operationalize as the economy doubling over some 4 year interval before it doubles over any 1 year interval. Lots of people in the AI safety community have strongly opposing views, and it seems like a really important and intriguing disagreement. I feel like I don't really understand the fast takeoff view.

(Below is a short post copied from Facebook. The link contains a more substantive discussion. See also: AI impacts on the same topic.)

I believe that the disagreement is mostly about what happens before we build powerful AGI. I think that weaker AI systems will already have radically transformed the world, while I believe fast takeoff proponents think there are factors that makes weak AI systems radically less useful. This is...

I think this is just a sigmoid function, but mirrored over the y-axis. If you extended it farther into the past, it would certainly flatten out just below 100%. So I think it's just another example of how specific technologies are adopted in sigmoid curves, except in reverse, because people are dis-adopting manual farming.

(And I think the question of why tech grows in sigmoid curves is because that's the solution to the differential equation that models the fundamental dynamics of "grows proportional to position, up to a carrying capacity".)

The following is a fictional dialogue building off of AI Alignment: Why It’s Hard, and Where to Start.

(Somewhere in a not-very-near neighboring world, where science took a very different course…)

ALFONSO:  Hello, Beth. I’ve noticed a lot of speculations lately about “spaceplanes” being used to attack cities, or possibly becoming infused with malevolent spirits that inhabit the celestial realms so that they turn on their own engineers.

I’m rather skeptical of these speculations. Indeed, I’m a bit skeptical that airplanes will be able to even rise as high as stratospheric weather balloons anytime in the next century. But I understand that your institute wants to address the potential problem of malevolent or dangerous spaceplanes, and that you think this is an important present-day cause.

BETH:  That’s… really not how we at...

One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.

We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the...

4Ramana Kumar2moThis is not the type signature for a utility function that matters for the coherence arguments (by which I don't mean VNM - see this comment [https://www.alignmentforum.org/posts/Q9JKKwSFybCTtMS9d/what-are-we-assuming-about-utility-functions?commentId=4dZRMFs7gkDi8WHbN] ). It does often fit the type signature in the way those arguments are formulated/formalised, but intuitively, it's not getting at the point of the theorems. I suggest you consider utility functions defined as functions of the state of the world only, not including the action taken. (Yes I know actions could be logged in the world state, the agent is embedded in the state, etc. - this is all irrelevant for the point I'm trying to make - I'm suggesting to consider the setup where there's a Cartesian boundary, an unknown transition function, and environment states that don't contain a log of actions.) I don't think the above kind of construction works in that setting. I think that's the kind of setting it's better to focus on.

Have you seen this post, which looks at the setting you mentioned?

From my perspective, I want to know why it makes sense to assume that the AI system will have preferences over world states, before I start reasoning about that scenario. And there are reasons to expect something along these lines! I talk about some of them in the next post in this sequence! But I think once you've incorporated some additional reason like "humans will want goal-directed agents" or "agents optimized to do some tasks we write down will hit upon a core of general intelligence",... (read more)

I write an essay every Thursday. Every so often, one seems to really resonate with people.

The piece I just wrote on the nature of explicit and implicit communication both got an enthusiastic reader response and seems directly relevant to a number of the projects and explorations people are doing here, so I'm bringing it over here.

Two points worth mentioning —

1. I take, I think, a relatively fair stance on the tradeoffs and benefits between implicit and explicit communication. But some people are heavily invested in explicit communications models, almost to the identity level, and might not like what they read. I just ask you to bring an open mind — I think the examples of implicit communication here are all clear and convincing cases where explicit can...

[Admin note]. The CIA strategy doc link no longer works, so I updated it to point to https://web.archive.org/web/20200214014530/https://www.cia.gov/news-information/featured-story-archive/2012-featured-story-archive/CleanedUOSSSimpleSabotage_sm.pdf

(Meanwhile: the previous link has one of the most ominous 404 pages I've ever seen)

Cross posted from my personal blog.

In this post, I'm going to assume you've come across the Cognitive Reflection Test before and know the answers. If you haven't, it's only three quick questions, go and do it now.

One of the striking early examples in Kahneman's Thinking, Fast and Slow is the following problem:

(1) A bat and a ball cost $1.10 in total. The bat costs$1.00 more than the ball.

How much does the ball cost? _____ cents

This question first turns up informally in a paper by Kahneman and Frederick, who find that most people get it wrong:

Almost everyone we ask reports an initial tendency to answer “10 cents” because the sum $1.10 separates naturally into$1 and 10 cents, and 10 cents is about the right magnitude. Many

...

The first time I saw the bat and ball question, it was like there were two parts of my S1. The first one said "the answer is 0.1" and the second one said "this is a math problem, I'm invoking s2". S2 sees the math problem and searches for a formula, at which point she comes up with the algebraic solution. Then s2 pops open a notepad and executes it even though 0.1 seems plausible.

No real thought went into any step of this. I suspect the split reaction in the first bit was due to my extensive practice at doing math problems. After enough failures, I learned to stop using intuition to do math and "invoke s2" became an automatic response.

# What is voting theory?

Voting theory, also called social choice theory, is the study of the design and evaulation of democratic voting methods (that's the activists' word; game theorists call them "voting mechanisms", engineers call them "electoral algorithms", and political scientists say "electoral formulas"). In other words, for a given list of candidates and voters, a voting method specifies a set of valid ways to fill out a ballot, and, given a valid ballot from each voter, produces an outcome.

(An "electoral system" includes a voting method, but also other implementation details, such as how the candidates and voters are validated, how often elections happen and for what offices, etc. "Voting system" is an ambiguous term that can refer to a full electoral system, just to the voting method,...

Small nitpick, quadratic voting can work without money. Give each person n points which they can use to vote, instead of asking them to open their wallet and use n dollars. Points have diminishing weight when given to the same candidate, same as dollars.

Now and then people have asked me if I think that other people should also avoid high school or college if they want to develop new ideas. This always felt to me like a wrong way to look at the question, but I didn't know a right one.

Recently I thought of a scary new viewpoint on that subject.

This started with a conversation with Arthur where he mentioned an idea by Yoshua Bengio about the software for general intelligence having been developed memetically. I remarked that I didn't think duplicating this culturally transmitted software would be a significant part of the problem for AGI development. (Roughly: low-fidelity software tends to be algorithmically shallow. Further discussion moved to comment below.)

But this conversation did get me thinking about...

Self control is for addiction though, that is not the same as filter bubbles. It might be possible to be content with your life in a real long-term sense, and yet be stuck in say, Google's filter bubble as to what information, opinions or articles you read. Atleast for the average person.

Also Zuckerberg is aware that his app might be trading off long-term user satisfaction for short-term too heavily, there are interviews I'm too lazy to link. Companies are interested in long-term profits too :)

This essay was originally posted in 2007.

Frank Sulloway once said: “Ninety-nine per cent of what Darwinian theory says about human behavior is so obviously true that we don’t give Darwin credit for it. Ironically, psychoanalysis has it over Darwinism precisely because its predictions are so outlandish and its explanations are so counterintuitive that we think, Is that really true? How radical! Freud’s ideas are so intriguing that people are willing to pay for them, while one of the great disadvantages of Darwinism is that we feel we know it already, because, in a sense, we do.”

Suppose you find an unconscious six-year-old girl lying on the train tracks of an active railroad. What, morally speaking, ought you to do in this situation? Would it be better to leave...

"If the technology were available to gradually raise her IQ to 120, without negative side effects, would you judge it good to do so?"

to me it seems possible to give simple answers, if "without negative side effects" would hold.

BUT in reality this is NEVER the case! There will be different distribution of wealth etc, the lifes of quite some people will change (at least a bit).

Thus there are always negative side effects! Thus the question to be answered is, to compare the positive effects (multiplied by the number of people (and not the power of these people... (read more)

Epistemic Status: Confident

This idea is actually due to my husband, Andrew Rettek, but since he doesn’t blog, and I want to be able to refer to it later, I thought I’d write it up here.

In many games, such as Magic: The Gathering, Hearthstone, or Dungeons and Dragons, there’s a two-phase process. First, the player constructs a deck or character from a very large sample space of possibilities.  This is a particular combination of strengths and weaknesses and capabilities for action, which the player thinks can be successful against other decks/characters or at winning in the game universe.  The choice of deck or character often determines the strategies that deck or character can use in the second phase, which is actual gameplay.  In gameplay, the character (or deck) can only...

What seems off to me is the idea that the 'player' is some sort of super-powerful incomprehensible lovecraftian optimizer. I think it's more apt to think of it as like a monkey, but a monkey which happens to share your body and have write access to the deepest patterns of your thought and feeling(see Steven Byrnes' posts for the best existing articulation of this view). It's just a monkey, its desires aren't totally alien and I think it's quite possible for one's conscious mind to develop a reasonably good idea of what it wants. That the OP prefers to push... (read more)

,,,,,

The following is a basically unedited summary I wrote up on March 16 of my take on Paul Christiano’s AGI alignment approach (described in “ALBA” and “Iterated Distillation and Amplification”). Where Paul had comments and replies, I’ve included them below.

I see a lot of free variables with respect to what exactly Paul might have in mind. I've sometimes tried presenting Paul with my objections and then he replies in a way that locally answers some of my question but I think would make other difficulties worse. My global objection is thus something like, "I don't see any concrete setup and consistent simultaneous setting of the variables where this whole scheme works." These difficulties are not minor or technical; they appear to me quite severe. I try...

[Eli's personal notes. Feel free to comment or ignore.]

My summary of Eliezer's overall view:

• 1. I don't see how you can't get cognition to "stack" like that, short of running a Turing machine made up of the agents in your system. But if you do that, then we throw alignment out the window.
• 2. There's this strong X-and-only-X problem.
• If our agents are perfect imitations of humans, then we do solve this problem. But having perfect imitations of humans is a very high bar that depends have a very powerful superintelligence already. And now we're just passing the

Epistemic status: trying to vaguely gesture at vague intuitions. A similar idea was explored here under the heading "the intelligibility of intelligence", although I hadn't seen it before writing this post. As of 2020, I consider this follow-up comment to be a better summary of the thing I was trying to convey with this post than the post itself.

There’s a mindset which is common in the rationalist community, which I call “realism about rationality” (the name being intended as a parallel to moral realism). I feel like my skepticism about agent foundations research is closely tied to my skepticism about this mindset, and so in this essay I try to articulate what it is.

Humans ascribe properties to entities in the world in order to describe and...

You have entirely missed the point I was making in that comment.

Of course I am aware of the standard form of the joke. I presented my modified form of the joke in the linked comment, as a deliberate contrast with the standard form, to illustrate the point I was making.

You probably already know that you can incentivise honest reporting of probabilities using a proper scoring rule like log score, but did you know that you can also incentivize honest reporting of confidence intervals?

To incentize reporting of a confidence interval, take the score , where is the size of your confidence interval, and is the distance between the true value and the interval. is whenever the true value is in the interval.

This incentivizes not only giving an interval that has the true value of the time, but also distributes the remaining 10% equally between overestimates and underestimates.

To keep the lower bound of the interval important, I recommend measuring and in log space. So if the true value is...

Is there a way to adjust this to support better scores for tighter confidence intervals?

For instance, using natural log, with a range of 8-10 and a true value of 10, I get -0.2231 whether I pick a 90% confidence interval, or a 95% confidence interval (coefficient of 40). It'd be nice if the latter scored better.

1. The Consciousness Researcher and Out-Of-Body Experiences

In his book Consciousness and the Brain, cognitive neuroscientist Stansilas Dehaene writes about scientifically investigating people’s reports of their out-of-body experiences:

… the Swiss neurologist Olaf Blanke[ did a] beautiful series of experiments on out-of-body experiences. Surgery patients occasionally report leaving their bodies during anesthesia. They describe an irrepressible feeling of hovering at the ceiling and even looking down at their inert body from up there. [...]
What kind of brain representation, Blanke asked, underlies our adoption of a specific point of view on the external world? How does the brain assess the body’s location? After investigating many neurological and surgery patients, Blanke discovered that a cortical region in the right temporoparietal junction, when impaired or electrically perturbed, repeatedly caused a sensation of
...

No worries. :) Getting it into the curation e-mail was probably good.

3frontier646moYou're doing good work with the curation and it's very effective at bringing important posts back into the reader's eye so thanks for that! I would probably have never seen this post otherwise. I'm glad you're working on the system to iron out the kinks.
2Ruby6moAs mentioned elsethread, I checked with authors the first couple of times. Having met no objections, I applied induction and tested whether maybe the overhead wasn't necessary.
2Ruby6mo+1 I have been unsure how long it could persist for. There's an argument for it being present all the time it's at the top of the curated list, and if you imagine author's being proud of it, for even longer than that. But perhaps (while we're doing the low-tech thing), it should only be for sending out the email.

Epistemic Status: Simple point, supported by anecdotes and a straightforward model, not yet validated in any rigorous sense I know of, but IMO worth a quick reflection to see if it might be helpful to you.

A curious thing I've noticed: among the friends whose inner monologues I get to hear, the most self-sacrificing ones are frequently worried they are being too selfish, the loudest ones are constantly afraid they are not being heard, the most introverted ones are regularly terrified that they're claiming more than their share of the conversation, the most assertive ones are always suspicious they are being taken advantage of, and so on. It's not just that people are sometimes miscalibrated about themselves- it's as if the loudest alarm in their heads, the one...

1DPiepgrass6moHmm. My biggest alarm is that I don't have enough friends and am not social enough. But this is accurate; I have not one single intellectual friend of the sort I want, and people ignore my writing. I think the reason I don't act to rectify the situation is that I don't know how. Okay, but what about other alarms, ones I don't notice so much? Well, I have an intense desire to be right (and not just appear to be right in the eyes of anyone else), and be knowledgeable... and so I frequent LW. I spend an unusual amount of time fretting over the accuracy and fairness of things I've posted earlier and often edit them days, weeks, or years afterward to provide more nuance and accuracy. I do this even at the risk of sounding boring and producing less writing, even though I think one needs a lot of writing to gain a following; might this interfere with my social life? Could be. Another alarm says I took a wrong turn to end up working on oil & gas software. I donate to clean energy efforts as a form of "offsetting", but this has the disturbing opportunity cost of reducing my donations to other efforts. That alarm seems reasonable and I'm open to different opportunities. But, maybe this post has exactly the problem you linked to [http://slatestarcodex.com/2014/03/24/should-you-reverse-any-advice-you-hear/]: offering good advice to those who least need it.

That sounds pretty rough.

This is harsh and may be completely off the mark, but I was trying to call attention especially to alarms where those close to you disagree. If friends and family agree that you're not social enough, then that's probably a true alarm that you're facing.

## 0.

Tl;dr: There's a similarity between these three concepts:

• A locally valid proof step in mathematics is one that, in general, produces only true statements from true statements. This is a property of a single step, irrespective of whether the final conclusion is true or false.
• There's such a thing as a bad argument even for a good conclusion. In order to arrive at sane answers to questions of fact and policy, we need to be curious about whether arguments are good or bad, independently of their conclusions. The rules against fallacies must be enforced even against arguments for conclusions we like.
• For civilization to hold together, we need to make coordinated steps away from Nash equilibria in lockstep. This requires general rules that are allowed to impose penalties
...

Lord Kelvin's careful and multiply-supported lines of reasoning arguing that the Earth could not possibly be so much as a hundred million years old, all failed simultaneously in a surprising way because that era didn't know about nuclear reactions.

I'm told that the biggest reason Kelvin was wrong was that, for many years, no one thought about there being a molten interior subject to convection:

Perry's [1895?] calculation shows that if the Earth has a conducting lid of 50 kilometers' thickness, with a perfectly convecting fluid underneath, then the me

Note: weird stuff, very informal.

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.

I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.

I am pretty convinced that daemons are a real...

I consider the argument in this post a reasonably convincing negative answer to this question---a minimal circuit may nevertheless end up doing learning internally and thereby generate deceptive learned optimizers.

This suggests a second informal clarification of the problem (in addition to Wei Dai's comment): can the search for minimal circuits itself be responsible for generating deceptive behavior? Or is it always the case that something else was the offender and the search for minimal circuits is an innocent bystander?

If the search for minimal circuits ... (read more)

This is part 30 of 30 in the Hammertime Sequence. Click here for the intro.

One of the overarching themes from CFAR, related to The Strategic Level, is that what you learn at CFAR is not a specific technique or set of techniques, but the cognitive strategy that produced those techniques. It follows that if I learned the right lessons from CFAR, then I would be able to produce qualitatively similar – if not as well empirically tested – new principles and approaches to instrumental rationality.

After CFAR, I wanted to design a test to see if I had learned the right lessons. Hammertime was that sort of test for me. Now here’s that same test for you.

# The Final Exam

I will give three essay prompts and three difficulty levels. Original ideas...

Thanks for the sequence, it was really helpful! :)

Second version, updated for the 2018 Review. See change notes.

There's a concept which many LessWrong essays have pointed at it (indeed, I think the entire sequences are exploring). But I don't think there's a single post really spelling it out explicitly:

You might want to become a more robust, coherent agent.

By default, humans are a kludgy bundle of impulses. But we have the ability to reflect upon our decision making, and the implications thereof, and derive better overall policies.

Some people find this naturally motivating –it's aesthetically appealing to be a coherent agent. But if you don't find naturally appealing, the reason I think it’s worth considering is robustness – being able to succeed at novel challenges in complex domains.

This is related to being instrumentally rational, but I don’t...

2AllAmericanBreakfast8moHave you found that this post (and the concept handle) have been useful for this purpose? Have you found that you do in fact reference it as a litmus test, and steer conversations according to the response others make?
4Raemon8moIt's definitely been useful with people I've collaborated closely with. (I find the post a useful background while working with the LW team, for example) I haven't had a strong sense of whether it's proven beneficial to other people. I have a vague sense that the sort of people who inspired this post mostly take this as background that isn't very interesting or something. Possibly with a slightly different frame on how everything hangs together.
2AllAmericanBreakfast8moIt sounds like this post functions (and perhaps was intended) primarily as a filter for people who are already good at agency, and secondarily as a guide for newbies? If so, that seems like a key point - surrounding oneself with other robust (allied) agents helps develop or support one's own agency.

I actually think it works better as a guide for newbies than as a filter. The people I want to filter on, I typically am able to have long protracted conversations about agency with them anyway, and this blog post isn't the primary way that they get filtered.