All of David_Kristoffersson's Comments + Replies

The case for aligning narrowly superhuman models

The amount of effort going into AI as a whole ($10s of billions per year) is currently ~2 orders of magnitude larger than the amount of effort going into the kind of empirical alignment I’m proposing here, and at least in the short-term (given excitement about scaling), I expect it to grow faster than investment into the alignment work.

There's a reasonable argument (shoutout to Justin Shovelain) that the risk is that work such as this done by AI alignment people will be closer to AGI than the work done by standard commercial or academic research, and th... (read more)

4Ajeya Cotra10moI'm personally skeptical that this work is better-optimized for improving AI capabilities than other work being done in industry. In general, I'm skeptical of perspectives that work that the rationalist/EA/alignment crowd does Pareto-dominates the other work going on -- that is, that it's significantly better for both alignment and capabilities than standard work, such that others are simply making a mistake by not working on it regardless of what their goals are or how much they care about alignment. I think sometimes this could be the case, but I wouldn't bet on it being a large effect. In general, I expect work optimized to help with alignment to be worse on average at pushing forward capabilities, and vice versa.
Anti-Aging: State of the Art

Unfortunately, there is no good 'where to start' guide for anti-aging. This is insane, given this is the field looking for solutions to the biggest killer on Earth today.

Low hanging fruit intervention: Create a public guide to that effect on a web site.

5JackH1yCompletely agree - we have this planned on our Oxford Society of Ageing and Longevity website ( I also plan to write a sequence on LessWrong of perhaps 10-15 posts similar to this one. Feel free to comment if you think there are specific angles you'd like me to focus on (e.g. explaining the science in more detail, discussing common philosophical objections, describing the financing of longevity biotech, etc.).
Is this viable physics?

That being said, I would bet that one would be able to find other formalisms that are equivalent after kicking down the door...

At least, we've now hit one limit in the shape of universal computation: No new formalism will be able to do something that couldn't be done with computers. (Unless we're gravely missing something about what's going on in the universe...)

Good and bad ways to think about downside risks

When it comes to the downside risk, it's often that there are more unknown unknown that produce harm then positive unknown unknown. People are usually biased to overestimate the positive effects and underestimate the negative effects for the known unknown.

This seems plausible to me. Would you like to expand on why you think this is the case?

The asymmetry between creation and destruction? (I.e., it's harder to build than it is to destroy.)

3ChristianKl2yThere are multiple reasons. Let's say you have nine different courses of action and all have utility -1. You have some error function when evaluating the utility of the actions and you think the options have utilities -5, -4, -3, -2, -1, 0, 1, 2, 3. All the negative options won't be on your mind and you will only think about doing those options that score highly. Even if you have some options that are actually benefitial if your evaluation function has enough noise, the fact that you don't put any attention on the options that score negatively means that the options that you do consider are biased. Confirmation bias will make you further want to believe that the option that you persue are positive. Most systems in our modern world are not anti-fragile and suffer if you expose them to random noise.
Good and bad ways to think about downside risks

Very good point! The effect of not taking an action depends on what the counterfactual is: what would happen otherwise/anyway. Maybe the article should note this.

mind viruses about body viruses

Excellent comment, thank you! Don't let the perfect be the enemy of the good if you're running from an exponential growth curve.

The recent NeurIPS call for papers requires authors to include a statement about the potential broader impact of their work

Looks promising to me. Technological development isn't by default good.

Though I agree with the other commenters that this could fail in various ways. For one thing, if a policy like this is introduced without guidance on how to analyze the societal implications, people will think of wildly different things. ML researchers aren't by default going to have the training to analyze societal consequences. (Well, who does? We should develop better tools here.)

5G Gordon Worley III2yAgreed, I think of this like sending a signal that at least a limited concern for safety is important. I'm sure we'll see a bunch of papers with sections addressing this that won't be great, but over time it stands some chance of more regularizing considering concerns about safety and ethics of ML work in the field such that safety work will become more accepted as valuable. So even without a lot of guidance or strong evaluative criteria, this seems a small win to me that, at worst, causes some papers to just have extra fluff sections their authors wrote to pretend to care about safety rather than ignoring it completely.
Jan Bloch's Impossible War

Or, at least, include a paragraph or a few to summarize it!

1Hivewired2yMy blog is about the only thing I have going for me at the moment, so I'd really prefer to keep my essays on my own site where I could theoretically make a little money off of them.
A point of clarification on infohazard terminology

Some quick musings on alternatives for the "self-affecting" info hazard type:

  • Personal hazard
  • Self info hazard
  • Self hazard
  • Self-harming hazard
3MichaelA2yI'd say the first, third, and fourth of those options sound too broad - they don't make it clear that this is about info. But I think something in that direction could be good (e.g., I proposed in a top-level comment "self-affecting info hazards"). I also think the term Anders Sandberg uses is acceptable. Mostly I'd just want to steer away from using a term that sounds like it obviously should mean some other specific thing (which I'd personally say is the case for "memetic hazards").
AI alignment concepts: philosophical breakers, stoppers, and distorters

I wrote this comment to an earlier version of Justin's article:

It seems to me that most of the 'philosophical' problems are going to get solved as a matter of solving practical problems in building useful AI. You could call ML systems, AI, that is getting developed now 'empirical'. From the perspective of the people building current systems, they likely don't consider what they're doing as solving philosophical problems. Symbol grounding problem? Well, an image classifier built on a convolutional neural network learns to ... (read more)

AIXSU - AI and X-risk Strategy Unconference

I expect the event to have no particular downside risks, and to give interesting input and spark ideas in experts and novices alike. Mileage will vary, of course. Unconferences foster dynamic discussion and a living agenda. If it's risky to host this event, then I expect AI strategy and forecasting meetups and discussions at EAG to be risky and they should also not be hosted.

I and other attendees of AIXSU pay careful attention to potential downside risks. I also think it's important we don't strangle open intellectual advancement. We need to... (read more)

Three Stories for How AGI Comes Before FAI
We can subdivide the security story based on the ease of fixing a flaw if we're able to detect it in advance. For example, vulnerability #1 on the OWASP Top 10 is injection, which is typically easy to patch once it's discovered. Insecure systems are often right next to secure systems in program space.

Insecure systems are right next to secure systems, and many flaws are found. Yet, the larger systems (the company running the software, the economy, etc) manage to correct somehow. It's because there are mechanisms in the larger systems poised t... (read more)

Project Proposal: Considerations for trading off capabilities and safety impacts of AI research

This seems like a valuable research question to me. I have a project proposal in a drawer of mine that is strongly related: "Entanglement of AI capability with AI safety".

A case for strategy research: what it is and why we need more of it

My guess is that the ideal is to have semi-independent teams doing research. Independence in order to better explore the space of questions, and some degree of plugging in to each other in order to learn from each other and to coordinate.

Are there serious info hazards, and if so can we avoid them while still having a public discussion about the non-hazardous parts of strategy?

There are info hazards. But I think if we can can discuss Superintelligence publicly, then yes; we can have a public discussion about non-hazardous parts of strategy.

Are there enough
... (read more)
A case for strategy research: what it is and why we need more of it

Nice work, Wei Dai! I hope to read more of your posts soon.

However I haven't gotten much engagement from people who work on strategy professionally. I'm not sure if they just aren't following LW/AF, or don't feel comfortable discussing strategically relevant issues in public.

A bit of both, presumably. I would guess a lot of it comes down to incentives, perceived gain, and habits. There's no particular pressure to discuss on LessWrong or the EA forum. LessWrong isn't perceived as your main peer group. And if you're at FHI or OpenAI, you'll have plenty contact with people who can provide quick feedback already.

A case for strategy research: what it is and why we need more of it
I'm very confused why you think that such research should be done publicly, and why you seem to think it's not being done privately.

I don't think the article implies this:

Research should be done publicly

The article states: "We especially encourage researchers to share their strategic insights and considerations in write ups and blog posts, unless they pose information hazards."
Which means: share more, but don't share if you think there are possible negative consequences of it.
Though I guess you could mean that it's very h... (read more)

1Davidmanheim3yGlad to hear that you aren't recommending strategy research in general - because that's what it looked like. And yes, I think it's incredibly hard to make sure we're not putting effort into efforts with negative expected value, and I think that attention hazards are critical, and are the biggest place where I think strategy research has the potential to increase risks rather than ameliorate them. (Which is exactly why I'm confused that anyone would suggest that more such research should be done publicly and/or shared. And it's why I don't think that a more detailed object level discussion makes sense here, in public.)
AI Safety Research Camp - Project Proposal

Yes -- the plan is to have these on an ongoing basis. I'm writing this just as the deadline was passed for the one planned to April.

Here's the web site:

The facebook is also a good place to keep tabs on it:

Beware Social Coping Strategies
Your relationship with other people is a macrocosm of your relationship with yourself.

I think there's something to that, but it's not that general. For example, some people can be very kind to others but harsh with themselves. Some people can be cruel to others but lenient to themselves.

If you can't get something nice, you can at least get something predictable

The desire for the predictable is what Autism Spectrum Disorder is all about, I hear.

I think there's something to that, but it's not that general. For example, some people can be very kind to others but harsh with themselves. Some people can be cruel to others but lenient to themselves.

Even if the behavior itself seems vastly different, that doesn't necessarily mean they aren't just different instances of the same "social program". For example, if you're "kind" to others but harsh with yourself, it might be because you don't know how to hold people accountable without being harsh, and corre... (read more)

A Fable of Science and Politics

It's bleen, without a moment's doubt.

LessWrong 2.0

Counterpoint: Sometimes, not moving means moving, because everyone else is moving away from you. Movement -- change -- is relative. And on the Internet, change is rapid.

Book Review: Naive Set Theory (MIRI research guide)

Thanks for the tip. Two other books on the subject that seem to be appreciated are Introduction to Set Theory by Karel Hrbacek and Classic Set Theory: For Guided Independent Study by Derek Goldrei.

Edit: weighs in:

Book Review: Naive Set Theory (MIRI research guide)

The author of the Teach Yourself Logic study guide agrees with you about reading multiple sources:

I very strongly recommend tackling an area of logic (or indeed any new area of mathematics) by reading a series of books which overlap in level (with the next one covering some of the same ground and then pushing on from the previous one), rather than trying to proceed by big leaps.

In fact, I probably can’t stress this advice too much, which is why I am highlighting it here. For this approach will really help to reinforce and deepen understanding as you re-encounter the same material from different angles, with different emphases.

Book Review: Naive Set Theory (MIRI research guide)

My two main sources of confusion in that sentence are:

  1. He says "distinct elements onto distinct elements", which suggests both injection and surjection.
  2. He says "is called one-to-one (usually a one-to-one correspondence)", which might suggest that "one-to-one" and "one-to-one correspondence" are synonyms -- since that is what he usually uses the parantheses for when naming concepts.

I find Halmos somewhat contradictory here.

But I'm convinced you're right. I've edited the post. Thanks.

2ThisSpaceAvailable6yIt is somewhat confusing, but remember that srujectivity is defined with respect to a particular codomain; a function is surjective if its range is equal to its codomain, and thus whether it's surjective depends on what its codomain is considered to be; every function maps its domain onto its range. "f maps X onto Y" means that f is surjective with respect to Y". So, for instance, the exponential function maps the real numbers onto the positive real numbers. It's surjective with respect to positive real numbers*. Saying "the exponential function maps real numbers onto real numbers" would not be correct, because it's not surjective with respect to the entire set of real numbers. So saying that a one-to-one function maps distinct elements onto a set of distinct elements can be considered to be correct, albeit not as clear as saying "to" rather than "onto". It also suffer from a lack of clarity in that it's not clear what the "always" is supposed to range over; are there functions that sometimes do map distinct elements to distinct elements, but sometimes don't?
Book Review: Naive Set Theory (MIRI research guide)

You guys must be right. And wikipedia corroborates. I'll edit the post. Thanks.

Welcome to Less Wrong! (7th thread, December 2014)


I'm currently attempting to read through the MIRI research guide in order to contribute to one of the open problems. Starting from Basics. I'm emulating many of Nate's techniques. I'll post reviews of material in the research guide at lesswrong as I work through it.

I'm mostly posting here now just to note this. I can be terse at times.

See you there.

Dark Arts of Rationality

First, appreciation: I love that calculated modification of self. These, and similar techniques, can be very useful if put to use in the right way. I recognize myself here and there. You did well to abstract it all out this clearly.

Second, a note: You've described your techniques from the perspective of how they deviate from epistemic rationality - "Changing your Terminal Goals", "Intentional Compartmentalization", "Willful inconsistency". I would've been more inclined to describe them from the perspective of their central eff... (read more)

MIRI's technical research agenda

And boxing, by the way, means giving the AI zero power.

No, hairyfigment's answer was entirely appropriate. Zero power would mean zero effect. Any kind of interaction with the universe means some level of power. Perhaps in the future you should say nearly zero power instead so as to avoid misunderstanding on the parts of others, as taking you literally on the "zero" is apparently "legalistic".

As to the issues with nearly zero power:

  • A superintelligence with nearly zero power could turn to be a heck of a lot more power than you expect
... (read more)
0[anonymous]7yI have read all of the resources you linked to and their references, the sequences, and just about every post on the subject here on LessWrong. Most of what passes for thinking regarding AI boxing and oracles here is confused and/or fallacious. It would be helpful if you could point to the specific argument which convinced you of this point. For the most part every argument I've seen along these lines either stacks the deck against the human operator(s), or completely ignores practical and reasonable boxing techniques. Again, I'd love to see a citation. Having a real AGI in a box is basically a ticket to unlimited wealth and power. Why would anybody risk losing control over that by unboxing? Seriously, someone owns an AGI would be paranoid about keeping their relative advantage and spend their time strengthening the box and investing in physical security.
MIRI's technical research agenda

So you disagree with the premise of the orthogonality thesis. Then you know a central concept to probe to understand the arguments put forth here. For example, check out Stuart's Armstrong's paper: General purpose intelligence: arguing the Orthogonality thesis

0[anonymous]7yI explained in my post how the orthogonality thesis as argued by Stuart Armstrong et al presents a false choice. His argument is flawed. I'm sorry I'm having trouble parsing what you are saying here...
MIRI's technical research agenda

There's no guarantee that boxing will ensure the safety of a soft takeoff. When your boxed AI starts to become drastically smarter than a human -- 10 times --- 1000 times -- 1000000 times -- the sheer enormity of the mind may slip out of human possibility to understand. All the while, a seemingly small dissonance between the AI's goals and human values -- or a small misunderstanding on our part of what goals we've imbued -- could magnify to catastrophe as the power differential between humanity and the AI explodes post-transition.

If an AI goes through the ... (read more)

0[anonymous]7yIf you want guarantees, find yourself another universe. "There's no guarantee" of anything. You're concept of a boxed AI seems very naive and uninformed. Of course a superintelligence a million times more powerful than a human would probably be beyond the capability of a human operator to manually debug. So what? Actual boxing setups would involve highly specialized machine checkers that assure various properties about the behavior of the intelligence and its runtime, in ways that truly can't be faked. And boxing, by the way, means giving the AI zero power. If there is a power differential, then really by definition it is out of the box. Regarding your last point, is is in fact possible to build an AI that is not a utility maximizer.
MIRI's technical research agenda

Mark: So you think human-level intelligence by principle does not combine with goal stability. Aren't you simply disagreeing with the orthogonality thesis, "that an artificial intelligence can have any combination of intelligence level and goal"?

2[anonymous]7yTo be clear I’ve been talking about human-like, which is a different distinction than human-level. Human-like intelligences operate similarly to human psychology. And it is demonstrably true that humans do not have a fixed set of fundamentally unchangeable goals, and human society even less so. For all its faults, the neoreactionaries get this part right in their critique of progressive society: the W-factor introduces a predictable drift in social values over time. And although people do tend to get “fixed in their ways”, it is rare indeed for a single person to remain absolutely rigidly so. So yes, in as far as we are talking about human-like intelligences, if they had fixed truly steadfast goals then that would be something which distinguishes them from humans. I don’t think the orthogonality thesis is well formed. The nature of an intelligence may indeed cause it to develop certain goals in due coarse, or for its overall goal set to drift in certain, expected if not predictable ways. Of course denying the orthogonality thesis as stated does not mean endorsing a cosmist perspective either, which would be just as ludicrous. I’m not naive enough to think that there is some hidden universal morality that any smart intelligence naturally figures out -- that’s bunk IMHO. But it’s just as naive to think that the structure of an intelligence and its goal drift over time are purely orthogonal issues. In real, implementable designs (e.g. not AIXI), one informs the other.