## LESSWRONGLW

Curiosity Killed the Cat and the Asymptotically Optimal Agent

The simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there's not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There's still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn't quite allow us to say that it will pick the right point on the trade-off. As long as we have access to "parents" that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them.

And I'd say it's more a conclusion, not a main one.

Pretending to be Wise

It's usually easier to prove others wrong than prove yourself right. Showing that their beliefs are contradictory is winning, even if their belief is that the sky is blue because blue light is scattered the most due to Rayleigh scattering. Showing that this (only slightly wrong, but nonetheless contradictory) belief is contradictory does not prove the sky to be mauve, or in any way not blue.

Explain/Worship/Ignore?

"Neither true nor false..." Not so. We gather such stories and treasure them. But at the end of the day, we label them fiction (or mythology, if some portion of humanity believed them to be true at some point) and know better than to go looking for Hogwarts. We know fiction is not corresponding with reality, not part of the map, in other words - not true. In every sense that matter, we treat fiction as false.

All that is good and proper - as long as such works don't claim to describe factual events.

Walkthrough: The Transformer Architecture [Part 2/2]

Thanks for the feedback. As a writer I still have a lot to learn about being more clear.

Walkthrough: The Transformer Architecture [Part 2/2]

You might want to shorten your sentences by using less of filler sentences and phrases which makes your article confusing and longer than it needs to be. Since this is a technical article and not a fictional story, it would be good if you can bring across your points in a clear and concise manner.

AIRCS Workshop: How I failed to be recruited at MIRI.

This is basically off-topic, but just for the record, regarding...

someone presented a talk where they explained how they tried and failed to model and simulate a brain of C. Elegans.... Furthermore, all of their research was done prior to them discovering AI safety stuff so it's good that no one created such a precise model of a - even if just a worm - brain.

That was me; I have never believed (at least not yet) that it’s good that the C. elegans nervous system is still not understood; to the contrary, I wish more neuroscientists were working on such a “full-stack” understanding (whole nervous system down to individual cells). What I meant to say is that I am personally no longer compelled to put my attention toward C. elegans, compared to work that seems more directly AI-safety-adjacent.

I could imagine someone making a case that understanding low-end biological nervous systems would bring us closer to unfriendly AI than to friendly AI, and perhaps someone did say such a thing at AIRCS, but I don’t recall it and I doubt I would agree. More commonly, people make the case that nervous-system uploading technology brings us closer to friendly AI in the form of eventually uploading humans—but that is irrelevant one way or the other if de novo AGI is developed by the middle of this century.

One final point: it is possible that understanding simple nervous systems gives humanity a leg up on interpretability (of non-engineered, neural decision-making), without providing new capabilities until somewhere around spider level. I don’t have much confidence that any systems-neuroscience techniques for understanding C. elegans or D. rerio would transfer to interpreting AI’s decision-making or motivational structure, but it is plausible enough that I currently consider such work to be weakly good for AI safety.

Curiosity Killed the Cat and the Asymptotically Optimal Agent

After a bit more thought, I've learned that it's hard to avoid ending back up with EU maximization - it basically happens as soon as you require that strategies be good not just on the true environment, but on some distribution of environments that reflect what we think we're designing an agent for (or the agent's initial state of knowledge about states of the world). And since this is such an effective tool at penalizing the "just pick the absolute best answer" strategy, it's hard for me to avoid circling back to it.

Here's one possible option, though: look for strategies that are too simple to encode the one best answer in the first place. If the absolute best policy has K-complexity of 10^3 (achievable in the real world by strategies being complicated, or in the multi-armed bandit case by just having 2^1000 possible actions) and your agent is only allowed to start with 10^2 symbols, this might make things interesting.

Northwest Passage Update

I like it! But you know, Northwest Passage is already written as a retrospective.

Three centuries thereafter, I take passage overland
In the footsteps of brave Kelso, where his "sea of flowers" began
Watching cities rise before me, then behind me sink again
This tardiest explorer, driving hard across the plain.
And through the night, behind the wheel, the mileage clicking west
I think upon Mackenzie, David Thompson and the rest
Who cracked the mountain ramparts and did show a path for me
To race the roaring Fraser to the sea.

Because the singer is modern, the chorus "Ah, for just one time / I would take the Northwest Passage" is about wishing to identify a lonely life with the grandeur of the past. A verse about the loss of the historical arctic would tie right back into this without needing to change the chorus a jot.

How do you survive in the humanities?

The real disagreement is probably about whether the teacher would change her how-to-treat-evidence preferences if she were exposed to more information. Is her view stable, or would she see it for a confusion and mistake if she knew more, and say that she now sees things differently and more clearly?

Training Regime Day 7: Goal Factoring

I agree that that comment didn't really add that much. I was just trying to caution against the view that goal factoring was a technique for convincing yourself to take/not take certain actions. I'm not sure whether I should have spent more time discussing that though, because I'm not sure how common such a failure mode is.

Thanks for the style pointer!

Theory and Data as Constraints

Yes. At some level we need to have some type of theory to start moving the data into different piles which we can compare. But if we're theory constrained we don't see how to put any order on the data -- it's not even information at that point; it's that random noise.

But clearly we do find ways to break out of that circle.

When the constrain is the data then intermediate constraints between data and theory are probably not as obvious, the data is not as overwhelming.

Yes, Roam was it. Thanks!

How do you survive in the humanities?

Unsurprisingly, questioning here is the path to you being burnt at the stake. Questioning is heresy.

This is about self preservation. You want a diploma, and you’re not going to get it unless you’re willing to lie about your beliefs and say the things you’re supposed to say.

I don't think OP described anything that looks like this. I don't know that it's not happening, and I don't know that it won't (though if it hasn't started after two years, I don't know why it would start now). But right now this claim seems unjustified to me.

Continuous Improvement: Insights from 'Topology'

Wrt continuity, I was implicitly just thinking of metric spaces (which are all first-countable, obviously). I’ll edit the post to clarify.

How much delay do you generally have between having a good new idea and sharing that idea publicly online?

I try to get them out there as soon as possible because I tend to do things either immediately or on the scale of months to years. lesslong.com, IRC, the like.

Continuous Improvement: Insights from 'Topology'

Very nice! Two small notes:

• The two notions of continuity (sequential continuity and topological continuity) you present under "Multivariate continuity" are not equivalent. In a sense the topology around a point can be 'too large' to recover it from just convergence of sequences (in particular, these notions are equivalent for first countable spaces (I think? Second countability is definitely enough, but I think first countability also is) but not for general topological spaces). You can fix this by replacing the sequences with nets.
• The compactifications (one-point and Stone-Cech) are very useful for classification and representation theorems, but personally I've hardly ever used them outside of that context. These compactifications are very deep mathematical results but also a bit niche.

I remember back when I took my course on Introduction to Topology that we spent a lot of time introducing homotopies and equivalence classes, and later the fundamental group. And then all that hard work paid off in a matter of minutes when Brouwer's fixed point theorem (on the 2-dimensional disc) was proven with these fundamental groups, which is actually one of the shorter proofs of this theorem if you already have the topological tools available.

Gary Marcus: Four Steps Towards Robust Artificial Intelligence
A team of people including Smolensky and Schmidhuber have produced better results on a mathematics problem set by combining BERT with a tensor products (Smolensky et al., 2016), a formal system for representing symbolic variables and their bindings (Schlag et al., 2019), creating a new system called TP-Transformer.

Notable that the latter paper was rejected from ICLR 2020, partly for unfair comparison. It seems unclear at present whether TP-Transformer is better than the baseline transformer.

I had a dream where I was flying by incrementing my own x and y coordiates. Somewhat related to simulated worlds, but also to straight programming.

How much delay do you generally have between having a good new idea and sharing that idea publicly online?

I'll ship it when it's ready.

If I have a good idea about how to tie my shoelaces I'll share it immediately!

If I have a good idea about a foundational change in western philosophy it will take me years.

Will AI undergo discontinuous progress?

I think this is a good analysis, and I'm really glad to see this kind of deep dive on an important crux. The most clarifying thing for me was connecting old and new arguments - they seem to have more common ground than I thought.

One thing I would appreciate being added is in-text references. There are a bunch of claims here about e.g. history, evolution with no explicit reference. Maybe it seems like common knowledge, but I wasn't sure whether to believe some things, e.g.

Evolution was optimizing for fitness, and driving increases in intelligence only indirectly and intermittently by optimizing for winning at social competition. What happened in human evolution is that it briefly switched to optimizing for increased intelligence, and as soon as that happened our intelligence grew very rapidly but continuously.

Could you clarify? I thought biological evolution always optimizes for inclusive genetic fitness.

George's Shortform

# Should discomfort be a requirement for important experiences ?

A while ago I was discussing with a friend maligning about the fact that there doesn't exist some sort of sublingual DMT, with an absorption profile similar to smoking DMT, but without the rancid taste.

(Side note, there are some ways to get sublingual DMT: https://www.dmt-nexus.me/forum/default.aspx?g=posts&t=10240 , but you probably won't find it for sale at your local drug dealer and effects will differ a lot from smoking. In most experiences I've read about I'm not even convinced that the people are experiencing sublingual absorption rather than just slowly swallowing DMT with MAOIs and seeing the effects that way)

My point where something along the way of:

I wish there was a way to get high on DMT without going through the unpleasant experience of smoking it, I'm pretty sure that experience serves to "prime" your mind to some extent and leads to a worst trip.

My friend's point was:

We are talking about one of the most reality-shattering experiences ever possible to a human brain that doesn't involve death or permanent damage, surely having a small cost of entry for that in terms of the unpleasant taste is actually a desirable side-effect.

I kind of ended up agreeing with my friend and I think most people would find that viewpoint appealing

# But

You could make the same argument for something like knee surgery (or any life-changing surgery, which is most of them).

You are electing to do something that will alter your life forever and will result in you experiencing severe side-effects for years to come... but the step between "decide to do it" and "support major consequences" has 0 discomfort associate to it.

That's not to say knee surgery is good, much like a DMT trip, I have a lot of prior of it being good for people (well, in this case assuming that doctor recommends you to do it).

But I do find it a bit strange that this is the case with most surgery, even if it's life altering, when I think of it in light of the DMT example.

# But

If you've visited South Korea and seen the progressive note mutilation going on in their society (I'm pretty sure this has a fancier name... see some term they use in the study of super-stimuli, seagulls sitting on gigantic painted balls kinda king), I'm pretty sure the surgery example can become blurrier.

As in, I think it's pretty easy to argue people are doing a lot of unnecessary plastic surgery, and I'm pretty sure some cost of entry (e.g. you must feel mild discomfort for 3 hours to get this done... equivalent to say, getting a tattoo on your arm), would reduce that number a lot and intuitively that seem like a good thing.

It's not like you could do that though, as in, in practice you can't really do "anesthesia with controlled pain level" it's either zero or operating within a huge error range (see people's subjective reports of pain after dental anesthesia with similar quantities of lidocaine).

What would you do with an Evil AI?

If I am confidant that I have the original source code, as written by humans, I read that. I am looking for deep abstract principles. I am looking only for abstract ideas that are general to the field of AI.

If I can encrypt the code in a way that only a future superintelligence can crack, and I feel hopeful about FAI, I do that. Otherwise, secure erase, possibly involving anything that can slag the hard drives that is lying around.

FactorialCode's Shortform

Due to the corona virus, masks and disinfectants are starting to run out in many locations. Still working on the mask situation, but it might be possible to make your own hand sanitizer by mixing isopropyl alcohol or ethanol with glycerol. The individual ingredients might be available even if hand sanitizer isn't. From what I gather, you want to aim for for at least 90% alcohol. Higher is better.

Training Regime Day 7: Goal Factoring

General:

I've seen other discussions of this material, but the 'make sure not to do this' parts made it felt more complete:

Remember, the point of goal factoring is not to pick and action and convince yourself that it's a good/bad action. Keep your bottom line empty.

Style:

Completeness check: [the way you] check if you've written down all the goals is to pretend that you already have everything you've written down in abundance. If you've written down all the goals, then you should feel no desire to perform the action any more.
Wanting More Intellectual Stamina

Seconding this recommendation!

Wanting More Intellectual Stamina

Epistemic status: Hardcore projecting myself onto a stranger.

---

I was in college pretty recently, and I think I recognize in this question a lot of the same unhealthy attitudes that were so toxic for me in college and for the year(ish) after graduation. Like this:

I feel like I'm unable to let go of the fun-loving part of me which needs stupid entertainment. I simply cannot stay interested enough in learning and knowledge to be doing it 24/7, but I feel like this is requisite in order to be a successful thinker.

This is just not how life works. The vast majority of people, including the really successful ones, like "stupid entertainment" of one form or another. Habryka watches a lot of YouTube. Luke Muehlhauser is obsessed with corgis. Elon Musk.... smokes weed on live TV. It's not intrinsically bad to enjoy things that aren't work.

You are framing this as "I'm unable to let go of the fun-loving part of me." I think that's dangerous. Interesting and successful people still enjoy hanging out with their friends and doing things that aren't work. Staying interested in one single field 24/7 is definitely not a requisite for being a successful thinker, and in fact is probably counterproductive (see David Epstein's great book Range on this subject). Keeping yourself happy and not burned out is really important, and following your curiosity to a variety of other fields can often give you valuable perspective on your core work.

How do you guys stay interested in something (an idea or even an entire field) persistently enough to always be motivated to work on it?

(The following paragraph is probably fairly specific to the existential risk community (as compared to e.g. academia), but you did ask on LW, so, y'know. That's what you get.)

For most of the people I know who are doing really intense work, they don't stay motivated solely out of 'interest.' If Buck Shlegeris were just following his interest, he'd likely spend more time on physics and music than he does, but instead he devotes a lot of his time to MIRI because he believes in the importance of working to reduce existential risks. That's not to say he doesn't enjoy his MIRI work, just that it's not all about "staying interested." Sometimes we do things because we endorse doing them, rather than because we just want to do them. I've heard of some rationalists who claim to have integrated all of the subcategories of their personality (to use your term), but these people are by far the exception rather than the rule.

Is it unrealistic to hope to always be motivated by your curiosity?

Yes and no. There might be times when you're just devouring everything you can on a topic – I remember in high school I used to spend Sundays at my friend's house with all the other girls in my calculus class, doing extra credit work for fun, and then I would go to math team competitions after school and talk with my friends about proofs at lunch. I think there are academics who are also like this – in particular, some professors seem to just want to talk about their field all the time, and they seem to really enjoy it. Maybe it's possible to intentionally cultivate that level of sustained enthusiasm, but if so I don't know how to do it, and I wouldn't count on it as your only motivator. Curiosity can drive your choice of field and keep you excited about it on medium timescales, but not minute to minute.

I like my job quite a lot, but there are plenty of days when I don't feel intrinsically motivated to do it. Days when what I really want is to do housework or practice some song on the guitar or go for a long walk in the forest. But I do my work anyway, because I've committed to do it – because there would be consequences if I just didn't show up to work, because my coworkers (who I really like) would have to shoulder the burden I left, because my financial security is tied to it. Curiosity is a lovely motivator if you have it, but external commitments are much more reliable.

Will I burn myself out if I devote my free-time to extracurricular reading?

Not if you still allow time for other things that provide you with value! (See the recommendation of goal factoring below.) And especially not if you read because you're following your interest, rather than because you think you 'should' (see also p.167 here). I read like it's a religion and it often gives me energy rather than draining it. I'm a 'technical writer for software in the streets, rationalist in the sheets' with a degree in physics, but I read about whatever I want – currently that's mostly urban design, nutrition, and evolution. I love reading. But if I'm not into a book, I'll drop it. I think you should generally not perform mental violence in order to get yourself to do things... although being in school probably makes that hard.

---

Recommendations: A fair amount has been written on LW about the value of rest; see the Slack and the Sabbath sequence for a good start. I also recommend looking into CFAR's technique of goal factoring, where you try to get at the reasons why you're really doing something. (See also the Hammertime post and the CFAR handbook). Not to write the bottom line for you, but I expect you'll find that things like hanging out with your friends are providing you with value that you couldn't get by spending all your time studying.

Scott Alexander's wanting vs. liking vs. approving framework also seems relevant here (though, spoiler alert, it's kind of a confusing mess if you actually try to pin down what he means by each word.)

Also extremely relevant: Eliezer's On Doing the Impossible.

---

George's Shortform

Hmh, I actually did not think of that one all-important bit. Yeap, what I described as a "meta model for Dave's mind" is indeed a "meta model for human minds" or at least a "meta model for American minds" in which I plugged in some Dave-specific observations.

I'll have to re-work this at some point with this in mind, unless there's already something much better on the subject out there.

But again, I'll excuse this with having been so tried when I wrote this that I didn't even remember I did until your comment reminded me about it.

How much delay do you generally have between having a good new idea and sharing that idea publicly online?

I rarely share ideas online (I'm working on that); when I do the ideas tend to be "small" observations or models, the type I can write out quickly and send. ~10mins - 1 day after I have it.

You are an optimizer. Act like it!

I directionally agree - much of the time I can benefit by thinking a bit more about what I'm optimizing, and acting in a more optimal fashion. But I don't think this is universally applicable.

In the long run, optimizers win.

Well, no. Most optimizers fail. Many optimizers are only seeking short-term measurable outcomes, and the long run makes them irrelevant (or dead).

Tessellating Hills: a toy model for demons in imperfect search

Hmm, the inherent 1d nature of the visualization kinda makes it difficult to check for selection effects. I'm not convinced that's actually what's going on here. 1725 is special because the ridges of the splotch function are exactly orthogonal to x0. The odds of this happening probably go down exponentially with dimensionality. Furthermore, with more dakka, one sees that the optimization rate drops dramatically after ~15000 time steps, and may or may not do so again later. So I don't think this proves selection effects are in play. An alternative hypothesis is simply that the process gets snagged by the first non-orthogonal ridge it encounters, without any serous selection effects coming into play.

How much delay do you generally have between having a good new idea and sharing that idea publicly online?

Mine is probably much longer than it should. Although I also have some reasons not to share them right away such as needing them to have a good first impression.

I probably haven't shared most of them yet, so the delay is probably at least a few years currently.

Attainable Utility Preservation: Empirical Results

Decreases or increases?

Decreases. Here, the "human" is just a block which paces back and forth. Removing the block removes access to all states containing that block.

1. Is "Model-free AUP" the same as "AUP stepwise"?

Yes. See the paper for more details.

1. Why does "Model-free AUP" wait for the pallet to reach the human before moving, while the "Vanilla" agent does not?

I'm pretty sure it's just an artifact of the training process and the penalty term. I remember investigating it in 2018 and concluding it wasn't anything important, but unfortunately I don't recall the exact explanation.

I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it's harder to think of how state/agent are 'contained' in these games.)

It would still try to preserve access to future states as much as possible with respect to doing nothing that turn.

Is the code for the SafeLife PPO-AUP stuff you did on github?

Here. Note that we're still ironing things out, but the preliminary results have been pretty solid.

Editor Mini-Guide

I'm reasonably confident the word "bignote" doesn't matter here (and nor does "longnote"), it's just the word chosen in that example. I just tested with "note" and it worked fine.

I do have some confusion here. It looks to me like the bignote and longnote examples are the same apart from that word. So if you tried one and it didn't work, then tried the other and it did, I don't know what else you would have changed. Do you happen to remember?

Welcome to Less Wrong! (2012)

Elon Musk is wrong: Robotaxis are stupid. We need standardized rented autonomous tugs to move customized owned unpowered wagons.

Great thinking! However, the trouble is that this doesn't exclude non-autonomous motive sources. For example, why can't you just rent a tug driven by a person for your personal wagon? Additionally, there is a dichotomy of configurability vs availability of motion. You would seemingly have to wait on a tug to go anywhere. Cool thoughts though!

What do you make of AGI:unaligned::spaceships:not enough food?

One big difference is that "having enough food" admits a value function ("quantity of food") that is both well understood and for the most part smooth and continuous over the design space, given today's design methodology (if we try to design a ship with a particular amount of food and make a tiny mistake it's unlikely that the quantity of food will change that much). In contrast, the "how well is it aligned" metric is very poorly understood (at least compared with "amount of food on a spaceship") and a lot more discontinuous (using today's techniques of designing AIs, a tiny error in alignment is almost certain to cause catastrophic failure). Basically - we do not know what exactly if means to get it right, and even if we knew, we do not know what the acceptable error tolerances are, and even if we knew, we do not know how to meet them. None of that applies to the amount of food on a spaceship.

Welcome to Less Wrong! (2012)

Hello lesswrong community!

"Who am I?" I am a Network Engineer, who once used to know a bit of math (sadly, not anymore). Male, around 30, works in IT, atheist - I think I'll blend right in.

"How did I discover lesswrong?" Like the vast majority, I discovered lesswrong after reading HPMOR many years ago. It remains my favourite book to this day. HPMOR and the Sequences taught me a lot of new ideas and, more importantly, put what I already knew into a proper perspective. By the time HPMOR was finally finished, I was no longer sure where my worldview happened to coincide with Mr. Yudkowsky, and where it was shaped by him entirely. This might be due to me learning something new, or a mixture of wishful thinking, hindsight bias and the illusion of transparency, I don't know. I know this - HPMOR nudged me from nihilism to the much rosier and downright cuddly worldview of optimistic nihilism, for which I will be (come on singularity, come on singularity!) eternally grateful.

"When did I became a rationalist?" I like to think of my self as rational in my day-to-day, but I would not describe myself as a rationalist - by the same logic that says a white belt doesn't get to assume the title of master for showing up. Or have I mixed those up and "rational" is the far loftier description?

"Future plans?" I am now making a second flyby over the Sequences, this time with comments. I have a few ideas for posts that might be useful to someone and a 90% complete plotline for an HPMOR sequel (Eliezer, you magnificent bastard, did you have to tease a Prologue?!!!).

Looking forward to meeting some of you (or anyone, really) in the comments and may we all survive this planet together.

Tessellating Hills: a toy model for demons in imperfect search

That's very cool, thanks for making it. At first I was worried that this meant that my model didn't rely on selection effects. Then I tried a few different random seeds, and some, like 1725, didn't show demon-like behaviour. So I think we're still good.

Attainable Utility Preservation: Empirical Results
Bumping into the human makes them disappear, reducing the agent's control over what the future looks like. This is penalized.

Decreases or increases?

AUPstarting state fails here,
but AUPstepwise does not.

Questions:

1. Is "Model-free AUP" the same as "AUP stepwise"?

2. Why does "Model-free AUP" wait for the pallet to reach the human before moving, while the "Vanilla" agent does not?

There is one weird thing that's been pointed out, where stepwise inaction while driving a car leads to not-crashing being penalized at each time step. I think this is because you need to use an appropriate inaction rollout policy, not because stepwise itself is wrong. ↩︎

That might lead to interesting behavior in a game of chicken.

I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it's harder to think of how state/agent are 'contained' in these games.)

To be frank, this is crazy. I'm not aware of any existing theory explaining these results, which is why I proved a bajillion theorems last summer to start to get a formal understanding (some of which became the results on instrumental convergence and power-seeking).

Is the code for the SafeLife PPO-AUP stuff you did on github?

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
see above about trying to conform with the way terms are used, rather than defining terms and trying to drag everyone else along.

This seems odd given your objection to "soft/slow" takeoff usage and your advocacy of "continuous takeoff" ;)

Theory and Data as Constraints
However, the act of consuming the data is still costly for most of us. As romeo notes, when we are wondering though the fields on our unknown unknowns it looks very random (I also attributed that idea to you) so how do we get any patterns to emerge.
While part of the pattern recognition stems form some underlying theory, new patterns will be found as one starts organizing the data and then the pattern can start to be understood be thinking about potential relationships that explain the connections.

There used to be an exhibit at Epcot on "the pattern of progress" which I think pointed to the same thing you're pointing to here. There's a short video from it which I really like; it breaks "progress" down into a five-step pattern:

• Seeing - i.e. obtaining data
• Mapping - organizing the data and noticing patterns
• Understanding - figuring out a gears-level model
• Belief - using the model to make plans
• Action - actually doing things based on the model

Breaking things into steps is always a bit cheesy, but I do think there's a valuable point in here: there's an intermediate step between seeing the data and building a gears-level model. I think that's what you're pointing to: there's a need to organize the data and slice it in various ways so you can notice patterns - i.e. mapping, in the colloquial sense of the word.

Does that sound right?

There was a online tool someone here mentioned a year or so back. Totally forgetting what the name, basically it was a better set of note cards for information bits than then could be linked.

Possibly Roam?

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
Does this make sense to you?

Yeah that makes sense. Your points about "bio" not being short for "biological" were valid, but the fact that as a listener I didn't know that fact implies that it seems really easy to mess up the language usage here. I'm starting to think that the real fight should be about using terms that aren't self explanatory.

Have you actually observed it being used in ways that you fear (and which would be prevented if we were to redefine it more narrowly)?

I'm not sure about whether it would have been prevented by using the term more narrowly, but in my experience the most common reaction people outside of EA/LW (and even sometimes within) have to hearing about AI risk is to assume that it's not technical, and to assume that it's not about accidents. In that sense, I have seen been exposed to quite a bit of this already.

What do you make of AGI:unaligned::spaceships:not enough food?
Similarly, if this were the only problem, then people would just put more effort into determining whether an AGI is aligned before turning it on, or not build them.

The traditional arguments for why AGI could go wrong imply that AGI could go wrong even if you put an immense amount of effort into trying to patch errors. In machine learning, when we validate our models, we will ideally do so in an environment that we think matches the real world, but it's common for the real world to turn out to be subtly different. In the extreme case, you could perform comprehensive testing and verification and still fail to properly assess the real world impact.

If the cost of properly ensuring safety is arbitrarily high, there is a point at which people will begin deploying unsafe systems. This is inevitable, unless you could somehow either ban computer hardware or stop AI research insights from proliferating.

Theory and Data as Constraints
Were you the one who made the point that when you don't understand something it doesn't look mysterious and suggestive, it looks random?

Yup, that's from my review of Design Principles of Biological Circuits.

What might it look like to systematize the search strategy that returns blindspots?

A few years ago I wrote about one strategy for this, based on an example I ran into in the wild. We had some statistics on new user signups for an app; day-to-day variation in signup rate looked random. Assuming that each user decides whether to signup independently of all the other users, the noise in total signup count should be ~ (ignoring a constant factor). But the actual day-to-day variability was way larger than that - therefore there had to be some common factor influencing people. We had identified an unknown unknown. (Turned out, our servers didn't have enough capacity, and would sometimes get backed up. Whenever that happened, signups dropped very low. So we added servers, and signup rate improved.)

The link talks a bit about how to generalize that strategy, although it's still far from a universal technique.

Making Sense of Coronavirus Stats

I certainly agree but that information will only be known with a much longer delay than either the case fatality rate (which will initially be over estimated) and the infection rate (which will be under estimated). So that doesn't really help with how we should initially react to any new outbreak. Seems like we want to understand the date that is available early to assess the risks and therefore policy actions. How we present the data (and I don't get to see what any of the big bureaucracies use) seems to matter. This may be due to subject experts being who actually generates the data but non-experts have to understand the implications.

I would really like to see COVID-19 used as a case study for the Information Hazards theory.

Open & Welcome Thread - February 2020

An observation on natural language being illogical: I've noticed that at least some native Chinese speakers use 不一定 (literally "not certain") to mean "I disagree", including when I say "I think there's 50% chance that X." At first I was really annoyed with the person doing that ("I never said I was certain!") but then I noticed another person doing it so now I think it's just a standard figure of speech at this point, and I'm just generally annoyed at ... cultural evolution, I guess.

Why SENS makes sense
Unfair dismissals

I found that section a useful summary which didn't require a lot of background. Parts that stood out, without quoting the entire thing:

OP’s claim number one: Open Philanthropy's list of selected topics and the SENS' plan differ in focus.
...
If Open Philanthropy had said that what SRF is funding right now differs in focus with their list of selected topics, I would agree.
OP’s claim number two: Open Philanthropy, unlike SRF, doesn't claim that progress on the topics they identified would be sufficient to make aging negligible in humans.
What SRF claims is that solving all the seven categories will probably lead to lifespans longer than the current maximum. After that, what other forms of damages will appear is not known, but at that point, those additional damages may be cured (maybe through a SENS 2.0 panel of therapies) during the time "bought" by the first therapies and through their improvement.
Aubrey de Grey can often be heard making another claim that may prove confusing. He says: "Since no other damage has been discovered in decades, it is more and more probable that the SENS list is complete". "Complete" here means that it is the complete list of things that go wrong in a normal human lifespan. It's clear that we currently can't acquire direct data about what will go wrong after the current maximum human lifespan is exceeded.
Why SENS makes sense
Under a total utilitarian view, it is probably second or third after existential risk mitigation.
[...]
I can count at least three times in which non-profits operating under the principles of Effective Altruism have acknowledged SENS and then dismissed it without good reasons.

I once read a comment on the effective altruism subreddit that tried to explain why aging didn't get much attention in EA despite being so important, and I thought it was quite enlightening. Supporting anti-aging research requires being weird across some axes, but not others. You have to be against something that most people think is normal, natural and inevitable while at the same time being short-termist and human-focused.

People who are weird across all axes will generally support existential risk mitigation, or moral circle expansion, depending on their ethical perspective. If you're short termist but weird in other regards, then you generally will help factory farm animals or wild animals. If you are not weird across all axes, you will support global health interventions.

I want to note that I support anti-aging research, but I tend to take a different perspective than most EAs do. On a gut level, if something is going to kill me, my family, my friends, everyone I know, everyone on Earth if they don't get killed by something else first, and probably do so relatively soon and in a quite terrible way, I think it's worth investing in a way to defeat that. This gut-level reaction comes before any calm deliberation, but it still seems compelling to me.

My ethical perspective is not perfectly aligned with a long-termist utilitarian perspective, and being a moral anti-realist, I think it's OK to sometimes support moral causes that don't necessarily have a long-term impact. Using similar reasoning, I come to the conclusion that we should be nice to others and we should help our friends and those around us when possible, even when these things are not as valuable from a long-termist perspective.

Gary Marcus: Four Steps Towards Robust Artificial Intelligence

To clarify, I had first read the "the whole point of having knowledge" sentence in light of the fact that he wants to hardcode knowledge into our systems, and from that point of view it made more sense. I am re-reading and it's not the best comparison admittedly. The rest of the paper still echoes the general vibe of not doing random searches for answers, and leveraging our human understanding to yield some sort of robustness.

Exercises in Comprehensive Information Gathering

I'm also a big fan of this, I have got huge mileage out of creating a single page timeline of 1600 - 1800. I've got a few books lined up to create 1800-2000 and 1400-1800 but they are unfortunately low on my priority list at the moment. I would highly recommend it - what was happening in the world when the first academics journals were published. And 16-1800 is such a fascinating time, the scientific and industrial revolution, the age of enlightenment, the colonial empires and world trade.

The other one I have found a lot of value in is reading through cochrane/cambell reviews (high quality meta studies with readable summaries). There is a summary list of some useful ones here (I can't remember who I got it from though, but thanks whoever you are!) https://docs.google.com/spreadsheets/d/19D8JUgf95t-f-oUAHqh8Nn2G90KO3gUiua9yAjBSSqI/edit?usp=sharing

Gary Marcus: Four Steps Towards Robust Artificial Intelligence
At one point he echoes concerns about future systems based on deep learning that sound faintly similar to those expressed in the Rocket Alignment Problem.

The quoted paragraph does not sound like the Rocket Alignment problem to me. It seems to me that the quoted paragraph is arguing that you need to have systems that are robust, whereas the Rocket Alignment problem argues that you need to have a deep understanding of the systems you build. These are very different: I suspect the vast majority of AI safety researchers would agree that you need robustness, but you can get robustness without understanding, e.g. I feel pretty confident that AlphaZero robustly beats humans at Go, even though I don't understand what sort of reasoning AlphaZero is doing.

(A counterargument is that we understand how the AlphaZero training algorithm incentivizes robust gameplay, which is what rocket alignment is talking about, but then it's not clear to me why the rocket alignment analogy implies that we couldn't ever build aligned AI systems out of deep learning.)

Attainable Utility Preservation: Empirical Results

It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".

This has been an idea I’ve been intrigued by ever since AUP came out. My main concern with it is the increase in compute required and loss of competitiveness. Still probably worth running the experiments.

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't seem right. Try a transposition-invariant normalization? (Or generate benign auxiliary reward functions in the first place.)

Correct. Proposition 4 in the AUP paper guarantees penalty invariance to affine transformation only if the denominator is also the penalty for taking some action (absolute difference in Q values). You could, for example, consider the penalty of some mild action: . It’s really up to the designer in the near-term. We’ll talk about more streamlined designs for superhuman use cases in two posts.

Is there an environment where this agent would spuriously go in circles?

Don’t think so. Moving generates tiny penalties, and going in circles usually isn’t a great way to accrue primary reward.

Goal-directed = Model-based RL?
About the "right hand rule" agent, I feel it depends on whether it is a hard-coded agent or a learning agent.

Yes, I meant the hard-coded one. It still seems somewhat goal-directed to me.

do you see goal-directedness as a continuous spectrum, as a set of zones on this spectrum, or as a binary threshold on this spectrum?

Oh, definitely a continuous spectrum. (Though I think several people disagree with me on this, and see it more like a binary-ish threshold. Such people often say things like "intelligence and generalization require some sort of search-like cognition". I don't understand their views very well.)

What do the baby eaters tell us about ethics?

Sorry this is so late. I haven't been on the site for a while. My last post was in reply to no interference always being better than fighting it out. Most of the character's seem to think that stopping the baby eaters has more utility than letting the superhappies do the same thing to us would cost.

Making Sense of Coronavirus Stats

Death rates are not the only thing we should be worried about. SARS lead to long-term problems for survivors:

Forty percent [of studied SARS survivors] reported some degree of chronic fatigue and 27 percent met diagnostic criteria for chronic fatigue syndrome; people with fatigue symptoms were also more likely than those without them to have psychiatric disorders. For comparison, far less than one percent of Americans met chronic fatigue syndrome criteria, according to the U.S. Centers for Disease Control and Prevention, although many more than that have symptoms.

It's important to know to what extend similar problems might appear with this coronavirus.

Attainable Utility Preservation: Empirical Results

It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't seem right. Try a transposition-invariant normalization? (Or generate the auxiliary goals already normalized.)

Is there an environment where this agent would spuriously go in circles?

What do you make of AGI:unaligned::spaceships:not enough food?

What I pointed was that the spaceship examples had very specific features:

• Both personal and economic incentives are against the issue.
• The problem are obvious when one is confronted with the situation
• At the point where the problem becomes obvious, you can still solve it.

My intuition is that the main disanalogies with the AGI case are the first one (at least the economic incentives that might push people to try dangerous things when the returns are potentially great) and the last one, depending on your position on takeoffs.

Goal-directed = Model-based RL?

About the "right hand rule" agent, I feel it depends on whether it is a hard-coded agent or a learning agent. If it is hard-coded, then clearly it doesn't require a model. But if it learns such a rule, I would assume it was inferred from a learned model of what mazes are.

For the non-adaptative agent, you say it is less goal-directed; do you see goal-directedness as a continuous spectrum, as a set of zones on this spectrum, or as a binary threshold on this spectrum?

Theory and Data as Constraints

I liked the extension of your taut-slack constraints to the theory-date setting. I think you are correct that people are still working though that shift.

" Data is now very cheap, so consume a lot of it and see what happens." is a bit more problematic to me. There certainly is a lot of truth to the old saying, there is no seeing without looking. In one sense the data is cheap -- it is just there and in many ways not an economic good any longer.

However, the act of consuming the data is still costly for most of us. As romeo notes, when we are wondering though the fields on our unknown unknowns it looks very random (I also attributed that idea to you) so how do we get any patterns to emerge.

While part of the pattern recognition stems form some underlying theory, new patterns will be found as one starts organizing the data and then the pattern can start to be understood be thinking about potential relationships that explain the connections.

There was a online tool someone here mentioned a year or so back. Totally forgetting what the name, basically it was a better set of note cards for information bits than then could be linked. You get a nice graph forming up (searchable I believe on edges not merely phase/subject/category/word). If that were a collaborative tool (might be) that might be a slack constraint for bringing up unseen patterns in the data (reducing that cost of consuming). The edges might be color-coded and allow multiple edges between nodes based on some categorization/classification of the relationship, then filtering on color (though might also be interesting to look at possible patterns in the defined edges too).

Why Science is slowing down, Universities and Maslow's hierarchy of needs

Viktor Frankl found that the need for self-actualization or meaning was strong in internment which in-turn links to d world war where the basic needs often weren't fulfilled and decided about who made it out alive.

When it comes to the claim that the hierarchy doesn't exist, Wikipedia links to the Atlantic which inturn links to Louis Tay et al which says:

In addition, the associations of SWB [subjective well being] with the fulfillment of specific needs were largely independent of whether other needs were fulfilled.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists
I ask because you're one of the most prolific participants here but don't fall into one of the existing "camps" on AI risk for whom I already have good models for.

Seems right, I think my opinions fall closest to Paul's, though it's also hard for me to tell what Paul's opinions are. I think this older thread is a relatively good summary of the considerations I tend to think about, though I'd place different emphases now. (Sadly I don't have the time to write a proper post about what I think about AI strategy -- it's a pretty big topic.)

The current situation seems to be that we have two good (relatively clear) terms "technical accidental AI risk" and "AI-caused x-risk" and the dispute is over what plain "AI risk" should be shorthand for. Does that seem fair?

Yes, though I would frame it as "the ~5 people reading these comments have two clear terms, while everyone else uses a confusing mishmash of terms". The hard part is in getting everyone else to use the terms. I am generally skeptical of deciding on definitions and getting everyone else to use them, and usually try to use terms the way other people use terms.

In other words I don't think this is strong evidence that all 4 people would endorse defining "AI risk" as "technical accidental AI risk". It also seems notable that I've been using "AI risk" in a broad sense for a while and no one has objected to that usage until now.

Agreed with this, but see above about trying to conform with the way terms are used, rather than defining terms and trying to drag everyone else along.

Curiosity Killed the Cat and the Asymptotically Optimal Agent

It is interesting to note that AIXI, a Bayes-optimal reinforcement learner in general environments,is not asymptotically optimal [Orseau, 2010], and in-deed, may cease to explore [Leikeet al., 2015]. Depending on its prior and its past observations, AIXI may decide at some point that further exploration is not worth the risk. Given our result, this seems like reasonable behavior.

Given this, why is your main conclusion "Perhaps our results suggest we are in need of more theory regarding the 'parenting' of artificial agents" instead of "We should use Bayesian optimality instead of asymptotic optimality"?

Open & Welcome Thread - February 2020

Copy-pasting a followup to this with Robin Hanson via DM (with permission).

Robin: You can of course suspect people of many things using many weak clues. But you should hold higher standards of evidence when making public accusations that you say orgs should use to fire people, cancel speeches, etc.

Me: My instinct is to support/agree with this, but (1) it's not an obvious interpretation of what you tweeted and (2) I think we need to understand why the standards of evidence for making public accusations and for actual firing/canceling have fallen so low (which my own comment didn't address either) and what the leverage points are for changing that, otherwise we might just be tilting at windmills when we exhort people to raise those standards (or worse, making suicide charges, if we get lumped with "public enemies").

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I agree that this is troubling, though I think it’s similar to how I wouldn’t want the term biorisk to be expanded ...

Well as I said, natural language doesn't have to be perfectly logical, and I think "biorisk" is in somewhat in that category but there's an explanation that makes it a bit reasonable than it might first appear, which is that the "bio" refers not to "biological" but to "bioweapon". This is actually one of the definitions that Google gives when you search for "bio": "relating to or involving the use of toxic biological or biochemical substances as weapons of war. 'bioterrorism'"

I guess the analogous thing would be if we start using "AI" to mean "technical AI accidents" in a bunch of phrases, which feels worse to me than the "bio" case, maybe because "AI" is a standalone word/acronym instead of a prefix? Does this make sense to you?

Not to say that’s what you are doing with AI risk. I’m worried about what others will do with it if the term gets expanded.

But the term was expanded from the beginning. Have you actually observed it being used in ways that you fear (and which would be prevented if we were to redefine it more narrowly)?

Will AI undergo discontinuous progress?

Rohin Shah told me something similar.

This quote seems to be from Rob Bensinger.

How do you survive in the humanities?

As Dagon said, learning empathy and humility is always a good idea. You don't have to believe your teacher or condone their views or practices, but that's a different issue.

Why Science is slowing down, Universities and Maslow's hierarchy of needs

Can you provide references, specify what's wrong with Maslow's hierarchy, and/or supply a superior model?

Theory and Data as Constraints

> which means people will repeatedly be hit in the face by unknown unknowns.

Were you the one who made the point that when you don't understand something it doesn't look mysterious and suggestive, it looks random? So it's a wicked problem because you don't realize there's something you can do about it. I hadn't ever had the thought before that behavioral economics is the attempt to systematize blindspots. What might it look like to systematize the search strategy that returns blindspots? One strategy I've found is crossing the idea of sentence stem completion with maslow-ish questions about important areas of life.

romeostevensit's Shortform

This also applies to books

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I agree that this is troubling, though I think it's similar to how I wouldn't want the term biorisk to be expanded to include biodiversity loss (a risk, but not the right type), regular human terrorism (humans are biological, but it's a totally different issue), zombie uprisings (they are biological, but it's totally ridiculous), alien invasions etc.

Not to say that's what you are doing with AI risk. I'm worried about what others will do with it if the term gets expanded.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Also, isn't defining "AI risk" as "technical accidental AI risk" analogous to defining "apple" as "red apple" (in terms of being circular/illogical)? I realize natural language doesn't have to be perfectly logical, but this still seems a bit too egregious.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

But I am optimistic about the actual risks that you and others argue for.

Why? I actually wrote a reply that was more questioning in tone, and then changed it because I found some comments you made where you seemed to be concerned about the additional AI risks. Good thing I saved a copy of the original reply, so I'll just paste it below:

I wonder if you would consider writing an overview of your perspective on AI risk strategy. (You do have a sequence but I'm looking for something that's more comprehensive, that includes e.g. human safety and philosophical problems. Or let me know if there's an existing post that I've missed.) I ask because you're one of the most prolific participants here but don't fall into one of the existing "camps" on AI risk for whom I already have good models for. It's happened several times that I see a comment from you that seems wrong or unclear, but I'm afraid to risk being annoying or repetitive with my questions/objections. (I sometimes worry that I've already brought up some issue with you and then forgot your answer.) It would help a lot to have a better model of you in my head and in writing so I can refer to that to help me interpret what the most likely intended meaning of a comment is, or to predict how you would likely answer if I were to ask certain questions.

It’s notable that AI Impacts asked for people who were skeptical of AI risk (or something along those lines) and to my eye it looks like all four of the people in the newsletter independently interpreted that as accidental technical AI risk in which the AI is adversarially optimizing against you (or at least that’s what the four people argued against).

Maybe that's because the question was asked in a way that indicated the questioner was mostly interested in technical accidental AI risk? And some of them may be fine with defining "AI risk" as "AI-caused x-risk" but just didn't have the other risks on the top of their minds, because their personal focus is on the technical/accidental side. In other words I don't think this is strong evidence that all 4 people would endorse defining "AI risk" as "technical accidental AI risk". It also seems notable that I've been using "AI risk" in a broad sense for a while and no one has objected to that usage until now.

I would certainly support having clearer definitions and terminology if we could all agree on them.

The current situation seems to be that we have two good (relatively clear) terms "technical accidental AI risk" and "AI-caused x-risk" and the dispute is over what plain "AI risk" should be shorthand for. Does that seem fair?

Jan Bloch's Impossible War
That is not true

Nitpick -- for replies like this, it's helpful if you say which part of the parent comment you're objecting to.

Obviously the reader can figure it out from the rest of your comment, but (especially since I didn't immediately recognize CSA as referring to the Confederate States of America) I wasn't sure what your first sentence was saying. A quote of the offending sentence from the parent comment would have been helpful.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
It seems worth clarifying that you're only optimistic about certain types of AI safety problems.

Tbc, I'm optimistic about all the types of AI safety problems that people have proposed, including the philosophical ones. When I said "all else equal those seem more likely to me", I meant that if all the other facts about the matter are the same, but one risk affects only future people and not current people, that risk would seem more likely to me because people would care less about it. But I am optimistic about the actual risks that you and others argue for.

That said, over the last week I have become less optimistic specifically about overcoming race dynamics, mostly from talking to people at FHI / GovAI. I'm not sure how much to update though. (Still broadly optimistic.)

it seems that when you wrote the title of this newsletter "Why AI risk might be solved without additional intervention from longtermists" you must have meant "Why some forms of AI risk ...", or perhaps certain forms of AI risk just didn't come to your mind at that time.

It's notable that AI Impacts asked for people who were skeptical of AI risk (or something along those lines) and to my eye it looks like all four of the people in the newsletter independently interpreted that as accidental technical AI risk in which the AI is adversarially optimizing against you (or at least that's what the four people argued against). This seems like pretty strong evidence that when people hear "AI risk" they now think of technical accidental AI risk, regardless of what the historical definition may have been. I know certainly that is my default assumption when someone (other than you) says "AI risk".

I would certainly support having clearer definitions and terminology if we could all agree on them.

Goal-directed = Model-based RL?
Since you say that goal-directed behavior is not about having a model or not, is it about the form of the model? Or about the use of the model?

I'm thinking that there may not be any model. Consider for example an agent that solves (simply connected) mazes by implementing the right hand rule: such an agent seems at least somewhat goal-directed, but it's hard for me to see a model anywhere in this agent.

Would a model-based agent that did not adapt its model when the environment changed be considered as not goal-directed (like the lookup-table agent in your example)?

Yeah, I think that does make it less goal-directed.

How do you survive in the humanities?

If it's any consolation, they probably take their own statements less literally than you do, and so it's less important that they're incoherent than you might think. They'll mostly end up acting and deciding by copying others, which works pretty well in general (see: The Secret Of Our Success).

landfish lab

I don't expect OS vendors are more aligned, but it might be a more achievable political goal to get them aligned, since there's a smaller number of them. (I'm not sure if this is true, just a hypothesis)

Eight Short Studies On Excuses

A potential solution for appeasing other students and preventing them from faking Sports Fandom -- while still accommodating a Sports Fan's reasonable situation -- is to give the Sports Fan an extra assignment to complete. This would dissuade other students from turning in their paper late (because they would want to avoid having to do extra work), but would satisfy the Sports Fan since they would do anything to be able to see their team, band, etc.

The teacher would still have to have strict guidelines for this accommodation: 1) the request would have to be deemed reasonable 2) the assignment couldn't be too easy or many students would take advantage of it 3) The extension for the original assignment couldn't be too accommodating, just long enough to give the student the time they lost from attending the event But this could be a conceivable solution to this problem.

landfish lab

landfish lab

It seems weird to expect that OS vendors are particularly more aligned with your preferences than app vendors are. You actually have more control over apps - it's possible to use different ones without building your own hardware and writing your own drivers. Don't like the bundle of behaviors that an app presents? don't use it. There are fewer OSes to choose from, and they tend to group together harder-to-replicate functionality in a way that you can't really pick and choose very well.

I'm totally with you that I don't much care for the way current social media platforms (including apps and data-handling outside of apps) work, but I'm not sure what the alternative is, for things where almost everyone I want to interact with is captured by them, and there's no coordination point to change it. Compare with limited choice in options on a political ballot - I hate it, but I don't think the equilibrium has a good leverage point to improve.

George's Shortform

I'd agree that this is useful to think on, but I tend to use "meta model" to mean "a model of how to build and apply models across distinct people", and your example of abstracting Dave's preferences is just another model for him, not all that meta.

I might suggest you call it an "abstract model" or an "explainable model". In fact, if they make the same predictions, they're equally powerful, but one is more compressible and easier to transmit (and examine in your head).

Suspiciously balanced evidence

A big part of the answer for me is something like this Scott Alexander post about the probability of X within your model versus the probability that your model is miscalibrated in a relevant way. Given how shaky our models of the world are, this alone makes it hard for me to push past 99% on many questions, especially those that require predicting human decisions.

Training Regime Day 6: Seeking Sense
I'm not advising people to drop their items in an attempt to discover new uses for them

Yes, you are not.

This should have prompted me to search harder for a way to use it more effectively.

I think 'dropping things' is one, perhaps inefficient, way of doing that.

And it makes a good metaphor. If you try things differently, or try new things, they might not work the first time. (Or ever - we remember the Apollo missions, and the Wright Brothers because they succeeded.)

Dropping items in an attempt to discover new uses for them, drawn out over 27 lines:

If you take something apart, you might learn.

But it might break.

So if you dropped it and it broke would that be really inconvenient, or easily replaced?

If something falls it might break.

There might be an opportunity to learn.

To put the pieces back together well.

But there is risk in things falling.

And breaking.

Sometimes they break forever.[1]

There is less risk in taking things apart.

But we don't do it very often.

And sometimes we stop before finishing, because we're afraid of breaking things.[2]

But if something is easily replaced

And we're not afraid of breaking it

Then we might learn something by taking it apart.

If it breaks it breaks.

If we learned something, we learned something.

If we learn a better way of doing or making things, we learn a better way of doing or making things.

Is a broken thing too high a price to pay?

For knowledge?

For a chance to learn a better way?[3]

[1] You might have to learn, how to make glue (red link).

[2] If this isn't you, then this...isn't you.

[3] Even if it takes more than one thing broken?

Until you find a way

to put it back together.

Until you find, another way/how, to use it.

Tessellating Hills: a toy model for demons in imperfect search

Now this is one of the more interesting things I've come across.

I fiddled around with the code a bit and was able to reproduce the phenomenon with DIMS = 1, making visualisation possible:

Behold!

Here's the code I used to make the plot:

import torch
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d

DIMS = 1   # number of dimensions that xn has
WSUM = 5    # number of waves added together to make a splotch
EPSILON = 0.10 # rate at which xn controlls splotch strength
TRAIN_TIME = 5000 # number of iterations to train for
LEARN_RATE = 0.2   # learning rate
MESH_DENSITY = 100 #number of points ot plt in 3d mesh (if applicable)

torch.random.manual_seed(1729)

# knlist and k0list are integers, so the splotch functions are periodic
knlist = torch.randint(-2, 3, (DIMS, WSUM, DIMS)) # wavenumbers : list (controlling dim, wave id, k component)
k0list = torch.randint(-2, 3, (DIMS, WSUM))       # the x0 component of wavenumber : list (controlling dim, wave id)
slist = torch.randn((DIMS, WSUM))                # sin coefficients for a particular wave : list(controlling dim, wave id)
clist = torch.randn((DIMS, WSUM))                # cos coefficients for a particular wave : list (controlling dim, wave id)

# initialize x0, xn

# numpy arrays for plotting:
x0_hist = np.zeros((TRAIN_TIME,))
xn_hist = np.zeros((TRAIN_TIME, DIMS))
loss_hist = np.zeros(TRAIN_TIME,)

def model(xn,x0):
wavesum = torch.sum(knlist*xn, dim=2) + k0list*x0
splotch_n = torch.sum(
(slist*torch.sin(wavesum)) + (clist*torch.cos(wavesum)),
dim=1)
foreground_loss = EPSILON * torch.sum(xn * splotch_n)
return foreground_loss - x0

# train:
for t in range(TRAIN_TIME):

print(t)
loss = model(xn,x0)
loss.backward()
# constant step size gradient descent, with some noise thrown in
x0_hist[t] = x0.detach().numpy()
xn_hist[t] = xn.detach().numpy()
loss_hist[t] = loss.detach().numpy()

plt.plot(x0_hist)
plt.xlabel('number of steps')
plt.ylabel('x0')
plt.show()
for d in range(DIMS):
plt.plot(xn_hist[:,d])
plt.xlabel('number of training steps')
plt.ylabel('xn')
plt.show()

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot3D(x0_hist,xn_hist[:,0],loss_hist)

#plot loss landscape
if DIMS == 1:
x0_range = np.linspace(np.min(x0_hist),np.max(x0_hist),MESH_DENSITY)
xn_range = np.linspace(np.min(xn_hist),np.max(xn_hist),MESH_DENSITY)
x,y = np.meshgrid(x0_range,xn_range)
z = np.zeros((MESH_DENSITY,MESH_DENSITY))
for i,x0 in enumerate(x0_range):
for j,xn in enumerate(xn_range):
z[j,i] = model(torch.tensor(xn),torch.tensor(x0)).numpy()
ax.plot_surface(x,y,z,color='orange',alpha=0.3)
ax.set_title("loss")
plt.show()

How do you survive in the humanities?

Epistemology is a team sport (consilience). Adversarial strategies for such are a consent based sport.

You seem to be modeling the profession of teaching as people who are authorized to say true things saying them to students. Teaching is only weakly entangled with epistemology on a practical basis.

Training Regime Day 7: Goal Factoring

> This process is a lot like just writing a pro/cons list. Although plain pro/con lists are more useful than people give them credit for, I think that the crucial addition is trying to figure out different actions to take to get what you want.

Good point!

I think of this as pros and cons are reusable between goals, and it's worth learning the general sorts of structure that pro and cons (and their generation) have. Doing this, your sense of the 'pro con space' and how it connects to your longer term goals will improve. You'll find yourself making more modular choices such that overall there is less wasted motion when it turns out you need to modify your sense of the goal or method. In the pedagogy literature, a lot of this falls under the heading of 'multifinal goals and means.' This also suggests a complementary practice of method factoring.

Eliezer Yudkowsky Facts
Eliezer is not a high school dropout

Nah. He never even got as far as high school, in order to drop out.

The Intelligent Social Web

I'm glad to have helped. :)

I'll answer the rest by PM. Diving into Integral Theory here strikes me as a bit off topic (though I certainly don't mind the question).

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I appreciate the arguments, and I think you've mostly convinced me, mostly because of the historical argument.

I do still have some remaining apprehension about using AI risk to describe every type of risk arising from AI.

I want to include philosophical failures, as long as the consequences of the failures flow through AI, because (aside from historical usage) technical problems and philosophical problems blend into each other, and I don't see a point in drawing an arbitrary and potentially contentious border between them.

That is true. The way I see it, UDT is definitely on the technical side, even though it incorporates a large amount of philosophical background. When I say technical, I mostly mean "specific, uses math, has clear meaning within the language of computer science" rather than a more narrow meaning of "is related to machine learning" or something similar.

My issue with arguing for philosophical failure is that, as I'm sure you're aware, there's a well known failure mode of worrying about vague philosophical problems rather than more concrete ones. Within academic philosophy, the majority of discussion surrounding AI is centered around consciousness, intentionality, whether it's possible to even construct a human-like machine, whether they should have rights etc.

There's a unique thread of philosophy that arose from Lesswrong, which includes work on decision theory, that doesn't focus on these thorny and low priority questions. While I'm comfortable with you arguing that philosophical failure is important, my impression is that the overly philosophical approach used by many people has done more harm than good for the field in the past, and continues to do so.

It is therefore sometimes nice to tell people that the problems that people work on here are concrete and specific, and don't require doing a ton of abstract philosophy or political advocacy.

I don't think this is a good argument, because even within "accidental technical AI risk" there are different problems that aren't equally worthwhile to solve, so why aren't you already worried about outsiders thinking all those problems are equally worthwhile?

This is true, but my impression is that when you tell people that a problem is "technical" it generally makes them refrain from having a strong opinion before understanding a lot about it. "Accidental" also reframes the discussion by reducing the risk of polarizing biases. This is a common theme in many fields:

• Physicists sometimes get frustrated with people arguing about "the philosophy of the interpretation of quantum mechanics" because there's a large subset of people who think that since it's philosophical, then you don't need to have any subject-level expertise to talk about it.
• Economists try to emphasize that they use models and empirical data, because a lot of people think that their field of study is more-or-less just high status opinion + math. Emphasizing that there are real, specific models that they study helps to reduce this impression. Same with political science.
• A large fraction of tech workers are frustrated about the use of Machine Learning as a buzzword right now, and part of it is that people started saying Machine Learning = AI rather than Machine Learning = Statistics, and so a lot of people thought that even if they don't understand statistics, they can understand AI since that's like philosophy and stuff.

Scott Aaronson has said

But I’ve drawn much closer to the community over the last few years, because of a combination of factors: [...] The AI-risk folks started publishing some research papers that I found interesting—some with relatively approachable problems that I could see myself trying to think about if quantum computing ever got boring. This shift seems to have happened at roughly around the same time my former student, Paul Christiano, “defected” from quantum computing to AI-risk research.

My guess is that this shift in his thinking occurred because a lot of people started talking about technical risks from AI, rather than framing it as a philosophy problem, or a problem of eliminating bad actors. Eliezer has shared this viewpoint for years, writing in the CEV document,

Warning: Beware of things that are fun to argue.

reflecting the temptation to derail discussions about technical accidental risks.

How do you survive in the humanities?

I would make the same argument for a Scientology class[1]. You can and should learn empathy and humility, and one of the best ways is interaction with people with very different beliefs and models than you. You don't have to agree with them, you don't have to use their mechanisms directly, but you can and should identify how those mechanisms work for them, and understand that you'll probably need some mechanisms for yourself that aren't perfectly self-legible.

[1] Except the actual torture and brainwashing parts. If sleep deprivation or overt threats of violence are part of the class, you should probably just get out.

Open & Welcome Thread - February 2020

Offering 100-300h of technical work on an AI Safety project

I am a deep learning engineer (2y exp), I currently develop vision models to be used on satellite images (I also do some software engineering around that) (Linkedin profile https://www.linkedin.com/in/maxime-riche-73696182/). On my spare time, I am organizing a EA local group in Toulouse (France), learning RL, doing a research project on RL for computer vision (only expecting indirect utility from this) and developing an EAA tool (EffectiveAnimalAdvocacy). I have been in the French EA community for 4 years. In 2020, I chose to work part time to dedicate 2 to 3 days of work per week to EA aligned projects.Thus for the next 8 months, I have ~10h / week that I want to dedicate to assist an AI safety project. For myself, I am not looking for funds, nor to publish myself a paper, nor a blog post.To me the ideal project would be:

• a relevant technical AI safety project (research or not). I am looking for advice on the "relevant" part.
• where I would be able to help the project to achieve better quality results than otherwise without my contribution. (e.g. through writing better code, doing more experiments, testing other designs)
• where my contribution would include writing code. If it is a research proposal, then implement experiments. If there is no experimental part currently in the project, I could take charge of creating one.
On unfixably unsafe AGI architectures

I agree that AGI is more omni-use than bioweapons and thus will be harder to get people not to develop and use. I think our prospects look pretty bleak in this scenario, but it's not completely hopeless.

For human cloning, what I had in mind was a nation cloning its smartest individuals for the purpose of having better science/tech. Think of what the US could have accomplished if they had 10,000 Von Neumanns instead of 1.

Making Sense of Coronavirus Stats

Believable, considering that people are often contagious for the flu up to three or four days after they recover and kids can be contagious for even longer after they recover from it.

Making Sense of Coronavirus Stats

I'm combining that analysis with another preprint that went into more extensive higher N tissue bank data and found no correlation of ACE2 expression with ethnicity or gender.

To top it off with Iran, now we have local authorities saying its in many cities and TWO confirmed international travelers that caught it in Iran over the last few weeks (in Canada and Lebanon). That is the smoking gun, i'm calling thousands of cases there as of now.

I'm starting to suspect I won't be getting to that conference this June...

Making Sense of Coronavirus Stats

I could but don't think it matters.

First, most here (and I don't disagree) are saying the numbers are all incorrect anyhow so using a different calculation accomplishes nothing.

One is still left with the question of what the defined population should be. Moreover, I don't see why one cannot define the population to be those who are infected so it is not clear to me this is not consistent with the definition. We should also ask should there be multiple defined population, so suggesting a mortality rate largely vacuous. (Something I clearly did not address initially as well.)

Even if it is not correct to call the numbers I generated a mortality rate it seems sensible to have some sense of dangerous the situation and some generic rate definition you linked to really doesn't much insight to that.

landfish lab

I recently did a quick Google scholar search which convinced me of this, but was lazybwhen finding source for you :).

Google scholar search convinced me but totally ok to disbelieve. After all who is to say non-replications will replicate :).

Making Sense of Coronavirus Stats

Data always says something unless it's randomly generated. At the very least Chinese data provides lower bounds on some things. You can get somewhat better estimates if you model their incentives (though the lying will greatly increase the uncertainty and complexity of any model)