All of james.lucassen's Comments + Replies

I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example. 

Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful.

- Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI.
- Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutual... (read more)

3Adam Jermyn3mo
Got it, that’s very clear. Thanks! So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.

“Just Retarget the Search” directly eliminates the inner alignment problem.


I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you're willing to assume that our interpretability tools are so good they can't ever be tricked, you have to deal with that.

It's not necessarily a huge issue - hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it's not just "bada-bing bada-boom" exactly.

7Evan R. Murphy4mo
Yea, I agree that if you give a deceptive model the chance to emerge then a lot more risks arise for interpretability and it could become much more difficult. Circumventing interpretability: How to defeat mind-readers [] kind of goes through the gauntlet, but I think one workaround/solution Lee lays out there which I haven't seen anyone shoot down yet (aside from it seeming terribly expensive) is to run the interpretability tools continuously or near continuously from the beginning of training. This would give us the opportunity to examine the mesa-optimizer's goals as soon as they emerge, before it has a chance to do any kind of obfuscation.

Not confident enough to put this as an answer, but

presumably no one could do so at birth

If you intend your question in the broadest possible sense, then I think we do have to presume exactly this. A rock cannot think itself into becoming a mind - if we were truly a blank slate at birth, we would have to remain a blank slate, because a blank slate has no protocols established to process input and become non-blank. Because it's blank.

So how do we start with this miraculous non-blank structure? Evolution. And how do we know our theory of evolution is correct?... (read more)

1M. Y. Zuo4mo
This would imply every animal has some degree of 'mind'. As they all react to external stimuli, to some extent, at birth.

Agree that there is no such guarantee. Minor nitpick that the distribution in question is in my mind, not out there in the world - if the world really did have a distribution of muggers' cash that was slower than 1/x, the universe would be comprised almost entirely of muggers' wallets (in expectation). 

But even without any guarantee about my mental probability distribution, I think my argument does establish that not every possible EV agent is susceptible to Pascal's Mugging. That suggests that in the search for a formalism of ideal decison-making algorithm, formulations of EV that meet this check are still on the table.

First and most important thing that I want to say here is that fanaticism is sufficient for longtermism, but not necessary. The ">10^36 future lives" thing means that longtermism would be worth pursuing even on fanatically low probabilities - but in fact, the state of things seems much better than that! X-risk is badly neglected, so it seems like a longtermist career should be expected to do much better than reducing X-risk by 10^-30% or whatever the break-even point is.

Second thing is that Pascal's Wager in particular kind of shoots itself in the foot ... (read more)

This has the problem that you have no assurance that the distribution does drop off sufficiently fast. It would be convenient if it did, but the world is not structured for anyone's convenience.
I absolutely agree that fanaticism isn’t necessary for longtermism; my question is for those few who are “fanatics,” how do they resolve that sort of thing consistently.

My best guess at mechanism:

  1. Before, I was a person who prided myself on succeeding at marshmallow tests. This caused me to frame work as a thing I want to succeed on, and work too hard.
  2. Then, I read Meaningful Rest and Replacing Guilt, and realized that often times I was working later to get more done that day, even though it would obviously be detrimental to the next day. This makes the reverse marshmallow test dynamic very intuitively obvious.
  3. Now I am still a person who prides myself on my marshmallow prowess, but hopefully I've internalized an externality
... (read more)

When you have a self-image as a productive, hardworking person, the usual Marshmallow Test gets kind of reversed. Normally, there's some unpleasant task you have to do which is beneficial in the long run. But in the Reverse Marshmallow Test, forcing yourself to work too hard makes you feel Good and Virtuous in the short run but leads to burnout in the long run. I think conceptualizing of it this way has been helpful for me.

Yes!  I am really interested in this sort of dynamic; for me things in this vicinity were a big deal I think.  I have a couple half-written blog posts that relate to this that I may manage to post over the next week or two; I'd also be really curious for any detail about how this seemed to be working psychologically in you or others (what gears, etc.).  

I have been using the term "narrative addiction" to describe the thing that in hindsight I think was going on with me here -- I was running a whole lot of my actions off of a backchain from a... (read more)

Nice post!

perhaps this problem can be overcome by including checks for generalization during training, i.e., testing how well the program generalizes to various test distributions.

I don't think this gets at the core difficulty of speed priors not generalizing well. Let's we generate a bunch of lookup-table-ish things according to the speed prior, and then reject all the ones that don't generalize to our testing set. The majority of the models that pass our check are going to be basically the same as the rest, plus whatever modification that causes them to ... (read more)

2Charlie Steiner5mo
I think this is a little bit off. The world doesn't have a True Distribution, it's just the world. A more careful treatment would involve talking about why we expect Solomonoff induction to work well, why the speed prior (as in universal search prior) also works in theory, and what you think might be different in practice (e.g. if you're actually constructing a program with gradient descent using something like "description length" or "runtime" as a loss).

In general, I'm a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled "world model" is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.

4Steven Byrnes5mo
I for one am moderately optimistic that the world-model can actually remain “just” a world-model (and not a secret deceptive world-optimizer), and that the value function can actually remain “just” a value function (and not a secret deceptive world-optimizer), and so on, for reasons in my post Thoughts on safety in predictive learning [] —particularly the idea that the world-model data structure / algorithm can be relatively narrowly tailored to being a world-model, and the value function data structure / algorithm can be relatively narrowly tailored to being a value function, etc.
2Evan R. Murphy5mo
Since LeCun's architecture is together a kind of optimizer (I agree with Algon that it's probably a utility maximizer) then the emergence of additional mesa-optimizers seems less likely. We expect optimization to emerge because it's a powerful algorithm for SGD to stumble on that outcompetes the alternatives. But if the system is already an optimizer, then where is that selection pressure coming from to make another one?

I'm pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on "no sims" isn't "weird Butlerian religion that still studies AI alignment", it's something more like "deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world".

In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of pro... (read more)

1Adam Jermyn5mo
I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”. Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?
  • Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous - we can't inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
  • The patient research strategy is a bit weird, because t
... (read more)
1Adam Jermyn5mo
I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.
1Adam Jermyn5mo
I think I basically agree re: honeypots. I'm sure there'll be weird behaviors if we outlaw simulations, but I don't think that's a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn't stop them from solving alignment.

Thanks! Edits made accordingly. Two notes on the stuff you mentioned that isn't just my embarrassing lack of proofreading:

  • The definition of optimization used in Risks From Learned Optimization is actually quite different from the definition I'm using here. They say: 

    "a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system."

    I perso
... (read more)

Whatever you end up doing, I strongly recommend taking a learning-by-writing style approach (or anything else that will keep you in critical assessment mode rather than classroom mode). These ideas are nowhere near solidified enough to merit a classroom-style approach, and even if they were infallible, that's probably not the fastest way to learn them and contribute original stuff.

The most common failure mode I expect for rapid introductions to alignment is just trying to absorb, rather than constantly poking and prodding to get a real working understanding. This happened to me, and wasted a lot of time.

This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?

3Adam Zerner6mo
I don't think it's quite the same problem. Actually I think it's pretty different. This post tries to address the problem that people are hesitant to ask potentially "dumb" questions by making it explicit that this is the place to ask any of those questions. StackExchange tries to solve the problem of having a timeless place to ask and answer questions and to refer to such questions. It doesn't try to solve the first problem of welcoming potentially dumb questions, and I think that that is a good problem to try to solve. For that second problem, LessWrong does have Q&A functionality, as well as things like the wiki.

Agree it's hard to prove a negative, but personally I find the following argument pretty suggestive:

"Other AGI labs have some plans - these are the plans we think are bad, and a pivotal act will have to disrupt them. But if we, ourselves, are an AGI lab with some plan, we should expect our pivotal agent to also be able to disrupt our plans. This does not directly lead to the end of the world, but it definitely includes root access to the datacenter."

2Evan R. Murphy6mo
Here's the thing I'm stuck on lately. Does it really follow from "Other AGI labs have some plans - these are the plans we think are bad" that some drastic and violent-seeming plan like burning all the world's GPUs with nanobots is needed? I know Eliezer tried to settle this point with 4. We can't just "decide not to build AGI" [,standards%C2%A0individually.] , but it seems like the obvious kinds of 'pivotal acts' needed are much boring and less technological than he believes, e.g. have conversations with a few important people, probably the leadership at top AI labs. Some people seem to think this has been tried and didn't work. And I suppose I don't know the extent to which this has been tried, as any meetings that have been had with leadership at the AI labs, the participants probably aren't liberty to talk about. But it just seems like there should be hundreds of different angles, asks, pleads, compromises, bargains etc. with different influential people before it would make sense to conclude that the logical course of action is "nanobots".

Proposed toy examples for G:

  • G is "the door opens", a- is "push door", a+ is "some weird complicated doorknob with a lock". Pretty much any b- can open a-, but only a very specific key+manipulator combo opens a+. a+ is much more informative about successful b than a- is.
  • G is "I make a million dollars", a- is "straightforward boring investing", a+ is "buy a lottery ticket". A wide variety of different world-histories b can satisfy a-, as long as the markets are favorable - but a very narrow slice can satisfy a+. a+ is a more fragile strategy (relative to noise in b) than a- is.

it doesn't work if your goal is to find the optimal answer, but we hardly ever want to know the optimal answer, we just want to know a good-enough answer.

Also not an expert, but I think this is correct


When a bounded agent attempts a task, we observe some degree of success. But the degree of success depends on many factors that are not "part of" the agent - outside the Cartesian boundary that we (the observers) choose to draw for modeling purposes. These factors include things like power, luck, task difficulty, assistance, etc. If we are concerned with the agent as a learner and don't consider knowledge as part of the agent, factors like knowledge, skills, beliefs, etc. are also externalized. Applied rationality is the result of attempting to d

... (read more)

This leans a bit close to the pedantry side, but the title is also a bit strange when taken literally. Three useful types (of akrasia categories)? Types of akrasia, right, not types of categories?

That said, I do really like this classification! Introspectively, it seems like the three could have quite distinct causes, so understanding which category you struggle with could be important for efforts to fix. 

Props for first post!

Oh, oops. I added the "categories" as panic-editing after the first comment. I have now returned it to the original (vague) title. Seems like a good time to use the "English is not my native language" excuse. Thanks! I hope it helps you in the future.

Trying to figure out what's being said here. My best guess is two major points:

  • Meta doesn't work. Do the thing, stop trying to figure out systematic ways to do the thing better, they're a waste of time. The first thing any proper meta-thinking should notice is that nobody doing meta-thinking seems to be doing object level thinking any better.
  • A lot of nerds want to be recognized as Deep Thinkers. This makes meta-thinking stuff really appealing for them to read, in hopes of becoming a DT. This in turn makes it appealing for them to write, since it's what other nerds will read, which is how they get recognized as a DT. All this is despite the fact that it's useless.
The post doesn't spend much of its time making specific criticisms because specific criticism of this patronage system would indict OP for attempting to participate in it. This hampers its readability.

Ah, gotcha. I think the post is fine, I just failed to read.

If I now correctly understand, the proposal is to ask a LLM to simulate human approval, and use that as the training signal for your Big Scary AGI. I think this still has some problems:

  • Using an LLM to simulate human approval sounds like reward modeling, which seems useful. But LLM's aren't trained to simulate humans, they're trained to predict text. So, for example, an LLM will regurgitate the dominant theory of human values, even if it has learned (in a Latent Knowledge sense) that humans really
... (read more)
It still does honestly seem way more likely to not kill us all than a paperclip-optimizer, so if we're pressed for time near the end, why shouldn't we go with this suggestion over something else?

The key thing here seems to be the difference between understanding  a value and having that value. Nothing about the fragile value claim or the Orthogonality thesis says that the main blocker is AI systems failing to understand human values. A superintelligent paperclip maximizer could know what I value and just not do it, the same way I can understand what the paperclipper values and choose to pursue my own values instead.

Your argument is for LLM's understanding human values, but that doesn't necessarily have anything to do with the values that they... (read more)

I think you’re misunderstanding my point, let me know if I should change the question wording.

Assume we’re focused on outer alignment. Then we can provide a trained regressor LLM as the utility function, instead of Eg maximize paperclips. So understanding and valuing are synonymous in that setting.

now this is how you win the first-ever "most meetings" prize

2Logan Riggs8mo
Haha, yeah I won some sort of prize like that. I didn't know it because I left right before they announced to go take a break from all those meetings!

Agree that this is definitely a plausible strategy, and that it doesn't get anywhere near as much attention as it seemingly deserves, for reasons unknown to me. Strong upvote for the post, I want to see some serious discussion on this. Some preliminary thoughts:

  • How did we get here?
    • If I had to guess, the lack of discussion on this seems likely due to a founder effect. The people pulling the alarm in the early days of AGI safety concerns were disproportionately to the technical/philosophical side rather than to the policy/outreach/activism side. 
    • In earl
... (read more)
5Adrià Garriga-alonso8mo
Also, the people pulling the alarm in the early days of AGI safety concerns, are also people interested in AGI. They find it cool. I get the impression that some of them think aligned people should also try to win the AGI race, so doing capabilities research and being willing to listen to alignment concerns is good. (I disagree with this position and I don't think it's a strawman, but it might be a bit unfair.) Many of the people that got interested in AGI safety later on also find AGI cool, or have done some capabilities research (e.g. me), so thinking that what we've done is evil is counterintuitive.

You should submit this to the Future Fund's ideas competition, even though it's technically closed. I'm really tempted to do it myself just to make sure it gets done, and very well might submit something in this vein once I've done a more detailed brainstorm.

Probably a good idea, though I'm less optimistic about the form being checked. I'll plan on writing something up today. If I don't end up doing that today for whatever reason, akrasia, whatever, I'll DM you.

I don't think I understand how the scorecard works. From:

[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.

And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?

If the scorecard is learned, then it needs a training signal from Steering. But if it's useless at the start, it can't provide a training signal. On the other hand, ... (read more)

2Steven Byrnes8mo
The categories are hardcoded, the function-that-assigns-a-score-to-a-category is learned. Everybody has a goosebumps predictor, everyone has a grimacing predictor, nobody has a debt predictor, etc. Think of a school report card: everyone gets a grade for math, everyone gets a grade for English, etc. But the score-assigning algorithm is learned. So in the report card analogy, think of a math TA ( = Teaching Assistant = Thought Assessor) who starts out assigning math grades to students randomly, but the math professor (=Steering Subsystem) corrects the TA when its assigned score is really off-base. Gradually, the math TA learns to assign appropriate grades by looking at student tests. In parallel, there’s an English class TA (=Thought Assessor), learning to assign appropriate grades to student essays based on feedback from the English professor (=Steering Subsystem). The TAs (Thought Assessors) are useless at the start, but the professors aren't. Back to biology: If you get shocked, then the Steering Subsystem says to the “freezing in fear” Thought Assessor: “Hey, you screwed up, you should have been sending a signal just now.” The professors are easy to hardwire because they only need to figure out the right answer in hindsight. You don't need a learning algorithm for that.

What do you think about the effectiveness of the particular method of digital decluttering recommended by Digital Minimalism? What modifications would you recommend? Ideal duration?

One reason I have yet to do a month-long declutter is because I remember thinking something like "this process sounds like something Cal Newport just kinda made up and didn't particularly test, my own methods that I think of for me will probably better than Cal's method he thought of for him".

So far my own methods have not worked.

Kurt Brown (mentioned in the post) did an experiment on this, helping residents of CEEALAR (formerly the EA Hotel) do their own Newport-style digital declutter; you can read his preliminary writeup here [] .
This post is at least one more data point that Cal Newport’s method worked for someone else.

Memetic evolution dominates biological evolution for the same reason.

Faster mutation rate doesn't just produce faster evolution - it also reduces the steady-state fitness. Complex machinery can't reliably be evolved if pieces of it are breaking all the time. I'm mostly relying No Evolutions for Corporations or Nanodevices plus one undergrad course in evolutionary bio here.

Also, just empirically: memetic evolution produced civilization, social movements, Crusades, the Nazis, etc.

Thank you for pointing this out. I agree with the empirical observation that we... (read more)

I think we're seeing Friendly memetic tech evolving that can change how influence comes about. 


Wait, literally evolving? How? Coincidence despite orthogonality? Did someone successfully set up an environment that selects for Friendly memes? Or is this not literally evolving, but more like "being developed"?

The key tipping point isn't "World leaders are influenced" but is instead "The Friendly memetic tech hatches a different way of being that can spread quickly." And the plausible candidates I've seen often suggest it'll spread superexponentiall

... (read more)

Ah, so on this view, the endgame doesn't look like

"make technical progress until the alignment tax is low enough that policy folks or other AI-risk-aware people in key positions will be able to get an unaware world to pay it"

 But instead looks more like

"get the world to be aware enough to not bumble into an apocalypse, specifically by promoting rationality, which will let key decision-makers clear out the misaligned memes that keep them from seeing clearly"

Is that a fair summary? If so, I'm pretty skeptical of the proposed AI alignment strategy, even ... (read more)

Two points: 1. I have more hope than you here. I think we're seeing Friendly memetic tech evolving that can change how influence comes about. The key tipping point isn't "World leaders are influenced" but is instead "The Friendly memetic tech hatches a different way of being that can spread quickly." And the plausible candidates I've seen often suggest it'll spread superexponentially. 2. This is upstream of making the technical progress and right social maneuvers anyway. There's insufficient collective will to do enough of the right kind of alignment research. Trying anyway mostly adds to the memetic dumpster fire we're all in. So unless you have a bonkers once-in-an-aeon brilliant Messiah-level insight, you can't do this first.
1 0x44 0x46 9mo
It seems to me that in 2020 the world was changed relatively quickly. How many events in history was able to shift every mind on the planet within 3 months? If it only takes 3 months to occupy the majority of focus then you have a bounds for what a Super Intelligent Agent may plan for. What is more concerning and also interesting is that such an intelligence can make something appear to be for X but it's really planning for Y. So misdirection and ulterior motive is baked into this theory gaming. Unfortunately this can lead to a very schizophrenic inspection of every scenario as if strategically there is intention to trigger infinite regress on scrutiny. When we're dealing with these Hyperobjects/Avatars/Memes we can't be certain that we understand the motive. Given that we can't understand the motive of any external meme, perhaps the only right path is to generate your own and propagate that solely?
A sketch of solution that doesn't involve (traditional) world leaders could look like "Software engineers get together and agree that the field is super fucked, and start imposing stronger regulations and guidelines like traditional engineering disciplines use but on software." This is a way of lowering the cost of alignment tax in the sense that, if software engineers all have a security mindset, or have to go through a security review, there is more process and knowledge related to potential problems and a way of executing a technical solution at the last moment. However, this description is itself is entirely political not technical, yet easily could not reach the awareness of world leaders or the general populace.

Putting this in a separate comment, because Reign of Terror moderation scares me and I want to compartmentalize. I am still unclear about the following things:

  • Why do we think memetic evolution will produce complex/powerful results? It seems like the mutation rate is much, much higher than biological evolution.
  • Valentine describes these memes as superintelligences, as "noticing" things, and generally being agents. Are these superintelligences hosted per-instance-of-meme, with many stuffed into each human? Or is something like "QAnon" kind of a distributed in
... (read more)
Doesn't the second part answer the first? I mean, the reason biological evolution matters is because its mutation rate massively outstrips geological and astronomical shifts. Memetic evolution dominates biological evolution for the same reason. Also, just empirically: memetic evolution produced civilization, social movements, Crusades, the Nazis, etc. I wonder if I'm just missing your question. Both. I wonder if you're both (a) blurring levels and (b) intuitively viewing these superintelligences as having some kind of essence that either is or isn't in someone. What is or isn't a "meme" isn't well defined. A catch phrase (e.g. "Black lives matter!") is totally a meme. But is a religion a meme? Is it more like a collection of memes? If so, what exactly are its constituent memes? And with catch phrases, most of them can't survive without a larger memetic context. (Try getting "Black lives matter!" to spread through an isolated Amazonian tribe.) So should we count the larger memetic context as part of the meme? But if you stop trying to ask what is or isn't a meme and you just look at the phenomenon, you can see something happening. In the BLM movement, the phrase "Silence is violence" evolved and spread because it was evocative and helped the whole movement combat opposition in a way that supported its egregoric possession. So… where does the whole BLM superorganism live? In its believers and supporters, sure. But also in its opponents. (Think of how folk who opposed BLM would spread its claims in order to object to them.) Also on webpages. Billboards. Now in Hollywood movies. And it's always shifting and mutating. The academic field of memetics died because they couldn't formally define "meme". But that's backwards. Biology didn't need to formally define life to recognize that there's something to study. The act of studying seems to make some definitions more possible. That's where we're at right now. Egregoric zoology, post Darwin but pre Watson & Crick.
I really appreciate your list of claims and unclear points. Your succinct summary is helping me think about these ideas. A few examples came to mind: sports paraphernalia, tabletop miniatures, and stuffed animals (which likely outnumber real animals by hundreds or thousands of times). One might argue that these things give humans joy, so they don't count. There is some validity to that. AI paperclips are supposed to be useless to humans. On the other hand, one might also argue that it is unsurprising that subsystems repurposed to seek out paperclips derive some 'enjoyment' from the paperclips... but I don't think that argument will hold water for these examples. Looking at it another way, some amount of paperclips are indeed useful. No egregore has turned the entire world to paperclips just yet. But of course that hasn't happened, else we would have already lost. Even so: consider paperwork (like the tax forms mentioned in the post), skill certifications in the workplace, and things like slot machines and reality television. A lot of human effort is wasted on things humans don't directly care about, for non-obvious reasons. Those things could be paperclips. (And perhaps some humans derive genuine joy out of reality television, paperwork, or giant piles of paperclips. I don't think that changes my point that there is evidence of egregores wasting resources.)

My attempt to break down the key claims here:

  • The internet is causing rapid memetic evolution towards ideas which stick in people's minds, encourage them to take certain actions, especially ones that spread the idea. Ex: wokism, Communism, QAnon, etc
  • These memes push people who host them (all of us, to be clear) towards behaviors which are not in the best interests of humanity, because Orthogonality Thesis
  • The lack of will to work on AI risk comes from these memes' general interference with clarity/agency, plus selective pressure to develop ways to get past "
... (read more)

I like this, thank you.

I score this as "Good enough that I debated not bothering to correct anything."

I think some corrections might be helpful though:


The internet is causing rapid memetic evolution…

While I think that's true, that's not really central to what I'm saying. I think these forces have been the main players for way, way longer than we've had an internet. The internet — like every other advance in communication — just increased evolutionary pressure at the memetic level by bringing more of these hypercreatures into contact with one another ... (read more)

I think there's an important difference Valentine tries to make with respect to your fourth bullet (and if not, I will make). You perhaps describe the right idea, but the wrong shape. The problem is more like "China and the US both have incentives to bring about AGI and don't have incentives towards safety." Yes deflecting at the last second with some formula for safe AI will save you, but that's as stupid as jumping away from a train at the last second. Move off the track hours ahead of time, and just broker a peace between countries to not make AGI.

Putting this in a separate comment, because Reign of Terror moderation scares me and I want to compartmentalize. I am still unclear about the following things:

  • Why do we think memetic evolution will produce complex/powerful results? It seems like the mutation rate is much, much higher than biological evolution.
  • Valentine describes these memes as superintelligences, as "noticing" things, and generally being agents. Are these superintelligences hosted per-instance-of-meme, with many stuffed into each human? Or is something like "QAnon" kind of a distributed in
... (read more)

So, what it sounds like to me is that you at least somewhat buy a couple object-level moral arguments for veganism, but also put a high confidence in some variety of moral anti-realism which undermines those arguments. There are two tracks of reasoning I would consider here.

First: if anti-realism is correct, it doesn't matter what we do. If anti-realism is not correct, then it seems like we shouldn't eat animals. Unless we're 100% confident in the anti-realism, it seems like we shouldn't eat animals. Note that there are a couple difficulties with this kind... (read more)

"if anti-realism is true, it doesn't matter [to us] what we do" -- that's false. Whether something does matter to us is a fact independent of whether something ought to matter to us.
Missing cells in your matrix: perhaps morals are real, and creating animals for the purpose of meat is permitted. The main question for moral realists is "how can you find the truth?". If morals have some basis in the real, measurable, objective universe, then it becomes an empirical question about how to act.

This might not work depending on the details of how "information" is specified in these examples, but would this model of abstractions consider "blob of random noise" a good abstraction? 

On the one hand, different blobs of random noise contain no information about each other on a particle level - in fact, they contain no information about anything on a particle level, if the noise is "truly" random. And yet they seem like a natural category, since they have "higher-level properties" in common, such as unpredictability and idk maybe mean/sd of particle... (read more)

If they have mean/sd in common (as in e.g. a Gaussian clustering problem), then the mean/sd are exactly the abstract information. If they're all completely independent, without any latents (like mean/sd) at all, then the blob itself is not a natural abstraction, at least if we're staying within an information-theoretic playground. I do expect this will eventually need to be extended beyond mutual information, especially to handle the kinds of abstractions we use in math (like "groups", for instance). My guess is that most of the structure will carry over; Bayes nets and mutual information have pretty natural category-theoretic extensions as I understand it, and I expect that roughly the same approach and techniques I use here will extend to that setting. I don't personally have enough expertise there to do it myself, though.

unlike other technologies, an AI disaster might not wait around for you to come clean it up

I think this piece is extremely important, and I would have put it in a more central place. The whole "instrumental goal preservation" argument makes AI risk very different from the knife/electricity/car analogies. It means that you only get one shot, and can't rely on iterative engineering. Without that piece, the argument is effectively (but not exactly) considering only low-stakes alignment.

In fact, I think if we get rid of this piece of the alignment problem, bas... (read more)

Another minor note: very last link, to splendidtable, seems to include an extra comma at the end of the link which makes it 404

thanks to both of you, fixed now.

Currently working on ELK - posted some unfinished thoughts here. Looking to turn this into a finished submission before end of January - any feedback is much appreciated, if anyone wants to take a look!

Dang, I wish I had read this before the EA Forum's creative writing contest closed. It makes a lot of sense that HPMOR could be valuable via this "first-person-optimizing-experience" mechanism - I had read it after reading the Sequences, so I was mostly looking for examples of rationality techniques and secret hidden Jedi knowledge. 

Since HPMOR!Harry isn't so much EA as transhumanist, I wonder if a first-person EA experience could be made interesting enough to be a useful story? I suppose the Comet King from Unsong is also kind of close to this niche, but not really described in first person or designed to be related to. This might be worth a stab...

Hpmor got me to read the sequences by presenting a teaser of what a rationalist could do and then offering the real me that power. This line from the OP resonated deeply: The sequences then expanded that vision into something concrete, and did in fact completely change my life for the better.

TLDR: if we model a human as a collection of sub-agents rather than single agent, how do we make normative claims about which sub-agents should or shouldn't hammer down others? There's no over-arching set of goals to evaluate against, and each sub-agent always wants to hammer down all the others.

If I'm interpreting things right, I think I agree with the descriptive claims here, but tentatively disagree with the normative ones. I agree that modeling humans as single agents is inaccurate, and a multi-agent model of some sort is better. I also agree that the ... (read more)

In my experience, trying to apply rationality to hidden-role games such as Mafia tends to break them pretty quickly - not in the sense of making rationalist players extremely powerful, just in the much less fun sense of making the game basically unrecognizable and a lot less fun. I played a hidden role game called Secret Hitler with a group of friends, a few of whom were familiar with some Sequences content, and the meta very quickly shifted towards a boring fixed point.

The problem is that rationality is all about being asymmetric towards truth, which is g... (read more)

That's great. If I ever attempt to design my own conlang, I'm using this rule.

The first enigma seems like it's either very closely related or identical to Hume's problem of induction. If that is a fair-rephrasing, then I think it's not entirely true that the key problem is that the use of empiricism cannot be justified by empiricism or refuted by empiricism. Principles like "don't believe in kludgy unwieldy things" and "empiricism is a good foundation for belief" can in fact be supported by empiricism - because those heuristics have worked well in the past, and helped us build houses and whatnot.

I think the key problem is that empir... (read more)

Interesting way of putting it! The usual objection to circular reasoning is more logical... that it allows quodlibet, the ability to prove anything. Which of course depends on a Criterion...the criterion that you shouldn't be able to prove everything.

Right, yeah I agree that we can evaluate empiricism on empirical grounds. That is a thing we can do. And yes, as you say, we can come to different conclusions about empiricism when we evaluate it on empirical grounds. Very interesting point re object-level and meta-level conclusions. But why would start with empiricism at all? Why should we begin with empiricism, and then conclude on such grounds either that empiricism is trustworthy or untrustworthy?

When I say "empiricism cannot justify empiricism", I mean that empiricism cannot explain why we trust empir... (read more)

This is great. Feels like a very good catch. Attempting to start a comment thread doing a post-mortem of why this happened and what measures might make this sort of clarity-losing definition drift happen less in the future.

One thing I am a bit surprised by is that the definition on the tag page for inside/outside view was very clearly the original definition, and included a link to the Wikipedia for reference class forecasting in the second sentence. This suggests that the drifted definition was probably not held as an explicit belief by a large number of ... (read more)

The funny thing is, I literally did a whole review+summary post of the Superforecasting literature and yet two years later I was using "outside view" to mean a whole bunch of not-particularly-related things! I think the applause light hypothesis is plausible.

I think a lot of this discussion becomes clearer if we taboo "intelligence" as something like "ability to search and select a high-ranked option from a large pool of strategies".

  • Agree that the rate-limiting step for a superhuman intelligence trying to affect the world will probably be stuff that does not scale very well with intelligence, like large-scale transport, construction, smelting widgets, etc. However, I'm not sure it would be so severe a limitation as to produce situations like what you describe, where a superhuman intelligence sits around for a
... (read more)
The problem (even in humans) is rarely the ability to identify the right answer, or even the speed at which answers can be evaluated, but rather the ability to generate new possibilities. And that is a skill that is both hard and not well understood.
I guess it depends on how many "intelligence-driven issues" are yet to solve and how important they are, my intuition is that the answer is "not many" but I have very low trust in that intuition. It might also be just the fact that "useful" is fuzzy and my "not super useful" might be your "very useful", and quantifying useful gets into the thorny issue of quantifying intuitions about progress.