# All of TurnTrout's Comments + Replies

Covid 5/6: Vaccine Patent Suspension

Both writesr take it as common knowledge that the reasons to not take the virus are stupid and wrong, and that the job is to fix what’s wrong with these soldiers who are refusing.

"writers" and "not take the vaccine", no?

Your Dog is Even Smarter Than You Think

Really? The main claim is presented "in an outrageous way"?

I can imagine reading the post and being unconvinced by the evidence presented. In fact, that was my reaction (although I haven't watched the videos yet). But... being outraged

Posts should not make large, unsupported claims, and criticism should not be hyperbolic. Here is what I have learned from your critique:

• You think that StyleOfDog writes "maniacally" and uses "meme-speak" too much.
• Unclear why I should care. Not my favorite style of writing, but I understood what they were trying t
8subconvergence3dSorry for over-reacting to what I perceived as essentially a curated list of youtube videos with no real context. I made a probably more substantial comment as an answer to the OP.
TurnTrout's shortform feed

Comment #1000 on LessWrong :)

5niplav7dWith 5999 karma! Edit: Now 6000 – I weak-upvoted an old post of yours [https://www.lesswrong.com/posts/EvKWNRkJgLosgRDSa/lightness-and-unease] I hadn't upvoted before.
Draft report on existential risk from power-seeking AI

This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.

Not necessarily true - you're still considering the IID case.

I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP's state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the

1ofer7dJust to summarize my current view: For MDP problems in which the state representation is very complex, and different action sequences always yield different states, POWER-defined-over-an-IID-reward-distribution is equal for all states, and thus does not match the intuitive concept of power. At some level of complexity such problems become relevant (when dealing with problems with real-world-like environments). These are not just problems that show up when one adverserially constructs an MDP problem to game POWER, or when one makes "really weird modelling choices". Consider a real-world inspired MDP problem where a state specifies the location of every atom. What makes POWER-defined-over-IID problematic in such an environment is the sheer complexity of the state, which makes it so that different action sequences always yield different states. It's not "weird modeling decisions" causing the problem. I also (now) think that for some MDP problems (including many grid-world problems), POWER-defined-over-IID may indeed match the intuitive concept of power well, and that publications about such problems (and theorems about POWER-defined-over-IID) may be very useful for the field. Also, I see that the abstract of the paper no longer makes the claim "We prove that, with respect to a wide class of reward function distributions, optimal policies tend to seek power over the environment", which is great (I was concerned about that claim).
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

LeCun claims too much. It's true that the case of animals like orangutans points to a class of cognitive architectures which seemingly don't prioritize power-seeking. It's true that this is some evidence against power-seeking behavior being common amongst relevant cognitive architectures. However, it doesn't show that instrumental subgoals are much weaker drives of behavior than hardwired objectives.

One reading of this "drives of behavior" claim is that it has to be tautological; by definition, instrumental subgoals are always in service of the (hardwired)... (read more)

4Lukas_Gloor6dI feel like you can turn this point upside down. Even among primates that seem unusually docile, like orang utans, male-male competition can get violent and occasionally ends in death. Isn't that evidence that power-seeking is hard to weed out? And why wouldn't it be in an evolved species that isn't eusocial or otherwise genetically weird?
Draft report on existential risk from power-seeking AI

Two clarifications. First, even in the existing version, POWER can be defined for any bounded reward function distribution - not just IID ones. Second, the power-seeking results no longer require IID. Most reward function distributions incentivize POWER-seeking, both in the formal sense, and in the qualitative "keeping options open" sense.

To address your main point, though, I think we'll need to get more concrete. Let's represent the situation with a state diagram.

Both you and Rohin... (read more)

1ofer8dI think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP's state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state. This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal. The "complicated MDP environment" argument does not need partial observability or an infinite state space; it works for any MDP where the state graph is a finite tree with a constant branching factor. (If the theorems require infinite horizon, add self-loops to the terminal states.)
Draft report on existential risk from power-seeking AI

Right. But what does this have to do with your “different concept” claim?

1ofer9dA person does not become less powerful (in the intuitive sense) right after paying college tuition (or right after getting a vaccine) due to losing the ability to choose whether to do so. [EDIT: generally, assuming they make their choices wisely.] I think POWER may match the intuitive concept when defined over certain (perhaps very complicated) reward distributions; rather than reward distributions that are IID-over-states (which is what the paper deals with). Actually, in a complicated MDP environment—analogous to the real world—in which every sequence of actions results in a different state (i.e. the graph of states is a tree with a constant branching factor), the POWER of all the states that the agent can get to in a given time step is equal; when POWER is defined over an IID-over-states reward distribution.
TurnTrout's shortform feed

When proving theorems for my research, I often take time to consider the weakest conditions under which the desired result holds - even if it's just a relatively unimportant and narrow lemma. By understanding the weakest conditions, you isolate the load-bearing requirements for the phenomenon of interest. I find this helps me build better gears-level models of the mathematical object I'm studying. Furthermore, understanding the result in generality allows me to recognize analogies and cross-over opportunities in the future. Lastly, I just find this plain satisfying.

Draft report on existential risk from power-seeking AI

I think the draft tends to use the term power to point to an intuitive concept of power/influence (the thing that we expect a random agent to seek due to the instrumental convergence thesis). But I think the definition above (or at least the version in the cited paper) points to a different concept, because a random agent has a single objective (rather than an intrinsic goal of getting to a state that would be advantageous for many different objectives)

This is indeed a misunderstanding. My paper analyzes the single-objective setting; no intrinsic power-seeking drive is assumed.

1ofer9dI probably should have written the "because ..." part better. I was trying to point at the same thing Rohin pointed at in the quoted text. Taking a quick look at the current version of the paper, my point still seems to me relevant. For example, in the environment in figure 16, with a discount rate of ~1, the maximally POWER-seeking behavior is to always stay in the same first state (as noted in the paper), from which all the states are reachable. This is analogous to the student from Rohin's example who takes a gap year instead of going to college.
Lessons I've Learned from Self-Teaching

Ok, so the main advice is: don't make a card for everything, just the important concepts. And those concepts can be found in "cheatsheets" and "course review notes", it seems — unfortunately, I don't have any of those things.

Why not use Google for notes from other schools?

1Eugleo10dIf the plan was to understand the given subject, then yes, that would work. And of course, that is the ultimate plan. However, the more pressing matter are the exams. I would be afraid that the intersection of the two programs won't contain all of the important concepts; that's my experience with "other" textbooks (meaning, other than the few recommend ones), at least.
Draft report on existential risk from power-seeking AI

I commented a portion of a copy of your power-seeking writeup.

I like the current doc a lot. I also feel that it seems to not consider some big formal hints and insights we've gotten from my work over the past two years.[1]

Very recently, I was able to show the following strong result:

Some researchers have speculated that capable reinforcement learning (RL) agents  are often incentivized to seek resources and power in pursuit of their objectives. While seeking power in order to optimize a misspecified objective, agents might be incentivized to beh

3Joe Carlsmith8dThanks for reading, and for your comments on the doc. I replied to specific comments there, but at a high level: the formal work you’ve been doing on this does seem helpful and relevant (thanks for doing it!). And other convergent phenomena seem like helpful analogs to have in mind.
Lessons I've Learned from Self-Teaching

Definitely not too late to make cards. I've learned a great deal of basic chemistry in the last month or so, just studying for random 30-minute chunks and binging interesting-looking wikipedia articles. A month+ is plenty of time for spaced repetition to work its magic.

For math, I recommend Anki; for non-Latex-intensive subjects like biology, I recommend SuperMemo for fast card creation while you read and review material. Unfortunately, I have yet to write up my thoughts on the latter.

I like to look at the "cheat sheets" for courses and ensure I know how t... (read more)

1Eugleo10dThanks, that sounds encouraging. I have to admit, I despise Anki. I love well-designed tools, to the point it's probably detrimental to my life. But alas, it is like it is — and so every time I see Anki (and I've made around 300 cards for it past the last few years), I have an urge to throw the computer out of the window. But, that's just a detail. I plan to use Mochi, by the way. Ok, so the main advice is: don't make a card for everything, just the important concepts. And those concepts can be found in "cheatsheets" and "course review notes", it seems — unfortunately, I don't have any of those things. So, I'd be grateful if you could illustrate what do you think constitutes a major, Anki-worthy piece of information. Let's say you're trying to revise (in the context of my current exam prep) a base chapter on vector spaces from Linear algebra done right. You probably want a card with the VS axioms. What else do you view as important enough?
TurnTrout's shortform feed

The Pfizer phase 3 study's last endpoint is 7 days after the second shot. Does anyone know why the CDC recommends waiting 2 weeks for full protection? Are they just being the CDC again?

6jimrandomh11dPeople don't really distinguish between "I am protected" and "I am safe for others to be around". If someone got infected prior to their vaccination and had a relatively-long incubation period, they could infect others; I don't think it's a coincidence that two weeks is also the recommended self-isolation period for people who may have been exposed.
For mRNA vaccines, is (short-term) efficacy really higher after the second dose?

Frustratingly, the phase 3's don't report this number. But using some data included in the Pfizer phase 3, I was able to make this graph:

The image isn't loading for me on LW, although it does load if I right-click and select 'open in a new tab.'

1Sam Marks13dUmm ... that's weird. I'll paste in the picture again and maybe that'll fix whatever bug is going on? Let me know if it loads now.
GPT-3 Gems

## Another (outer) alignment failure story

This was run on davinci via the OpenAI API. First completion.

ML starts running factories, warehouses, shipping, and construction. ML assistants help write code and integrate ML into new domains. ML designers help build factories and the robots that go in them. ML finance systems invest in companies on the basis of complicated forecasts and (ML-generated) audits. Tons of new factories, warehouses, power plants, trucks and roads are being built. Things are happening quickly, investors have super strong FOMO, no one real

2Kaj_Sotala17dApparently we'll be able to build lots of drones.
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

Apparently VMs are the way to go for pdf support on linux.

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

It's a spaced repetition system that focuses on incremental reading. It's like Anki, but instead of hosting flashcards separately from your reading, you extract text while reading documents and PDFs. You later refine extracts into ever-smaller chunks of knowledge, at which point you create the "flashcard" (usually 'clozes', demonstrated below).

2adamShimi23dI've been wanting to try SuperMemo for a while, especially given the difficulty that you mention with making Anki cards. But it doesn't run natively on linux AFAIK, and I can't be bothered for the moment to make it work using wine.
What weird beliefs do you have?

I don’t follow the last bit. If ghosts were real, the first-order news would be amazing: maybe humanity wouldn’t have truly lost the brain-information of any human, ever!

3alexgieg24dYes, but that would (does?) also means a strict limit in how much cognitive abilities, including emotional amplitude, can be engineered. Neural engineering would has as its task improving a human body's brain up to that limit, but not beyond, as after a point it would be (is?) incompatible with "human souls". So, the first-order news would be good, in that 42 billion or so human souls would be intact (barring something able to kill souls). The second-order news, however, is that the trillions to quadrillions of human beings that will still come to exist will all be, well, basically this, just spread around. So, for me, if those quadrillions of future human beings could have been orders or magnitude more at the price of all human beings so far existing not having a continuity into that future, the utility thus gained would also be orders of magnitude higher.
The Case for Extreme Vaccine Effectiveness

The all-or-nothing vaccine hypothesis is:

But maybe the vaccine is 100% effective against all outcomes! So long as it’s correctly transported and administered, that is. Except sometimes vaccines are left at high temperature for too long, the delicate proteins are damaged, and people receiving them are effectively not vaccinated. If this happens 5% of the time, then 95% of people are completely immune to Covid and 5% are identical to not be vaccinated. Whatever chance they had of getting severe Covid before, it’s the same now.

If all-or-nothing were true, you... (read more)

Nope, that seems roughly right. It is I who failed to propagate. Was a cached argument from before I'd looked at the data.

I'll update the post shortly with this. Thanks for pointing it out.

A Brief Review of Current and Near-Future Methods of Genetic Engineering

Do you think such humans would have a high probability of working on TAI alignment, compared to working on actually making TAI?

5GeneSmith25dThis is a really good question. I'm not sure I have a satisfying answer to this other than to say that awareness of the dangers of both nuclear weapons and computers has been disproportionately high among extremely smart people. John Von Neumann literally woke up from a dream in 1945 and dictated to his wife the outcome of both the Manhattan Project and the more general project of computation. Or Alan Turing around the same time: Another one from him: Granted, these are just anecdotes. And let it be noted that Von Neumann and Turing both went on to make significant progress in their respective fields despite these concerns. My current theory is that yes, they are more likely to both recognize the danger of AI and do something about it. But that could be wrong. I will have to think more about this.
Modified bases in mRNA vaccines against Covid-19

I think you are indeed making a mistake by letting unsourced FB claims worry you, given the known proliferation of antivax-driven misinformation. There is an extremely low probability that you're first hearing about a real issue via some random, unsourced FB comment.

For more evidence, look to the overreactions to J&J / AZ adverse effects. Regulatory bodies are clearly willing to make a public fuss over even small probabilities of things going wrong.

A Brief Review of Current and Near-Future Methods of Genetic Engineering

Evolution requires some amount of mutation, which is occasionally beneficial to the species. Species that were too good at preventing mutations would be unable to adapt to changing environmental conditions, and thus die out.

We're aware of many species which evolved to extinction. I guess I'm looking for why there's no plausible "path" in genome-space between this arrangement and an arrangement which makes fatal errors happen less frequently. EG why wouldn't it be locally beneficial to the individual genes to code for more robustness against spontaneous abortions, or an argument that this just isn't possible for evolution to find (like wheels instead of legs, or machine guns instead of claws).

A Brief Review of Current and Near-Future Methods of Genetic Engineering

I feel confused wrt the genetic mutation hypothesis for the spontaneous abortion phenomenon. Wouldn't genes which stop the baby from being born, quickly exit the gene pool? Similarly for gamete formation processes which allow such mutations to arise?

2gilch1moYes, by killing the fetus before it's born. New mutations still happen all the time. Usually they hit junk DNA and not much happens, but what if it breaks something vital? And it's possible to inherit deleterious recessive alleles from both parents. That why incest is still a problem, from a genetic standpoint. And yet we still have transposons. Evolution requires some amount of mutation, which is occasionally beneficial to the species. Species that were too good at preventing mutations would be unable to adapt to changing environmental conditions, and thus die out.
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

I agree. I've put it in my SuperMemo and very much look forward to going through it. Thanks Peter & Owen!

2Mark Xu24dI'm curious what "put it in my SuperMemo" means. Quick googling only yielded SuperMemo as a language learning tool.
Analyzing Multiplayer Games using Impact

(midco developed this separately from our project last term, so this is actually my first read)

I have a lot of small questions.

What is your formal definition of the IEU ? What kinds of goals is it conditioning on (because IEU is what you compute after you view your type in a Bayesian game)?

Multi-agent "impact" seems like it should deal with the Shapley value. Do you have opinions on how this should fit in?

You note that your formalism has some EDT-like properties with respect to impact:

Well, in a sense, they do. The universes where player

1midco1moAnswering questions one-by-one: * I played fast and loose with IEU in the intro section. I think it can be consistently defined in the Bayesian game sense of "expected utility given your type", where the games in the intro section are interpreted as each player having constant type. In the Bayesian Network section, this is explicitly the definition (in particular, player i's IEU varies as a function of their type). * Upon reading the Wiki page, it seems like Shapley value and Impact share a lot of common properties? I'm not sure of any exact relationship, but I'll look into connections in the future. * I think what's going on is that the "causal order" ofωandaiis switched, which makesai"look as though" it controls the value ofω. In terms of game theory the distinction is (I think) definitional; I include it because Impact has to explicitly consider this dynamic. * In retrospect: yep, that's conditional expectation! My fault for the unnecessary notation. I introduced it to capture the idea of a vector space projection on random variables and didn't see the connection to pre-existing notation.
Testing The Natural Abstraction Hypothesis: Project Intro

My one note of unease is that an abstraction thermometer seems highly dual-use; if successful, this project could accelerate AI timelines. But that doesn't mean it isn't worth doing.

6johnswentworth1moRe: dual use, I do have some thoughts on exactly what sort of capabilities would potentially come out of this. The really interesting possibility is that we end up able to precisely specify high-level human concepts - a real-life language of the birds [https://en.wikipedia.org/wiki/Language_of_the_birds]. The specifications would correctly capture what-we-actually-mean, so they wouldn't be prone to goodhart. That would mean, for instance, being able to formally specify "strawberry on a plate" in non-goodhartable way, so an AI optimizing for a strawberry on a plate would actually produce a strawberry on a plate. Of course, that does not mean that an AI optimizing for that specification would be safe - it would actually produce a strawberry on a plate, but it would still be perfectly happy to take over the world and knock over various vases in the process. Of course just generally improving the performance of black-box ML is another possibility, but I don't think this sort of research is likely to induce a step-change in that department; it would just be another incremental improvement. However, if alignment is a bottleneck to extracting economic value [https://www.lesswrong.com/posts/BnDF5kejzQLqd5cjH/alignment-as-a-bottleneck-to-usefulness-of-gpt-3] from black-box ML systems, then this is the sort of research which would potentially relax that bottleneck without actually solving the full alignment problem. In other words, it would potentially make it easier to produce economically-useful ML systems in the short term, using techniques which lead to AGI disasters in the long term.

I still don't fully agree with OP but I do agree that I should weight this heuristic more.

Yeah, I think these are good points.

OK, if we're talking about central identity, then I very much wouldn't sign a contract giving away rights to my central identity. I interpreted the question to be about selling one's "immortal soul" (which supposedly goes to heaven if I'm good).

I think part of the lesson here is ‘don’t casually sell vaguely defined things that are generally understood to be some kind of big deal’

I guess I feel like this is a significant steelman and atypical of normal usage. In my ontology, that algorithm is closer to ‘mind.’

5Raemon1moSo there's a specific thing of "the immortal part of you that goes to heaven", which is just false. But I think plenty of people draw a mind/soul/body, where the mind/soul distinction is pointing at a cluster that's sort of like: * System 1 (as opposed to System 2) * strongly felt emotions * the core of your being – the things that make you distinctly you, vs the parts of your algorithm that any ol' person could easily implement (i.e. design by committee, paint by numbers). your central identity. When one says "that artistic piece has soul" or "they poured their soul into a project", one is saying (something like) "they invested their identity into it" or "they made it out of creative pieces that would be hard for someone else to replicate" or "they worked extremely hard on it, because they deeply cared about the outcome" (where if they had not deeply cared about the outcome they would have worked less hard). I think people that talk about immortal souls are usually also talking about the cluster of properties that have to do with the above. And they're just-plain-wrong about the immortal part, and they don't have super great abstractions for the other parts, but the other parts seem like they're trying to engage with a real thing.

I agree that "soul" has more 'real' meaning than "florepti xor bobble." There's another point to consider, though, which is that many of us will privilege claims about souls with more credence than they realistically deserve, as an effect of having grown up in a certain kind of culture.

Out of all the possible metaphysical constructs which could 'exist', why believe that souls are particularly likely? Many people believing in souls is some small indirect evidence for them, but not an amount of evidence commensurate with the concept's prior improbability.

3DanielFilan1moBecause there are good candidates for what a soul might be. E.g. the algorithm that's running in your head.

I think "Don't casually make contracts you don't intent to keep" is just pretty cruxy for me. This is a key piece of being a trustworthy person who can coordinate in complex, novel domains. There might be a price where there is worth it to do it as a joke, but $10 is way too low. I agree that the contracts part was important, and I share this crux. I should have noted that. I did purposefully modify my hypothetical so that I wasn't becoming less trustworthy by signing my acquaintance's piece of paper. This actually seems obviously wrong to me, if ... (read more) Don't Sell Your Soul My gut reaction is... okay, sure, maybe doing it ostentatiously is obnoxious, but these reasons against feel rather contrived. (It's not at all a takedown to say "I disagree, your arguments feel contrived, bye!", but I figured I'd rather write a small comment than not engage at all) If an acquaintance approached me on the street, asked me to sign a piece of paper that says "I, TurnTrout, give [acquaintance] ownership over my metaphysical soul" in exchange for$10 (and let's just ignore other updates I should make based on being approached with such a w... (read more)

Unless you're really desperate, it just seems like a bad idea to sign any kind of non-standard contract for $10. There's always a chance that you're misunderstanding the terms, or that the contract gets challenged at some point, or even that your signature on the contract is used as blackmail. Maybe you're trying to run for office or get a job at some point in the future, and the fact that you've sold your soul is used against you. The actual contract that Jacob references is long enough that even taking the time to read and understand it is worth signific... (read more) 2DanielFilan1moI declare replies to this comment to be devoted to getting data on this question. I mean "soul" is clearly much closer to having a meaning than "florepti xor bobble". You can tell that an em is pretty similar to being a soul but hand sanitizer is not really. You know some properties that souls are supposed to have. There are various secular accounts of what a soul is that basically match the intuiton (e.g. your personality). I actually started this essay thinking "eh, I don't think this matters too much", but by the end of it I was just like "yeah, this checks out." I think "Don't casually make contracts you don't intent to keep" is just pretty cruxy for me. This is a key piece of being a trustworthy person who can coordinate in complex, novel domains. There might be a price where there is worth it to do it as a joke, but$10 is way too low.

Suppose instead that the acquaintance approached me with a piece of paper that says "I, TurnTrout, give [acquaintance] ownership over

Disentangling Corrigibility: 2015-2021

where these people feel the need to express their objections even before reading the full paper itself

I'd very much like to flag that my comment isn't meant to judge the contributions of your full paper. My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!

(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn't totally off-base, and then spent the rest of the time... (read more)

Disentangling Corrigibility: 2015-2021

I very much agree with Eliezer about the abstract making big claims. I haven't read the whole paper, so forgive any critiques which you address later on, but here are some of my objections.

I think you might be discussing corrigibility in the very narrow sense of "given a known environment and an agent with a known ontology, such that we can pick out a 'shutdown button pressed' event in the agent's world model, the agent will be indifferent to whether this button is pressed or not."

1. We don't know how to robustly pick out things in the agent's world mod
TurnTrout's shortform feed

The discussion of the HPMOR epilogue in this recent April Fool's thread was essentially online improv, where no one could acknowledge that without ruining the pretense. Maybe I should do more improv in real life, because I enjoyed it!

Why We Launched LessWrong.SubStack

I think it's pretty obvious.

• Julia, Luke, Scott, and Eliezer know each other very well.
• Exactly three months ago, they all happened to consult their mental simulations of each other for advice on their respective problems, at the same time.
• Recognizing the recursion that would result if they all simulated each other simulating each other simulating each other... etc, they instead searched over logically-consistent universe histories, grading each one by expected utility.
• Since each of the four has a slightly different utility function, they of course aca
6Ben Pace1mo(absolutely great use of that link)
Why We Launched LessWrong.SubStack

Oh, another thing: I think it was pretty silly that Eliezer had Harry & co infer the existence of the AI alignment problem and then have Harry solve the inner alignment problem.

1. That plot point needlessly delayed the epilogue while we waited for Eliezer to solve inner alignment for the story's sake.
2. It was pretty mean of Eliezer to spoil that problem's solution. Some of us were having fun thinking about it on our own, thanks.
Why We Launched LessWrong.SubStack

I only read the HPMOR epilogue because - let's be honest - HPMOR is what LessWrong is really for.

• Honestly, although I liked the scene with Harry and Dumbledore, I would have preferred Headmaster Dobby not be present.
• I now feel bad for thinking Ron was dumb for liking Quidditch so much. But with hindsight, you can see his benevolent influence guiding events in literally every single scene. Literally. It was like a lorry missed you and your friends and your entire planet by centimetres - simply because someone threw a Snitch at so

Oh, another thing: I think it was pretty silly that Eliezer had Harry & co infer the existence of the AI alignment problem and then have Harry solve the inner alignment problem.

1. That plot point needlessly delayed the epilogue while we waited for Eliezer to solve inner alignment for the story's sake.
2. It was pretty mean of Eliezer to spoil that problem's solution. Some of us were having fun thinking about it on our own, thanks.

It's not clear to me why we need this tag.

2Yoav Ravid1moAgree
My AGI Threat Model: Misaligned Model-Based RL Agent

It seems to me that deliberation can expand the domain of the value function. If I don’t know of football per se, but I’ve played a sport before, then I can certainly imagine a new game and form opinions about it. so I’m not sure how large the minimal set of generator concepts is, or if that’s even well-defined.

2Steven Byrnes1moStrong agree. This is another way that it's a hard problem.
How do we prepare for final crunch time?

For this to matter, our alignment researchers need to be at the cutting edge of AI capabilities, and they need to be positioned such that their work can actually be incorporated into AI systems as they are deployed.

If we become aware that a lab will likely deploy TAI soon, other informed actors will probably become aware as well. This implies that many people would be trying to influence and gain access to this lab. Therefore, we should already have AI alignment researchers in positions of power within the lab before this happens.

5Eli Tyre1moStrong agree.
Raj Thimmiah's Shortform

I took Raj up on this generous offer. I'll post updates in the next few weeks as to how SM is compared to Anki!

Generalizing Power to multi-agent games

But perhaps a better way forward would be to define a new concept of "Useful power" or something like that, which equals your share of the total power in a zero-sum game.

I don’t see why useful power is particularly useful, since it’s taking a non-constant-sum quantity (outside of nash equilibria)  and making it constant-sum, which seems misleading.

But I also don’t see a problem with the “better play -> less exploitability -> less total Power” reasoning. this feels like a situation where our naive intuitions about power are just wrong, and if you think about it more, the formal result reflects a meaningful phenomenon.

4Daniel Kokotajlo1moDifferent strokes for different folks, I guess. It feels very different to me.
The Fusion Power Generator Scenario

I somehow agree with both you and OP, and also I don't buy part of the lever analogy yet. It seems important that the levers not only look similar, but that they be close to each other, in order to expect users to reliably mess up. Similarly, strong tool AI will offer many, many affordances, and it isn't clear how ''close'' I should expect them to be in use-space. From the security mindset, that's sufficient cause for serious concern, but I'm still trying to shake out the expected value estimate for powerful tool AIs -- will they be thermonuclear-weapon-like (as in your post), or will mistakes generally look different?

2johnswentworth1moOne way in which the analogy breaks down: in the lever case, we have two levers right next to each other, and each does something we want - it's just easy to confuse the levers. A better analogy for AI might be: many levers and switches and dials have to be set to get the behavior we want, and mistakes in some of them matter while others don't, and we don't know which ones matter when. And sometimes people will figure out that a particular combination extends the flaps, so they'll say "do this to extend the flaps", except that when some other switch has the wrong setting and it's between 4 and 5 pm on Thursday that combination will still extend the flaps, but it will also retract the landing gear, and nobody noticed that before they wrote down the instructions for how to extend the flaps. Some features which this analogy better highlights: * Most of the interface-space does things we either don't care about or actively do not want * Even among things which usually look like they do what we want, most do something we don't want at least some of the time * The system has a lot of dimensions, we can't brute-force check all combinations, and problems may be in seemingly-unrelated dimensions
flaritza's Shortform

before you have a chance to do something useful

That statement seems far too strong, at least if you aren’t just talking about a very narrow subset of AI safety research (part of MIRI’s agenda). at a glance, that website gauges a skillset associated with one flavor of proof-based mathematics. For proof-based AI safety work, i think that the more important and general skill is: can you make meaningful formal conjectures and then prove them?

4Viliam1moI admit I am confused about what exactly "proof based math" means. I assumed that in general all math is proof based, so this specifically refers to computer proofs. If not, then of course my advice does not apply.
My AGI Threat Model: Misaligned Model-Based RL Agent

Great post!

Do you like football? Well “football” is a learned concept living inside your world-model. Learned concepts like that are the only kinds of things that it’s possible to “like”. You cannot like or dislike [nameless pattern in sensory input that you’ve never conceived of]. It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that pattern.

This was a ‘click’ for me, thanks.

4TurnTrout1moIt seems to me that deliberation can expand the domain of the value function. If I don’t know of football per se, but I’ve played a sport before, then I can certainly imagine a new game and form opinions about it. so I’m not sure how large the minimal set of generator concepts is, or if that’s even well-defined.
Generalizing Power to multi-agent games

Thanks so much for your comment! I'm going to speak for myself here, and not for Jacob.

That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I'm just

I want to go a bit deeper into the fine points, but my general reaction is "I wanted that in the post". You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don't think that stating a definition and proving a single theorem that has the "obvious" quality (whether or not it is actually obvious, mind you) is that convincing.

The best way to describe my interpretation is that I feel that you two went for the "scientific paper" style, but the current state... (read more)

Generalizing Power to multi-agent games

Probably going to reply to the rest later (and midco can as well, of course), but regarding:

Coming back after reading more, do you use  to mean "the strategy profile for every process except "? That would make more sense of the formulas (since you fix , there's no reason to have a ) but if it's the case, then this notation is horrible (no offense).

By the way, indexing the other strategies by  instead of, let's say  or  is quite unconventional and confusing.

Using "" to mean "the strategy ... (read more)

2adamShimi1moOk, that's fair. It's hard to know which notation is common knowledge, but I think that adding a sentence explaining this one will help readers who haven't studied game theory formally. Maybe making all vector profiles bold (like for the action profile) would help to see at a glance the type of the parameter. If I had seen it was a strategy profile, I would have inferred immediately what it meant.