All of adamShimi's Comments + Replies

Formal Inner Alignment, Prospectus

I think you should be careful to not mix an analogy and an isomorphism. I agree that there is a pretty natural analogy with the cancer case, but it falls far short of an isomorphism at the moment. You don't have an argument to say that the mechanism used by cancer cells are similar to those creating mesa-optimizers, that the process creating them is similar, etc

I'm not saying that such a lower level correspondence doesn't exist. Just that saying "Look, the very general idea is similar" is not a strong enough argument for such a correspondence.

-1Waddington2hAll analogies rely on isomorphisms. They simply refer to shared patterns. A good analogy captures many structural regularities that are shared between two different things while a bad one captures only a few. The field of complex adaptive systems (CADs) is dedicated to the study of structural regularities between various systems operating under similar constraints. Ant colony optimization and simulated annealing can be used to solve an extremely wide range of problems because there are many structural regularities to CADs. I worry that a myopic focus will result in a lot of time wasted on lines of inquiry that have parallels in a number of different fields. If we accept that the problem of inner alignment can be formalized, it would be very surprising to find that the problem is unique in the sense that it has no parallels in nature. Especially considering the obvious general analogy to the problem of cancer which may or may not provide insight to the alignment problem.
Formal Inner Alignment, Prospectus

Haven't read the full comment thread, but on this sentence

Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment!

Evan actually wrote a post to explain that it isn't the complement for him (and not the compliment either :p) 

Formal Inner Alignment, Prospectus

Thanks for the post!

Here is my attempt at a detailed peer-review feedback. I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).

One thing I really like is the multiple "failure" stories at the beginning. It's usually frustrating in posts like that to see people argue against position/arguments which are not written anywhere. Here we can actually see the problematic arguments.

I responded that for me, the whole po

... (read more)
2abramdemski8hThanks! Right. By "no connection" I specifically mean "we have no strong reason to posit any specific predictions we can make about mesa-objectives from outer objectives or other details of training" -- at least not for training regimes of practical interest. (I will consider this detail for revision.) I could have also written down my plausibility argument (that there is actually "no connection"), but probably that just distracts from the point here. (More later!)
Challenge: know everything that the best go bot knows about go

What does that mean though? If you give the go professional a massive transcript of the bot knowledge, it's probably unusable. I think what the go professional gives you is the knowledge of where to look/what to ask for/what to search. 

5Nisan3dOr maybe it means we train the professional in the principles and heuristics that the bot knows. The question is if we can compress the bot's knowledge into, say, a 1-year training program for professionals. There are reasons to be optimistic: We can discard information that isn't knowledge (lossy compression). And we can teach the professional in human concepts (lossless compression).
Challenge: know everything that the best go bot knows about go

That's basically what Paul's universality (my distillation post for another angle) is aiming for: having a question-answering overseer which can tell you everything you want to know about what the system knows and what it will do. You still probably need to be able to ask a relevant question, which I think is what you're pointing at.

Mundane solutions to exotic problems

Sorry about that. I corrected it but it was indeed the first link you gave.

April 15, 2040

This seems like a role for the law. Like having corrigibility except for breaking the law. I find that reasonable at first hand, but I also know relatively little about law in different countries to understand how uncompetitive that would make the AIs.

(There's also a risk of giving too much power to the legislative authority in your country, if you're worried about that kind of thing)

Although I could imagine something like a modern day VPN allowing you to make your AI believe it's in another country, to make it do something illegal where you are. That's bad in a country with useful laws and good in a country with an authoritarian regime.

4evhub9dYour link is broken. For reference, the first post in Paul's ascription universality sequence can be found here [https://ai-alignment.com/towards-formalizing-universality-409ab893a456] (also Adam has a summary here [https://www.alignmentforum.org/posts/farherQcqFQXqRcvv/universality-unwrapped] ).
[Linkpost] Teaching Paradox, Europa Univeralis IV, Part I: State of Play

Yes, that's one reason I felt that this particular post might resonate with people here.

[Linkpost] Teaching Paradox, Europa Univeralis IV, Part I: State of Play

My bad, I thought that just putting the link at the beginning of the post would make a linkpost, but that's not how it works. Also, apparently if you make a link without linked url, it goes back to the post containing the link (as you mentioned).

This should now be fixed.

2gjm11dYup, seems OK now.
April 2021 Deep Dive: Transformers and GPT-3

You're welcome!

Thanks for the link. After a quick look, it seems like a good complementary resource for what I did in week 3 and 4.

April 2021 Deep Dive: Transformers and GPT-3

Glad you liked it! Good luck with your learning, I'm curious to see how my path and recommendations generalize.

April 2021 Deep Dive: Transformers and GPT-3

To be fair, I doubted a bit whether this type of post was really valuable. So some sort of signaling that we as a community are interested by those might be useful.

AMA: Paul Christiano, alignment researcher

Copying my question from your post about your new research center (because I'm really interested in the answer): which part (if any) of theoretical computer science do you expect to be particularly useful for alignment?

6paulfchristiano13dLearning theory definitely seems most relevant. Methodologically I think any domain where you are designing and analyzing algorithms, especially working with fuzzy definitions or formalizing intuitive problems, is also useful practice though much less bang for your buck (especially if just learning about it rather than doing research in it). That theme cuts a bunch across domains, though I think cryptography, online algorithms, and algorithmic game theory are particularly good.
Coherence arguments imply a force for goal-directed behavior

Yeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)

My first intuition is that I expect mapping internal concept to mathematical formalisms to be easier when the end goal is deconfusion and making sense of behaviors,... (read more)

Coherence arguments imply a force for goal-directed behavior

Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole - concepts which people just don't talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they're making a very similar mistake as a doctor who tries to design artificial livers based

... (read more)
2Richard_Ngo15dYeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.) I don't think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)
Announcing the Alignment Research Center

This is so great! I always hate wishing people luck when I trust in their competence to mostly deal with bad luck and leverage good luck. I'll use that one now.

Announcing the Alignment Research Center

Sounds really exciting! I'm wondering which kind of theoretical computer science do you have in mind specifically? Like which part of that do you think has the most uses for alignment? (Still trying to find a way to use my PhD in the theory of distributed computing for something alignment related ^^)

Gradations of Inner Alignment Obstacles

Agreed, it depends on the training process.

Gradations of Inner Alignment Obstacles

Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).

But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started.

This argument is obviously a bit sloppy, though.

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively c... (read more)

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.

I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.

Where are intentions to be found?

I have two reactions while reading this post:

  • First, even if we say that a given human (for example) at a fixed point in time doesn't necessarily contain everything that we would want the AI to learn, if it only learns what's in there, there might already be a lot of alignment failures that disappear. For example paperclip maximizers are probably ruled out by taking one human's values at a point in time and extrapolating. But that clearly doesn't help with scenarios where the AI does the sort of bad things humans can do, for example.
  • Second, I would argue th
... (read more)
Gradations of Inner Alignment Obstacles

Cool post! It's clearly not super polished, but I think you're pointing at a lot of important ideas, and so it's a good thing to publish it relatively quickly.

The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.

As far as I understand it, the initial assumption of internal search was mostly done for two reasons: because then you can sp... (read more)

8abramdemski18dRight, so, the point of the argument for basin-like proposals is this: A basin-type solution has to 1. initialize in such a way as to be within a good basin / not within a bad basin. 2. Train in a way which preserves this property. Most existing proposals focus on (2) and don't say that much about (1), possibly counting on the idea that random initializations will at least not be actively deceptive. The argument I make in the post is meant to question this, pointing toward a difficulty in step (1). One way to put the problem in focus: suppose the ensemble learning hypothesis: Ensemble learning hypothesis (ELH): Big NNs basically work as a big ensemble of hypotheses, which learning sorts through to find a good one. This bears some similarity to lottery-ticket thinking. Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization). But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started. This argument is obviously a bit sloppy, though.
2johnswentworth23dYes.
Updating the Lottery Ticket Hypothesis

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall  found during training is actually nearly a solution to the linearly-approximated equations.

Trying to check if I'm understanding correctly: does that mean that despite SGD doing a lot of successive changes that use the gradient at the successive parameter values, these "even out" s... (read more)

6johnswentworth23dSort of. They end up equivalent to a single Newton step, not a single gradient step (or at least that's what this model says). In general, a set of linear equations is not solved by one gradient step, but is solved by one Newton step. It generally takes many gradient steps to solve a set of linear equations. (Caveat to this: if you directly attempt a Newton step on this sort of system, you'll probably get an error, because the system is underdetermined. Actually making Newton steps work for NN training would probably be a huge pain in the ass, since the underdetermination would cause numerical issues.)
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

I've been wanting to try SuperMemo for a while, especially given the difficulty that you mention with making Anki cards. But it doesn't run natively on linux AFAIK, and I can't be bothered for the moment to make it work using wine.

2TurnTrout1moApparently VMs are the way to go for pdf support on linux.
Identifiability Problem for Superrational Decision Theories

As outlined in the last paragraph of the post. I want to convince people that TDT-like decision theories won't give a "neat" game theory, by giving an example where they're even less neat than classical game theory.

Hum, then I'm not sure I understand in what way classical game theory is neater here?

I think you're thinking about a realistic case (same algorithm, similar environment) rather than the perfect symmetry used in the argument. A communication channel is of no use there because you could just ask yourself what you would send, if you had one, and th

... (read more)
1Bunthut1moChanging the labels doesn't make a difference classically. Yes. No, I think you should take the problems of distributed computing, and translate them into decision problems, that you then have a solution to.
Identifiability Problem for Superrational Decision Theories

Well, if I understand the post correctly, you're saying that these two problems are fundamentally the same problem, and so rationality should be able to solve them both if it can solve one. I disagree with that, because from the perspective of distributed computing (which I'm used to), these two problems are exactly the two kinds of problems that are fundamentally distinct in a distributed setting: agreement and symmetry-breaking.

Communication won't make a difference if you're playing with a copy.

Actually it could. Basically all of distributed computing as... (read more)

1Bunthut1moNo. I think: As outlined in the last paragraph of the post. I want to convince people that TDT-like decision theories won't give a "neat" game theory, by giving an example where they're even less neat than classical game theory. I think you're thinking about a realistic case (same algorithm, similar environment) rather than the perfect symmetry used in the argument. A communication channel is of no use there because you could just ask yourself what you would send, if you had one, and then you know you would have just gotten that message from the copy as well. I'd be interested. I think even just more solved examples of the reasoning we want are useful currently.
"Taking your environment as object" vs "Being subject to your environment"

Not sure I'm not right person to ask for that, because I tend to often doubt basically almost anything I say or think (not at the same time), and sometimes I forget why something makes sense, and spend quite some time trying to find a good explanation. So I guess I'm naturally the type that gets something out of the imagined version.

Specializing in Problems We Don't Understand

The impression I always had with general systems (from afar) was that it looked cool, but it never seemed to be useful for doing anything other than "think in systems", (so not useful for doing research in another field or making any concrete applications). So that's why I never felt interested. Note that I'm clearly not knowledgeable at all on the subject, this is just my outside impression.

I assume from your comment you think that's wrong. Is the Weinberg book a good resource for educating myself and seeing how wrong I am?

Specializing in Problems We Don't Understand

Fair enough. But identifying good subproblems of well-posed problems is a different skill from identifying good well-posed subproblems of a weird and not formalized problem. An example of the first would be to simplify the problem as much as possible without making it trivial (classic technique in algorithm analysis and design), whereas an example of the second would be defining the logical induction criterion, which creates the problems of finding a logical inductor (not sure that happened in this order, this is a part of what's weird with problem formula... (read more)

2johnswentworth1moThis deserves its own separate response. At a high level, we can split this into two parts: * developing intuitions * translating intuitions into math We've talked about the translation step a fair bit before (the conversation which led to this post [https://www.lesswrong.com/posts/GhFoAxG49RXFzze5Y/what-s-so-bad-about-ad-hoc-mathematical-definitions] ). A core point of that post is that the translation from intuition to math should be faithful, and not inject any "extra assumptions" which weren't part of the math. So, for instance, if I have an intuition that some function is monotonically increasing between 0 and 1, then my math should say "assume f(x) is monotonically increasing between 0 and 1", not "let f(x) = x^2"; the latter would be making assumptions not justified by my intuition. (Some people also make the opposite mistake - failing to include assumptions which their intuition actually does believe. Usually this is because the intuition only feels like it's mostly true or usually true, rather than reliable and certain; the main way to address this is to explicitly call the assumption an approximation.) On the flip side, that implies that the intuition has to do quite a bit of work, and the linked post talked a lot less about that. How do we build these intuitions in the first place? The main way is to play around with the system/problem. Try examples, and see what they do. Try proofs, and see where they fail. Play with variations on the system/problem. Look for bottlenecks and barriers, look for approximations, look for parts of the problem-space/state-space with different behavior. Look for analogous systems in the real world, and carry over intuitions from them. Come up with hypotheses/conjectures, and test them.
4johnswentworth1moI disagree with the claim that "identifying good subproblems of well-posed problems is a different skill from identifying good well-posed subproblems of a weird and not formalized problem", at least insofar as we're focused on problems for which current paradigms fail. P vs NP is a good example here. How do you identify a good subproblem for P vs NP? I mean, lots of people have come up with subproblems in mathematically-straightforward ways, like the strong exponential time hypothesis or P/poly vs NP. But as far as we can tell so far, these are not very good subproblems - they are "simplifications" in name only, and whatever elements make P vs NP hard in the first place seem to be fully maintained in them. They don't simplify the parts of the original problem which are actually hard. They're essentially variants of the original problem, a whole cluster of problems which are probably-effectively-identical in terms of the core principles. They're not really simplifications. Simplifying an actually-hard part of P vs NP is very much a fuzzy conceptual problem. We have to figure out how-to-carve-up-the-problem in the right way, how to frame it so that a substantive piece can be reduced. I suspect that your intuition that "there are way more useful and generalizable techniques for the first case than the second case" is looking at things like simplifying-P-vs-NP-to-strong-exponential-time-hypothesis, and confusing these for useful progress on the hard part of a hard problem. Something like "simplify the problem as much as possible without making it trivial" is a very useful first step, but it's not the sort of thing which is going address the hardest part of a problem when the current paradigm fails. (After all, the current paradigm is usually what underlies our notion of "simplicity".)
[LINK] Luck, Skill, and Improving at Games

Pretty cool. The part about not blaming luck reminded me a lot of the advice to not adopt a victim mindset. I also like the corresponding advice to not take credit for luck.

"Taking your environment as object" vs "Being subject to your environment"

So far I have said there are three ways of getting perspective on your environment: leaving it, imagining yourself into someone outside of it, and assuming that it's hostile.

 

What are some other ways of getting a better perspective on your environment?

Imagining myself explaining the environment to someone else, or literally doing that. That's also a very useful technique for checking understanding, and I think it uses the same mechanism: when you read a paper, you feel a sense of familiarity and obviousness that makes you think you understand. But if ... (read more)

3Rana Dexsin1moI would expect the imagined version to not work as well for someone who isn't already used to trying to see their environment from the outside, since they're likely to just imagine someone else who's used to the same environment (because it's normal and obvious, right?), after which the explanation can just be the “official” explanation. Any experiential information on that?
Specializing in Problems We Don't Understand

This looks like expanding on the folklore separation between engineering and research in technical fields like computer science: engineering is solving a problem we know how to solve or know the various pieces needed for solving, whereas research is solving a problem no one ever solved, and such that we don't expect/don't know if the standard techniques apply. Of course this is not exactly accurate, and generalizes to field that we wouldn't think of as engineering.

I quite like your approach; it looks like the training for an applied mathematician (in the s... (read more)

5johnswentworth1moThat's a good one to highlight. In general, there's a lot of different skills which I didn't highlight in the post (in many cases because I haven't even explicitly noticed them) which are still quite high-value. The outside-view approach of having a variety of problems you don't really understand should still naturally encourage building those sorts of skills. In the case of problem formulation, working on a wide variety of problems you don't understand will naturally involve identifying good subproblems. It's especially useful to look for subproblems which are a bottleneck for multiple problems at once - i.e. generalizable bottleneck problems. That's exactly the sort of thinking which first led me to (what we now call) the embedded agency cluster of problems, as well as the abstraction work.
Identifiability Problem for Superrational Decision Theories

I don't see how the two problems are the same. They are basically the agreement and symmetry breaking problems of distributed computing, and those two are not equivalent in all models. What you're saying is simply that in the no-communication model (where the same algorithm is used on two processes that can't communicate), these two problems are not equivalent. But they are asking for fundamentally different properties, and are not equivalent in many models that actually allow communication. 

1Bunthut1mo"The same" in what sense? Are you saying that what I described in the context of game theory is not surprising, or outlining a way to explain it in retrospect? Communication won't make a difference if you're playing with a copy.
Phylactery Decision Theory

I feel like doing a better job of motivating why we should care about this specific problem might help get you more feedback.

If we want to alter a decision theory to learn its set of inputs and outputs, your proposal makes sense to me at first glance. But I'm not sure why I should particularly care, or why there is even a problem to begin with solution. The link you provide doesn't help me much after skimming it, and I (and I assume many people) almost never read something that requires me to read other posts without even a summary of the references. I mad... (read more)

1Bunthut1moThe link would have been to better illustrate how the proposed system works, not about motivation. So, it seems that you understood the proposal, and wouldn't have needed it. I don't exactly want to learn the cartesian boundary. A cartesian agent believes that its input set fully screens off any other influence on its thinking, and the outputs screen off any influence of the thinking on the world. Its very hard to find things that actually fulfill this. I explain how PDT can learn cartesian boundaries, if there are any, as a sanity/conservative extension check. But it can also learn that it controls copies or predictions of itself for example.
Testing The Natural Abstraction Hypothesis: Project Intro

This project looks great! I especially like the focus on a more experimental kind of research, while still focused and informed on the specific concepts you want to investigate.

If you need some feedback on this work, don't hesitate to send me a message. ;)

Open & Welcome Thread – March 2021

To be clear, I was just answering the comment, not complaining again about the editor. I find it's great, and the footnote is basically a nitpick (but a useful nitpick). I also totally get if it takes quite some time and work to implement. ;)

Open & Welcome Thread – March 2021

Thanks for the link!

But yeah, I like using the WYSIWIG, at least if I have to edit on LW directly (otherwise vim is still my favorite probably)

2habryka1moYeah, I really want to get around to this. I am sorry for splitting the feature-set awkwardly across two editors!
TAI?

Quick answer without any reference, so probably biased towards my internal model: I don't think we reached TAI yet because I believe that if you remove every application of AI in the world (to simplify the definitions, every product of ML), the vast majority of people wouldn't see any difference, and probably some positive difference (less attention manipulation on social media for example).

Compare with removing every computing device, or removing electricity.

And taking as examples the AI we're making now, I expect that your first two points are wrong: peo... (read more)

Vanessa Kosoy's Shortform

Oh, right, that makes a lot of sense.

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

2Vanessa Kosoy1moYes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is ϵ and your quantilization fraction is ϕ then the AI's probability of corruption is bounded by ϵϕ. Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn't specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.
Review of "Fun with +12 OOMs of Compute"

About the update

You're right, that's what would happen with an update.

I think that the model I have in mind (although I hadn't explicitly thought about it until know), is something like a distribution over ways to reach TAI (capturing how probable it is that they're the first way to reach AGI), and each option comes with its own distribution (let's say over years). Obviously you can compress that into a single distribution over years, but then you lose the ability to do fine grained updating.

For example, I imagine that someone with relatively low probabili... (read more)

Review of "Fun with +12 OOMs of Compute"

Let me try to make an analogy with your argument.

Say we want to make X. What you're saying is "with 10^12 dollars, we could do it that way". Why on earth would I update at all whether it can be done with 10^6 dollars? If your scenario works with that amount, then you should have described it using only that much money. If it doesn't, then you're not providing evidence for the cheaper case.

Similarly here, if someone starts with a low credence on prosaic AGI, I can see how your arguments would make them put a bunch of probability mass close to +10^12 compute... (read more)

I'm not sure, but I think that's not how updating works? If you have a bunch of hypotheses (e.g. "It'll take 1 more OOM," "It'll take 2 more OOMs," etc.) and you learn that some of them are false or unlikely (only 10% chance of it taking more than 12" then you should redistribute the mass over all your remaining hypotheses, preserving their relative strengths. And yes I have the same intuition about analogical arguments too. For example, let's say you overhear me talking about a bridge being built near my h... (read more)

Review of "Fun with +12 OOMs of Compute"

You're welcome!

To put it another way: I don't actually believe we will get to +12 OOMs of compute, or anywhere close, anytime soon. Instead, I think that if we had +12 OOMs, we would very likely get TAI very quickly, and then I infer from that fact that the probability of getting TAI in the next 6 OOMs is higher than it would otherwise be (if I thought that +12 OOMs probably wasn't enough, then my credence in the next 6 OOMs would be correspondingly lower).

To some extent this reply also partly addresses the concerns you raised about memory and bandwidth--I

... (read more)
9Daniel Kokotajlo1moThanks! Well, I agree that I didn't really do anything in my post to say how the "within 12 OOMs" credence should be distributed. I just said: If you distribute it like Ajeya does except that it totals to 80% instead of 50%, you should have short timelines. There's a lot I could say about why I think within 6 OOMs should have significant probability mass (in fact, I think it should have about as much mass as the 7-12 OOM range). But for now I'll just say this: If you agree with me re Question Two, and put (say) 80%+ probability mass by +12 OOMs, but you also disagree with me about what the next 6 OOMs should look like and think that it is (say) only 20%, then that means your distribution must look something like this: Probability distribution over how many extra OOMs of compute we need given current ideasEDIT to explain: Each square on this graph is a point of probability mass. The far-left square represents 1% credence in the hypothesis "It'll take 1 more OOM." The second-from the left represents "It'll take 2 more OOM." The third-from-the-left is a 3% chance it'll take 3 more OOM, and so on. The red region is the region containing the 7-12 OOM hypotheses. Note that I'm trying to be as charitable as I can when drawing this! I only put 2% mass on the far right (representing "not even recapitulating evolution would work!"). This is what I think the probability distribution of someone who answered 80% to my Question Two should look like if they really really don't want to believe in short timelines. Even on this distribution, there's a 20% chance of 6 or fewer OOMs being enough given current ideas/algorithms/etc. (And hence, about a 20% chance of AGI/TAI/etc. by 2030, depending on how fast you think we'll scale up and how much algorithmic progress we'll make.) And even this distribution looks pretty silly to me. Like, why is it so much more confident that 11 OOMs will be how much we need, than 13 OOMs? Given our current state of ignorance about AI, I think the slo
Vanessa Kosoy's Shortform

However, it can do much better than that, by short-term quantilizing w.r.t. the user's reported success probability (with the user's policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user's reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation.

I don't understand what you mean here by quantilizing. The meanin... (read more)

4Vanessa Kosoy1moThe distribution is the user's policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it [https://www.alignmentforum.org/posts/5bd75cc58225bf0670375556/quantilal-control-for-finite-mdps] for MDPs.
Generalizing Power to multi-agent games

Glad to be helpful!

I go into more detail in my answer to Alex, but what I want to say here is that I don't feel like you use the power-scarcity idea enough in the post itself. As you said, it's one of three final notes, and without any emphasis on it.

So while I agree that the power-scarcity is an important research question, it would be helpful IMO if this post put more emphasis on that connection.

Generalizing Power to multi-agent games

Thanks for the detailed reply!

I want to go a bit deeper into the fine points, but my general reaction is "I wanted that in the post". You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don't think that stating a definition and proving a single theorem that has the "obvious" quality (whether or not it is actually obvious, mind you) is that convincing.

The best way to describe my interpretation is that I feel that you two went for the "scientific paper" style, but the current state... (read more)

Generalizing Power to multi-agent games

Ok, that's fair. It's hard to know which notation is common knowledge, but I think that adding a sentence explaining this one will help readers who haven't studied game theory formally.

Maybe making all vector profiles bold (like for the action profile) would help to see at a glance the type of the parameter. If I had seen it was a strategy profile, I would have inferred immediately what it meant.

Generalizing Power to multi-agent games

Exciting to see new people tackling AI Alignment research questions! (and I'm already excited by what Alex is doing, so him having more people work in his kind of research feels like a good thing).

That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalizatio... (read more)

4Daniel Kokotajlo1mo"I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals." IMO, the unintuitive and potentially problematic thing is not that in a zero-sum game playing better makes things worse for everybody else. That part is fine. The unintuitive and potentially problematic thing is that, according to this formalism, the total collective Power is greater the worse everybody plays. This seems adjacent to saying that everybody would be better off if everyone played poorly, which is true in some games (maybe) but definitely not true in zero-sum games. (Right? This isn't my area of expertise) EDIT: Currently I suppose what you'd say is that power =/= utility, and so even though we'd all have more power if we were all less competent, we wouldn't actually be better off. But perhaps a better way forward would be to define a new concept of "Useful power" or something like that, which equals your share of the total power in a zero-sum game. Then we could say that everyone getting less competent wouldn't result in everyone becoming more usefully-powerful, which seems like an important thing to be able to say. Ideally we could just redefine power that way instead of inventing a new concept of useful power, but maybe that would screw up some of your earlier theorems?
9midco2moThank you so much for the comments! I'm pretty new to the platform (and to EA research in general), so feedback is useful for getting a broader perspective on our work. To add to TurnTrout's comments about power-scarcity and the CCC [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/w6BtMqKRLxG9bNLMr], I'd say that the broader vision of the multi-agent formulation is to establish a general notion of power-scarcity as a function of "similarity" between players' reward functions (I mention this in the post's final notes). In this paradigm, the constant-sum case is one limiting case of "general power-scarcity", which I see as the "big idea". As a simple example, general power-scarcity would provide a direct motivation for fearing robustly instrumental goals, since we'd have reason to believe an AI with goals orthogonal(ish) from human goals would be incentivized to compete with humanity for Power. We're planning to continue investigating multi-agent Power and power-scarcity, so hopefully we'll have a more fleshed-out notion of general power-scarcity in the months to come. Also, re: "as players' strategies improve, their collective Power tends to decrease", I think your intuition is correct? Upon reflection, the effect can be explained reasonably well by "improving your actions has no effect on your Power, but a negative effect on opponents' Power".

Thanks so much for your comment! I'm going to speak for myself here, and not for Jacob.

That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I'm just

... (read more)
6TurnTrout2moProbably going to reply to the rest later (and midco can as well, of course), but regarding: Using "σ−i" to mean "the strategy profile of everyone but playeri" is common notation; I remember it being used in 2-3 game theory textbooks I read, and you can see its prominence by consulting the Wikipedia page for Nash equilibrium [https://en.wikipedia.org/wiki/Nash_equilibrium#Nash_Equilibrium]. Do I agree this is horrible notation? Meh. I don't know. But it's not a convention we pioneered in this work.
Against evolution as an analogy for how humans will create AGI

Just wanted to say that this comment made me add a lot of things on my reading list, so thanks for that (but I'm clearly not well-read enough to go into the discussion).

4gwern2moFurther reading: https://www.reddit.com/r/reinforcementlearning/search/?q=flair%3AMetaRL&include_over_18=on&restrict_sr=on&sort=new [https://www.reddit.com/r/reinforcementlearning/search/?q=flair%3AMetaRL&include_over_18=on&restrict_sr=on&sort=new] https://www.gwern.net/Backstop#external-links [https://www.gwern.net/Backstop#external-links]
Load More