All of Daniel Kokotajlo's Comments + Replies

Challenge: know everything that the best go bot knows about go
When choosing between two moves that are both judged to win the game with 0.9999999 alpha go not choosing the move that maximizes points suggest that it does not use patterns about what optimal moves are in certain local situations to make it's judgements. 

I nitpick/object to your use of "optimal moves" here. The move that maximizes points is NOT the optimal move; the optimal move is the move that maximizes win probability. In a situation where you are many points ahead, plausibly the way to maximize win probability is not to try to get more points, but rather to try to anticipate and defend against weird crazy high-variance strategies your opponent might try.

Credibility of the CDC on SARS-CoV-2

It would be interesting to see this post updated, e.g. to describe the situation today or (even better) how it evolved over the course of 2020-2021.

NTK/GP Models of Neural Nets Can't Learn Features

I think I get this distinction; I realize the NN papers show the latter; I guess our disagreement is about how big a deal / how surprising this is.

Pre-Training + Fine-Tuning Favors Deception

Nice post! You may be interested in this related post and discussion.

I think you may have forgotten to put a link in "See Mesa-Search vs Mesa-Control for discussion."

2Mark Xu5dthanks, fixed
NTK/GP Models of Neural Nets Can't Learn Features

Ah, OK. Interesting, thanks. Would you agree with the following view:

"The NTK/GP stuff has neural nets implementing a "psuedosimplicity prior" which is maybe also a simplicity prior but might not be, the evidence is unclear. A psuedosimplicity prior is like a simplicity prior except that there are some important classes of kolmogorov-simple functions that don't get high prior / high measure."

Which would you say is more likely: The NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren't either, or (b) NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren't well-characterized by the NTK/GP stuff.

1interstice6dYeah, that summary sounds right. I'd say (b) -- it seems quite unlikely to me that the NTK/GP are universally data-efficient, while neural nets might be(although that's mostly speculation on my part). I think the lack of feature learning is a stronger argument that NTK/GP don't characterize neural nets well.
NTK/GP Models of Neural Nets Can't Learn Features
Feature learning requires the intermediate neurons to adapt to structures in the data that are relevant to the task being learned, but in the NTK limit the intermediate neurons' functions don't change at all.
Any meaningful function like a 'car detector' would need to be there at initialization -- extremely unlikely for functions of any complexity.

I used to think it would be extremely unlikely for a randomly initialized neural net to contain a subnetwork that performs just as well as the entire neural net does after training. But the mu... (read more)

3interstice6dThey would exist in a sufficiently big random NN, but their density would be extremely low I think. Like, if you train a normal neural net with a 15000 neurons and then there's a car detector, the density of car detectors is now 1/15000. Whereas I think the density at initialization is probably more like 1/2^50 or something like that(numbers completely made up), so they'd have a negligible effect on the NTK's learning ability('slight tweaks' can't happen in the NTK regime since no intermediate functions change by definition) A difference with the pruning case is that the number of possible prunings increases exponentially with the number of neurons but the number of neurons is linear. My take on the LTH is that pruning is basically just a weird way of doing optimization so it's not that surprising you can get good performance.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Sorry I didn't notice this earlier! What do you think about the argument that Joar gave?

If a function is small-volume, it's complex, because it takes a lot of parameters to specify.

If a function is large-volume, it's simple, because it can be compressed a lot since most parameters are redundant.

It sounds like you are saying: Some small-volume functions are actually simple, or at least this might be the case for all we know, because maybe it's just really hard for neural networks to efficiently represent that function. This is especially... (read more)

3interstice6dYeah, exactly -- the problem is that there are some small-volume functions which are actually simple. The argument for small-volume --> complex doesn't go through since there could be other ways of specifying the function. Other senses of simplicity include various circuit complexities [] and Levin complexity []. There's no argument that parameter-space volume corresponds to either of them AFAIK(you might think parameter-space volume would correspond to "neural net complexity", the number of neurons in a minimal-size neural net needed to compute the function, but I don't think this is true either -- every parameter is Gaussian so it's unlikely for most to be zero)
What are your favorite examples of adults in and around this community publicly changing their minds?

The funnest one off the top of my head is how Yudkowsky used to think that the best thing for altruists to do was build AGI as soon as possible, because that's the quickest way to solve poverty, disease, etc. and achieve a glorious transhuman future. Then he thought more (and talked to Bostrom, I was told) and realized that that's pretty much the exact opposite of what we should be doing. When MIRI was founded its mission was to build AGI as soon as possible.

(Disclaimer: This is the story as I remember it being told, it's entirely possible I'm wrong)

3Raven5dHe recounts this story in the Sequences.
AMA: Paul Christiano, alignment researcher

My counterfactual attempts to get at the question "Holding ideas constant, how much would we need to increase compute until we'd have enough to build TAI/AGI/etc. in a few years?" This is (I think) what Ajeya is talking about with her timelines framework. Her median is +12 OOMs. I think +12 OOMs is much more than 50% likely to be enough; I think it's more like 80% and that's after having talked to a bunch of skeptics, attempted to account for unknown unknowns, etc. She mentioned to me that 80% seems plausible to her too but that sh... (read more)

AMA: Paul Christiano, alignment researcher

Hmm, I don't count "It may work but we'll do something smarter instead" as "it won't work" for my purposes.

I totally agree that noise will start to dominate eventually... but the thing I'm especially interested in with Amp(GPT-7) is not the "7" part but the "Amp" part. Using prompt programming, fine-tuning on its own library, fine-tuning with RL, making chinese-room-bureaucracies, training/evolving those bureaucracies... what do you think about that? Naively the scaling laws would predict that we... (read more)

AMA: Paul Christiano, alignment researcher

When you say hardware progress, do you just mean compute getting cheaper or do you include people spending more on compute? So you are saying, you guess that if we had 10 OOMs of compute today that would have a 50% chance of leading to human-level AI without any further software progress, but realistically you expect that what'll happen is we get +5 OOMs from increased spending and cheaper hardware, and then +5 "virtual OOMs" from better software?

Draft report on existential risk from power-seeking AI

Thanks for the thoughtful reply. Here are my answers to your questions:

Here is what you say in support of your probability judgment of 10% on "Conditional it being both possible and strongly incentivized to build APS systems, APS systems will end up disempowering approximately all of humanity."

Beyond this, though, I’m also unsure about the relative difficulty of creating practically PS-aligned systems, vs. creating systems that would be practically PS-misaligned, if deployed, but which are still superficially attractive to deploy. One comm
... (read more)
3Joe Carlsmith6dHi Daniel, Thanks for taking the time to clarify. One other factor for me, beyond those you quote, is the “absolute” difficulty of ensuring practical PS-alignment, e.g. (from my discussion of premise 3): My sense is that relative to you, I am (a) less convinced that ensuring practical PS-alignment will be “hard” in this absolute sense, once you can build APS systems at all (my sense is that our conceptions of what it takes to “solve the alignment problem” might be different), (b) less convinced that practically PS-misaligned systems will be attractive to deploy despite their PS-misalignment (whether because of deception, or for other reasons), (c) less convinced that APS systems becoming possible/incentivized by 2035 implies “fast take-off” (it sounds like you’re partly thinking: those are worlds where something like the scaling hypothesis holds, and so you can just keep scaling up; but I don’t think the scaling hypothesis holding to an extent that makes some APS systems possible/financially feasible implies that you can just scale up quickly to systems that can perform at strongly superhuman levels on e.g. ~any task, whatever the time horizons, data requirements, etc), and (d) more optimistic about something-like-present-day-humanity’s ability to avoid/prevent failures at a scale that disempower ~all of humanity (though I do think Covid, and its policitization, an instructive example in this respect), especially given warning shots (and my guess is that we do get warning shots both before or after 2035, even if APS systems become possible/financially feasible before then). Re: nuclear winter, as I understand it, you’re reading me as saying: “in general, if a possible and incentivized technology is dangerous, there will be warning shots of the dangers; humans (perhaps reacting to those warning shots) won’t deploy at a level that risks the permanent extinction/disempowerment of ~all humans; and if they start to move towards such disempowerment/extinction, they’ll
[AN #139]: How the simplicity of reality explains the success of neural nets
I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like double descent (AN #77).

Would you mind explaining why this is? It seems to me like random sampling would display double descent. For example, as you increase model size, at first you get more and more parameters that let you approximate the data better... but then you get too many parameters and just start memorizing the data... ... (read more)

4rohinmshah9dHmm, I think you're right. I'm not sure what I was thinking when I wrote that. (Though I give it like 50% that if past-me could explain his reasons, I'd agree with him.) Possibly I was thinking of epochal double descent, but that shouldn't matter because we're comparing the final outcome of SGD to random sampling, so epochal double descent doesn't come into the picture.
NTK/GP Models of Neural Nets Can't Learn Features
I'll confess that I would personally find it kind of disappointing if neural nets were mostly just an efficient way to implement some fixed kernels, when it seems possible that they could be doing something much more interesting -- perhaps even implementing something like a simplicity prior over a large class of functions, which I'm pretty sure NTK/GP can't be

Wait, why can't NTK/GP be implementing a simplicity prior over a large class of functions? They totally are, it's just that the prior comes from the measure in random initia... (read more)

6interstice7dThere's an important distinction[1] [#fn-vCBKd8FAa79jE6hap-1] to be made between these two claims: A) Every function with large volume in parameter-space is simple B) Every simple function has a large volume in parameter space For a method of inference to qualify as a 'simplicity prior', you want both claims to hold. This is what lets us derive bounds like 'Solomonoff induction matches the performance of any computable predictor', since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they'll be picked out after boundedly many mistakes. In particular, you want there to be an implication like, if a function has complexity C, it will have parameter-volume at least exp(−βC). Now, the Mingard results, at least the ones that have mathematical proof, rely on the Levin bound. This only shows (A), which is the direction that is much easier to prove -- it automatically holds for any mapping from parameter-space to functions with bounded complexity. They also have some empirical results that show there is substantial 'clustering', that is, there are some simple functions that have large volumes. But this still doesn't show that all of them do, and indeed is compatible with the learnable function class being extremely limited. For instance, this could easily be the case even if NTK/GP was only able to learn linear functions. In reality the NTK/GP is capable of approximating arbitrary functions on finite-dimensional inputs but, as I argued in another comment [] , this is not the right notion of 'universality' for classification problems. I strongly suspect[2] [#fn-vCBKd8FAa79jE6hap-2] that the NTK/GP can be shown to not be 'universally data-efficient' as I outlined there, but as far as I'm aware no one's looked into the issue formally yet. Empirically, I think the results we have so far [https://www
Parsing Abram on Gradations of Inner Alignment Obstacles

Well, it seems to be saying that the training process basically just throws away all the tickets that score less than perfectly, and randomly selects one of the rest. This means that tickets which are deceptive agents and whatnot are in there from the beginning, and if they score well, then they have as much chance of being selected at the end as anything else that scores well. And since we should expect deceptive agents that score well to outnumber aligned agents that score well... we should expect deception.

I'm working on a much more fleshed out and expanded version of this argument right now.

4alexflint8dYeah right, that is scarier. Looking forward to reading your argument, esp re why we would expect deceptive agents that score well to outnumber aligned agents that score well. Although in the same sense we could say that a rock “contains” many deceptive agents, since if we viewed the rock as a giant mixture of computations then we would surely find some that implement deceptive agents.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Pinging you to see what your current thoughts are! I think that if "SGD is basically equivalent to random search" then that has huge, huge implications.

4evhub9dI guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.
Parsing Abram on Gradations of Inner Alignment Obstacles

I think Abram's concern about the lottery ticket hypothesis wasn't about the "vanilla" LTH that you discuss, but rather the scarier "tangent space hypothesis." See this comment thread.

4alexflint9dThank you for the pointer. Why is the tangent space hypothesis version of the LTH scarier?
Why I Work on Ads

I think universal paywalls would be much better. Consider how video games typically work: You pay for the game, then you can play it as much as you like. Video games sometimes try to sell you things (e.g. political ideologies, products) but there is vastly less of that then e.g. youtube or facebook, what with all the ads, propaganda, promoted content, etc. Imagine if instead all video games were free, but to make money the video game companies accepted bribes to fill their games with product placement and propaganda. I would not prefer that world, even tho... (read more)

The AI Timelines Scam

Is it really true that most people sympathetic to short timelines are thus mainly due to social proof cascade? I don't know any such person myself; the short-timelines people I know are either people who have thought about it a ton and developed detailed models, or people who just got super excited about GPT-3 and recent AI progress basically. The people who like to defer to others pretty much all have medium or long timelines, in my opinion, because that's the respectable/normal thing to think.

Open and Welcome Thread - May 2021

Welcome! I recognize your username, we must have crossed paths before. Maybe something to do with SpaceX?

5theme_arrow10dYes! We had a nice discussion in the comments of your "Fun with +12 OOMs of Compute" post.
Open and Welcome Thread - May 2021

My guess is: Regulation. It would be illegal to build and rent out nano-apartments. (Evidence: In many places in the USA, it's illegal for more than X people not from the same family to live together, for X = 4 or something ridiculously small like that.)

To add a bit more detail to your comment, this form of housing used to exist in the from of single room occupancy (SRO) buildings, where people would rent a single room and share bathroom and kitchen spaces. Reformers and planners started efforts to ban this form of housing starting around the early 20th century. From Wikipedia:

By the 1880s, urban reformers began working on modernizing cities; their efforts to create "uniformity within areas, less mixture of social classes, maximum privacy for each family, much lower density for many activities, buildings

... (read more)
1niplav10dThat's disheartening :-( But good to know nonetheless, thanks. Perhaps not a *completely* senseless regulation considering disease spreading (though there are better ways of attacking _that_ with other means).
Open and Welcome Thread - May 2021

Welcome! It's people like you (and perhaps literally you) on which the future of the world depends. :)

Wait... you started using the internet in 2006? Like, when you were 5???

Thanks!  2006 is what I remember, and what my older brother says too.  I was 5 though, so the most I got out of it was learning how to torrent movies and Pokemon ROMs until like 2008, when I joined Facebook (at the time to play an old game called FarmVille).

Naturalism and AI alignment

I'd be interested to see naturalism spelled out more and defended against the alternative view that (I think) prevails in this community. That alternative view is something like: "Look, different agents have different goals/values. I have mine and will pursue mine, and you have yours and pursue yours. Also, there are rules and norms that we come up with to help each other get along, analogous to laws and rules of etiquette. Also, there are game-theoretic principles like fairness, retribution, and bullying-resistance that are basically just good ... (read more)

1Michele Campolo9dI am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here [] the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment. In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief—than the agent fails at the narrow task. So we have a concept of true-given-task. By considering different tasks, e.g. in the case of a general agent that is prepared to face various tasks, we obtain true-in-general or, if you prefer, simply "truth". See also the section on knowledge in the post. Practical example: given that light is present almost everywhere in our world, I expect general agents to acquire knowledge about electromagnetism. I also expect that some AIs, given enough time, will eventually incorporate in their world-model beliefs like: "Certain brain configurations correspond to pleasurable conscious experiences. These configurations are different from the configurations observed in (for example) people who are asleep, and very different from what is observed in rocks." Now, take an AI with such knowledge and give it some amount of control over which goals to pursue: see also the beginning of Part II in the post. Maybe, in order to make this modification, it is necessary to abandon the single-agent framework and consider instead a multi-agent system, where one agent keeps expanding the knowledge base, another agent looks for "value" in the kb, and another one decides what actions to take given the current concept of value and other contents of the kb. [Two notes on how I am using the word control. 1 I am not assuming any extra-physical notion h
Naturalism and AI alignment
From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.

I think this is an interesting point -- but I don't conclude optimism from it as you do. Humans engage in explicit reasoning about what they should do, and they theorize and systematize, and some o... (read more)

Your Dog is Even Smarter Than You Think

Thanks for this! This definitely does intersect with my interests; it's relevant to artificial intelligence and to ethics. It does mostly just confirm what I already thought though, so my reaction is mostly just to pay attention to this sort of thing going forward.

AMA: Paul Christiano, alignment researcher

I'm very glad to hear that! Can you say more about why?

Natural language has both noise (that you can never model) and signal (that you could model if you were just smart enough). GPT-3 is in the regime where it's mostly signal (as evidenced by the fact that the loss keeps going down smoothly rather than approaching an asymptote). But it will soon get to the regime where there is a lot of noise, and by the time the model is 9 OOMs bigger I would guess (based on theory) that it will be overwhelmingly noise and training will be very expensive.

So it may or may not work in the sense of meeting some absolute performance threshold, but it will certainly be a very bad way to get there and we'll do something smarter instead.

Daniel Kokotajlo's Shortform

Probably, when we reach an AI-induced point of no return, AI systems will still be "brittle" and "narrow" in the sense used in arguments against short timelines.

Argument: Consider AI Impacts' excellent point that "human-level" is superhuman (bottom of this page)

The point of no return, if caused by AI, could come in a variety of ways that don't involve human-level AI in this sense. See this post for more. The general idea is that being superhuman at some skills can compensate for being subhuman at others. We should expect the point of no return to be reache... (read more)

Draft report on existential risk from power-seeking AI

Thanks for this! I like your concept of APS systems; I think I might use that going forward. I think this document works as a good "conservative" (i.e. optimistic) case for worrying about AI risk. As you might expect, I think the real chances of disaster are higher. For more on why I think this, well, there are the sequences of posts I wrote and of course I'd love to chat with you anytime and run some additional arguments by you.

For now I'll just say: 5% total APS risk (seems to me to) fail a sanity check, as follows:

1. There's at... (read more)

6Joe Carlsmith13dHi Daniel, Thanks for reading. I think estimating p(doom) by different dates (and in different take-off scenarios) can be a helpful consistency check, but I disagree with your particular “sanity check” here -- and in particular, premise (2). That is, I don’t think that conditional on APS-systems becoming possible/financially feasible by 2035, it’s clear that we should have at least 50% on doom (perhaps some of disagreement here is about what it takes for the problem to be "real," and to get "solved"?). Nor do I see 10% on “Conditional it being both possible and strongly incentivized to build APS systems, APS systems will end up disempowering approximately all of humanity” as obviously overconfident (though I do take some objections in this vein seriously). I’m not sure exactly what “10% on nuclear war” analog argument you have in mind: would you be able to sketch it out, even if hazily?
Predictive Coding has been Unified with Backpropagation

Thanks for this reply!

--I thought the paper about the methods of neuroscience applied to computers was cute, and valuable, but I don't think it's fair to conclude "methods are not up to the task." But you later said that "It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.)" so you aren't a radical skeptic about what we can know about the brain so maybe we don't disagree after all.

1 - 3: OK, I think I'll ... (read more)

2abramdemski13dHow could this address point #5? If GD is slow, then GD would be slow to learn faster learning methods. All of the following is intended as concrete examples against the pure-bayes-brain hypothesis, not as evidence against the brain doing some form of GD: 1. One thing the brain could be doing under the hood is some form of RL using value-prop. This is difficult to represent in Bayes nets. The attempts I've seen end up making reward multiplicative rather than additive across time, which makes sense because Bayes nets are great at multiplying things but not so great at representing additive structures. I think this is OK (we could regard it as an exponential transform of usual reward) until we want to represent temporal discounting. Another problem with this is: representing via graphical models means representing the full distribution over reward values, rather than a point estimate. But this is inefficient compared with regular tabular RL. 2. Another thing the brain could be doing under the hood is "memory-network" style reasoning which learns a policy for utilizing various forms of memory (visual working memory, auditory working memory, episodic memory, semantic memory...) for reasoning. Because this is fundamentally about logical uncertainty (being unsure about the outcome at the end of some mental work), it's not very well-represented by Bayesian models. It probably makes more sense to use (model-free) RL to learn how to use WM. Of course both of those objections could be overcome with a specific sort of work, showing how to represent the desired algorithm in bayes nets. As for GD: * My back of the envelope calculation suggests that GPT-3 has trained on 7 orders of magnitude more data than a 10yo has experienced in their lifetime. Of course a different NN architecture (+ different task, different loss functions, etc) could just be that much more efficient than transformers; but overal
Gradations of Inner Alignment Obstacles
Part of my idea for this post was to go over different versions of the lottery ticket hypothesis, as well, and examine which ones imply something like this. However, this post is long enough as it is.

I'd love to see you do this!

Re: The Treacherous Turn argument: What do you think of the following spitball objections:

(a) Maybe the deceptive ticket that makes T' work is indeed there from the beginning, but maybe it's outnumbered by 'benign' tickets, so that the overall behavior of the network is benign. This is an argument against ... (read more)

4abramdemski13dMy overall claim is that attractor-basin type arguments need to address the base case. This seems like a potentially fine way to address the base-case, if the math works out for whatever specific attractor-basin argument. If we're trying to avoid deception via methods which can steer away from deception if we assume there's not yet any deception, then we're in trouble; the technique's assumptions are violated. Right, this seems in line with the original lottery ticket hypothesis, and would alleviate the concern. It doesn't seem as consistent with the tangent space hypothesis [] , though.
AMA: Paul Christiano, alignment researcher

In this post I argued that an AI-induced point of no return would probably happen before world GDP starts to noticeably accelerate. You gave me some good pushback about the historical precedent I cited, but what is your overall view? If you can spare the time, what is your credence in each of the following PONR-before-GDP-acceleration scenarios, and why?

1. Fast takeoff

2. The sorts of skills needed to succeed in politics or war are easier to develop in AI than the sorts needed to accelerate the entire world economy, and/or have less deployment lag. (Maybe ... (read more)

I don't know if we ever cleared up ambiguity about the concept of PONR. It seems like it depends critically on who is returning, i.e. what is the counterfactual we are considering when asking if we "could" return. If we don't do any magical intervention, then it seems like the PONR could be well before AI since the conclusion was always inevitable. If we do a maximally magical intervention, of creating unprecedented political will, then I think it's most likely that we'd see 100%+ annual growth (even of say energy capture) before PONR. I don't think there ... (read more)

AMA: Paul Christiano, alignment researcher

1. What credence would you assign to "+12 OOMs of compute would be enough for us to achieve AGI / TAI / AI-induced Point of No Return within five years or so." (This is basically the same, though not identical, with this poll question)

2. Can you say a bit about where your number comes from? E.g. maybe 25% chance of scaling laws not continuing such that OmegaStar, Amp(GPT-7), etc. don't work, 25% chance that they happen but don't count as AGI / TAI / AI-PONR, for total of about 60%? The more you say the better, this is my biggest crux! ... (read more)

8paulfchristiano13d(I don't think Amp(GPT-7) will work though.)

I'd say 70% for TAI in 5 years if you gave +12 OOM.

I think the single biggest uncertainty is about whether we will be able to adapt sufficiently quickly to the new larger compute budgets (i.e. how much do we need to change algorithms to scale reasonably? it's a very unusual situation and it's hard to scale up fast and depends on exactly how far that goes). Maybe I think that there's an 90% chance that TAI is in some sense possible (maybe: if you'd gotten to that much compute while remaining as well-adapted as we are now to our current levels of compute) an... (read more)

Coherence arguments imply a force for goal-directed behavior

I love your health points analogy. Extending it, imagine that someone came up with "coherence arguments" that showed that for a rational doctor doing triage on patients, and/or for a group deciding who should do a risky thing that might result in damage, the optimal strategy involves a construct called "health points" such that:

--Each person at any given time has some number of health points

--Whenever someone reaches 0 health points, they (very probably) die

--Similar afflictions/disasters tend to cause similar amounts of decrease in hea... (read more)

Wouldn't these coherence arguments be pretty awesome? Wouldn't this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?

Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)

But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle - for exampl... (read more)

GPT-3: a disappointing paper

Ah! You are right, I misread the graph. *embarrassed* Thanks for the correction!

Does the lottery ticket hypothesis suggest the scaling hypothesis?

OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)

4DanielFilan17dAh to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).
Daniel Kokotajlo's Shortform

OH ok thanks! Glad to hear that. I'll edit.

The Fall of Rome, III: Progress Did Not Exist

There's another explanation for why the history books display that progression you mapped out: They are Dutch history books, so naturally they want to focus on the bits of history that are especially relevant to the Dutch. One should expect that the "center of action" of these books drifts towards the Netherlands over time, just as it drifts towards the USA over time in the USA, and (I would predict) towards Indonesia over time in Indonesia, towards Japan over time in Japan, etc.

2LukeOnline18dSure! I don't think the fact that Dutch history books end in the Netherlands is good evidence that the Netherlands is the most significant place in world history :) But Ancient Egypt, Classical Greece and Classical Rome do seem to be of global significance. Greek ideas and inventions, from Aristotle to the Antikythera mechanism, do seem to be lasting and unique. And a bit harsher: the Greeks conquered Egypt. The Romans conquered Greece and Egypt. The balance of power actually seems to have shifted in that direction.
Daniel Kokotajlo's Shortform

The International Energy Agency releases regular reports in which it forecasts the growth of various energy technologies for the next few decades. It's been astoundingly terrible at forecasting solar energy for some reason. Marvel at this chart:

This is from an article criticizing the IEA's terrible track record of predictions. The article goes on to say that there should be about 500GW of installed capacity by 2020. This article was published in 2020; a year later, the 2020 data is in, and it's actually 714 GW. Even the article criticizing the IEA for thei... (read more)

9Zac Hatfield Dodds18dThe IEA is a running joke in climate policy circles; they're transparently in favour of fossil fuels and their "forecasts" are motivated by political (or perhaps commercial, hard to untangle with oil) interests rather than any attempt at predictive accuracy.
Does the lottery ticket hypothesis suggest the scaling hypothesis?

Whoa, the thing you are arguing against is not at all what I had been saying -- but maybe it was implied by what I was saying and I just didn't realize it? I totally agree that there are many optima, not just one. Maybe we are talking past each other?

(Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning! So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!)

4DanielFilan19dI guess I'm imagining that 'by default', your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum. Yes, that's exactly what I think happens. Training takes a long time, and I expect the weights in a 'ticket' to change based on the weights of the rest of the network (since those other weights have similar magnitude). I think the best way to see why I think that is to manually run thru the backpropagation algorithm. If I'm wrong, it's probably because of this paper [] that I don't have time to read over right now (but that I do recommend you read).
Does the lottery ticket hypothesis suggest the scaling hypothesis?

Hmmm, ok. Can you say more about why? Isn't the simplest explanation that the two tickets are the same?

2DanielFilan20dI expect that there are probably a bunch of different neural networks that perform well at a given task. We sort of know this because you can train a dense neural network to high accuracy, and also prune it to get a definitely-different neural network that also has high accuracy. Is it the case that these sparse architectures are small enough that there's only one optimum? Maybe, but IDK why I'd expect that.
Three reasons to expect long AI timelines

I definitely agree that our timelines forecasts should take into account the three phenomena you mention, and I also agree that e.g. Ajeya's doesn't talk about this much. I disagree that the effect size of these phenomena is enough to get us to 50 years rather than, say, +5 years to whatever our opinion sans these phenomena was. I also disagree that overall Ajeya's model is an underestimate of timelines, because while indeed the phenomena you mention should cause us to shade timelines upward, there is a long list of other phenomena I could m... (read more)

Three reasons to expect long AI timelines

Thanks for this post! I'll write a fuller response later, but for now I'll say: These arguments prove too much; you could apply them to pretty much any technology (e.g. self-driving cars, 3D printing, reusable rockets, smart phones, VR headsets...). There doesn't seem to be any justification for the 50-year number; it's not like you'd give the same number for those other techs, and you could have made exactly this argument about AI 40 years ago, which would lead to 10-year timelines now. You are just pointing out three reasons in f... (read more)

These arguments prove too much; you could apply them to pretty much any technology (e.g. self-driving cars, 3D printing, reusable rockets, smart phones, VR headsets...).

I suppose my argument has an implicit, "current forecasts are not taking these arguments into account." If people actually were taking my arguments into account, and still concluding that we should have short timelines, then this would make sense. But, I made these arguments because I haven't seen people talk about these considerations much. For example, I deliberately avoided the argument ... (read more)

Does the lottery ticket hypothesis suggest the scaling hypothesis?

Yeah, fair enough. I should amend the title of the question. Re: reinforcing the winning tickets: Isn't that implied? If it's not implied, would you not agree that it is happening? Plausibly, if there is a ticket at the beginning that does well at the task, and a ticket at the end that does well at the task, it's reasonable to think that it's the same ticket? Idk, I'm open to alternative suggestions now that you mention it...

4DanielFilan20dI don't think it's implied, and I'm not confident that it's happening. There are lots of neural networks!
Does the lottery ticket hypothesis suggest the scaling hypothesis?

The original paper doesn't demonstrate this but later papers do, or at least claim to. Here are several papers with quotes:
"In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis:
A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme ... (read more)

4DanielFilan21dNone of those quotes claim that training just reinforces the 'winning tickets'. Also those are referred to as the "strong" or "multi-ticket" LTH.
GPT-3: a disappointing paper

Update: It seems that GPT-3 can actually do quite well (maybe SOTA? Human-level-ish it seems) at SuperGLUE with the right prompt (which I suppose you can say is a kind of fine-tuning, but it's importantly different from what everyone meant by fine-tuning at the time this article was written!) What do you think of this?

This is also a reply to your passage in the OP:

The transformer was such an advance that it made the community create a new benchmark, “SuperGLUE,” because the previous gold standard benchmark (GLUE) was now too easy.
... (read more)
4nostalgebraist17dI'm confused -- the paper you link is not about better prompts for GPT-3. It's about a novel fine-tuning methodology for T5. GPT-3 only appears in the paper as a reference/baseline to which the new method is compared. The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings. Because of this, I sometimes refer to GPT-3 as "quantifying the cost (in additional scale) imposed by choosing a GPT-style model." That is, the following should be roughly competitive w/ each other: * BERT/T5 at param count N * GPT at param count ~100 * N See my comments near the bottom here [] . Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design. Your comment claims that the "SOTA" within that line of work is close to the overall SOTA on SuperGLUE -- which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks. However, I'd need to see a reference that actually establishes this.
Updating the Lottery Ticket Hypothesis

Thanks! I'm afraid I don't understand the math yet but I'll keep trying. In the meantime:

I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big.

Can you say more about why? It's not obvious to me that they are not big enough. Would you agree they probably contain edge detectors, circle detectors, etc. at initialization? Also, it seems that some subnetworks/tickets are already decent at the task at initialization, see e.g. this paper. Is that not "dog-recognizing subcircuits at initialization?" Or something similar?

4johnswentworth24dThe problem is what we mean by e.g. "dog recognizing subcircuit". The simplest version would be something like "at initialization, there's already one neuron which lights up in response to dogs" or something like that. (And that's basically the version which would be needed in order for a gradient descent process to actually pick out that lottery ticket.) That's the version which I'd call implausible: function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits. I would argue that dog-detectors are a lot more special than random circuits even a priori, but not so much more special that the space-of-functions-that-special is less than exponentially large. (For very small circuits like edge detectors, it's more plausible that some neurons implement that function right from the start.) The thing in the paper you linked is doing something different from that. At initialization, the neurons in the subcircuits they're finding would not light up in recognition of a dog, because they're still connected to a bunch of other stuff that's not in the subcircuit - the subcircuit only detects dogs once the other stuff is disconnected. And, IIUC, SGD should not reliably "find" those tickets: because no neurons in the subcircuit are significantly correlated with dogs, SGD doesn't have any reason to upweight them for dog-recognition. So what's going on in that paper is different from what's going on in normal SGD-trained nets (or at least not the full story).
Fun with +12 OOMs of Compute

It is irrelevant to this post, because this post is about what our probability distribution over orders of magnitude of compute should be like. Once we have said distribution, then we can ask: How quickly (in clock time) will we progress through the distribution / explore more OOMs of compute? Then the AI and compute trend, and the update to it, become relevant.

But not super relevant IMO. The AI and Compute trend was way too fast to be sustained, people at the time even said so. This recent halt in the trend is not surprising. What matters is what the tren... (read more)

4abramdemski13dIf the AI and compute trend is just a blip, then doesn't that return us to the previous trend line in the graph you show at the beginning, where we progress about 2 ooms a decade? (More accurately, 1 oom every 6-7 years, or, 8 ooms in 5 decades.) Ignoring AI and compute, then: if we believe +12 ooms in 2016 means great danger in 2020, we should believe that roughly 75 years after 2016, we are at most four years from the danger zone. Whereas, if we extrapolate the AI-and-compute trend, +12 ooms is like jumping 12 years in the future; so the idea of risk by 2030 makes sense. So I don't get how your conclusion can be so independent of AI-and-compute.
Fun with +12 OOMs of Compute

What Gwern said. :) But I don't know for sure what the person I talked to had in mind.

Load More