All of SoerenMind's Comments + Replies

Inference cost limits the impact of ever larger models

My point is that, while PCIe bandwidths aren't increasing very quickly, it's easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.

(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)

Inference cost limits the impact of ever larger models

Beware bandwidth bottlenecks, as I mentioned in my original post.

Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you'll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.

(Yes, 8x gpu->gpu communications will hurt overall latency... but not by all that much I don't think. 1 second is an eternity.)

Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets ... (read more)

1TLW2moTotal PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes). Gen5 and gen6 PCIe in theory will double this and double this again - but on a multiyear cadence at best. Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing. Hence: beware bandwidth bottlenecks.
Inference cost limits the impact of ever larger models

Thanks for elaborating I think I know what you mean now. I missed this:

I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer's computation.

My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don't have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer's weights from RAM at the same as computing the previous layer's output. This... (read more)

2TLW3moI am glad we were able to work out the matter! > If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today. Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does... (There are tricks here to an extent - such as compressing the model and decompressing it on-target - but they seldom save much. (And if they do, that just means your model is inefficient...)) According to a random guy on the internet [https://news.ycombinator.com/item?id=23991765], GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what's already stored in RAM.) That being said, as far as I can tell it is - in theory - possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.) (In practice I strongly suspect you'll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.) (It's a pity that there's no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...) (Yes, 8x gpu->gpu communications will hurt overall latency... but not by all that much I don't think. 1 second is an eternity.) > As I think we both agree, pipelining, in
4gwern3moIncidentally, the latency cost of width vs depth is something I've thought might explain why the brain/body allometric scaling laws are so unfavorable and what all that expensive brain matter does given that our tiny puny little ANNs seem capable of so much: everything with a meaningful biological brain, from ants to elephants, suffers from hard (fatal) latency requirements. You are simply not allowed by Nature or Darwin to take 5 seconds to compute how to move your legs. (With a striking counterexample, in both tininess of brain and largeness of latency, being Portia [https://www.gwern.net/docs/biology/portia/index].) It does not matter how many watts you save by using a deep skinny network, if after 10 layers have fired with another 100 to go to compute the next action to take, you've been eaten. So a biological brain might be forced to be deep into an unfavorable point on width vs depth - which might be extremely expensive - in order to meet its subset of robotics-related deadlines.
Inference cost limits the impact of ever larger models

The key is: pipelining doesn't help with latency of individual requests. But that's not what we care about here. What we care about is the latency from starting request 1 to finishing request N

Thanks for the examples. Your point seems to be about throughput, not latency (which to my knowledge is defined on a per-request basis). The latency per request may not matter for training but it does matter for inference if you want your model to be fast enough to interact with the world in real time or faster.

2TLW3moHm. Could you please reread my post? You're repeatedly stating assertions that I explicitly state and show are not the case. > Your point seems to be about throughput, not latency I gave an explicit example where a single inference is lower latency with pipelining here versus without. Hm. I think I understand where you seem to be misunderstanding. Let me try to explain a little more. > latency (which to my knowledge is defined on a per-request basis) The key here is that one "request" is composed of multiple requests. From the end user point of view, a single "request" means "a single full end-to-end inference". And the latency they care about is issuing the input data to getting the inference result out. But from the internal point of view, that single full end-to-end inference has multiple requests (essentially, "load weights for layer 0; run calculation on inputs and layer 0 weights to get layer 1 input; load weights for layer 1; run calculation on layer 0 output and layer 1 weights to get layer 2 input; etc, etc"). And you can reduce the latency of that one external request (the inference) by piplining multiple internal subrequests. You are absolutely correct in that the latency of each of the subrequests is not reduced - but the latency of the external request absolutely is reduced compared to if you didn't pipeline! (At least assuming the internal subrequests can be pipelined - which they can be in this case as I've repeatedly noted.)
Inference cost limits the impact of ever larger models

Perhaps what you meant is that latency will be high but this isn't a problem as long as you have high throughput. That's is basically true for training. But this post is about inference where latency matters a lot more.

(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don't want to interact with it in real time, even at GPT-3 scale)

Inference cost limits the impact of ever larger models

That would be interesting if true. I thought that pipelining doesn't help with latency. Can you expand?

Generically, pipelining increases throughput without lowering latency. Say you want to compute f(x) where f is a NN. Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That's why the latency to compute f(x) is high.

NB, GPT-3 used pipelining for training (in combination with model- and data parallelism) and still the large GPT-3 has h... (read more)

2TLW3moTo give a concrete example: Say each layer takes 10ms to process. The NN has 100 layers. It takes 40ms to round-trip weight data from the host (say it's on spinning rust or something). You can fit 5 layers worth of weights on a gpu, in addition to activation data / etc. On a GPU with a "sufficiently large" amount of memory, such that you can fit everything on-GPU, this will have 1.04s latency overall. 40ms to grab all of the weights into the GPU, then 1s to process. On a GPU, with no pipelining, loading five layers at a time then processing them, this will take 1.8 seconds latency overall. 40ms to load from disk, then 50 ms to process, for each group of 5 layers. On a GPU, with pipelining, this will take... 1.04s overall latency. t=0ms, start loading layer 1 weights. t=10ms, start loading layer 2 weights. ... t=40ms, start loading layer 5 weights & compute layer 1, t=50ms, start loading layer 6 weights & compute layer 2, etc. (Note that this has a max of 5 'active' sets of weights at once, like in the no-pipelining case.) (A better example would split this into request latency and bandwidth.) > Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That's why the latency to compute f(x) is high. To be clear: I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer's computation. I can be loading the NN weights for layer N+1 while I'm working on layer N. There's no dependency on the activations of the previous layer. > pipelining doesn't help with latency [https://cse.hkust.edu.hk/%7Ehamdi/Class/COMP381-07/Slides-07/Pipelining-1.ppt] Let me give an example (incorrect) exchange that hopefully illustrates the issue. "You can never stream video from a remote server, because your server roundtrip is 100ms and you only have 20ms per frame". "You can pipeline requests" "...b
1[comment deleted]3mo
1SoerenMind3moPerhaps what you meant is that latency will be high but this isn't a problem as long as you have high throughput. That's is basically true for training. But this post is about inference where latency matters a lot more. (It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don't want to interact with it in real time, even at GPT-3 scale)
Inference cost limits the impact of ever larger models

No, they don't. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference)

The motivation to make inference cheaper doesn't seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn't seem that MoEs change the ratio of training to inference cost, except insofar as they're currently finicky to train.

But the glas

... (read more)
Inference cost limits the impact of ever larger models

You may have better info, but I'm not sure I expect 1000x better serial speed than humans (at least not with innovations in the next decade). Latency is already a bottleneck in practice, despite efforts to reduce it. Width-wise parallelism has its limits and depth- or data-wise parallelism doesn't improve latency. For example, GPT-3 already has high latency compared to smaller models and it won't help if you make it 10^3x or 10^6x bigger.

Inference cost limits the impact of ever larger models

As Steven noted, your $1/hour number is cheaper than my numbers and probably more realistic. That makes a significant difference.

I agree that transformative impact is possible once we've built enough GPUs and connected them up into many, many new supercomputers bigger than the ones we have today. In a <=10 year timeline scenario, this seems like a bottleneck. But maybe not with longer timelines.

Inference cost limits the impact of ever larger models

you're missing all the possibilities of a 'merely human-level' AI. It can be parallelized, scaled up and down (both in instances and parameters), ultra-reliable, immortal, consistently improved by new training datasets, low-latency, ultimately amortizes to zero capital investment

I agree this post could benefit from discussing the advantages of silicon-based intelligence, thanks for bringing them up. I'd add that (scaled up versions of current) ML systems have disadvantages compared to humans, such as a lacking actuators and being cumbersome to fine-t... (read more)

Inference cost limits the impact of ever larger models

I broadly agree with your first point, that inference can be made more efficient. Though we may have different views on how much?

Of course, both inference and training become more efficient and I'm not sure if the ratio between them is changing over time.

As I mentioned there are also reasons why inference could become more expensive than in the numbers I gave. Given this uncertainty, my median guess is that the cost of inference will continue to exceed the cost of training (averaged across the whole economy).

I don't think sparse (mixture of expert) mode... (read more)

3gwern3moNo, they don't. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference). They're generally more challenging to train because of the discrete gating, imbalanced experts, and sheer size - the Switch paper discusses the problems, and even the original Shazeer MoE emphasizes all of the challenges in training a MoE compared to a small dense model. Now, if you solve those problems (as Switch does), then yes, the cheaper inference would also make cheaper training (as long as you don't have to do too much more training to compensate for the remaining problems), and that is an additional justification for Switch. But the primary motivation for researching MoE NMT etc has always been that it'd be a lot more economical to deploy at scale after training. Those results are sparse->dense, so they are not necessarily relevant (I would be thinking more applying distillation to the original MoE and distill each expert - the MoE is what you want for deployment at scale anyway, that's the point!). But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model. Like I said, the most reliable way to a small powerful model is through a big slow model. Yeah, we don't know what's going on there. They've mentioned further finetuning of the models, but no details. They decline to specify even what the parameter counts are, hence EAI needing to reverse-engineer guesses from their benchmarks. (Perhaps the small models are now distilled models? At least early on, people were quite contemptuous of the small models, but these days people find they can be quite handy. Did we just underrate them initially, or did they actually get better?) They have an 'instruction' series they've never explained what it is ( probably something like T0/FLAN?). Paul's estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise
Emergent modularity and safety

Our default expectation about large neural networks should be that we will understand them in roughly the same ways that we understand biological brains, except where we have specific reasons to think otherwise.

Here's a relevant difference: In the brain, nearby neurons can communicate with lower cost and latency than far-apart neurons. This could encourage nearby neurons to form modules to reduce the number of connections needed in the brain. But this is not the case for standard artificial architectures where layers are often fully connected or similar.

NLP Position Paper: When Combatting Hype, Proceed with Caution

Some minor feedback points: Just from reading the abstract and intro, this could be read as a non-sequitur: "It limits our ability to mitigate short-term harms from NLP deployments". Also, calling something a "short-term" problem doesn't seem necessary and it may sound like you think the problem is not very important.

1sbowman3moThanks! Tentative rewrite for the next revision: I tried to stick to 'present-day' over 'short-term', but missed this old bit of draft text in the abstract.
Prefer the British Style of Quotation Mark Punctuation over the American

One thing I dislike about the 'punctuation outside quotes' view is that it treats "!" and "?" differently than a full stop.

"This is an exclamation"!
"Is this a question"?

Seems less natural to me than:

"This is an exclamation!"
"Is this a question?"

I think have this intuition because it is part of the quote that it is an exclamation or a question.

What 2026 looks like

Yes I completely agree. My point is that the fine-tuned version didn't have better final coding performance than the version trained only on code. I also agree that fine-tuning will probably improve performance on the specific tasks we fine-tune on. 

What 2026 looks like

Most importantly I expect them to be fine-tuned on various things (perhaps you can bundle this under "higher-quality data"). Think of how Codex and Copilot are much better than vanilla GPT-3 at coding. That's the power of fine-tuning / data quality.


Fine-tuning GPT-3 on code had little benefit compared to training from scratch:

Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the finetuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strat

... (read more)
2Daniel Kokotajlo6moHuh.... I coulda' sworn they said Codex was pre-trained on internet text as well as on code, and that it was in particular a version of GPT-3, the 12B param version... The paper seems to support this interpretation when you add in more context to the quote you pulled: Note the bits I bolded. My interpretation is that Codex is indeed a fine-tuned version of GPT-3-12B; the thing they found surprising was that there wasn't much "transfer learning" from text to code, in the sense that (when they did smaller-scale experiments) models trained from scratch reached the same level of performance. So if models trained from scratch reached the same level of performance, why fine-tune from GPT-3? Answer: Because it converges more quickly that way. Saves compute.
What 2026 looks like

2023

The multimodal transformers are now even bigger; the biggest are about half a trillion parameters [...] The hype is insane now


This part surprised me. Half a trillion is only 3x bigger than GPT-3. Do you expect this to make a big difference? (Perhaps in combination with better data?). I wouldn't, given that GPT-3 was >100x bigger than GPT-2. 

Maybe your'e expecting multimodality to help? It's possible, but worth keeping in mind that according to some rumors, Google's multimodal model already has on the order of 100B parameters.

On the other hand, ... (read more)

3Daniel Kokotajlo6moI am not confident in that part. I was imagining that they would be "only" 3x bigger or so, but that they'd be trained on much higher-quality data (incl. multimodal) and also trained for longer/more data, since corps would be not optimizing purely for training-compute-optimal performance but instead worrying a bit more about inference-time compute costs. Most importantly I expect them to be fine-tuned on various things (perhaps you can bundle this under "higher-quality data"). Think of how Codex and Copilot are much better than vanilla GPT-3 at coding. That's the power of fine-tuning / data quality. Also, 3x bigger than GPT-3 is still, like, 40x bigger than Codex [https://www.metaculus.com/questions/405/when-will-programs-write-programs-for-us/#comment-65447] , and Codex is pretty impressive. So I expect scale will be contributing some amount to the performance gains for things like code and image and video, albeit not so much for text since GPT-3-175B was already pretty big. If Google's multimodal model is already 100B parameters big, then I look forward to seeing its performance! Is it worse than GPT-3? If so, that would be evidence against my forecast, though we still have two years to go...
Why not more small, intense research teams?

In my experience, this worked extremely well. But that was thanks to really good management and coordination which would've been hard in other groups I used to be part of.

What made the UK COVID-19 case count drop?

This wouldn't explain the recent reduction in R because Delta has already been dominant for a while.

What made the UK COVID-19 case count drop?

The  of Delta is ca. 2x the R0 of the Wuhan strain and this doubles the effect of new immunity on 

In fact, the ONS data gives me that ~7% of Scotland had Delta so that's a reduction in  of *7% = 6*7% = 0.42 just from very recent and sudden natural immunity. 

That's not [edited: forgot to say "not"] enough to explain everything, but there are more factors: 

1) Heterogenous immunity: the first people to become immune are often high-risk people who go to superspreader events etc. 

2) Vaccinations also w... (read more)

How should my timelines influence my career choice?

Another heuristic is to choose the option where you're most likely to do exceptionally well. (Cf heavy tailed impact etc). Among other thing this, this pushes you to optimize for the timelines scenario where you can be very successful, and to do the job with the best personal fit.

($1000 bounty) How effective are marginal vaccine doses against the covid delta variant?

Some standard ones like masks, but not at all times. They probably were in close or indoor contact with infected people without precautions.

($1000 bounty) How effective are marginal vaccine doses against the covid delta variant?
  1. FWIW I've seen multiple double-mRNA-vaccinated people in my social circles who still got infected with delta (and in one case infected someone else who was double vaccinated). Two of the cases I know were symptomatic (but mild).
3Ethan Perez6moI also know of 5+ cases of symptomatic COVID among double-vaxxed people in the bay area (including one instance where most people in a group house of ~6 people got covid). These are also relatively healthy individuals in their 20s
1brp6moImmune response is generally associated with age and lifestyle. What can you tell us about those factors?
2jacobjacob6moDo you know which, if any, risk-reducing precautions they were following?
1Lanrian6moHow many asymptomatic? And how did people know of them?
($1000 bounty) How effective are marginal vaccine doses against the covid delta variant?

According to one expert, the immune system essentially makes bets on how often it will face a given virus and how the virus will mutate in the future:

https://science.sciencemag.org/content/372/6549/1392

By that logic, being challenged more often means that the immune system should have a stronger and longer-lasting response:

The immune system treats any new exposure—be it infection or vaccination—with a cost-benefit threat analysis for the magnitude of immunological memory to generate and maintain. There are resource-commitment decisions: more cells and more

... (read more)
3jacobjacob6moI'll pay $50 for this answer, will message you for payment details.
Formal Inner Alignment, Prospectus

Suggestion for content 2: relationship to invariant causal prediction

Lots of people in ML these days seem excited about getting out of distribution generalization with techniques like invariant causal prediction. See e.g. this, this, section 5.2 here and related background. This literature seems promising but in discussions about inner alignment it's missing. It seems useful to discuss how far it can go in helping solve inner alignment. 

Formal Inner Alignment, Prospectus

Suggestion for content 1: relationship to ordinary distribution shift problems

When I mention inner alignment to ML researchers, they often think of it as an ordinary problem of (covariate) distribution shift.

My suggestion is to discuss if a solution to ordinary distribution shift is also a solution to inner alignment. E.g. an 'ordinary' robustness problem for imitation learning could be handled safely with an approach similar to Michael's: maintain a posterior over hypotheses , with a sufficiently flexible hypothesis class, and ask for h... (read more)

Formal Inner Alignment, Prospectus

Feedback on your disagreements with Michael:

I agree with "the consensus algorithm still gives inner optimizers control of when the system asks for more feedback". 

Most of your criticisms seem to be solvable by using a less naive strategy for active learning and inference, such as Bayesian Active Learning with Disagreement (BALD). Its main drawback is that exact posterior inference in deep learning is expensive since it requires integrating over a possibly infinite/continuous hypothesis space.  But approximations exist.

BALD (and similar methods) h... (read more)

Comment on the lab leak hypothesis

Re 1) the codons, according to Christian Drosten, have precedence for evolving naturally in viruses. That could be because viruses evolve much faster than e.g. animals. Source: search for 'codon' and use translate here: https://www.ndr.de/nachrichten/info/92-Coronavirus-Update-Woher-stammt-das-Virus,podcastcoronavirus322.html

The link also has a bunch of content about the evolution of furin cleavage sites, from a leading expert.

SoerenMind's Shortform

Favoring China in the AI race
In a many-polar AI deployment scenario,  a crucial challenge is to solve coordination problems between non-state actors: ensuring that companies don't cut corners, monitoring them, just to name a few challenges. And in many ways, China is better than western countries at solving coordination problems within their borders. For example, they can use their authority over companies as these tend to be state-owned or owned by some fund that is owned by a fund that is state owned. Could this mean that, in a many-polar scenario, ... (read more)

Suggestions of posts on the AF to review

Thanks - I agree there's value to public peer review. Personally I'd go further than notifying authors and instead ask for permission. We already have a problem where many people (including notably highly accomplished authors) feel discouraged from posting due to the fear of losing reputation. Worse, your friends will actually read reviews of your work, unlike OpenReview. And I wouldn't want to make this worse by implicitly making authors opt into a public peer review if that makes sense. 

There are also some differences between forums and academia. Fo... (read more)

Habryka's Shortform Feed

There's also a strong chance that delta is the most transmissible variant we know even without its immune evasion (source: I work on this, don't have a public source to share). I agree with your assessment that delta is a big deal.

Suggestions of posts on the AF to review

This seems useful. But do you ask the authors for permission to review and give them an easy way out? Academic peer review is for good reasons usually non-public. The prospect of having one's work reviewed in public seems likely to be extremely emotionally uncomfortable for some authors and may discourage them from writing.

3adamShimi8moPutting aside how people feel for the moment (I'll come back to it), I don't think peer-review should be private, and I think anyone publishing work in an openly readable forum where other researchers are expected to interact would value a thoughtful review of their work. That being said, you're probably right that at least notifying the authors before publication is a good policy. We sort of did that for the first two reviews, in the sense of literally asking people what they wanted to get reviews for, but we should make it a habit. Thanks for the suggestion.
The case for aligning narrowly superhuman models

Google seems to have solved some problem like the above for a multi-language-model (MUM):

"Say there’s really helpful information about Mt. Fuji written in Japanese; today, you probably won’t find it if you don’t search in Japanese. But MUM could transfer knowledge from sources across languages, and use those insights to find the most relevant results in your preferred language."

MIRI location optimization (and related topics) discussion

Some reactions:

  • The Oxford/London nexus  seems like a nice combination. It's 38min by train between the two, plus getting to the stations (which in London can be a pain).
  • Re intellectual life "behind the walls of the colleges": I haven't perceived much intellectual life in my college, and much more outside. Maybe the part inside the colleges is for undergraduates?
  • I don't have experience with long-range commuting into Oxford. But you can commute in 10-15 minutes by bike from the surrounding villages like Botley / Headington.
MIRI location optimization (and related topics) discussion

I don't think anyone has mentioned Oxford, UK yet? It's tiny. You could literally live on a farm here and still be 5-10 minutes from the city centre. And obviously it's a realistic place for a rationalist hub. I haven't perceived anti-tech sentiment here but haven't paid attention either.

9Jakob_J8moImmigration issues aside, I second the choice of the United Kingdom. Having lived in several European countries, the UK probably has one of the strongest intellectual cultures I've seen. The population is roughly that of California and Texas combined, and yet its combined cultural and scientific outputs is on par with the US as a whole (it has received the second largest number of Nobel prizes in the world, and in terms of Nobel prizes per capita it outperforms the US by a factor ~2). However, I would say that Oxford wouldn't be my first choice: * Most great things about Oxford are behind the walls of the colleges - if you are not a member of the university, you feel quite cut off from the intellectual life there. (Even as a member of the university, things are only active during term times, which are much shorter than elsewhere) * Living outside Oxford and commuting in is a pain - the roads are always clogged, even for buses. Commuting by train is possibly only from a few places. I would recommend living near London: * London is a really fun city. Whatever your interests might be, it is quite likely that you will find groups with the same interests as you. Also the food scene is amazing - you could probably find both great restaurants and grocery shops specializing in any cuisine you want. * Public transport is pretty great, much better than what I have seen in e.g. NY. It is common to live >1 hr outside the city and commute in, so there are lots of places in the countryside which are affordable but with a direct train to central London. * The job market is very active, and it shouldn't be a problem for two people to find a job here.

I'd guess it's not easy to change the land use for a farm and that it would be expensive and slow to build a campus in or near Oxford. It's probably easier to move into an existing "campus" (e.g. for a school, training center, residential conference facility). 

Immigration-wise: It will harder for EU people to move to the UK going forward but (AFAICT) easier for people from the Canada, US, Australia and elsewhere. The UK now has a points system for skilled workers (you need a job offer) and a special visa (don't need a job offer) for people in academia... (read more)

Three reasons to expect long AI timelines

I agree that 1-3 need more attention, thanks for raising them.

Many AI scientists in the 1950s and 1960s incorrectly expected that cracking computer chess would automatically crack other tasks as well.

There’s a simple disconnect here between chess and self-supervised learning.  You're probably aware of it but it it's worth mentioning. Chess algorithms were historically designed to win at chess. In contrast, the point of self-supervised learning is to extract representations that are useful in general. For example, to solve a new tasks we can feed the r... (read more)

The case for aligning narrowly superhuman models

How useful would it be to work on a problem where the LM "knows" can not be superhuman but it still knows how to do well and needs to be incentivized to do so? A currently prominent example problem is that LMs produce "toxic" content: 
https://lilianweng.github.io/lil-log/2021/03/21/reducing-toxicity-in-language-models.html

Demand offsetting

Put differently, buying eggs only hurt hens via some indirect market effects, and I’m now offsetting my harm at that level before it turns into any actual harm to a hen.

I probably misunderstand but isn't this also true about other offsetting schemes like convincing people to go vegetarian? They also lower demand.

3paulfchristiano10moAt a minimum they also impose harms on the people who you convinced not to eat meat (since you are assuming that eating meat was a benefit to you that you wanted to pay for). And of course they make further vegetarian outreach harder . And in most cases they also won't be such a precise an offset, e.g. it will apply to different animal products or at different times or with unclear probability. That said, I agree that I can offset "me eating an egg" by paying Alice enough that she's willing to skip eating an egg, and in some sense that's an even purer offset than the one in this post.
Acetylcholine = Learning rate (aka plasticity)

Related,  Acetylcholine has been hypothesized to signal to the rest of the brain that unfamiliar/uncertain things are about to happen
https://www.sciencedirect.com/science/article/pii/S0896627305003624
http://www.gatsby.ucl.ac.uk/~dayan/papers/yud2002.pdf

3Steven Byrnes10moThanks! Yeah that would seem to be consistent with "This is a good time to set your learning algorithm to have a higher-than-usual learning rate.", not to mention being alert and paying attention to that part of your sensory input.
Where is human level on text prediction? (GPTs task)

FWIW I wouldn't read much into it if LMs were outperforming humans at next-word-prediction. You can improve on it by having superhuman memory and doing things like analyzing the author's vocabulary. I may misremember but I thought we've already outperformed humans on some LM dataset?

Will OpenAI's work unintentionally increase existential risks related to AI?

No. Amodei led the GPT-3 project, he's clearly not opposed to scaling things. Idk why they're leaving but since they're all starting a new thing together, I presume that's the reason.

New SARS-CoV-2 variant

Some expert commentary here:  https://www.sciencemag.org/news/2020/12/mutant-coronavirus-united-kingdom-sets-alarms-its-importance-remains-unclear

Noteworthy:

  • We previously thought a strain from Spain was spreading faster than the rest but it was just because og people returning from holiday in Spain.
  • Chance events can help a strain spread faster.
  • The UK (and Denmark) do more gene sequencing than other countries - that may explain why they picked up the new variant first.
  • The strain has acquired 17 mutations at once which is very high. Not clear what that
... (read more)
3CellBioGuy1yThe spectrum and rate of excess mutations (assuming they all came in one step) is similar to what has been recorded elsewhere in immunocompromised people chronically infected for a month or two straight, in which there's more time for multiple lineages to coexist in the same body and compete with selection against each other and a longer time with high viral numbers without transmission bottlenecks.
Continuing the takeoffs debate

For example, moving from a 90% chance to a 95% chance of copying a skill correctly doubles the expected length of any given transmission chain, allowing much faster cultural accumulation. This suggests that there’s a naturally abrupt increase in the usefulness of culture

This makes sense when there's only one type of thing to teach / imitate. But some things are easier to teach and imitate than others (e. g. catching a fish vs. building a house). And while there may be an abrupt jump in the ability to teach or imitate each particular skill, this argument doesn't show that there will be a jump in the number of skills that can be taught /imitated. (Which is what matters)

2Richard_Ngo1yYeah, I don't think this is a conclusive argument, it's just pointing to an intuition (which was then backed up by simulations in the paper). And the importance of transmission fidelity is probably higher when we're thinking about cumulative culture (with some skills being prerequisites for others), not just acquiring independent skills. But I do think your point is a good one.
Covid Covid Covid Covid Covid 10/29: All We Ever Talk About

Right, to be clear that's the sort of number I have in mind and wouldn't call far far lower.

3CellBioGuy1yAlso, my boyfriend is an ICU nurse on the covid ward in a Southern state. He says average ICU stays are half as long now but that the average ICU patient is younger, so take that for what it's worth.
Covid Covid Covid Covid Covid 10/29: All We Ever Talk About

the infection fatality rate is far, far lower [now]

 

Just registering that, based on my reading of people who study the IFR over time, this is a highly contentious claim especially in the US.

4CellBioGuy1yThe numbers I have seen are suggesting that it's lower but only by say 30%?
interpreting GPT: the logit lens

Are these known facts? If not, I think there's a paper in here.

Will OpenAI's work unintentionally increase existential risks related to AI?
But what if they reach AGI during their speed up?

I agree, but I think it's unlikely OpenAI will be the first to build AGI.

(Except maybe if it turns out AGI isn't economically viable).

Will OpenAI's work unintentionally increase existential risks related to AI?

OpenAI's work speeds up progress, but in a way that's likely smooth progress later on. If you spend as much compute as possible now, you reduce potential surprises in the future.

4adamShimi1yPost OpenAI exodus [https://www.lesswrong.com/posts/7r8KjgqeHaYDzJvzF/dario-amodei-leaves-openai] update: does the exit of Dario Amodei, Chris Olah, Jack Clarke and potentially others from OpenAI make you change your opinion?
9adamShimi1yBut what if they reach AGI during their speed up? The smoothing at a later time assumes that we'll end up with diminishing returns before AGI, which is not what happens for the moment.
Load More