now

Epistemic Status: Musing and speculation, but I think there's a real thing here.

I.

When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground.

Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder....

(Continue Reading – 2402 more words)

Scaling of AI training runs will slow down after GPT-5

Maxime Riché

My credence: 33% confidence in the claim that the growth in the number of GPUs used for training SOTA AI will slow down significantly directly after GPT-5. It is not higher because of (1) decentralized training is possible, and (2) GPT-5 may be able to increase hardware efficiency significantly, (3) GPT-5 may be smaller than assumed in this post, (4) race dynamics.

TLDR: Because of a bottleneck in energy access to data centers and the need to build OOM larger data centers.

Update: See Vladimir_Nesov's comment below for why this claim is likely wrong, since decentralized training seems to be solved.

The reasoning behind the claim:

Current large data centers consume around 100 MW of power, while a single nuclear power plant generates 1GW. The largest seems to consume 150 MW.
An

...

(See More – 749 more words)

jsd2m10

Amazon recently bought a 960MW nuclear-powered datacenter.

I think this doesn't contradict your claim that "The largest seems to consume 150 MW" because the 960MW datacenter hasn't been built (or there is already a datacenter there but it doesn't consume that much energy for now)?

1Maxime Riché1h

Thank for the great comment! Do we know if distributed training is expected to scale well to GPT-6 size models (100 trillions parameters) trained over like 20 data centers? How does the communication cost scale with the size of the model and the number of data centers? Linearly on both? After reading for 3 min this: Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips (Google November 2023). It seems that scaling is working efficiently at least up to 50k GPUs (GPT-6 would be like 2.5M GPUs). There are also some surprising linear increases in start time with the number of GPUs, 13min for 32k GPUs. What is the SOTA?

6Chris_Leong3h

Only 33% confidence? It seems strange to state X will happen if your odds are < 50%

4Maxime Riché2h

The title is clearly an overstatement. It expresses more that I updated in that direction, than that I am confident in it. Also, since learning from other comments that decentralized learning is likely solved, I am now even less confident in the claim, like only 15% chance that it will happen in the strong form stated in the post. Maybe I should edit the post to make it even more clear that the claim is retracted.

We are headed into an extreme compute overhang

devrandom

21m

If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.

Definitions

Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff.

I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here).

I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.

Thesis

Due to practical reasons, the compute requirements for training LLMs...

(See More – 408 more words)

ACX Atlanta - The Atlanta Moloch Slayers

ACX Atlanta Meetups Everywhere Spring 2024

Apr 27thAtlanta

Steve French

The April 2024 Meetup will be April 27th at Bold Monk at 2:00 PM

We return to Bold Monk brewing for a vigorous discussion of rationalism and whatever else we deem fit for discussion – hopefully including actual discussions of the sequences and Hamming Circles/Group Debugging.

Location:
Bold Monk Brewing
1737 Ellsworth Industrial Blvd NW
Suite D-1
Atlanta, GA 30318, USA

No Book club this month!

This is also the meetups everywhere meetup that will be advertised on the blog - so we should have a large turnout!

We will be outside out front (in the breezeway) – this is subject to change, but we will be somewhere in Bold Monk. If you do not see us in the front of the restaurant, please check upstairs and out back – look for the yellow table sign. We will have to play the weather by ear.

Remember – bouncing around in conversations is a rationalist norm!

Steve French27m10

Great!

Reducing sycophancy and improving honesty via activation steering

117

Nina Rimsky

Ω 499mo

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger.

I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions. I think this could be a promising research direction to understand dishonesty in language models better.

What is sycophancy?

Sycophancy in LLMs refers to the behavior when a model tells you what it thinks you want to hear / would approve of instead of what it internally represents as the truth. Sycophancy is a common problem in LLMs trained on human-labeled data because human-provided training signals...

(Continue Reading – 2574 more words)

1alexandraabbas11h

"[...] This is because there would be no general direction towards a truth-based belief domain or away from using human modeling in output generation." What do you mean by "human modeling in output generation"?

Nina Rimsky27m20

I am contrasting generating an output by:

Modeling how a human would respond (“human modeling in output generation”)
Modeling what the ground-truth answer is

Eg. for common misconceptions, maybe most humans would hold a certain misconception (like that South America is west of Florida), but we want the LLM to realize that we want it to actually say how things are (given it likely does represent this fact somewhere)

[Concept Dependency] Concept Dependency Posts

Johannes C. Mayer

This is a Concept Dependency Post. It may not be worth reading on its own, out of context. See the backlinks at the bottom to see which posts use this concept.

See the backlinks at the bottom of the post. Every post starting with [Concept Dependency] is a concept dependency post, that describes a concept this post is using.

Problem: Often when writing I come up with general concepts that make sense in isolation. Often I want to reuse these concepts without having to reexplain them.

A Concept Dependency Post is explaining a single concept, usually with no or minimal context. It is expected that the relevant context is provided by another post that links to the concept dependency post.

Concept Dependency Posts can be very short. Much shorter than a regular post. They might not be worth reading on their own....

(See More – 414 more words)

1quila36m

i like the idea. it looks useful and it fits my reading style well. i wish something like this were more common - i have seen it on personal blogs before like carado's. i would use [Concept Dependency] or [Concept Reference] instead so the reader understands just from seeing the title on the front page. also avoids acronym collision

Johannes C. Mayer30m20

Adopted.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Nathan Young's Shortform

Nathan Young

2Nathan Young37m

If that were true then there are many ways you could partially do that - eg give people a set of tokens to represent their mana at the time of the devluation and if at future point you raise. you could give them 10x those tokens back.

2James Grugett4h

We are trying our best to honor mana donations! If you are inactive you have until the rest of the year to donate at the old rate. If you want to donate all your investments without having to sell each individually, we are offering you a loan to do that. We removed the charity cap of $10k donations per month, which is going beyond what we previous communicated.

Nathan Young31m20

Nevertheless lots of people were hassled. That has real costs, both to them and to you.

2Nathan Young2h

I’m discussing with Carson. I might change my mind but i don’t know that i’ll argue with both of you at once.

Take the wheel, Shoggoth! (Lesswrong is trying out changes to the frontpage algorithm)

Ruby, RobertM

For the last month, @RobertM and I have been exploring the possible use of recommender systems on LessWrong. Today we launched our first site-wide experiment in that direction.

(In the course of our efforts, we also hit upon a frontpage refactor that we reckon is pretty good: tabs instead of a clutter of different sections. For now, only for logged-in users. Logged-out users see the "Latest" tab, which is the same-as-usual list of posts.)

Why algorithmic recommendations?

A core value of LessWrong is to be timeless and not news-driven. However, the central algorithm by which attention allocation happens on the site is the Hacker News algorithm^[1], which basically only shows you things that were posted recently, and creates a strong incentive for discussion to always be...

(See More – 965 more words)

2habryka42m

GDPR is a giant mess, so it's pretty unclear what it requires us to implement. My current understanding is that it just requires us to tell you that we are collecting analytics data if you are from the EU. And the kind of stuff we are sending over to Recombee would be covered by it being data necessary to provide site functionality, not just analytics, so wouldn't be covered by that (if you want to avoid data being sent to Google Analytics in-particular, you can do that by just blocking the GA script in uBlock origin or whatever other adblocker you use, which it should do by default).

the gears to ascension40m20

drat, I was hoping that one would work. oh well. yes, I use ublock, as should everyone. Have you considered simply not having analytics at all :P I feel like it would be nice to do the thing that everyone ought to do anyway since you're in charge. If I was running a website I'd simply not use analytics.

back to the topic at hand, I think you should just make a vector embedding of all posts and show a HuMAP layout of it on the homepage. that would be fun and not require sending data anywhere. you could show the topic islands and stuff.

2kave1h

I am sad to see you getting so downvoted. I am glad you are bringing this perspective up in the comments.

2habryka1h

I am pretty excited about doing something more in-house, but it's much easier to get data about how promising this direction is by using some third-party services that already have all the infrastructure. If it turns out to be a core part of LW, it makes more sense to in-house it. It's also really valuable to have an relatively validated baseline to compare things to. There are a bunch of third-party services we couldn't really replace that we send user data to. Hex.tech as our analytics dashboard service. Google Analytics for basic user behavior and patterns. A bunch of AWS services. Implementing the functionality of all of that ourselves, or putting a bunch of effort into anonymizing the data is not impossible, but seems pretty hard, and Recombee seems about par for the degree to which I trust them to not do anything with that data themselves.

Examples of Highly Counterfactual Discoveries?

149

johnswentworth, kromem

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

(See More – 189 more words)

1Johannes C. Mayer1h

A few adjacent thoughts: * Why is a programming language like Haskell that is extremely powerful in the sense that if your program compiles, it is the program that you want with a very high probability because most stupid mistakes are now compile errors? * Why is there basically no widely used homoiconic language, i.e. a language in which you can use the language itself to <reason about the language/manipulate the language>. Here we have some technology that is basically ready to use (Haskell or Clojure), but people decide to mostly not use them. And with people, I mean professional programmers and companions who make software. * Why did nobody invent Rust earlier, by which I mean a system-level programming language that prevents you from making really dumb mistakes that can be machine-checked if you make them? * Why did it take like 40 years to get a latex replacement, even though latex is terrible in very obvious ways? These things have in common that there is a big engineering challenge. It feels like maybe this explains it, together with that people who would benefit from these technologies where in the position that the cost of creating them would have exceeded the benefit that they would expect from them. For Haskell and Clojure we can also consider this point. Certainly, these two technologies have their flaws and could be improved. But then again we would have a massive engineering challenge.

4Alexander Gietelink Oldenziel2h

I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions. The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to. To be sure - generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training. The issue of the true distribution not being contained in the model is called 'unrealizability' in Bayesian statistics. It is dealt with in Watanabe's second 'green' book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy. I don't have the time to recap this story here.

mattmacdermott42m30

Lucius-Alexander SLT dialogue?

4Alexander Gietelink Oldenziel2h

All proofs are contained in the Watanabe's standard text, see here https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

I.