LessWrong

My current main cruxes:

Will AI get takeover capability? When?
Single ASI or many AGIs?
Will we solve technical alignment?
Value alignment, intent alignment, or CEV?
Defense>offense or offense>defense?
Is a long-term pause achievable?

If there is reasonable consensus on any one of those, I'd much appreciate to know about it. Else, I think these should be research priorities.

The longest training run

Jsevillamol, Tamay, Owen D, anson.ho

Ω 362y

This is a linkpost for https://epochai.org/blog/the-longest-training-run

In short: Training runs of large Machine Learning systems are likely to last less than 14-15 months. This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms. [Edited 2022/09/22 to fix an error in the hardware improvements + rising investments calculation]

Scenario	Longest training run
Hardware improvements	3.55 years
Hardware improvements + Software improvements	1.22 years
Hardware improvements + Rising investments	9.12 months
Hardware improvements + Rising investments + Software improvements	2.52 months

Larger compute budgets and a better understanding of how to effectively use compute (through, for example, using scaling laws) are two major driving forces of progress in recent Machine Learning.

There are many ways to increase your effective compute budget: better hardware, rising investments in AI R&D and improvements in algorithmic efficiency. In this article...

(Continue Reading – 2667 more words)

1Maxime Riché43m

Why is g_I here 3.84, while above it is 1.03?

Maxime Riché42m10

This is actually corrected on the Epoch website but not here (https://epochai.org/blog/the-longest-training-run)

LLMs seem (relatively) safe

JustisMills

15h

This is a linkpost for https://justismills.substack.com/p/llms-seem-relatively-safe

Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.

In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.

Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds...

(Continue Reading – 1790 more words)

quetzal_rainbow42m10

The reason why EY&co were relatively optimistic (p(doom) ~ 50%) before AlphaGo was their assumption "to build intelligence, you need some kind of insight in theory of intelligence". They didn't expect that you can just take sufficiently large approximator, pour data inside, get intelligent behavior and have no idea about why you get intelligent behavior.

2avturchin1h

LLMs now can also self-play in adversarial word games and it increases their performance https://arxiv.org/abs/2404.10642

1zeshen3h

I agree with RL agents being misaligned by default, even more so for the non-imitation-learned ones. I mean, even LLMs trained on human-generated data are misaligned by default, regardless of what definition of 'alignment' is being used. But even with misalignment by default, I'm just less convinced that their capabilities would grow fast enough to be able to cause an existential catastrophe in the near-term, if we use LLM capability improvement trends as a reference.

2Wei Dai3h

If something is both a vanguard and limited, then it seemingly can't stay a vanguard for long. I see a few different scenarios going forward: 1. We pause AI development while LLMs are still the vanguard. 2. The data limitation is overcome with something like IDA or Debate. 3. LLMs are overtaken by another AI technology, perhaps based on RL. In terms of relative safety, it's probably 1 > 2 > 3. Given that 2 might not happen in time, might not be safe if it does, or might still be ultimately outcompeted by something else like RL, I'm not getting very optimistic about AI safety just yet.

Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda, Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah

Ω 3019h

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

Senthooran Rajamanoharan42m10

UPDATE: we've corrected equations 9 and 10 in the paper (screenshot of the draft below) and also added a footnote that hopefully helps clarify the derivation. I've also attached a revised figure 6, showing that this doesn't change the overall story (for the mathematical reasons I mentioned in my previous comment). These will go up on arXiv, along with some other minor changes (like remembering to mention SAEs' widths), likely some point next week. Thanks again Sam for pointing this out!

Updated equations (draft):

Updated figure 6 (shrinkage comparison for GE... (read more)

1Dan Braun5h

This is neat, nice work! I'm finding it quite hard to get a sense at what the actual Loss Recovered numbers you report are, and to compare them concretely to other work. If possible, it'd be very helpful if you shared: 1. What the zero ablations CE scores are for each model and SAE position. (I assume it's much worse for the MLP and attention outputs than the residual stream?) 2. What the baseline CE scores are for each model.

2Rohin Shah7h

This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs. The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to ReLU(πgate(x)), so you might think of πgate as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on fmag for everything else.

4Neel Nanda11h

Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This'll be added to the paper soon, sorry!). I'll leave the other Qs for my co-authors

Examples of Highly Counterfactual Discoveries?

139

johnswentworth, kromem

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

(See More – 189 more words)

2dr_s2h

Well, it's hard to tell because most other civilizations at the required level of wealth to discover this (by which I mean both sailing and surplus enough to have people who worry about the shape of the Earth at all) could one way or another have learned it via osmosis from Greece. If you only have essentially two examples, how do you tell whether it was the one who discovered it who was unusually observant rather than the one who didn't who was unusually blind? But it's an interesting question, it might indeed be a relatively accidental thing which for some reason was accepted sooner than you would have expected (after all, sails disappearing could be explained by an Earth that's merely dome-shaped; the strongest evidence for a completely spherical shape was probably the fact that lunar eclipses feature always a perfect disc shaped shadow, and even that requires interpreting eclipses correctly, and having enough of them in the first place).

3francis kafka3h

Bowler's comment on Wallace is that his theory was not worked out to the extent that Darwin's was, and besides I recall that he was a theistic evolutionist. Even with Wallace, there was still a plethora of non-Darwinian evolutionary theories before and after Darwin, and without the force of Darwin's version, it's not likely or necessary that Darwinism wins out. Also And he points out that minus Darwin, nobody would have paid as much attention to Wallace. Bowler also points out that Wallace didn't really form the connection between both natural and artificial selection.

Lukas_Gloor42m20

In some of his books on evolution, Dawkins also said very similar things when commenting on Darwin vs Wallace, basically saying that there's no comparison, Darwin had a better grasp of things, justified it better and more extensively, didn't have muddled thinking about mechanisms, etc.

3kromem12h

Though the Greeks actually credited the idea to an even earlier Phonecian, Mochus of Sidon. Through when it comes to antiquity credit isn't really "first to publish" as much as "first of the last to pass the survivorship filter."

Fabien's Shortform

Fabien Roger

Ω 52mo

Fabien Roger43mΩ240

List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.

When using length-10 lists (it crushes length-5 no matter the prompt), I get:

32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%

So few-shot hurts, but the fancy prompt does not seem to help. Code here.

I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. ... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

dirk's Shortform

dirk

dirk1h10

I'm against intuitive terminology [epistemic status: 60%] because it creates the illusion of transparency; opaque terms make it clear you're missing something, but if you already have an intuitive definition that differs from the author's it's easy to substitute yours in without realizing you've misunderstood.

1dirk1h

I'm not alexithymic; I directly experience my emotions and have, additionally, introspective access to my preferences. However, some things manifest directly as preferences which I have been shocked to realize in my old age, were in fact emotions all along. (In rare cases these are stronger than the ones directly-felt even, despite reliably seeming on initial inspection to be simply neutral metadata).

1dirk2h

Classic type of argument-gone-wrong (also IMO a way autistic 'hyperliteralism' or 'over-concreteness' can look in practice, though I expect that isn't always what's behind it): Ashton makes a meta-level point X based on Birch's meta point Y about object-level subject matter Z. Ashton thinks the topic of conversation is Y and Z is only relevant as the jumping-off point that sparked it, while Birch wanted to discuss Z and sees X as only relevant insofar as it pertains to Z. Birch explains that X is incorrect with respect to Z; Ashton, frustrated, reiterates that Y is incorrect with respect to X. This can proceed for quite some time with each feeling as though the other has dragged a sensible discussion onto their irrelevant pet issue; Ashton sees Birch's continual returns to Z as a gotcha distracting from the meta-level topic XY, whilst Birch in turn sees Ashton's focus on the meta-level point as sophistry to avoid addressing the object-level topic YZ. It feels almost exactly the same to be on either side of this, so misunderstandings like this are difficult to detect or resolve while involved in one.

1dirk2h

Meta/object level is one possible mixup but it doesn't need to be that. Alternative example, is/ought: Cedar objects to thing Y. Dusk explains that it happens because Z. Cedar reiterates that it shouldn't happen, Dusk clarifies that in fact it is the natural outcome of Z, and we're off once more.

Benchmarks for Detecting Measurement Tampering [Redwood Research]

ryan_greenblatt, Fabien Roger

Ω 468mo

This is a linkpost for https://arxiv.org/abs/2308.15605

TL;DR: This post discusses our recent empirical work on detecting measurement tampering and explains how we see this work fitting into the overall space of alignment research.

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals that are robust under optimization. One concern is measurement tampering, which is where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. (This is a type of reward hacking.)

Over the past few months, we’ve worked on detecting measurement tampering by building analogous datasets and evaluating simple techniques. We detail our datasets and experimental results in this paper.

Detecting measurement tampering can be thought of as a specific case of Eliciting Latent Knowledge (ELK): When AIs successfully tamper with...

(Continue Reading – 5788 more words)

Fabien Roger1h20

That's right. We initially thought it might be important so that the LLM "understood" the task better, but it didn't matter much in the end. The main hyperparameters for our experiments are in train_ray.py, where you can see that we use a "token_loss_weight" of 0.

(Feel free to ask more questions!)

Losing Faith In Contrarianism

omnizoid

16h

Crosspost from my blog.

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

(Continue Reading – 1290 more words)

ChristianKl1h20

What makes you believe that Substack is to blame and not him unpublishing it?

2ChristianKl1h

He explicitly says that the people who argue that there's no gap are mistaken to argue that. He argues for the gap being small, not nonexistent. He does not use the term "near zero" himself.

1Jacob G-W2h

Noted, thanks.

5Viliam3h

I guess in the average case, the contrarian's conclusion is wrong, but it is also a reminder that the mainstream case is not communicated clearly, and often exaggerated or supported by invalid arguments. For example: * it's not that "dieting doesn't work", but that people naively assume that dieting is simple and effective ("if you just stop eating chocolate and start exercising for one hour every day, you will certainly lose weight", haha nope), even when the actual weight-loss research shows otherwise; * it's not that "medicine doesn't improve health", but while some parts of medicine are very useful, other parts may be neutral or even harmful, and we often see that throwing more money at medicine does not actually improve the outcomes; * it's not that "education doesn't work", but if you filter your students by intelligence and hard work, of course they will have better outcomes in life regardless of how good is your teaching, so the impact of education is probably vastly overestimated, and this also explains why so many pedagogical experiments succeed at a pilot project (when you try them with a small group of smart and motivated students) and then fail in mainstream education (when you try the same thing with average or below-average students); * it's not that "opening the borders completely is a good idea", but a lot of potential value is lost by closing the borders for people who are neither fanatics nor criminals and could easily integrate to the new society. There is also an opposite bad extreme to contrarians, the various "I fucking love science... although I do not understand it... but I enjoy attacking people on social networks who seem to disagree with the scientific consensus as I understand it" people. The ones who are sure that the professor or the doctor is always right, and that the latest educational fad is always correct.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA