Summary

We study language models' capability to perform parallel reasoning in one forward pass. To do so, we test GPT-3.5's ability to solve (in one token position) one or two instances of algorithmic problems. We consider three different problems: repeatedly iterating a given function, evaluating a mathematical expression, and calculating terms of a linearly recursive sequence.

We found no evidence for parallel reasoning in algorithmic problems: The total number of steps the model could perform when handed two independent tasks was comparable to (or less than) the number of steps it could perform when given one task.

Motivation

Broadly, we are interested in AI models' capability to perform hidden cognition: Agendas such as scalable oversight and AI control rely (to some degree) on our ability to supervise and bound models' thinking....

(Continue Reading – 2650 more words)

1Olli Järviniemi24m

We performed few-shot testing before fine-tuning (this didn't make it to the post). I reran some experiments on the permutation iteration problem, and got similar results as before: for one function (and n = 6), the model got ~60% accuracy for k=2, but not great[1] accuracy for k=3. For two functions, it already failed at the f(x)+g(y) problem. (This was with 50 few-shot examples; gpt-3.5-turbo-0125 only allows 16k tokens.) So fine-tuning really does give considerably better capabilities than simply many-shot prompting. Let me clarify that with fine-tuning, our intent wasn't so much to create or teach the model new capabilities, but to elicit the capabilities the model already has. (C.f. Hubinger's When can we trust model evaluations?, section 3.) I admit that it's not clear where to draw the lines between teaching and eliciting, though. Relatedly, I do not mean to claim that one simply cannot construct a 175B model that successfully performs nested addition and multiplication. Rather, I'd take the results as evidence for GPT-3.5 not doing much parallel reasoning off-the-shelf (e.g. with light fine-tuning). I could see this being consistent with the data multiplexing paper (they do much heavier training). I'm still confused, though. I tried to run experiments on open source models on full fine-tuning, but it does, in fact, require much more RAM. I don't currently have the multi-GPU setups required to do full fine-tuning on even 7B models (I could barely fine-tune Pythia-1.4B on a single A100, and did not get much oomph out of it). So I'm backing down; if someone else is able to do proper tests here, go ahead. 1. ^ Note that while you can get 1/6 accuracy trivially, you can get 1/5 if you realize that the data is filtered so that fk(x)≠x, and 1/4 if you also realize that fk(x)≠f(x) (and are able to compute f(x)), ...

Ann5m10

Going to message you a suggestion I think.

yanni's Shortform

yanni kyriacos

2mo

yanni kyriacos5m10

Please help me find research on aspiring AI Safety folk!

I am two weeks into the strategy development phase of my movement building and almost ready to start ideating some programs for the year.

But I want these programs to be solving the biggest pain points people experience when trying to have a positive impact in AI Safety .

Has anyone seen any research that looks at this in depth? For example, through an interview process and then survey to quantify how painful the pain points are?

Some examples of pain points I've observed so far through my interviews wit... (read more)

Open Thread Spring 2024

habryka

2mo

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

Linch33m120

Do we know if @paulfchristiano or other ex-lab people working on AI policy have non-disparagement agreements with OpenAI or other AI companies? I know Cullen doesn't, but I don't know about anybody else.

I know NIST isn't a regulatory body, but it still seems like standards-setting should be done by people who have no unusual legal obligations.

To be clear, I want to differentiate between Non-Disclosure Agreements, which are a perfectly sane and reasonable in at least a limited form as a way to prevent leaking trade secrets, and non-disparagement agree... (read more)

3spencerkaplan1h

Hi everyone! I'm new to LW and wanted to introduce myself. I'm from the SF bay area and working on my PhD in anthropology. I study AI safety, and I'm mainly interested in research efforts that draw methods from the human sciences to better understand present and future models. I'm also interested in the AI safety's sociocultural dynamics, including how ideas circulate the research community and how uncertainty figures into our interactions with models. All thoughts and leads are welcome. This work led me to LW. Originally all the content was overwhelming but now there's much I appreciate. It's my go-to place for developments in the field and informed responses. More broadly, learning about rationality through the sequences and other posts is helping me improve my work as a researcher and I'm looking forward to continuing this process.

3habryka1h

Welcome! I hope you have a good time here!

"If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"

plex

This is a linkpost for https://aisafety.info/questions/NM1Y/If-we-go-extinct-due-to-misaligned-AI,-at-least-nature-will-continue,-right

[memetic status: stating directly despite it being a clear consequence of core AI risk knowledge because many people have "but nature will survive us" antibodies to other classes of doom and misapply them here.]

Unfortunately, no.^[1]

Technically, “Nature”, meaning the fundamental physical laws, will continue. However, people usually mean forests, oceans, fungi, bacteria, and generally biological life when they say “nature”, and those would not have much chance competing against a misaligned superintelligence for resources like sunlight and atoms, which are useful to both biological and artificial systems.

There’s a thought that comforts many people when they imagine humanity going extinct due to a nuclear catastrophe or runaway global warming: Once the mushroom clouds or CO2 levels have settled, nature will reclaim the cities. Maybe mankind in our hubris will have wounded Mother Earth and paid the price ourselves, but...

(See More – 354 more words)

ryan_greenblatt43m20

I've thought a bit about actions to reduce the probability that AI takeover involves violent conflict.

I don't think there are any amazing looking options. If goverments were generally more competent that would help.

Having some sort of apparatus for negotiating with rogue AIs could also help, but I expect this is politically infeasible and not that leveraged to advocate for on the margin.

2Mitchell_Porter9h

In preparation for what?

1jaan6h

AI takeover.

10owencb10h

OK hmm I think I understand what you mean. I would have thought about it like this: * "our reference class" includes roughly the observations we make before observing that we're very early in the universe * This includes stuff like being a pre-singularity civilization * The anthropics here suggest there won't be lots of civs later arising and being in our reference class and then finding that they're much later in universe histories * It doesn't speak to the existence or otherwise of future human-observer moments in a post-singularity civilization ... but as you say anthropics is confusing, so I might be getting this wrong.

Introducing AI Lab Watch

211

Zach Stein-Perlman

20d

This is a linkpost for https://ailabwatch.org

I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.

It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.

(It's much better on desktop than mobile — don't read it on mobile.)

It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.

It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.

Some clarifications and disclaimers.

How you can help:

Give feedback on how this project is helpful or how it could be different to be much more helpful
Tell me what's wrong/missing; point me to sources

...

(See More – 208 more words)

Martin Vlach1h20

So Alignment program is to be updated to 0 for OpenAI now that Superalignment team is no more? ( https://docs.google.com/document/d/1uPd2S00MqfgXmKHRkVELz5PdFRVzfjDujtu8XLyREgM/edit?usp=sharing )

Scientific Notation Options

jefftk

When working with numbers that span many orders of magnitude it's very helpful to use some form of scientific notation. At its core, scientific notation expresses a number by breaking it down into a decimal ≥1 and <10 (the "significand" or "mantissa") and an integer representing the order of magnitude (the "exponent"). Traditionally this is written as:

3 × 10⁴

While this communicates the necessary information, it has two main downsides:

It uses three constant characters ("× 10") to separate the significand and exponent.
It uses superscript, which doesn't work with some typesetting systems and adds awkwardly large line spacing at the best of times. And is generally lost on cut-and-paste.

Instead, I'm a big fan of e-notation, commonly used in programming and on calculators. This looks like:

3e4

This works everywhere, doesn't mess up your line spacing, and requires half as...

(See More – 73 more words)

Three-Monkey Mind1h10

I'd like to second this comment, at least broadly. I've seen the e notation in blog posts and the like and I've struggled to put the × 10 in the right place.

One of the reasons why I dislike trying to understand numbers written in scientific notation is because I have trouble mapping them to normal numbers with lots of commas in them. Engineering notation helps a lot with this — at least for numbers greater than 1 — by having the exponent be a multiple of 3. Oftentimes, losing significant figures isn't an issue in anything but the most technical scientific writing.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Interpretability: Integrated Gradients is a decent attribution method

StefanHex, Lucius Bushnaq, jake_mendel, Kaarel

by Lucius Bushnaq, Jake Mendel, Kaarel Hänni, Stefan Heimersheim.

A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.

Context

Understanding circuits in neural networks requires understanding how features interact with other features. There's a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their 'strength' in a principled manner that isn't vulnerable to common and simple counterexamples? In other words, how do we quantify how much the...

(Continue Reading – 1690 more words)

tailcalled1h20

We now have a method for how to do attributions on single data points. But when we're searching for circuits, we're probably looking for variables that have strong attributions between each other on average, measured over many data points.

Maybe?

One thing I've been thinking a lot recently is that building tools to interpret networks on individual datapoints might be more relevant than attributing over a dataset. This applies if the goal is to make statistical generalizations since a richer structure on an individual datapoint gives you more to generalize wi... (read more)

How I got so excited about HowTruthful

Bruce Lewis

6mo

This is the script for a video I made about my current full-time project. I think the LW community will understand its value better than the average person I talk to does.

Hi, I'm Bruce Lewis. I'm a computer programmer. For a long time, I've been fascinated by how computers can help people process information. Lately I've been thinking about and experimenting with ways that computers help people process lines of reasoning. This video will catch you up on the series of thoughts and experiments that led me to HowTruthful, and tell you why I'm excited about it. This is going to be a long video, but if you're interested in how people arrive at truth, it will be worth it.

Ten or 15 years ago I noticed how...

(Continue Reading – 1276 more words)

DPiepgrass1h10

I like that HowTruthful uses the idea of (independent) hierarchical subarguments, since I had the same idea. Have you been able to persuade very many to pay for it?

My first thought about it was that the true/false scale should have two dimensions, knowledge & probability:

One of the many things I wanted to do on my site was to gather user opinions, and this does that. ✔ I think of opinions as valuable evidence, just not always valuable evidence about the question under discussion (though to the extent people with "high knowledge" really have high knowle... (read more)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Marius Hobbhahn, Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache

Ω 224h

This is a linkpost for our two recent papers:

An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927
An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928

This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.

A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:

We know that the training loss goes down during training. Thus, the features learned during training must be determined by the loss

...

(See More – 694 more words)

tailcalled1h20

I was thinking in similar lines, but eventually dropped it because I felt like the gradients would likely miss something if e.g. a saturated softmax prevents any gradient from going through. I find it interesting that experiments also find that the interaction basis didn't work, and I wonder whether any of the failure here is due to saturated softmaxes.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Summary

Motivation

Context

LessOnline Festival