This is a linkpost for https://www.maximum-progress.com/p/ai-regulation-is-unsafe

Concerns over AI safety and calls for government control over the technology are highly correlated but they should not be.

There are two major forms of AI risk: misuse and misalignment. Misuse risks come from humans using AIs as tools in dangerous ways. Misalignment risks arise if AIs take their own actions at the expense of human interests.

Governments are poor stewards for both types of risk. Misuse regulation is like the regulation of any other technology. There are reasonable rules that the government might set, but omission bias and incentives to protect small but well organized groups at the expense of everyone else will lead to lots of costly ones too. Misalignment regulation is not in the Overton window for any government. Governments do not have strong incentives...

(Continue Reading – 1176 more words)

4Matthew Barnett1h

Makes sense. By comparison, my own unconditional estimate of p(doom) is not much higher than 10%, and so it's hard on my view for any intervention to have a double-digit percentage point effect. The crude mortality rate before the pandemic was about 0.7%. If we use that number to estimate the direct cost of a 1-year pause, then this is the bar that we'd need to clear for a pause to be justified. I find it plausible that this bar could be met, but at the same time, I am also pretty skeptical of the mechanisms various people have given for how a pause will help with AI safety.

2Daniel Kokotajlo29m

I agree that 0.7% is the number to beat for people who mostly focus on helping present humans and who don't take acausal or simulation argument stuff or cryonics seriously. I think that even if I was much more optimistic about AI alignment, I'd still think that number would be fairly plausibly beaten by a 1-year pause that begins right around the time of AGI. What are the mechanisms people have given and why are you skeptical of them?

ryan_greenblatt5m20

(Surely cryonics doesn't matter given a realistic actions space? Usage of cryonics is extremely rare. I agree that simulation arguments and similar considerations maybe imply that "helping current humans" is either incoherant or unimportant.)

3Amalthea36m

Somewhat of a nitpick, but the relevant number would be p(doom | strong AGI being built) (maybe contrasted with p(utopia | strong AGI)) , not overall p(doom).

Spatial attention as a “tell” for empathetic simulation?

Steven Byrnes

10h

(Half-baked work-in-progress. There might be a “version 2” of this post at some point, with fewer mistakes, and more neuroscience details, and nice illustrations and pedagogy etc. But it’s fun to chat and see if anyone has thoughts.)

1. Background

There’s a neuroscience problem that’s had me stumped since almost the very beginning of when I became interested in neuroscience at all (as a lens into AGI safety) back in 2019. But I think I might finally have “a foot in the door” towards a solution!

What is this problem? As described in my post Symbol Grounding and Human Social Instincts, I believe the following:

(1) We can divide the brain into a “Learning Subsystem” (cortex, striatum, amygdala, cerebellum and a few other areas) on the one hand, and a “Steering Subsystem”

...

(Continue Reading – 2186 more words)

Gunnar_Zarncke9m20

If step 5 is indeed grounded in the spatial attention being on other people, this should be testable! For example, people who pay less spatial attention to other people should feel less intense social emotions - because the steering system circuit gets activated less often and weaker. And I think that is the case. At least ChatGPT has some confirming evidence, though it's not super clear and I haven't yet looked deeper into it.

2Gunnar_Zarncke1h

The vestibular system can detect whether you look up or down. It could be that the reflex triggers when you a) look down (vestibular system) and b) have a visual parallax that indicates depth (visual system). Should be easy to test by closing one eye. Alternatively, it is the degree of accommodation of the lens. That should be testable by looking down with a lens that forces accommodation on short distances. The negative should also be testable by asking congenitally blind people about their experience with this feeling of dizziness close to a rim.

6interstice4h

Tangentially related: some advanced meditators report that their sense that perception has a center vanishes at a certain point along the meditative path, and this is associated with a reduction in suffering.

3Carl Feynman4h

You write: …But I think people can be afraid of heights without past experience of falling… I have seen it claimed that crawling-age babies are afraid of heights, in that they will not crawl from a solid floor to a glass platform over a yawning gulf. And they’ve never fallen into a yawning gulf. At that age, probably all the heights they’ve fallen from have been harmless, since the typical baby is both bouncy and close to the ground.

Breadboarding a Whistle Synth

jefftk

10h

With my electronic harp mandolin project I've been enjoying working with analog and embedded audio hardware. And a few weeks ago, after reading about Ugo Conti's whistle-controlled synth I wrote to him, he gave me a call, and we had a really interesting conversation. And my existing combination of hardware for my whistle synth [1] is bulky and expensive. Which has me excited about a new project: I'd like to make an embedded version.

Yesterday I got started on the first component: getting audio into the microcontroller. I want to start with a standard dynamic mic, so I can keep using the same mic for talkbox and whistle synth, so it should take standard balanced audio on XLR as input. In a full version this would need an XLR port, but for now I...

(See More – 489 more words)

4Richard_Kennaway5h

You want those in parallel for them to add. The series combination (which I see in the breadboard pic, not just the text) is only 2µF, making your high-pass frequency a little over 10kHz.

jefftk18m20

Whoops! You're right! Will do.

Johannes C. Mayer's Shortform

Johannes C. Mayer

1Johannes C. Mayer4h

Research Writing Workflow: First figure stuff out * Do research and first figure stuff out, until you feel like you are not confused anymore. * Explain it to a person, or a camera, or ideally to a person and a camera. * If there are any hiccups expand your understanding. * Ideally, as the last step, explain it to somebody whom you have not ever explained it to. * Only once you made a presentation without hiccups you are ready to write post. * If you have a recording this is useful as a starting point.

lukehmiles19m10

I like the rough thoughts way though. I'm not here to like read a textbook.

2Carl Feynman10h

I would highly recommend getting someone else to debug your subconscious for you. At least it worked for me. I don’t think it would be possible for me to have debugged myself. My first therapist was highly directive. He’d say stuff like “Try noticing when you think X, and asking yourself what happened immediately before that. Report back next week.” And listing agenda items and drawing diagrams on a whiteboard. As an engineer, I loved it. My second therapist was more in the “providing supportive comments while I talk about my life” school. I don’t think that helped much, at least subjectively from the inside. Here‘s a possibly instructive anecdote about my first therapist. Near the end of a session, I feel like my mind has been stretched in some heretofore-unknown direction. It’s a sensation I’ve never had before. So I say, “Wow, my mind feels like it’s been stretched in some heretofore-unknown direction. How do you do that?” He says, “Do you want me to explain?” And I say, “Does it still work if I know what you’re doing?” And he says, “Possibly not, but it’s important you feel I’m trustworthy, so I’ll explain if you want.” So I say “Why mess with success? Keep doing the thing. I trust you.” That’s an example of a debugging procedure you can’t do to yourself.

On Not Pulling The Ladder Up Behind You

Screwtape

Epistemic Status: Musing and speculation, but I think there's a real thing here.

I.

When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground.

Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder....

(Continue Reading – 2402 more words)

Ericf33m20

Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda, Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah

Ω 391d

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

Arthur Conmy44mΩ110

We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value.

For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by $E | | x | |_{2}$ , motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we'll state all the values used). We didn't see n... (read more)

11Senthooran Rajamanoharan12h

UPDATE: we've corrected equations 9 and 10 in the paper (screenshot of the draft below) and also added a footnote that hopefully helps clarify the derivation. I've also attached a revised figure 6, showing that this doesn't change the overall story (for the mathematical reasons I mentioned in my previous comment). These will go up on arXiv, along with some other minor changes (like remembering to mention SAEs' widths), likely some point next week. Thanks again Sam for pointing this out! Updated equations (draft): Updated figure 6 (shrinkage comparison for GELU-1L):

4Dan Braun16h

This is neat, nice work! I'm finding it quite hard to get a sense at what the actual Loss Recovered numbers you report are, and to compare them concretely to other work. If possible, it'd be very helpful if you shared: 1. What the zero ablations CE scores are for each model and SAE position. (I assume it's much worse for the MLP and attention outputs than the residual stream?) 2. What the baseline CE scores are for each model.

2Rohin Shah18h

This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs. The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to ReLU(πgate(x)), so you might think of πgate as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on fmag for everything else.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

We are headed into an extreme compute overhang

devrandom

If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.

Definitions

Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff.

I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here).

I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.

Thesis

Due to practical reasons, the compute requirements for training LLMs...

(See More – 408 more words)

lukehmiles1h24

This seems correct and important to me.

8ryan_greenblatt2h

See also Before smart AI, there will be many mediocre or specialized AIs.

6faul_sname2h

I think this only holds if fine tunes are composable, which as far as I can tell they aren't (fine tuning on one task subtly degrades performance on a bunch of other tasks, which isn't a big deal if you fine tune a little for performance on a few tasks but does mean you probably can't take a million independently-fine-tuned models and merge them into a single super model of the same size with the same performance on all million tasks). Also there are sometimes mornings where I can't understand code I wrote the previous night when I had all of the necessary context fresh to me, despite being the same person. I expect that LLMs will exhibit the same behavior of some things being hard to understand when examined out of the context which generated them. That's not to say a worldin which there are a billion copies of GPT-5 running concurrently will have no major changes, but I don't think a single coherent ASI falls out of that world.

1snewman3h

Nit: you mixed up 30 and 40 here (should both be 30 or both be 40). If you train a model with 10x as many parameters, but use the same training data, then it will cost 10x as much to train and 10x as much to operate, so the ratios will hold. In practice, I believe it is universal to use more training data when training larger models? Implying that the ratio would actually increase (which further supports your thesis). On the other hand, the world already contains over 8 billion human intelligences. So I think you are assuming that a few million AGIs, possibly running at several times human speed (and able to work 24/7, exchange information electronically, etc.), will be able to significantly "outcompete" (in some fashion) 8 billion humans? This seems worth further exploration / justification.

Examples of Highly Counterfactual Discoveries?

149

johnswentworth, kromem

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

(See More – 189 more words)

kave1h20

Maybe "counterfactually robust" is an OK phrase?

1Johannes C. Mayer4h

A few adjacent thoughts: * Why is a programming language like Haskell that is extremely powerful in the sense that if your program compiles, it is the program that you want with a very high probability because most stupid mistakes are now compile errors? * Why is there basically no widely used homoiconic language, i.e. a language in which you can use the language itself to <reason about the language/manipulate the language>. Here we have some technology that is basically ready to use (Haskell or Clojure), but people decide to mostly not use them. And with people, I mean professional programmers and companions who make software. * Why did nobody invent Rust earlier, by which I mean a system-level programming language that prevents you from making really dumb mistakes that can be machine-checked if you make them? * Why did it take like 40 years to get a latex replacement, even though latex is terrible in very obvious ways? These things have in common that there is a big engineering challenge. It feels like maybe this explains it, together with that people who would benefit from these technologies where in the position that the cost of creating them would have exceeded the benefit that they would expect from them. For Haskell and Clojure we can also consider this point. Certainly, these two technologies have their flaws and could be improved. But then again we would have a massive engineering challenge.

4Alexander Gietelink Oldenziel5h

I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions. The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to. To be sure - generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training. The issue of the true distribution not being contained in the model is called 'unrealizability' in Bayesian statistics. It is dealt with in Watanabe's second 'green' book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy. I don't have the time to recap this story here.

3mattmacdermott4h

Lucius-Alexander SLT dialogue?

Superposition is not "just" neuron polysemanticity

LawrenceC

Ω 201h

TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non--neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.

Epistemic status: I wrote this “quickly” in about 12 hours, as otherwise it wouldn’t have come out at all. Think of...

(Continue Reading – 3632 more words)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

1. Background

I.

Definitions

Thesis

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA