lc16m20

Robin Hanson has apparently asked the same thing. It seems like such a bizarre question to me:

Most people do not have the constitution or agency for criminal murder
Most companies do not have secrets large enough that assassinations would reduce the size of their problems on expectation
Most people who work at large companies don't really give a shit if that company gets fined and so they don't have the motivation to personally risk anything organizing murders to prevent lawsuits

4ChristianKl16h

Most companies don't threaten their employees with physical violence. According to another Boeing whistleblower Sam Salehpour, that seems to happen at Boeing. Being a defense contractor, I would expect Boeing corporate to have better relationships with the kind of people you would hire for such a task than corporations.

Reviewing the Structure of Current AI Regulations

Deric Cheng, Elliot_Mckernon

31m

This report is one in a series of ~10 posts comprising a 2024 State of the AI Regulatory Landscape Review, conducted by the Governance Recommendations Research Program at Convergence Analysis. Each post will cover a specific domain of AI governance (such as incident reporting, safety evals, model registries, and more). We’ll provide an overview of existing regulations, focusing on the US, EU, and China as the leading governmental bodies currently developing AI legislation. Additionally, we’ll discuss the relevant context behind each domain and conduct a short analysis.

This series is intended to be a primer for policymakers, researchers, and individuals seeking to develop a high-level overview of the current AI governance space. We’ll publish individual posts on our website and release a comprehensive report at the end of this series.

In this post,...

(Continue Reading – 3875 more words)

Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence

Towards_Keeperhood, Davanchama

20h

TLDR:

Around Einstein-level, relatively small changes in intelligence can lead to large changes in what one is capable to accomplish.
1. E.g. Einstein was a bit better than the other best physi at seeing deep connections and reasoning, but was able to accomplish much more in terms of impressive scientific output.
There are architectures where small changes can have significant effects on intelligence.
1. E.g. small changes in human-brain-hyperparameters: Einstein’s brain didn’t need to be trained on 3x the compute than normal physics professors for him to become much better at forming deep understanding, even without intelligence improving intelligence.

Einstein and the heavytail of human intelligence

1905 is often described as the "annus mirabilis" of Albert Einstein. He founded quantum physics by postulating the existence of (light) quanta, explained Brownian motion, introduced the special relativity theory and...

(Continue Reading – 1063 more words)

Towards_Keeperhood1h10

I think research on what you propose should definitely not be public and I'd recommend against publicly trying to push this alignment agenda.

1Towards_Keeperhood1h

(I think) Planck found the formula that matched the empirically observed distribution, but had no explanation for why it should hold. Einstein found the justification for this formula.

1RussellThor8h

OK but if that were true then there would have been many more Einstein like breakthroughs since then. More likely is that such low hanging fruit have been plucked and a similar intellect is well into diminishing returns. That is given our current technological society and >50 year history of smart people trying to work on everything if there are such breakthroughs to be made, then the IQ required is now higher than in Einsteins day.

4Lukas_Gloor12h

I lean towards agreeing with the takeaway; I made a similar argument here and would still bet on the slope being very steep inside the human intelligence level.

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov, Neel Nanda

Ω 348mo

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort

We would like to thank Atticus Geiger for his valuable feedback and in-depth discussions throughout this project.

tl;dr:

Activation patching is a common method for finding model components (attention heads, MLP layers, …) relevant to a given task. However, features rarely occupy entire components: instead, we expect them to form non-basis-aligned subspaces of these components.

We show that the obvious generalization of activation patching to subspaces is prone to a kind of interpretability illusion. Specifically, it is possible for a 1-dimensional subspace patch in the IOI task to significantly affect predicted probabilities by activating a normally dormant pathway outside the IOI circuit. At the same time, activation patching the entire MLP layer where this subspace lies has no such effect. We call this an "MLP-In-The-Middle" illusion.

We show a simple mathematical model of how this situation may arise more generally, and a priori / heuristic arguments for why it may be common in real-world LLMs.

Introduction

The linear representation hypothesis suggests that language models represent concepts as meaningful directions (or subspaces, for non-binary features) in the much larger space of possible activations. A central goal of mechanistic interpretability is to discover these subspaces and map them to interpretable variables, as they form the “units” of model computation.

However, the residual stream activations (and maybe even the neuron activations!) mostly don’t have a privileged basis. This means that many meaningful subspaces won’t be basis-aligned; rather than iterating over possible neurons and sets of neurons, we need to consider arbitrary subspaces of activations. This is a much larger search space! How can we navigate it?

A natural approach to check “how well” a subspace represents a concept is to use a subspace analogue of the activation patching technique. You run the model on input A, but with the activation along the subspace taken from an input B that differs from A only in the value of the concept in question. If the subspace encodes the information used by the model to distinguish B from A, we expect to see a corresponding change in model behavior (compared to just running on A).

Surpri...

Filip Sondej1h10

What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)

This way v won't have any dormant component, so the MLP output after patching also cannot use that dormant pathway.

Beauty and the Bets

Ape in the coat

1mo

This is the ninth post in my series on Anthropics. The previous one is The Solution to Sleeping Beauty.

Introduction

There are some quite pervasive misconceptions about betting in regards to the Sleeping Beauty problem.

One is that you need to switch between halfer and thirder stances based on the betting scheme proposed. As if learning about a betting scheme is supposed to affect your credence in an event.

Another is that halfers should bet at thirders odds and, therefore, thirdism is vindicated on the grounds of betting. What do halfers even mean by probability of Heads being 1/2 if they bet as if it's 1/3?

In this post we are going to correct them. We will understand how to arrive to correct betting odds from both thirdist and halfist positions, and...

(Continue Reading – 3462 more words)

Duschkopf1h10

„Whether or not your probability model leads to optimal descision making is the test allowing to falsify it.“

Sure, I don‘t deny that. What I am saying is, that your probability model don‘t tell you which probability you have to base on a certain decision. If you can derive a probability from your model and provide a good reason to consider this probability relevant to your decision, your model is not falsified as long you arrive at the right decision. Suppose a simple experiment where the experimenter flips a fair coin and you have to guess if Tails or Hea... (read more)

use smileys! (reflections on how to make society's interpretive priors more charitable)

Emrik

Hi : )

I used to use smileys in my writing all the time (more than I do now!). but then I read Against Disclaimers, and I thought that every time I used a smiley I wud make people who don't use smileys seem less friendly (bc my conspicuous-friendliness wud be available as a contrast to others' behaviour). so instead, my strategy for maximizing friendliness in the world became:

if I just have the purest of kindness in my heart while I interacting with ppl, and use plain words with no extra signalling, I will make plain words seem more friendly in general.

this was part of a general heuristic strategy: "to marginally move society in the direction of a better interpretive equilibrium, just act like that equilibrium is already...

(See More – 101 more words)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Does reducing the amount of RL for a given capability level make AI safer?

Chris_Leong, porby

Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.

If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead.

However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we...

(See More – 82 more words)

2Chris_Leong9h

You mention that society may do too little of the safer types of RL. Can you clarify what you mean by this?

5porby10h

Calling MuZero RL makes sense. The scare quotes are not meant to imply that it's not "real" RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent. For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn't going to wander down a path to general superintelligence with strong preferences about paperclips.

Steven Byrnes2h40

Right, and that wouldn’t apply to a model-based RL system that could learn an open-ended model of any aspect of the world and itself, right?

I think your “it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function” should have some caveat that it only clearly applies to currently-known techniques. In the future there could be better automatic-world-model-builders, and/or future generic techniques to do automatic unsupervised reward-shaping for an arbitrary reward, such that AIs could find out-of-the-box ways to solve hard problems without handholding.

Some Experiments I'd Like Someone To Try With An Amnestic

johnswentworth

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...

(See More – 589 more words)

Algon2h21

@habryka this comment has an anomalous amount of karma. It showed up on popular comments, I think, and I'm wondering if people liked the comment when they saw it there which lead to a feedback loop of more eyeballs on the comment, more likes, more eyeball etc. If so, is that the intended behaviour of the popular comments feature? It seems like it shouldn't be.

3the gears to ascension17h

Yeah there are definitely tasks that depressants would be expected to leave intact. I'd guess it's correlated strongly with degree of working memory required.

If You Demand Magic, Magic Won't Help

168

Eliezer Yudkowsky

16y

Most witches don't believe in gods. They know that the gods exist, of course. They even deal with them occasionally. But they don't believe in them. They know them too well. It would be like believing in the postman.
—Terry Pratchett, Witches Abroad

Once upon a time, I was pondering the philosophy of fantasy stories—

And before anyone chides me for my "failure to understand what fantasy is about", let me say this: I was raised in an SF&F household. I have been reading fantasy stories since I was five years old. I occasionally try to write fantasy stories. And I am not the sort of person who tries to write for a genre without pondering its philosophy. Where do you think story ideas come from?

Anyway:

I was...

(See More – 949 more words)

David James2h10

Let's step back. This thread of the conversation is rooted in this claim: "Let's be honest: all fiction is a form of escapism.". Are we snared in the [Disputing Definitions](https://www.lesswrong.com/posts/7X2j8HAkWdmMoS8PE/disputing-definitions) trap? To quote from that LW article:

> if the issue arises, both sides should switch to describing the event in unambiguous lower-level constituents, like acoustic vibrations or auditory experiences. Or each side could designate a new word, like 'alberzle' and 'bargulum', to use for what they respectively ... (read more)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Einstein and the heavytail of human intelligence

tl;dr:

Introduction

Introduction

Fooming Shoggoths Dance Concert

June 1st at LessOnline