Robin Hanson has apparently asked the same thing. It seems like such a bizarre question to me:
This report is one in a series of ~10 posts comprising a 2024 State of the AI Regulatory Landscape Review, conducted by the Governance Recommendations Research Program at Convergence Analysis. Each post will cover a specific domain of AI governance (such as incident reporting, safety evals, model registries, and more). We’ll provide an overview of existing regulations, focusing on the US, EU, and China as the leading governmental bodies currently developing AI legislation. Additionally, we’ll discuss the relevant context behind each domain and conduct a short analysis.
This series is intended to be a primer for policymakers, researchers, and individuals seeking to develop a high-level overview of the current AI governance space. We’ll publish individual posts on our website and release a comprehensive report at the end of this series.
In this post,...
TLDR:
1905 is often described as the "annus mirabilis" of Albert Einstein. He founded quantum physics by postulating the existence of (light) quanta, explained Brownian motion, introduced the special relativity theory and...
I think research on what you propose should definitely not be public and I'd recommend against publicly trying to push this alignment agenda.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort
We would like to thank Atticus Geiger for his valuable feedback and in-depth discussions throughout this project.
Activation patching is a common method for finding model components (attention heads, MLP layers, …) relevant to a given task. However, features rarely occupy entire components: instead, we expect them to form non-basis-aligned subspaces of these components.
We show that the obvious generalization of activation patching to subspaces is prone to a kind of interpretability illusion. Specifically, it is possible for a 1-dimensional subspace patch in the IOI task to significantly affect predicted probabilities by activating a normally dormant pathway outside the IOI circuit. At the same time, activation patching the entire MLP layer where this subspace lies has no such effect. We call this an "MLP-In-The-Middle" illusion.
We show a simple mathematical model of how this situation may arise more generally, and a priori / heuristic arguments for why it may be common in real-world LLMs.
The linear representation hypothesis suggests that language models represent concepts as meaningful directions (or subspaces, for non-binary features) in the much larger space of possible activations. A central goal of mechanistic interpretability is to discover these subspaces and map them to interpretable variables, as they form the “units” of model computation.
However, the residual stream activations (and maybe even the neuron activations!) mostly don’t have a privileged basis. This means that many meaningful subspaces won’t be basis-aligned; rather than iterating over possible neurons and sets of neurons, we need to consider arbitrary subspaces of activations. This is a much larger search space! How can we navigate it?
A natural approach to check “how well” a subspace represents a concept is to use a subspace analogue of the activation patching technique. You run the model on input A, but with the activation along the subspace taken from an input B that differs from A only in the value of the concept in question. If the subspace encodes the information used by the model to distinguish B from A, we expect to see a corresponding change in model behavior (compared to just running on A).
Surpri...
What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)
This way v won't have any dormant component, so the MLP output after patching also cannot use that dormant pathway.
This is the ninth post in my series on Anthropics. The previous one is The Solution to Sleeping Beauty.
There are some quite pervasive misconceptions about betting in regards to the Sleeping Beauty problem.
One is that you need to switch between halfer and thirder stances based on the betting scheme proposed. As if learning about a betting scheme is supposed to affect your credence in an event.
Another is that halfers should bet at thirders odds and, therefore, thirdism is vindicated on the grounds of betting. What do halfers even mean by probability of Heads being 1/2 if they bet as if it's 1/3?
In this post we are going to correct them. We will understand how to arrive to correct betting odds from both thirdist and halfist positions, and...
„Whether or not your probability model leads to optimal descision making is the test allowing to falsify it.“
Sure, I don‘t deny that. What I am saying is, that your probability model don‘t tell you which probability you have to base on a certain decision. If you can derive a probability from your model and provide a good reason to consider this probability relevant to your decision, your model is not falsified as long you arrive at the right decision. Suppose a simple experiment where the experimenter flips a fair coin and you have to guess if Tails or Hea...
Hi : )
I used to use smileys in my writing all the time (more than I do now!). but then I read Against Disclaimers, and I thought that every time I used a smiley I wud make people who don't use smileys seem less friendly (bc my conspicuous-friendliness wud be available as a contrast to others' behaviour). so instead, my strategy for maximizing friendliness in the world became:
if I just have the purest of kindness in my heart while I interacting with ppl, and use plain words with no extra signalling, I will make plain words seem more friendly in general.
this was part of a general heuristic strategy: "to marginally move society in the direction of a better interpretive equilibrium, just act like that equilibrium is already...
Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.
If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead.
However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we...
Right, and that wouldn’t apply to a model-based RL system that could learn an open-ended model of any aspect of the world and itself, right?
I think your “it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function” should have some caveat that it only clearly applies to currently-known techniques. In the future there could be better automatic-world-model-builders, and/or future generic techniques to do automatic unsupervised reward-shaping for an arbitrary reward, such that AIs could find out-of-the-box ways to solve hard problems without handholding.
A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.
… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...
@habryka this comment has an anomalous amount of karma. It showed up on popular comments, I think, and I'm wondering if people liked the comment when they saw it there which lead to a feedback loop of more eyeballs on the comment, more likes, more eyeball etc. If so, is that the intended behaviour of the popular comments feature? It seems like it shouldn't be.
Most witches don't believe in gods. They know that the gods exist, of course. They even deal with them occasionally. But they don't believe in them. They know them too well. It would be like believing in the postman.
—Terry Pratchett, Witches Abroad
Once upon a time, I was pondering the philosophy of fantasy stories—
And before anyone chides me for my "failure to understand what fantasy is about", let me say this: I was raised in an SF&F household. I have been reading fantasy stories since I was five years old. I occasionally try to write fantasy stories. And I am not the sort of person who tries to write for a genre without pondering its philosophy. Where do you think story ideas come from?
Anyway:
I was...
Let's step back. This thread of the conversation is rooted in this claim: "Let's be honest: all fiction is a form of escapism.". Are we snared in the [Disputing Definitions](https://www.lesswrong.com/posts/7X2j8HAkWdmMoS8PE/disputing-definitions) trap? To quote from that LW article:
> if the issue arises, both sides should switch to describing the event in unambiguous lower-level constituents, like acoustic vibrations or auditory experiences. Or each side could designate a new word, like 'alberzle' and 'bargulum', to use for what they respectively ...