A recent popular tweet did a "math magic trick", and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question:
This is a cute magic trick, and like any good trick they nonchalantly gloss over the most important step. Did you spot it? Did you notice your confusion?
Here's the key question: Why did they switch from a differential equation to an integral equation? If you can use when , why not use it when ?
Well, lets try it, writing for the derivative:
So now you may be disappointed, but relieved: yes, this version fails, but at least it fails-safe, giving you the trivial solution, right?
But no, actually can fail catastrophically, which we can see if we try a nonhomogeneous equation...
Very nice! Notice that if you write as , and play around with binomial coefficients a bit, we can rewrite this as:
which holds for as well, in which case it becomes the derivative product rule
(By the way, how do you spoiler tag?)
There are two main areas of catastrophic or existential risk which have recently received significant attention; biorisk, from natural sources, biological accidents, and biological weapons, and artificial intelligence, from detrimental societal impacts of systems, incautious or intentional misuse of highly capable systems, and direct risks from agentic AGI/ASI. These have been compared extensively in research, and have even directly inspired policies. Comparisons are often useful, but in this case, I think the disanalogies are much more compelling than the analogies. Below, I lay these out piecewise, attempting to keep the pairs of paragraphs describing first biorisk, then AI risk, parallel to each other.
While I think the disanalogies are compelling, comparison can still be useful as an analytic tool - while keeping in mind that the ability to directly...
This is an 800-word blog post, not 5 words. There’s plenty of room for nuance.
The way it stands right now, if there’s a conversation like:
...Person A: It’s not inconceivable that the world might wildly under-invest in societal resilience against catastrophic risks even after a “warning shot” for AI. Like for example, look at the case of bio-risks—COVID just happened, so the costs of novel pandemics are right now extremely salient to everyone on Earth, and yet, (…etc.).
Person B: You idiot, bio-risks are not at all analogous to AI. Look at this blog post by Dav
Robin Hanson has apparently asked the same thing. It seems like such a bizarre question to me:
This report is one in a series of ~10 posts comprising a 2024 State of the AI Regulatory Landscape Review, conducted by the Governance Recommendations Research Program at Convergence Analysis. Each post will cover a specific domain of AI governance (such as incident reporting, safety evals, model registries, and more). We’ll provide an overview of existing regulations, focusing on the US, EU, and China as the leading governmental bodies currently developing AI legislation. Additionally, we’ll discuss the relevant context behind each domain and conduct a short analysis.
This series is intended to be a primer for policymakers, researchers, and individuals seeking to develop a high-level overview of the current AI governance space. We’ll publish individual posts on our website and release a comprehensive report at the end of this series.
In this post,...
TLDR:
1905 is often described as the "annus mirabilis" of Albert Einstein. He founded quantum physics by postulating the existence of (light) quanta, explained Brownian motion, introduced the special relativity theory and...
I think research on what you propose should definitely not be public and I'd recommend against publicly trying to push this alignment agenda.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort
We would like to thank Atticus Geiger for his valuable feedback and in-depth discussions throughout this project.
Activation patching is a common method for finding model components (attention heads, MLP layers, …) relevant to a given task. However, features rarely occupy entire components: instead, we expect them to form non-basis-aligned subspaces of these components.
We show that the obvious generalization of activation patching to subspaces is prone to a kind of interpretability illusion. Specifically, it is possible for a 1-dimensional subspace patch in the IOI task to significantly affect predicted probabilities by activating a normally dormant pathway outside the IOI circuit. At the same time, activation patching the entire MLP layer where this subspace lies has no such effect. We call this an "MLP-In-The-Middle" illusion.
We show a simple mathematical model of how this situation may arise more generally, and a priori / heuristic arguments for why it may be common in real-world LLMs.
The linear representation hypothesis suggests that language models represent concepts as meaningful directions (or subspaces, for non-binary features) in the much larger space of possible activations. A central goal of mechanistic interpretability is to discover these subspaces and map them to interpretable variables, as they form the “units” of model computation.
However, the residual stream activations (and maybe even the neuron activations!) mostly don’t have a privileged basis. This means that many meaningful subspaces won’t be basis-aligned; rather than iterating over possible neurons and sets of neurons, we need to consider arbitrary subspaces of activations. This is a much larger search space! How can we navigate it?
A natural approach to check “how well” a subspace represents a concept is to use a subspace analogue of the activation patching technique. You run the model on input A, but with the activation along the subspace taken from an input B that differs from A only in the value of the concept in question. If the subspace encodes the information used by the model to distinguish B from A, we expect to see a corresponding change in model behavior (compared to just running on A).
Surpri...
What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)
This way v won't have any dormant component, so the MLP output after patching also cannot use that dormant pathway.
This is the ninth post in my series on Anthropics. The previous one is The Solution to Sleeping Beauty.
There are some quite pervasive misconceptions about betting in regards to the Sleeping Beauty problem.
One is that you need to switch between halfer and thirder stances based on the betting scheme proposed. As if learning about a betting scheme is supposed to affect your credence in an event.
Another is that halfers should bet at thirders odds and, therefore, thirdism is vindicated on the grounds of betting. What do halfers even mean by probability of Heads being 1/2 if they bet as if it's 1/3?
In this post we are going to correct them. We will understand how to arrive to correct betting odds from both thirdist and halfist positions, and...
„Whether or not your probability model leads to optimal descision making is the test allowing to falsify it.“
Sure, I don‘t deny that. What I am saying is, that your probability model don‘t tell you which probability you have to base on a certain decision. If you can derive a probability from your model and provide a good reason to consider this probability relevant to your decision, your model is not falsified as long you arrive at the right decision. Suppose a simple experiment where the experimenter flips a fair coin and you have to guess if Tails or Hea...
Hi : )
I used to use smileys in my writing all the time (more than I do now!). but then I read Against Disclaimers, and I thought that every time I used a smiley I wud make people who don't use smileys seem less friendly (bc my conspicuous-friendliness wud be available as a contrast to others' behaviour). so instead, my strategy for maximizing friendliness in the world became:
if I just have the purest of kindness in my heart while I interacting with ppl, and use plain words with no extra signalling, I will make plain words seem more friendly in general.
this was part of a general heuristic strategy: "to marginally move society in the direction of a better interpretive equilibrium, just act like that equilibrium is already...
Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.
If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead.
However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we...
Right, and that wouldn’t apply to a model-based RL system that could learn an open-ended model of any aspect of the world and itself, right?
I think your “it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function” should have some caveat that it only clearly applies to currently-known techniques. In the future there could be better automatic-world-model-builders, and/or future generic techniques to do automatic unsupervised reward-shaping for an arbitrary reward, such that AIs could find out-of-the-box ways to solve hard problems without handholding.