How does it work to optimize for realistic goals in physical environments of which you yourself are a part? E.g. humans and robots in the real world, and not humans and AIs playing video games in virtual worlds where the player not part of the environment.  The authors claim we don't actually have a good theoretical understanding of this and explore four specific ways that we don't understand this process.

TurnTrout19hΩ22495
3
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy: > A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented). By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction): > On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It's just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.  * This definition also makes obvious the fact that "shards" are a matter of implementation, not of behavior. * It also captures the fact that "shard" definitions are somewhat subjective. In one moment, I might model someone is having a separate "ice cream shard" and "cookie shard", but in another situation I might choose to model those two circuits as a larger "sweet food shard." So I think this captures something important. However, it leaves a few things to be desired: * What, exactly, is a "motivational circuit"? Obvious definitions seem to include every neural network with nonconstant outputs. * Demanding a compositional representation is unrealistic since it ignores superposition. If k dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have k≤dmodel shards, which seems obviously wrong and false.  That said, I still find this definition useful.  I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing. 1. ^ Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
tlevin1d607
3
I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable). I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more effective actors in the DC establishment overall are much more in the habit of looking for small wins that are both good in themselves and shrink the size of the ask for their ideal policy than of pushing for their ideal vision and then making concessions. Possibly an ideal ecosystem has both strategies, but it seems possible that at least some versions of "Overton Window-moving" strategies executed in practice have larger negative effects via associating their "side" with unreasonable-sounding ideas in the minds of very bandwidth-constrained policymakers, who strongly lean on signals of credibility and consensus when quickly evaluating policy options, than the positive effects of increasing the odds of ideal policy and improving the framing for non-ideal but pretty good policies. In theory, the Overton Window model is just a description of what ideas are taken seriously, so it can indeed accommodate backfire effects where you argue for an idea "outside the window" and this actually makes the window narrower. But I think the visual imagery of "windows" actually struggles to accommodate this -- when was the last time you tried to open a window and accidentally closed it instead? -- and as a result, people who rely on this model are more likely to underrate these kinds of consequences. Would be interested in empirical evidence on this question (ideally actual studies from psych, political science, sociology, econ, etc literatures, rather than specific case studies due to reference class tennis type issues).
Richard_Ngo2hΩ562
2
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible. This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition? The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions). The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal". Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl's do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.
I think there should be a way to find the highest rated shortform posts.
The FDC just fined US phone carriers for sharing the location data of US customers to anyone willing to buy them. The fines don't seem to be high enough to deter this kind of behavior. That likely includes either directly or indirectly the Chinese government.  What does the US Congress do to protect spying by China? Of course, banning tik tok instead of actually protecting the data of US citizens.  If you have thread models that the Chinese government might target you, assume that they know where your phone is and shut it of when going somewhere you don't want the Chinese government (or for that matter anyone with a decent amount of capital) to know.  

Popular Comments

Recent Discussion

This is a linkpost for https://arxiv.org/abs/2404.19756

Abstract:

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

4Nathan Helm-Burger4h
Wow, this is super fascinating. A juicy tidbit: "
6Mateusz Bagiński3h
I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it's solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic's work on Sleeper Agents seems like a very strong piece of evidence that it is the case. Still, if they're right that KANs don't have this problem at much smaller sizes than MLP-based NNs, that's very interesting. Nevertheless, I think talking about catastrophic forgetting as a "serious problem in modern ML" seems significantly misleading
gwern3m20

(likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?)

Pretraining, specifically: https://gwern.net/doc/reinforcement-learning/meta-learning/continual-learning/index#scialom-et-al-2022-section

The intuition is that after pretraining, models can map new data into very efficient low-dimensional latents and have tons of free space / unused parameters. So you can easily prune them, but also easily specialize them with LoRA (because the sparsity is automatic, just learned) or just regular online SGD.

But yeah,... (read more)

2Nathan Helm-Burger1h
I'm not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B's training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint? I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.

Q. "Can you hold the door?" A. "Sure."

That's straightforward.

Q. "Can you play the violin at my wedding next year?" A. "Sure."

Colloquial language would imply not only am I willing and able to do this, I already know how to play the violin. Sometimes, what I want to answer is that I don't know how to play the violin, I'm willing to learn, but you should know I currently don't know.

Which I can say, it just takes more words.

If you're working at the intersection between cryptogrpahy, secuity and AI, consider joining this upcoming workshop:

Foresight's AGI: Cryptography, Security & Multipolar Scenarios Workshop
May 14-15, all-day
The Institute, Salesforce Tower, San Francisco

Goals
To help AI development benefit humanity, Foresight Institute has held various workshops over the past years and launched a Grants Program that funds work on AI security risks, cryptography tools for safe AI, and safe multipolar AI scenarios. Our 2024 workshop invites leading researchers, entrepreneurs, and funders in this growing space to explore new tools and architectures that help humans and AIs cooperate securely. In addition to short presentations, working groups, and project development, we offer mentorship hours, open breakouts, and speaker & sponsor gatherings.

Questions we’ll address include

  1. Which challenges in AI alignment, AI security and AI coordination (in particular
...

I'll be there! Talk to me about boundaries and coordination/Goodness

6Richard_Ngo2h
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible. This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition? The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions). The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal". Havi

There's also the problem of: what do you mean by "the human"? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you're perfectly capable of making different decisions, but you just don't.

Another problem, which I like to think of as the "control panel of the universe" problem, is where the AI gives you the "control panel of the universe", bu... (read more)

4Richard_Ngo1h
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.
1samshap7h
Instead of demanding orthogonal representations, just have them obey the restricted isometry property. Basically, instead of requiring  ∀i≠j:<xi,xj>=0, we just require ∀i≠j:xi⋅xj≤ϵ . This would allow a polynomial number of sparse shards while still allowing full recovery.
7cubefox15h
For illustration, what would be an example of having different shards for "I get food" (F) and "I see my parents again" (P) compared to having one utility distribution over F∧P, F∧¬P, ¬F∧P, ¬F∧¬P?

I think this is also what I was confused about -- TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren't always applied, but rather only in some circumstances does the circuitry fire at all, and there are circums... (read more)

ACX recently posted about the Rootclaim Covid origins debate, coming out in favor of zoonosis. Did the post change the minds of those who read it, or not? Did it change their judgment in favor of zoonosis (as was probably the goal of the post), or conversely did it make them think Lab Leak was more likely (as the "Don't debate conspiracy theorists" theory claims)?

I analyzed the ACX survey to find out, by comparing responses before and after the post came out. The ACX survey asked readers whether they think the origin of Covid is more likely natural or Lab Leak. The ACX survey went out March 26th and was open until about April 10th. The Covid origins post came out March 28th, and the highlights on April...

Is a lot of the effect not "people who read ACX trust Scott Alexander"? Like, the survey selects for most "passionate" readers, those willing to donate their free time to Scott for research with ~nothing in return. Him publicly stating on his platform "I am now much less certain of X" is likely to make that group of people be less certain of X?

8Brendan Long5h
I'm not sure if this really applies here, since Lab Leak was never really treated as a crazy/fringe idea among rationalists. In fact, it looks like it was the majority opinion before the debate and ACX posts.
1ErnestScribbler2h
This is a very good point.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

 Summary

  • We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
  • We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
  • One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered).
...

Hey Jacob + Philippe,

Hope you all don't mind but we put up layer 8 of your transcoders onto Neuronpedia, with ~22k dashboards here:

https://neuronpedia.org/gpt2-small/8-tres-dc

Each dashboard can be accessed at their own url:

https://neuronpedia.org/gpt2-small/8-tres-dc/0 goes to feature index 0.

You can also test each feature with custom text:

Or search all features at: https://www.neuronpedia.org/gpt2-small/tres-dc

An example search: https://www.neuronpedia.org/gpt2-small/?sourceSet=tres-dc&selectedLayers=[]&sortIndexes=[]&q=the%20cat%20sat%20on%20... (read more)

3Philippe Chlenski7h
This sounds like it could work. I can think of a few reasons why this approach could be challenging, however: 1. We don't really know how transcoders (or SAEs, to the best of my knowledge) behave when they're being trained to imitate a model component that's still updating 2. Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions. Substituting one transcoder at a time would require restarting the forward pass at each layer.  3. If the transcoders are used to predict next tokens, they may lose interpretability and return to superposition. Under a "transcoder-aware" training regime, these would be the first things I would check for. Also, you may be interested in Jacob's comment here for some details on when we tried to co-train SAEs and transcoders to have sparse connections to one another. This is a very different question, of course, but it provides some preliminary evidence that the fidelity-interpretability tradeoff persists across more elaborate training settings.
2Vladimir_Nesov6h
Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the "full-precision" master model), while transcoders only need to be maintained to remain in sync with the MLPs, whatever they are (according to the same local objective as before, which doesn't care at all about token prediction). The search is for MLPs such that their transcoders are good predictors, not directly for transcoders that are good predictors. Unclear given the extreme quantization results, where similarly post-training replacement would degrade model performance a lot, yet quantization-aware pre-training somehow doesn't. This seems to be the main technical hurdle to do the experiment, updating transcoders both efficiently and correctly, as underlying MLPs gradually change. (I'm guessing some discontinuous jumps in choice of transcoders might be OK.)
3Jacob Dunefsky11h
I'm pretty sure that there's at least one other MATS group (unrelated to us) currently working on this, although I'm not certain about any of the details. Hopefully they release their research soon! I did try something similar at one point, but it didn't quite work out. In particular: given an SAE for MLP-out activations, you can try and train an MLP transcoder with an additional loss term penalizing the L1 norm of the pullback of the SAE encoder features by the transcoder decoder matrix. This was intended to induce sparse input-independent connections from the transcoder features to the MLP-out SAE features. Unfortunately, this didn't yield great results. The transcoder features were often polysemantic, while the input-independent connections from the transcoder features to the SAE features were somewhat bizarre-looking. Here's an old graph I just dug up: the x-axis is transcoder feature index and the y-axis is the input-independent connection strength to a certain SAE feature: In the end, I decided to pause working on this idea. Potentially, it could turn out that this idea is workable, but if so, then there are probably a few extra tweaks that have to be done to get it working beyond the naive approach that I tried.
3metachirality1h
I think there should be a way to find the highest rated shortform posts.

You can! Just go to the all-posts page, sort by year, and the highest-rated shortform posts for each year will be in the Quick Takes section: 

2024: 

2023: 

2022: 

1dkornai2h
Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals.  In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed. This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] which impedes long term goals. This suggests that any system that possesses both a set of goals and the capacity to understand how events influence their probability of achieving such goals should posses a capacity to feel pain. This also suggests that the amount of pain is proportional to the degree of "setbacks"  and the degree to which "setbacks" are perceived. I think this is a relatively robust argument for the inherent reality of pain not just in a broad spectrum biological organisms, but also in synthetic [including sufficiently advanced AI] agents.  We should strive to reduce the pain we cause in the agents we interact with.

It also suggests that there might some sort of conservation law for pain for agents.

Conservation of Pain if you will

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA