How does it work to optimize for realistic goals in physical environments of which you yourself are a part? E.g. humans and robots in the real world, and not humans and AIs playing video games in virtual worlds where the player not part of the environment.  The authors claim we don't actually have a good theoretical understanding of this and explore four specific ways that we don't understand this process.

TurnTrout20hΩ22495
3
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy: > A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented). By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction): > On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It's just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.  * This definition also makes obvious the fact that "shards" are a matter of implementation, not of behavior. * It also captures the fact that "shard" definitions are somewhat subjective. In one moment, I might model someone is having a separate "ice cream shard" and "cookie shard", but in another situation I might choose to model those two circuits as a larger "sweet food shard." So I think this captures something important. However, it leaves a few things to be desired: * What, exactly, is a "motivational circuit"? Obvious definitions seem to include every neural network with nonconstant outputs. * Demanding a compositional representation is unrealistic since it ignores superposition. If k dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have k≤dmodel shards, which seems obviously wrong and false.  That said, I still find this definition useful.  I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing. 1. ^ Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
tlevin1d618
3
I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable). I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more effective actors in the DC establishment overall are much more in the habit of looking for small wins that are both good in themselves and shrink the size of the ask for their ideal policy than of pushing for their ideal vision and then making concessions. Possibly an ideal ecosystem has both strategies, but it seems possible that at least some versions of "Overton Window-moving" strategies executed in practice have larger negative effects via associating their "side" with unreasonable-sounding ideas in the minds of very bandwidth-constrained policymakers, who strongly lean on signals of credibility and consensus when quickly evaluating policy options, than the positive effects of increasing the odds of ideal policy and improving the framing for non-ideal but pretty good policies. In theory, the Overton Window model is just a description of what ideas are taken seriously, so it can indeed accommodate backfire effects where you argue for an idea "outside the window" and this actually makes the window narrower. But I think the visual imagery of "windows" actually struggles to accommodate this -- when was the last time you tried to open a window and accidentally closed it instead? -- and as a result, people who rely on this model are more likely to underrate these kinds of consequences. Would be interested in empirical evidence on this question (ideally actual studies from psych, political science, sociology, econ, etc literatures, rather than specific case studies due to reference class tennis type issues).
Richard_Ngo2hΩ562
2
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible. This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition? The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions). The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal". Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl's do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.
I think there should be a way to find the highest rated shortform posts.
Hypothesis, super weakly held and based on anecdote: One big difference between US national security policy people and AI safety people is that the "grieving the possibility that we might all die" moment happened, on average, more years ago for the national security policy person than the AI safety person.  This is (even more weakly held) because the national security field has existed for longer, so many participants literally had the "oh, what happens if we get nuked by Russia" moment in their careers in the Literal 1980s...

Popular Comments

Recent Discussion

Adversarial Examples: A Problem

The apparent successes of the deep learning revolution conceal a dark underbelly. It may seem that we now know how to get computers to (say) check whether a photo is of a bird, but this façade of seemingly good performance is belied by the existence of adversarial examples—specially prepared data that looks ordinary to humans, but is seen radically differently by machine learning models.

The differentiable nature of neural networks, which make them possible to be trained at all, are also responsible for their downfall at the hands of an adversary. Deep learning models are fit using stochastic gradient descent (SGD) to approximate the function between expected inputs and outputs. Given an input, an expected output, and a loss function (which measures "how bad" it...

Do you know if it is happening naturally from increased scale, or only correlated with scale (people are intentionally trying to correct the "misalignment" between ML and humans of shape vs texture bias by changing aspects of the ML system like its training and architecture, and simultaneously increasing scale)? I somewhat suspect the latter due the existence of a benchmark that the paper seems to target ("humans are at 96% shape / 4% texture bias and ViT-22B-384 achieves a previously unseen 87% shape bias / 13% texture bias").

In either case, it seems kind... (read more)

1Carl Feynman3h
An interesting question!  I looked in “Towards Deep Learning Models Resistant to Adversarial Attacks” to see what they had to say on the question.  If I’m interpreting their Figure 6 correctly, there’s a negligible increase in error rate as epsilon increases, and then at some point the error rate starts swooping up toward 100%.  The transition seems to be about where the perturbed images start to be able to fool humans.  (Or perhaps slightly before.).  So you can’t really blame the model for being fooled, in that case.  If I had to pick an epsilon to train with, I would pick one just below the transition point, where robustness is maximized without getting into the crazy zone. All this is the result of a cursory inspection of a couple of papers.  There’s about a 30% chance I’ve misunderstood.
1sliqz4h
Because of the strange loopy nature of concepts/language/self/different problems metaphilosophy seems unsolvable? Asking: What is good? already implies that there are the concepts "good", "what", "being" that there are answers and questions ... Now we could ask what concepts or questions to use instead ... Similarly: > "What are all the things we can do with the things we have and what decision-making process will we use and why use that process if the character of the different processes is the production of different ends; don't we have to know which end is desired in order to choose the decision-making process that also arrives at that result?" > Which leads back to desire and knowing what you want without needing a system to tell you what you want. It's all empty in the Buddhist sense. It all depends on which concepts or turing machines or which physical laws you start with.
8Quintin Pope10h
I think it actually points to convergence between human and NN learning dynamics. Human visual cortices are also bad at hands and text, to the point that lucid dreamers often look for issues with their hands / nearby text to check whether they're dreaming. One issue that I think causes people to underestimate the degree of convergence between brain and NN learning is to compare the behaviors of entire brains to the behaviors of individual NNs. Brains consist of many different regions which are "trained" on different internal objectives, then interact with each other to collectively produce human outputs. In contrast, most current NNs contain only one "region", which is all trained on the single objective of imitating certain subsets of human behaviors. We should thus expect NN learning dynamics to most resemble those of single brain regions, and that the best match for humanlike generalization patterns will arise from putting together multiple NNs that interact with each other in a similar manner as human brain regions.

Hypothesis, super weakly held and based on anecdote:
One big difference between US national security policy people and AI safety people is that the "grieving the possibility that we might all die" moment happened, on average, more years ago for the national security policy person than the AI safety person. 

This is (even more weakly held) because the national security field has existed for longer, so many participants literally had the "oh, what happens if we get nuked by Russia" moment in their careers in the Literal 1980s...

Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Guillaume Corlouer for suggestions on this writeup.

Introduction

What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because

  • We have a formalism that relates training data to internal
...
snewman17m10

I am trying to wrap my head around the high-level implications of this statement. I can come up with two interpretations:

  1. What LLMs are doing is similar to what people do as they go about their day. When I walk down the street, I am simultaneously using visual and other input to assess the state of the world around me ("that looks like a car"), running a world model based on that assessment ("the car is coming this way"), and then using some other internal mechanism to decide what to do ("I'd better move to the sidewalk").
  2. What LLMs are doing is harder than
... (read more)
9habryka4h
Promoted to curated: Formalizing what it means for transformers to learn "the underlying world model" when engaging in next-token prediction tasks seems pretty useful, in that it's an abstraction that I see used all the time when discussing risks from models where the vast majority of the compute was spent in pre-training, where the details usually get handwaived. It seems useful to understand what exactly we mean by that in more detail.  I have not done a thorough review of this kind of work, but it seems to me that also others thought the basic ideas in the work hold up, and I thought reading this post gave me crisper abstractions to talk about this kind of stuff in the future.
5gwern31m
You might find my notes of interest.
ErioirE24m10

Yes, thanks!

If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.

Definitions

Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed".  A large compute overhang leads to additional risk due to faster takeoff.

I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here).

I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.

Thesis

Due to practical reasons, the compute requirements for training LLMs...

gwern33m42

I think this only holds if fine tunes are composable, which as far as I can tell they aren't

You know 'finetunes are composable', because a finetune is just a gradient descent step on a batch of data and a parameter update, and if you train on more than one GPU and share updates, DL training still works {{citation needed}}.

If you can train asynchronously on a thousand, or 20,000, or 100,000 GPUs, that is what you are doing; this is especially true in DRL, where you might be, say, training across 170,000 CPU-cores. (You are certainly not accumulating the ... (read more)

1devrandom10h
The main advantage is that you can immediately distribute fine-tunes to all of the copies.  This is much higher bandwidth compared to our own low-bandwidth/high-effort knowledge dissemination methods. The monolithic aspect may potentially be a disadvantage, but there are a couple of mitigations: * AGI are by definition generalists * you can segment the population into specialists (see also this comment about MoE)
1devrandom10h
  The purpose of a fine-tune is to "internalize" some knowledge - either because it is important to have implicit knowledge of it, or because you want to develop a skill. Although you may have a million instances executing tasks, the knowledge you want to internalize is likely much more sparse.  For example, if an instance is tasked with exploring a portion of a search space, and it doesn't find a solution in that portion, it can just summarize its finding in a few words.  There might not even be a reason to internalize this summary - it might be merged with other summaries for a more global view of the search landscape. So I don't see the need for millions of fine-tunes.  It seems more likely that you'd have periodic fine-tunes to internalize recent progress - maybe once an hour. The main point is that the single periodic fine-tune can be copied to all instances.  This ability to copy the fine-tune is the main advantage of instances being identical clones.
This is a linkpost for https://arxiv.org/abs/2404.19756

Abstract:

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

4Nathan Helm-Burger5h
Wow, this is super fascinating. A juicy tidbit:
6Mateusz Bagiński4h
I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it's solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic's work on Sleeper Agents seems like a very strong piece of evidence that it is the case. Still, if they're right that KANs don't have this problem at much smaller sizes than MLP-based NNs, that's very interesting. Nevertheless, I think talking about catastrophic forgetting as a "serious problem in modern ML" seems significantly misleading
gwern1h42

(likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?)

Pretraining, specifically: https://gwern.net/doc/reinforcement-learning/meta-learning/continual-learning/index#scialom-et-al-2022-section

The intuition is that after pretraining, models can map new data into very efficient low-dimensional latents and have tons of free space / unused parameters. So you can easily prune them, but also easily specialize them with LoRA (because the sparsity is automatic, just learned) or just regular online SGD.

But yeah,... (read more)

2Nathan Helm-Burger2h
I'm not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B's training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint? I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Q. "Can you hold the door?" A. "Sure."

That's straightforward.

Q. "Can you play the violin at my wedding next year?" A. "Sure."

Colloquial language would imply not only am I willing and able to do this, I already know how to play the violin. Sometimes, what I want to answer is that I don't know how to play the violin, I'm willing to learn, but you should know I currently don't know.

Which I can say, it just takes more words.

If you're working at the intersection between cryptogrpahy, secuity and AI, consider joining this upcoming workshop:

Foresight's AGI: Cryptography, Security & Multipolar Scenarios Workshop
May 14-15, all-day
The Institute, Salesforce Tower, San Francisco

Goals
To help AI development benefit humanity, Foresight Institute has held various workshops over the past years and launched a Grants Program that funds work on AI security risks, cryptography tools for safe AI, and safe multipolar AI scenarios. Our 2024 workshop invites leading researchers, entrepreneurs, and funders in this growing space to explore new tools and architectures that help humans and AIs cooperate securely. In addition to short presentations, working groups, and project development, we offer mentorship hours, open breakouts, and speaker & sponsor gatherings.

Questions we’ll address include

  1. Which challenges in AI alignment, AI security and AI coordination (in particular
...

I'll be there! Talk to me about boundaries and coordination/Goodness

6Richard_Ngo2h
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible. This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition? The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions). The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal". Havi

There's also the problem of: what do you mean by "the human"? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you're perfectly capable of making different decisions, but you just don't.

Another problem, which I like to think of as the "control panel of the universe" problem, is where the AI gives you the "control panel of the universe", bu... (read more)

4Richard_Ngo2h
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA