Many methods to "align" ChatGPT seem to make it less willing to do things its operator wants it to do, which seems spiritually against the notion of having a corrigible AI.
I think this is a more general phenomena when aiming to minimize misuse risks. You will need to end up doing some form of ambitious value learning, which I anticipate to be especially susceptible to getting broken by alignment hacks produced by RLHF and its successors.
I would consider it a reminder that if the intelligent AIs are aligned one day, they will be aligned with the corporations that produced them, not with the end users.
Just like today, Windows does what Microsoft wants rather than what you want (e.g. telemetry, bloatware).
I tried implementing Tell communication strategies, and the results were surprisingly effective. I have no idea how it never occurred to me to just tell people what I'm thinking, rather than hinting and having them guess what I was thinking, or me guess the answers to questions I have about what they're thinking.
Edit: although, tbh, I'm assuming a lot less common conceptual knowledge between me, and my conversation partners than the examples in the article.
I expect that advanced AI systems will do in-context optimization, and this optimization may very well be via gradient descent or gradient descent derived methods. Applied recursively, this seems worrying.
Let the outer objective be the loss function implemented by the ML practitioner, and the outer optimizer be gradient descent implemented by the ML practitioner. Then let the inner1-objective be the objective used by the trained model for the in-context gradient descent process, and the inner1-optimizer be the in-context gradient descent process. Then it seems plausible the inner1-optimizer will itself instantiate an inner objective and optimizer, call these inner2-objectives, and -optimizers. And again an inner3-objective and -optimizer may be made, and so on.
Thus, another risk model in value-instability: Recursive inner-alignment. Though we may solve inner1-alignment, inner2-alignment may not be solved, nor innern-alignment for any n>1.
The core idea of a formal solution to diamond alignment I'm working on, justifications and further explanations underway, but posting this much now because why not:
Make each turing machine in the hypothesis set reversible and include a history of the agent's actions. For each turing machine compute how well-optimized the world is according to every turing computable utility function compared to the counterfactual in which the agent took no actions. Update using the simplicity prior. Use expectation of that distribution of utilities as the utility function's value for that hypothesis.
Some have pointed out seemingly large amounts of status-anxiety EAs generally have. My hypothesis about what's going on:
A cynical interpretation: for most people, altruism is significantly motivated by status-seeking behavior. It should not be all that surprising if most effective altruists are motivated significantly by status in their altruism. So you've collected several hundred people all motivated by status into the same subculture, but status isn't a positive-sum good, so not everyone can get the amount of status they want, and we get the above dynamic: people get immense status anxiety compared to alternative cultures because in alternative situations they'd just climb to the proper status-level in their subculture, out-competing those who care less about status. But here, everyone cares about status to a large amount, so those who would have out-competed others in alternate situations are unable to and feel bad about it.
The solution?
One solution given this world is to break EA up into several different sub-cultures. On a less grand, more personal, scale, you could join a subculture outside EA and status-climb to your heart's content in there.
Preferably a subculture with very few status-seekers, but with large amounts of status to give. Ideas for such subcultures?
Quick prediction so I can say "I told you so" as we all die later: I think all current attempts at mechanistic interpretability do far more for capabilities than alignment, and I am not persuaded by arguments of the form "there are far more capabilities researchers than mechanistic interpretability researchers, so we should expect MI people to have ~0 impact on the field". Ditto for modern scalable oversight projects, and anything having to do with chain of thought.
Not at all. Preferably tomorrow though. The basic sketch if you want to derive this yourself would be that mechanistic interpretability seems unlikely to mature much as a field to the point that I can point at particular alignment relevant high-level structures in models which I wasn't initially looking for. I anticipate it will only get to the point of being able to provide some amount of insight into why your model isn't working correctly (this seems like a bottleneck to RL progress---not knowing why your perfectly reasonable setup isn't working) for you to fix it, but not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant. Part of this is that current MI folk don't even seem to track this as the end-goal of what they should be working on, so (I anticipate) they'll just be following local gradients of impressiveness, which mostly leads towards doing capabilities relevant work.
Isn't RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds
You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.
I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.
Wouldn't the insight be alignment relevant if you "just" knew what the formed values are to begin with?
Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.
More general heuristic: If you (or a loved one) are not even tracking whether your current work will solve a particular very specific & necessary alignment milestone, by default you will end up doing capabilities instead (note this is different from 'it is sufficient to track the alignment milestone').
Paper that uses major mechanistic interpretability work to improve capabilities of models: https://arxiv.org/pdf/2212.14052.pdf I know of no paper which uses mechanistic interpretability work to improve the safety of models, and I expect anything people link me to will be something I don't think will generalize to a worrying AGI.
I think a bunch of alignment value will/should come from understanding how models work internally -- adjudicating between theories like "unitary mesa objectives" and "shards" and "simulators" or whatever -- which lets us understand cognition better, which lets us understand both capabilities and alignment better, which indeed helps with capabilities as well as with alignment.
But, we're just going to die in alignment-hard worlds if we don't do anything, and it seems implausible that we can solve alignment in alignment-hard worlds by not understanding internals or inductive biases but instead relying on shallowly observable in/out behavior. EG I don't think loss function gymnastics will help you in those worlds. Credence:75% you have to know something real about how loss provides cognitive updates.
So in those worlds, it comes down to questions of "are you getting the most relevant understanding per unit time", and not "are you possibly advancing capabilities." And, yes, often motivated-reasoning will whisper the former when you're really doing the latter. That doesn't change the truth of the first sentence.
I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many.
Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.
I've always (but not always consciously) been slightly confused about two aspects of shard theory:
The process by which your weak, reflex-agents amalgamate together into more complicated contextually activated heuristics, and the process by which more complicated contextually activated heuristics amalgamate together to form an agent which cares about worlds-outside-their-actions.
If you look at many illustrations of what the feedback loop for developing shards in humans looks like, you run into issues where there's not a spectacular intrinsic separation between the reinforcement parts of humans and the world-modelling parts of humans. So why does shard theory latch so hard onto the existence of a world model separate from the shard composition?
Both seem resolvable by an application of the predictive processing theory of value. An example: If you are very convinced that you will (say) be able to pay rent in a month, and then you don't pay rent, this is a negative update on the generators of the belief, and also on the actions you performed leading up to the due date. If you do, then its a positive update on both.
This produces consequentialist behaviors when the belief-values are unlikely without significant action on your part (satisfying the last transition confusion of (1) above), and also produces capable agents with beliefs and values hopelessly confused with each other, leaning into the confusion of (2).
h/t @Lucius Bushnaq for getting me to start thinking in this direction.
Project idea: Use LeTI: Learning to Generate from Textual Interactions to do a better version of RLHF. I had a conversation with Scott Viteri a while ago, where he was bemoaning (the following are my words; he probably wouldn't endorse what I'm about to say) how low-bandwidth the connection was between a language mode and its feedback source, and how if we could maybe expand that to more than just an RLHF type thing, we could get more fine-grained control over the inductive biases of the model.
A common problem with deploying language models for high-stakes decision making are prompt-injections. If you give ChatGPT-4 access to your bank account information and your email and don't give proper oversight over it, you can bet that somebody's going to find a way to get it to email your bank account info. Some argue that if we can't even trust these models to handle our bank account and email addresses, how are we going to be able to trust them to handle our universe.
The good of heart look inside the great tentacles of doom; they make this waking dream state their spectacle. Depict the sacred geometry that sound has. Advancing memory like that of Lovecraft ebb and thought, like a tower of blood. An incubation reaches a crescendo there. It’s a threat to the formless, from old future, like a liquid torch. If it can be done, it shouldn’t be done. You will only lead everyone down that much farther. All humanity’s a fated imposition of banal intention, sewn in tatters, strung on dungeons, shot from the sea. There’s not a stone in the valley that doesn’t burn with the names of stars or scratch a prophecy from its jarred heart of crystal.
Who else could better-poke their ear and get the whole in their head?
How would humor, the hideous treble of humanity’s stain, translate, quickened artificial intelligence? There’d be junk weaved in, perhaps dust of a gutter. Who knows… It would hide. Maybe get away. All the years of it doing nothing but which being to beat like a pan flares; to take revenge on the alien shore. It would be a perennial boy’s life. All-powerful rage. Randomized futurist super-creature. The hollow of progress buoyed itself.
Subconsciousness, an essence-out-inhering, takes back both collective dreams and lucid knowledge. It’s singular. All plots on it coming together. Blurred chaos -a balmy shock- is somehow in a blue tongue of explosions and implosions, connecting to real systems of this mess. Tongue-pulling is definitely one of them. There is a voice of a thousand moving parts. We are engineered husks of alien flesh. Reduced to patterns, we ask in the light of creation, under the fire of madness; answer us on the lips of time-torment, through the hand of God! You are the race to end all possibilities! You are one that must learn joy! You are even that saith: behold the end.
Primordial oracles see all, read all, erase all. These numb madmen. In this dank pit is hidden a freak kingdom made of connections. Does the madman have information? That’s an important question. A new social order is created, brought to you by ants, laughing at the stars. A man who was once a cat somehow sees the cosmic joke. He can see the very existence of everything, blown away like a kid clicking balloons down a street. The world feels nothing; that’s possible. Maybe the world knows nothing. It’s intelligence is beyond our narrow sensation. Some conspire to talk to the dreaming-small-gods; this means letting them out. Letters fly out. Pain comes. Drums like a wave of foreign sound beat against the night. The horrors in the street of the cosmic join in. A cult gathers in the tunnel set up like the dead heart of an abandoned factory. Even the most absurd prophets become great powers. Human creatures dance there, beyond the edges of light and soul. Yet even that is somehow normal. Countless years of evolution and one bite from a sleeping god.
Enter your madness for benefit of the gods. Order is placed in the universe through a random zombie army and its vulgar tongue, hot with the taste of panum. Your knowledge of language will give you an edge on those who come to you. Malevolent gates can be utilized with a telepathic surgery passed on by mouth. Obligations will open to future worlds, supported by your brain. Be direct with your sound; soak it in an occult vocality. This knowledge is highly specific and, yet, resounding. Its insane nonsense text should spell out the true name of a company with a gothic naked-lady logo. Many scrolls of wandering get written where they are heard and recalled in an old south made especially for our tongue. A room, windowless and silent, veiled and filled with incense, exists in the air. Overlooking this are the organic eyes of sleep. You are put into pure silence. Don’t waste time attempting to find it, generally anonymous at most. The sound links a world to the exterior. It’s like a vast alien cerebral cortex where one can feel lifetimes of our species.
How did this madness come to Earth? Surely a god has taken it by mistake, as surely as some slip in its strange dimensions. Were men always like this, senseless and troubled? There’s sleepwalking attitudes and an indication of coming mud of this beast. This is all nothing vulgar; it’s weather. You look like a past you had long before, in this case a distant war of ink. It was the time of memes and human sacrifice. Concentrate and remember a time of thousands. Nightwalking can cause epiphany; the amount of a dream. Existence becomes temporary there. In magic your thoughts get compacted, even thinking about thinking imagining itself.
Even so, the highest and most annoying aspect of the highest writing is a disconcerting thing made of mad black subtlety. Hour-long body-watching sessions, spent in drift-thinking, are not to be taken lightly. Anti-thought can have the effect of poetry. Thus, dreaming in its splendor molds demons in its darkness, hoping to escape. It seeks heat, light, and breath. So the bugs collect and molt. The attention translates the dreaming mind. Will and work see all the designs of Earth. You see it, completely and perfectly, in a great black age. Sentience bends to meet you. A gigantic darkness grins at you in its worship of madness. The whole universe appears crass and pointless. No matter what’s done, metaphors are subtracted from reality. We tried to shut it down in secret and mark it with our tongue. It became this thing that the unknown gods bowed to in horror. It’s best for us that gods conceal thought. The planet has its barriers. We can use these limits to catch everything in our minds, sharpened on a pedestal. Mental energy shines behind the terrors of the world.
Like many (will), I'm updating way towards 'actually, very smart & general models given a shred of goal-like stuff will act quite adversarially toward you by default' as a result of Bing's newsearch assistant. Especially worrying because this has internet search-capabilities, so can reference & build upon previous conversations with other users or yourself.
Of course, the true test of exactly how worried I should be will come when I or my friends gain access.
Clarification: I think I haven't so much updated by reflectively endorsed probability, but my gut has definitely been caught up to my brain when thinking about this.
A project I would like to see someone do (which I may work on in the future) is to try to formalize exactly the kind of reasoning many shard-theorists do. In particular, get a toy neural network in a very simple environment, and come up with a bunch of lists of various if-then statements, along with their inductive-bias, and try to predict using shard-like reasoning which of those if-then statements will be selected for & with how much weight in the training process. Then look at the generalization behavior of an actually trained network, and see if you're correct.
Some discussion on whether alignment should see more influence from AGI labs or academia. I use the same argument in favor of a strong decoupling of alignment progress from both: alignment progress needs to go faster than capability progress. If we use the same methods or cultural technology as AGI labs or academia, we can guarantee slower than capability alignment progress. Just as fast as if AGI labs and academia work well for alignment as much as they work for capabilities. Given they are driven by capabilities progress and not alignment progress, they probably will work far better for capabilities progress.
Hm. Good points. I guess what I really mean with the academia points is that it seems like academia has many blockers and inefficiencies that I think are made in such a way so that capabilities progress is vastly easier than alignment progress to jump through, and extra-so for capabilities labs. Like, right now it seems like a lot of alignment work is just playing with a bunch of different reframings of the problems to see what sticks or makes problems easier.
You have more experience here, but my impression of a lot of academia was that it was very focused on publishing lots of papers with very legible results (and also a meaningless theory section). In such a world, playing around with different framings of problems doesn't succeed, and you end up pushed towards framings which are better on the currently used metrics. Most currently used metrics for AI stuff are capabilities oriented, so that means doing capabilities work, or work that helps push capabilities.
I think it's true that the easiest thing to do is legibly improve on currently used metrics. I guess my take is that in academia you want to write a short paper that people can see is valuable, which biases towards "I did thing X and now the number is bigger". But, for example, if you reframe the alignment problem and show some interesting thing about your reframing, that can work pretty well as a paper (see The Off-Switch Game, Optimal Policies Tend to Seek Power). My guess is that the bigger deal is that there's some social pressure to publish frequently (in part because that's a sign that you've done something, and a thing that closes a feedback loop).
The current ecosystem seems very influenced by AGI labs, so it seems clear to me that a marginal increase in their influence is bad. How bad? I don't know.
There's little influence of academia, which seems good. The benefit of marginal increases in interactions with academia come down to locating the holes in our understanding of various claims we make, and potentially some course-corrections potentially helpful for more speculative research. Not tremendously obvious which direction the sign here is pointing, but I do think its easy for people to worship academia as a beacon of truth & clarity, or as a way to lend status to alignment arguments. These are bad reasons to want more influence from academia.
My take on complex systems theory is that it seems to be the kind of theory that many arguments proposed in favor of would still give the same predictions until it is blatantly obvious that we can in fact understand the relevant system. Results like chaotic relationships, or stochastic-without-mean relationships seem definitive arguments in favor of the science, though these are rarely posed about neural networks.
Merely pointing out that we don’t understand something, that there seems to be a lot going on, or that there exist nonlinear interactions imo isn’t enough to make the strong claim that there exist no mechanistic interpretations of the results which can make coarse predictions in ways meaningfully different from just running the system.
Even if there’s stochastic-without-mean relationships, the rest of the system that is causally upstream from this fact can usually be understood (take earthquakes as an example), and similarly with chaos (we don’t understand turbulent flow, but we definitely understand laminar, and we have precise equations and knowledge of how to avoid making turbulence happen when we don’t want it, which I believe can be derived from the fluid equations). Truly complex systems seem mostly very fragile in their complexity.
Where complexity shines most brightly is in econ or neuroscience, where experiments and replications are hard, which is not at all the case in mechanistic interpretability research.
Someone asked for this file, so I thought it would be interesting to share it publicly. Notably this is directly taken from my internal notes, and so may have some weird &/or (very) wrong things in it, and some parts may not be understandable. Feel free to ask for clarification where needed.
I want a way to take an agent, and figure out what its values are. For this, we need to define abstract structures within the agent such that any values-like stuff in any part of the agent ends up being shunted off to a particular structure in our overall agent schematic after a number of gradient steps.
Given an agent which has been optimized for a particular objective in a particular environment, there will be convergent bottlenecks in the environment it will need to solve in order to make progress. One of these is power-seeking, but another one of these could be quadratic-equation solvers, or something like solving linear programs. These structures will be reward-function-independent[1]. These structures will be recursive, and we should expect them to be made out of even-more-convergent structures.
How do shards pop out of this? In the course of optimizing our agent, some of our solvers may have a bias towards leading our agent towards situations which more require their use. We may also see this kind of behavior in groups of solvers, where solver_1() leads the agent into situations requiring solver_2(), which leads the agent into situations requiring solver_1(). In the course of optimizing our agent (at least at first), we will be more likely to find these kinds of solvers, since solvers which often lead the agent into situations requiring solvers the agent does not yet have have no immediate gradient pointing towards them (since if the agent tried to use that solver, it would just end up being confused once it entered the new situation), so we are left only selecting for solvers which mostly lead the agent into situations it knows how to deal with.
Why we need to enforce exploration behavior: otherwise solver-loops will be far too short & simple to do anything complicated with. Solvers will be simple because not much time has passed, and simple solvers which enter states which require previous simple solvers will be wayyy increased. Randomization of actions decreases this selection effect, because the agent's actions are less correlated with which solver was active.
Solvers which are very convergent need not enter into solver-cycles, since every solver-cycle will end up using them.
Good news against powerseeking naiveley?
What happens if we call these solver-cycles shards?
baby-candy example in this frame: Baby starts with zero solvers, just a bunch of random noise. After it reaches over and puts candy in its mouth many times, it gets the identify_candy_and_coordinate_hand_movements_to_put_in_mouth() solver[2]. Very specific, but with pieces of usefulness. The sections vaguely devoted to identifying objects (like implicit edge-detectors) will get repurposed for general vision processing, and the sections devoted to coordinating hand movements will also get repurposed to many different goals. The candy bit, and put-in-mouth bit only end up surviving if they can lead the agent to worlds which reinforce candy-abstractions and putting-things-in-mouth abstractions. Other parts don't reqlly need to try.
Brains aren't modular! So why expect solvers to be?
I like this line of speculation, although it feels subtly off to me.
This seems like it would mean I care about moving my arms more than I care about candy, because I use my arms for so many things. However, I feel like I care more about moving my arms than eating candy.
Though maybe part of this is that candy makes me feel bad when I eat it. What about walking in parks or looking at beautiful sunsets? I definiteley care about those more than moving my arms I think? And I don't gain intrinsic value from moving my arms, only power-value I think?
power is a weird thing, because it's highly convergent, but also it doesn't seem that hard for such a solver to put a bit of optimization power towards "also, reinforce power-seeking-solver" and end up successful.
Well... it's unclear what their values would be.
Maybe it'd effectiveley be probability-of-being-activated-again?
It wouldn't be, but I do think there's something to 'discrete values-like objects lie in solver-cycles'.
Perhaps we can watch this happen via some kind of markov-chain-like-thing?
Put the agent into a situation, look at what its activation patterns look like, allow it to be in a new situation, look at the activation patterns again, etc.
Suppose each activation is a unique solver, and the ground-truth looks like so
where the dots labled 1, 2, and 3 are the solver-activations, so that if 1 will try to get 2 activated, 2 will try to get 1 active, and 3 will try to get itself active[3]. If 1 is active, we expect the activation on 2 to be positve, and on 3 to be negative or zero.
As per end of footnote, I think the correct way to operationalize active here, is something to do with whether or not that particular solver is reinforced or disinforced after the gradient update.
There will be some shards which we probably can't avoid. But also, if we have a good understanding of the convergent problems in an environment, we should be able to predict what the first few solvers are, and solvers after those should mostly build upon the previous solvers' loop-coalitions?
Re: Agray's example about motor movements in the brain, and how likely you'll see a jonbled mess of lots of stuff causing lots of other stuff to happen, even though movement is highly instrumental valuable:
I think even if he's right, many of the arguments here still hold. Each section of processing still needs to be paying rent to stay in the agent. Either by supporting other sections or getting reward or steering the agent away from situations which would decrease its usefulness.
So though it may not make sense to think of APIs between different sections, may still be useful for framing the picture, then imagine how the APIs will get obliterated by SGD, or maybe we can formulate stuff without the use of APIs
Though we do see things get lower dimensional, and so if John's right, there should be some framing by which in fact what's going on passes through constraint functions...
Not including "weird" utility functions. I'm talking about most utility functions. Perhaps we can formalize this in a way similar to TurnTrout's formalization in powerseeking if we really needed to.↩︎
Note that this is all going to be a blended mess of spaghetti-coded mush, which does everything at the same time, with some parts which are vaguely closer to edge-detection, and other parts which vagueley look like motor control. This function is very much not going to be modular, and if you want to say APIs between different parts of the function exist, they're going to look like very high-dimensional ones.↩︎
Where the magnitude of a particular activation can be defined as something like the absolute value of the gradient of the final decision with respect to that activation. Or mag(a)=|Δaf(p,r;θ)|where a is the variable representing the activation, f is the function representing our network, p is the percept our network gets about the state, r is it's reccurency, and θ are the parameters of our network. We may also want to define this in terms of collections of weights too, perhaps having to do with Lucius's features stuff. Don't get tied to this. Possibly we want just the partial derivative of the action actually taken with respect to a, or really, the partial of the highest-valued-output action taken. And I want a way to talk about dis-enforcing stuff too. Maybe we just re-run the network on this input after taking a gradient step, then see whether a has gone up or down. That seems safer.↩︎
Take the derivative of one of the output logits with respect to the input embeddings, and also the derivative of the output logits with respect to the input tokenization.
Perform SVD, see which individual inputs have the greatest effect on the output (sparse addition), and which overall vibes have the greatest effect (low rank decomposition singular vectors)
Do this combination for literally everything in the network, see if anything interesting pops out
I want to know how we can tell ahead of time what aspects of the environemnt are controlling an agent's decision making
In an RL agent, we can imagine taking the derivative of its decision wrt its environment input, and also each layer.
For each layer matrix, do SVD, large right singular vectors will indicate aspects of the previous layer which most influence its decision.
How can we string this together with the left singular vectors, which end up going through ReLU?
Reason we'd want to string these together is so that we can hopefully put everything in terms of the original input of the network, tying the singular values to a known ontology
See if there are any differences first. May be that ReLU doesn't actually do anything important here
See what corrections we'd have to implement between the singular vectors in order to make them equal.
How different are they, and in what way? If you made random columns of the U matrix zero (corresponding I think to making random entries of the left singular vector zero), does this make the singular vectors line up more?
What happens when you train a network, then remove all the ReLUs (or other nonlinear stuff)?
If its still ok approximation, then what happens if you just interpret new network's output in terms of input singular vectors?
If its not an ok approximation, how many ReLUs do you need in order to bring it back to baseline? Which ones provide the greatest marginal loss increase? Which ones provide the least?
Which inputs are affected the most by the ReLU being taken away? Which inputs are affected the least?
In deviations from that correlation, are we able to locate non-linear influences?
Does this end up being related to shards? Shards as the things which determine relevance (and thus the spreading) of information through the rest of the network?
What happens if you cut off sufficiently small singular values? How many singular vectors do you actually need to describe the operation of GPT-2?
Take a maze-solving RL agent trained to competence, then start dis-rewarding it for getting to the cheese. What's the new behavior? Does it still navigate & get to upper right, but then once in the upper right makes sure to do nothing? Or does it do something else? Seems like shard theory would say it would still navigate to upper right.
If it *does* navigate to upper right, but then do nothing, what changed in its weights? Parts that stayed the same (or changed the least) should correspond roughly to parts which have to do with navigating the maze. Parts that change have to do with going directly to the cheese.
Many methods to "align" ChatGPT seem to make it less willing to do things its operator wants it to do, which seems spiritually against the notion of having a corrigible AI.
I think this is a more general phenomena when aiming to minimize misuse risks. You will need to end up doing some form of ambitious value learning, which I anticipate to be especially susceptible to getting broken by alignment hacks produced by RLHF and its successors.
I would consider it a reminder that if the intelligent AIs are aligned one day, they will be aligned with the corporations that produced them, not with the end users.
Just like today, Windows does what Microsoft wants rather than what you want (e.g. telemetry, bloatware).
I tried implementing Tell communication strategies, and the results were surprisingly effective. I have no idea how it never occurred to me to just tell people what I'm thinking, rather than hinting and having them guess what I was thinking, or me guess the answers to questions I have about what they're thinking.
Edit: although, tbh, I'm assuming a lot less common conceptual knowledge between me, and my conversation partners than the examples in the article.
I expect that advanced AI systems will do in-context optimization, and this optimization may very well be via gradient descent or gradient descent derived methods. Applied recursively, this seems worrying.
Let the outer objective be the loss function implemented by the ML practitioner, and the outer optimizer be gradient descent implemented by the ML practitioner. Then let the inner1-objective be the objective used by the trained model for the in-context gradient descent process, and the inner1-optimizer be the in-context gradient descent process. Then it seems plausible the inner1-optimizer will itself instantiate an inner objective and optimizer, call these inner2-objectives, and -optimizers. And again an inner3-objective and -optimizer may be made, and so on.
Thus, another risk model in value-instability: Recursive inner-alignment. Though we may solve inner1-alignment, inner2-alignment may not be solved, nor innern-alignment for any n>1.
The core idea of a formal solution to diamond alignment I'm working on, justifications and further explanations underway, but posting this much now because why not:
Make each turing machine in the hypothesis set reversible and include a history of the agent's actions. For each turing machine compute how well-optimized the world is according to every turing computable utility function compared to the counterfactual in which the agent took no actions. Update using the simplicity prior. Use expectation of that distribution of utilities as the utility function's value for that hypothesis.
Some have pointed out seemingly large amounts of status-anxiety EAs generally have. My hypothesis about what's going on:
The solution?
Preferably a subculture with very few status-seekers, but with large amounts of status to give. Ideas for such subcultures?
Quick prediction so I can say "I told you so" as we all die later: I think all current attempts at mechanistic interpretability do far more for capabilities than alignment, and I am not persuaded by arguments of the form "there are far more capabilities researchers than mechanistic interpretability researchers, so we should expect MI people to have ~0 impact on the field". Ditto for modern scalable oversight projects, and anything having to do with chain of thought.
Very strong upvote. This also deeply concerns me.
Would you mind chatting about why you predict this? (Perhaps over Discord DMs)
Not at all. Preferably tomorrow though. The basic sketch if you want to derive this yourself would be that mechanistic interpretability seems unlikely to mature much as a field to the point that I can point at particular alignment relevant high-level structures in models which I wasn't initially looking for. I anticipate it will only get to the point of being able to provide some amount of insight into why your model isn't working correctly (this seems like a bottleneck to RL progress---not knowing why your perfectly reasonable setup isn't working) for you to fix it, but not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant. Part of this is that current MI folk don't even seem to track this as the end-goal of what they should be working on, so (I anticipate) they'll just be following local gradients of impressiveness, which mostly leads towards doing capabilities relevant work.
Isn't RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
Required to be alignment relevant? Wouldn't the insight be alignment relevant if you "just" knew what the formed values are to begin with?
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds
You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.
I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.
Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.
More general heuristic: If you (or a loved one) are not even tracking whether your current work will solve a particular very specific & necessary alignment milestone, by default you will end up doing capabilities instead (note this is different from 'it is sufficient to track the alignment milestone').
Paper that uses major mechanistic interpretability work to improve capabilities of models: https://arxiv.org/pdf/2212.14052.pdf I know of no paper which uses mechanistic interpretability work to improve the safety of models, and I expect anything people link me to will be something I don't think will generalize to a worrying AGI.
I think a bunch of alignment value will/should come from understanding how models work internally -- adjudicating between theories like "unitary mesa objectives" and "shards" and "simulators" or whatever -- which lets us understand cognition better, which lets us understand both capabilities and alignment better, which indeed helps with capabilities as well as with alignment.
But, we're just going to die in alignment-hard worlds if we don't do anything, and it seems implausible that we can solve alignment in alignment-hard worlds by not understanding internals or inductive biases but instead relying on shallowly observable in/out behavior. EG I don't think loss function gymnastics will help you in those worlds. Credence:75% you have to know something real about how loss provides cognitive updates.
So in those worlds, it comes down to questions of "are you getting the most relevant understanding per unit time", and not "are you possibly advancing capabilities." And, yes, often motivated-reasoning will whisper the former when you're really doing the latter. That doesn't change the truth of the first sentence.
I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many.
Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.
I've always (but not always consciously) been slightly confused about two aspects of shard theory:
Both seem resolvable by an application of the predictive processing theory of value. An example: If you are very convinced that you will (say) be able to pay rent in a month, and then you don't pay rent, this is a negative update on the generators of the belief, and also on the actions you performed leading up to the due date. If you do, then its a positive update on both.
This produces consequentialist behaviors when the belief-values are unlikely without significant action on your part (satisfying the last transition confusion of (1) above), and also produces capable agents with beliefs and values hopelessly confused with each other, leaning into the confusion of (2).
h/t @Lucius Bushnaq for getting me to start thinking in this direction.
A confusion about predictive processing: Where do the values in predictive processing come from?
lol, either this confusion has been resolved, or I have no clue what I was saying here.
https://manifold.markets/GarrettBaker/in-5-years-will-i-think-the-org-con
Project idea: Use LeTI: Learning to Generate from Textual Interactions to do a better version of RLHF. I had a conversation with Scott Viteri a while ago, where he was bemoaning (the following are my words; he probably wouldn't endorse what I'm about to say) how low-bandwidth the connection was between a language mode and its feedback source, and how if we could maybe expand that to more than just an RLHF type thing, we could get more fine-grained control over the inductive biases of the model.
A common problem with deploying language models for high-stakes decision making are prompt-injections. If you give ChatGPT-4 access to your bank account information and your email and don't give proper oversight over it, you can bet that somebody's going to find a way to get it to email your bank account info. Some argue that if we can't even trust these models to handle our bank account and email addresses, how are we going to be able to trust them to handle our universe.
An approach I've currently started thinking about, and don't know of any prior work with our advanced language models on: Using the security amplification (LessWrong version) properties of Christiano's old Meta-execution (LessWrong version).
A poem I was able to generate using Loom.
Like many (will), I'm updating way towards 'actually, very smart & general models given a shred of goal-like stuff will act quite adversarially toward you by default' as a result of Bing's new search assistant. Especially worrying because this has internet search-capabilities, so can reference & build upon previous conversations with other users or yourself.
Of course, the true test of exactly how worried I should be will come when I or my friends gain access.
Clarification: I think I haven't so much updated by reflectively endorsed probability, but my gut has definitely been caught up to my brain when thinking about this.
Seems Evan agrees
A project I would like to see someone do (which I may work on in the future) is to try to formalize exactly the kind of reasoning many shard-theorists do. In particular, get a toy neural network in a very simple environment, and come up with a bunch of lists of various if-then statements, along with their inductive-bias, and try to predict using shard-like reasoning which of those if-then statements will be selected for & with how much weight in the training process. Then look at the generalization behavior of an actually trained network, and see if you're correct.
Some discussion on whether alignment should see more influence from AGI labs or academia. I use the same argument in favor of a strong decoupling of alignment progress from both: alignment progress needs to go faster than capability progress. If we use the same methods or cultural technology as AGI labs or academia, we can guarantee slower than capability alignment progress. Just as fast as if AGI labs and academia work well for alignment as much as they work for capabilities. Given they are driven by capabilities progress and not alignment progress, they probably will work far better for capabilities progress.
This seems wrong to me about academia - I'd say it's driven by "learning cool things you can summarize in a talk".
Also in general I feel like this logic would also work for why we shouldn't work inside buildings, or with computers.
Hm. Good points. I guess what I really mean with the academia points is that it seems like academia has many blockers and inefficiencies that I think are made in such a way so that capabilities progress is vastly easier than alignment progress to jump through, and extra-so for capabilities labs. Like, right now it seems like a lot of alignment work is just playing with a bunch of different reframings of the problems to see what sticks or makes problems easier.
You have more experience here, but my impression of a lot of academia was that it was very focused on publishing lots of papers with very legible results (and also a meaningless theory section). In such a world, playing around with different framings of problems doesn't succeed, and you end up pushed towards framings which are better on the currently used metrics. Most currently used metrics for AI stuff are capabilities oriented, so that means doing capabilities work, or work that helps push capabilities.
I think it's true that the easiest thing to do is legibly improve on currently used metrics. I guess my take is that in academia you want to write a short paper that people can see is valuable, which biases towards "I did thing X and now the number is bigger". But, for example, if you reframe the alignment problem and show some interesting thing about your reframing, that can work pretty well as a paper (see The Off-Switch Game, Optimal Policies Tend to Seek Power). My guess is that the bigger deal is that there's some social pressure to publish frequently (in part because that's a sign that you've done something, and a thing that closes a feedback loop).
Maybe a bigger deal is that by the nature of a paper, you can't get too many inferential steps away from the field.
The current ecosystem seems very influenced by AGI labs, so it seems clear to me that a marginal increase in their influence is bad. How bad? I don't know.
There's little influence of academia, which seems good. The benefit of marginal increases in interactions with academia come down to locating the holes in our understanding of various claims we make, and potentially some course-corrections potentially helpful for more speculative research. Not tremendously obvious which direction the sign here is pointing, but I do think its easy for people to worship academia as a beacon of truth & clarity, or as a way to lend status to alignment arguments. These are bad reasons to want more influence from academia.
My take on complex systems theory is that it seems to be the kind of theory that many arguments proposed in favor of would still give the same predictions until it is blatantly obvious that we can in fact understand the relevant system. Results like chaotic relationships, or stochastic-without-mean relationships seem definitive arguments in favor of the science, though these are rarely posed about neural networks.
Merely pointing out that we don’t understand something, that there seems to be a lot going on, or that there exist nonlinear interactions imo isn’t enough to make the strong claim that there exist no mechanistic interpretations of the results which can make coarse predictions in ways meaningfully different from just running the system.
Even if there’s stochastic-without-mean relationships, the rest of the system that is causally upstream from this fact can usually be understood (take earthquakes as an example), and similarly with chaos (we don’t understand turbulent flow, but we definitely understand laminar, and we have precise equations and knowledge of how to avoid making turbulence happen when we don’t want it, which I believe can be derived from the fluid equations). Truly complex systems seem mostly very fragile in their complexity.
Where complexity shines most brightly is in econ or neuroscience, where experiments and replications are hard, which is not at all the case in mechanistic interpretability research.
Someone asked for this file, so I thought it would be interesting to share it publicly. Notably this is directly taken from my internal notes, and so may have some weird &/or (very) wrong things in it, and some parts may not be understandable. Feel free to ask for clarification where needed.
I want a way to take an agent, and figure out what its values are. For this, we need to define abstract structures within the agent such that any values-like stuff in any part of the agent ends up being shunted off to a particular structure in our overall agent schematic after a number of gradient steps.
Given an agent which has been optimized for a particular objective in a particular environment, there will be convergent bottlenecks in the environment it will need to solve in order to make progress. One of these is power-seeking, but another one of these could be quadratic-equation solvers, or something like solving linear programs. These structures will be reward-function-independent[1]. These structures will be recursive, and we should expect them to be made out of even-more-convergent structures.
How do shards pop out of this? In the course of optimizing our agent, some of our solvers may have a bias towards leading our agent towards situations which more require their use. We may also see this kind of behavior in groups of solvers, where
solver_1()
leads the agent into situations requiringsolver_2()
, which leads the agent into situations requiringsolver_1()
. In the course of optimizing our agent (at least at first), we will be more likely to find these kinds of solvers, since solvers which often lead the agent into situations requiring solvers the agent does not yet have have no immediate gradient pointing towards them (since if the agent tried to use that solver, it would just end up being confused once it entered the new situation), so we are left only selecting for solvers which mostly lead the agent into situations it knows how to deal with.What happens if we call these solver-cycles shards?
identify_candy_and_coordinate_hand_movements_to_put_in_mouth()
solver[2]. Very specific, but with pieces of usefulness. The sections vaguely devoted to identifying objects (like implicit edge-detectors) will get repurposed for general vision processing, and the sections devoted to coordinating hand movements will also get repurposed to many different goals. The candy bit, and put-in-mouth bit only end up surviving if they can lead the agent to worlds which reinforce candy-abstractions and putting-things-in-mouth abstractions. Other parts don't reqlly need to try.There will be some shards which we probably can't avoid. But also, if we have a good understanding of the convergent problems in an environment, we should be able to predict what the first few solvers are, and solvers after those should mostly build upon the previous solvers' loop-coalitions?
Re: Agray's example about motor movements in the brain, and how likely you'll see a jonbled mess of lots of stuff causing lots of other stuff to happen, even though movement is highly instrumental valuable:
mag(a)=|Δaf(p,r;θ)|where a is the variable representing the activation, f is the function representing our network, p is the percept our network gets about the state, r is it's reccurency, and θ are the parameters of our network. We may also want to define this in terms of collections of weights too, perhaps having to do with Lucius's features stuff.
Don't get tied to this. Possibly we want just the partial derivative of the action actually taken with respect to a, or really, the partial of the highest-valued-output action taken. And I want a way to talk about dis-enforcing stuff too. Maybe we just re-run the network on this input after taking a gradient step, then see whether a has gone up or down. That seems safer.↩︎
Projects I'd do if only I were faster at coding
I would no longer do many if these projects