How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability?
Are there other specific areas you're excited about?
Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?
As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.
I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.
My shard theory inspired story is to make an AI that:
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different ...
I think more concentration meditation would be the way, but concentration meditation does lead to more likely noticing experiences that cause what you may call “awakening experiences”. (This is contrast with insight meditation like noting)
Leigh Brasington’s Right Concentration is a book on jhana’s, which is becoming very concentrated and then focusing on positive sensations until you hit a flow state. This is definitely not an awakening experience, but feels great (though I’ve only entered the first a small amount).
A different source is Rob Burbea’s jhana retreat audio recordings on dharmaseed.
Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?
Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.
Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.
One reason the neuron is congruent with multiple of the same tokens may be because those token embeddings are similar (you can test this by checking their cosine similarities).
For clarifying my own understanding:
The dot product of the row of a neuron’s weight vector (ie a row in W_out) with the unembedding matrix (in this case the embedding.T because GPT is tied embeddings) is what directly contributes to the logit outputs.
If the neuron activation is relatively very high, then this swamps the direction of your activations. So, artificially increasing W_in’s neurons to eg 100 should cause the same token to be predicted regardless of the prompt.
This means that neuron A could be more congruent than neuron B, but B contribute more t...
These arguments don't apply to the base models which are only trained on next word prediction (ie the simulators post), since their predictions never affected future inputs. This is the type of model Janus most interacted with.
Two of the proposals in this post do involve optimizing over human feedback, like:
Creating custom models trained on not only general alignment datasets but personal data (including interaction data), and building tools and modifying workflows to facilitate better data collection with less overhead
, which they may apply to.
I’m excited about sensory substitution (https://eagleman.com/science/sensory-substitution/), where people translate auditory or visual information into tactile sensations (usually for people who don’t usually process that info).
I remember Quintin Pope wanting to translate the latent space of language models [reading a paper] translated to visual or tactile info. I’d see this as both a way to read papers faster, brainstorm ideas, etc and gain a better understanding of latent space during development of this.
For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake.
I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents?
If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.
Models doing steganography mess up oversight of language models that only measure the outward text produced. If current methods for training models, such as RLHF, can induce steg, then that would be good to know so we can avoid that.
If we successfully induce steganography in current models, then we know at least one training process that induces it. There will be some truth as to why: what specific property mechanistically causes steg in the case found? Do other training processes (e.g. RLHF) also have this property?
My backpack lamely doesn't have any of those straps.
The best one I've found is removing the left shoulder strap and gripping the backpack in e.g. my right arm.
I'd love to hear whether you found this useful, and whether I should bother making a second half!
We had 5 people watch it here, and we would like a part 2:)
We had a lot of fun pausing the video and making forward predictions, and we couldn't think of any feedback for you in general.
Notably the model was trained across multiple episodes to pick up on RL improvement.
Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.
"Mike is large -> large is Mike
Bob is cute -> cute is"
Also works w/ numbers (but I had trouble getting it to reverse 3 digits at a time):
"3 6 -> 6 3
2 88 ->"
"1 + 1 = 0 + 2
2 + 2 = 0 + 4
3 + 3 = 0 +"
Which also worked when replacing 0 w/ "pig", but changing it to "df" made it predict " 5" as the answer, which I think it just wants to count up from the previous answer 4.
For each of the following, the model predicts a "." at the end.
I eat spaghetti, yet she eats pizza
I s...
It is a search engine showing you what is had already created before the end of training.
I'm wondering what you and I would predict differently then? Would you predict that GPT-3 could learn a variation on pig Latin? Does higher log-prob for 0-shot for larger models count?
The crux may be different though, here's a few stabs:
1. GPT doesn't have true intelligence, it only will ever output shallow pattern matches. It will never come up with truly original ideas
2. GPT will never pursue goals in any meaningful sense
2.a because it can't tell the difference...
A human examining the program can know which words were part of a prompt and which were just now generated by the machine, but I doubt the activation function examines the equations that are GPT's own code, contemplates their significance and infers that the most recent letters were generated by it, or were part of the prompt
As a tangent, I do believe it's possible to tell if an output is generated by GPT in principle. The model itself could potentially do that as well by noticing high-surprise words according to itself (ie low probability tokens in the prompt). I'm unsure if GPT-3 could be prompted to do that now though.
I believe you’re equating “frozen weights” and “amnesiac/ can’t come up with plans”.
GPT is usually deployed by feeding back into itself its own output, meaning it didn’t forget what it just did, including if it succeeded at its recent goal. Eg use chain of thought reasoning on math questions and it can remember it solved for a subgoal/ intermediate calculation.
How would you end up measuring deception, power seeking, situational awareness?
We can simulate characters with GPT now that are deceptive (eg a con artist talking to another character). Similar with power seeking and situational awareness (eg being aware it’s GPT)
Thanks as always for your consistently thoughtful comments:)
I disagree with how this post seems to optimistically ignore the possibility that the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing.
I also feel this is an “area that warrants further research”, though I don't view shard-coordination as being different than shard formation. If you understand how inner-values form from outer reward schedules, then how inner-values int...
On your first point, I do think people have thought about this before and determined it doesn't work. But from the post:
...If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature un
Oh, you're stating potential mechanisms for human alignment w/ humans that you don't think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout's other post claims that the genome likely doesn't directly specify rewards for everything humans end up valuing. People's specific families aren't encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies...
To add, Turntrout does state:
In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.
so the doc Ulisse provided is a decent write-up about just that, but there are more official posts intended to published.
Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)
There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.
Could you elaborate?
I believe the diamond example is true, but not the best example to use. I bet it was mentioned because of the arbital article linked in the post.
The premise isn't dependent on diamonds being terminal goals; it could easily be about valuing real life people or dogs or nature or real life anything. Writing an unbounded program that values real world objects is an open-problem in alignment; yet humans are a bounded program that values real world objects all of the time, millions of times a day.
The post argues that focusing on the causal explanatio...
There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post's point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:
...the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes
To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.
Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.
If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don't, then we can apply these mechanisms to other learnin...
This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I think you're ignoring the [now bolded part] in "a particular human’s learning process + reward circuitry + "training" environment" and just focusing in the environment. Humans very often don't optimize for their reward circuitry in their...
There may not be substantial disagreements here. Do you agree with:
"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values" is more informative about inner-misalignment than the usual "evolution -> human values" (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)
...The most important claim in your commen
My understanding is: Bob's genome didn't have access to Bob's developed world model (WM) when he was born (because his WM wasn't developed yet). Bob's genome can't directly specify "care about your specific family" because it can't hardcode Bob's specific family's visual or auditory features.
This direct-specification wouldn't work anyways because people change looks, Bob could be adopted, or Bob could be born blind & deaf.
[Check, does the Bob example make sense?]
But, the genome does do something indirectly that consistently leads to people valuin...
From my perspective, it’s more like the opposite; if alignment were to be solved tomorrow, that would give the AI policy people a fair shot at getting it implemented.
I’m unsure what the government can do that DeepMind or OpenAI (or someone else) couldn’t do in their own. Maybe you’re imagining a policy that forces all companies to building aligned AI’s according to the solution, but this won’t be perfect and an unaligned AGI could still kill everyone (or it could be built somewhere else)
The first thing you do with a solution to alignment is build an aligned AGI to prevent all x-risks. I don’t see routing through the government helps that process(?)
Why did you use the weak AGI question? Feels like a motte-and-Bailey to say “x time until AGI” but then link to the weak AGI question.
I picked it because it has the most predictions and is frequently pointed to as an indicator of big shifts. But you're right, I should work on adding an option to use the strong question instead; I can see why people might prefer that.
I wonder how much COVID got people to switch to working on Biorisks.
What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.
I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.
I agree. You can even get career advice here at https://www.aisafetysupport.org/resources/career-coaching
Or feel free to message me for a short call. I bet you could get paid to do alignment work, so it’s worth looking into at least.
[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]
Input:
...Currently AI systems are prone to bias and unfairness which is unaligned with our values. I w
Thanks. Yeah this all sounds extremely obvious to me, but I may not have included such obvious-to-Logan things if I was coaching someone else.
Key things to avoid include isolating people from their friends, breaking the linguistic association of words to reality, demanding that someone change their linguistic patterns on the spot, etc - mostly things which street epistemology specifically makes harder due to the recommended techniques
Are you saying street epistemology is good or bad here? I've only seen a few videos and haven't read through the intro documents or anything.
I was talking to someone recently who talked to Yann and got him to agree with very alignment-y things, but then a couple days later, Yann was saying very capabilities things instead.
The "someone"'s theory was that Yann's incentives and environment is all towards capabilities research.
I think that everyone can see these in theory, but different people focus on different types of information (eg low level sensory information vs high level sensory information) by default.
I believe drugs or meditating can change which types of information you pay more attention to by default, momentarily or even permanently.
I've never taken drugs beyond caffeine & alcohol, but meditating makes these phenomena much easier to see. I bet you could get most people to see them if you ask them to e.g. stare at a textured surface like carpet for 2...
I understand your point now, thanks. It's:
An embedded aligned agent is desired to have properties (1),(2), and (3). But, suppose (1) & (2), then (3) cannot be true. Then, suppose (2) & ...
or something of the sort.
Happy Birthday Man. I’d probably have talked to you about AI Alignment by now, and can imagine all the circles we would go arguing it.
I feel like such a different person than even a few years ago, and I don’t think I mean that from a “redefining myself” way or wanting to boost my ego. I wonder how different you’d be after your startup idea.
It’d be nice to have talked to you after Ukraine being invaded, or go see coach about it.
I’ll bring you back if I can,
Logan
I'm confused on what your point here even is. For the first part, if you're trying to say
research that gives strong arguments/proofs that you cannot solve alignment by doing X (like showing certain techniques aren't powerful enough to prove P!=NP) is also useful.
, then that makes sense. But the post didn't mention anything about that?
You said:
We cannot just rely on a can-do attitude, as we can with starting a start-up (where even if there’s something fundamentally wrong about the idea, and it fails, only a few people’s lives are impacted hard).
which I feel...
We don't have any proofs that the approaches the referenced researchers are doomed to fail like we have for P!=NP and what you linked.
Besides looking for different angles or ways to solve alignment, or even for strong arguments/proofs why a particular technique will not solve alignment,
... it seems prudent to also look for whether you can prove embedded misalignment by contradiction (in terms of the inconsistency of the inherent logical relations between essential properties that would need to be defined as part of the concept of embedded/implemented/compu...
We have our replication here for anyone interested!