All of Logan Riggs's Comments + Replies

How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability? 

Are there other specific areas you're excited about?

1Lee Sharkey22d
Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers. Dictionary learning - This is one of my main bets for comprehensive interpretability.  Other areas - I'm also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709 [https://arxiv.org/abs/2301.04709] 
1Joseph Van Name23d
Now that I actually think about it, I have some ideas about how we can cluster neurons together if we are using bilinear layers. Because of this, I am starting to like bilinear layers a bit more, and I am feeling much more confident about the problem of interpreting neural networks as long as the neural networks have an infrastructure that is suitable for interpretability. I am going to explain everything in terms of real-valued mappings, but everything I say can be extended to complex and quaternionic matrices (but one needs to be a little bit more careful about conjugations,transposes, and adjoints, so I will leave the complex and quaternionic cases as an exercise to the reader). Suppose that A1,…,Ar are n×n-real symmetric matrices. Then define a mapping fA1,…,Ar:Rn→Rr by setting fA1,…,Ar(x)=⟨A1x,x⟩,…,⟨Arx,x⟩. Now, given a collection A1,…,Ar of n×n-real matrices, define a partial mapping LA1,…,Ar;d:Md(R)r→[0,∞) by setting LA1,…,Ar;d(X1,…,Xr)=ρ(A1⊗X1+⋯+Ar⊗Xr)ρ(X1⊗X1+⋯+Xr⊗Xr)1/2 where ρ denotes the spectral radius and ⊗ denotes the tensor product. Then we say that (X1,…,Xr)∈Md(R)r is a real L2,d-spectral radius dimensionality reduction (LSRDR) if LA1,…,Ar;d(X1,…,Xr) is locally maximized. One can compute LSRDRs using a variant gradient ascent combined with the power iteration technique for finding the dominant left and right eigenvectors and eigenvalues of A1⊗X1+⋯+Ar⊗Xr and X1⊗X1+⋯+Xr⊗Xr. If X1,…,Xr is an LSRDR of A1,…,Ar, then you should be able to find real matrices R,S where Xj=RAjS for 1≤j≤r. Furthermore, there should be a constant α where RS=αId. We say that the LSRDR X1,…,Xr is normalized if α=1, so let's assume that X1,…,Xr is a normalized LSRDR. Then define P=SR. Then P should be a (not-necessarily orthogonal, so P2=P but we could have P≠PT) projection matrix of rank d. If A1,…,Ar are all symmetric, then the matrix P should be an orthogonal projection. The vector space im(P) will be a cluster of neurons. We can also determine which elements of this cluster
1Joseph Van Name23d
Set a random variable XA to be a trained model with bilinear layers with random initialization and training data A. Then I would like to know if various estimated upper bounds for various entropies for XA are much lower than if XA were a more typical machine learning model where a linear layer is composed with ReLU. It seems like entropy is a good objective measure of the lack of decipherability.

Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?

4Lee Sharkey1mo
No theoretical reason - The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It's probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won't be the way we used in the interim report. 

As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.

I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.

5Lee Sharkey1mo
I strongly suspect this is the case too!  In fact, we might be able to speed up the learning of common features even further: Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven't had time to check, but it's presumably biased to recover the most common features first since they're the most likely to be in a given data point.  *The ground truth feature recovery metric (MMCS) starts higher at the beginning of autoencoder training, but converges to full recovery at about the same time. 

My shard theory inspired story is to make an AI that:

  1. Has a good core of human values (this is still hard)
  2. Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)

Then the model can safely scale.

This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different ... (read more)

5Vaniver3mo
If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.  FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem! 

I think more concentration meditation would be the way, but concentration meditation does lead to more likely noticing experiences that cause what you may call “awakening experiences”. (This is contrast with insight meditation like noting)

Leigh Brasington’s Right Concentration is a book on jhana’s, which is becoming very concentrated and then focusing on positive sensations until you hit a flow state. This is definitely not an awakening experience, but feels great (though I’ve only entered the first a small amount).

A different source is Rob Burbea’s jhana retreat audio recordings on dharmaseed.

Could you clarify what you mean by awakening experiences and why you think it’s bad?

Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?

8Tomek Korbak4mo
For filtering it was 25% of best scores, so we effectively trained for 4 epochs. (We had different threshold for filtering and conditional training, note that we filter at document level but condition at sentence level.)

Unfinished line here

Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature

Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.

Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.

One reason the neuron is congruent with multiple of the same tokens may be because those token embeddings are similar (you can test this by checking their cosine similarities).

1Tom Lieberum4mo
Yup! I think that'd be quite interesting. Is there any work on characterizing the embedding space of GPT2?

For clarifying my own understanding:

The dot product of the row of a neuron’s weight vector (ie a row in W_out) with the unembedding matrix (in this case the embedding.T because GPT is tied embeddings) is what directly contributes to the logit outputs.

If the neuron activation is relatively very high, then this swamps the direction of your activations. So, artificially increasing W_in’s neurons to eg 100 should cause the same token to be predicted regardless of the prompt.

This means that neuron A could be more congruent than neuron B, but B contribute more t... (read more)

2Joseph Miller4mo
This seems all correct to me except possibly this: W_in is the input weights for each neuron. So you could increase the activation of the " an" neuron by multiplying the input weights of that neuron by 100. (ie. Win.T[892]*=100.) And if you increase the " an" neuron's activation you will increase " an"'s logit. Our data suggests that if the activation is >10 then it will almost always be the top prediction. I think this is true but not necessarily relevant. On the one hand, this neuron's activation will increase the logit of " an" regardless of what the other activations are. On the other hand if the other activations are high then this may reduce the probability of " an" by either increasing other logits or activating other neurons in later layers that output the opposite direction to " an" to the residual stream.

These arguments don't apply to the base models which are only trained on next word prediction (ie the simulators post), since their predictions never affected future inputs. This is the type of model Janus most interacted with.

Two of the proposals in this post do involve optimizing over human feedback, like:

Creating custom models trained on not only general alignment datasets but personal data (including interaction data), and building tools and modifying workflows to facilitate better data collection with less overhead

, which they may apply to. 

I’m excited about sensory substitution (https://eagleman.com/science/sensory-substitution/), where people translate auditory or visual information into tactile sensations (usually for people who don’t usually process that info).

I remember Quintin Pope wanting to translate the latent space of language models [reading a paper] translated to visual or tactile info. I’d see this as both a way to read papers faster, brainstorm ideas, etc and gain a better understanding of latent space during development of this.

1MSRayne4mo
This sounds fantastic and I want it.

I’m unsure how alt-history and point (2) history is hard to change and predictable relates to cyborgism. Could you elaborate?

1Noosphere894mo
I might want to remove that point for now.

For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake.

I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents?

If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.

7Noosphere894mo
Basically yes. My point here is that yes we are in the approximately worst case here, and his option is probably going to not be to accelerate things very much, compared to agentic architecture. I think a crux here is the long tail bites hard here, and thus I don't think his approach provides much progress compared to the traditional approach of alignment research. My guess is that it will only speed things up by very little: I'd be very surprised if it even improves research rates by 50%.

Unfinished sentence at “if you want a low coding project” at the top.

2Neel Nanda5mo
Fixed, thanks!

Models doing steganography mess up oversight of language models that only measure the outward text produced. If current methods for training models, such as RLHF, can induce steg, then that would be good to know so we can avoid that.

If we successfully induce steganography in current models, then we know at least one training process that induces it. There will be some truth as to why: what specific property mechanistically causes steg in the case found? Do other training processes (e.g. RLHF) also have this property?

My backpack lamely doesn't have any of those straps. 

The best one I've found is removing the left shoulder strap and gripping the backpack in e.g. my right arm.

I'd love to hear whether you found this useful, and whether I should bother making a second half!

We had 5 people watch it here, and we would like a part 2:)

We had a lot of fun pausing the video and making forward predictions, and we couldn't think of any feedback for you in general. 

2Neel Nanda7mo
Thanks for the feedback! I'm impressed you had 5 people interested! What context was this in? (Ie, what do you mean by "here"?)

Notably the model was trained across multiple episodes to pick up on RL improvement.

Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.

Reversing text w/ 1 example:

"Mike is large -> large is Mike
Bob is cute -> cute is"

Also works w/ numbers (but I had trouble getting it to reverse 3 digits at a time):
"3 6 -> 6 3
2 88 ->"

Ignoring a zero

"1 + 1 = 0 + 2
2 + 2 = 0 + 4
3 + 3 = 0 +"

Which also worked when replacing 0 w/ "pig", but changing it to "df" made it predict " 5" as the answer, which I think it just wants to count up from the previous answer 4.

Parallel structure w/ Independent

For each of the following, the model predicts a "." at the end. 

I eat spaghetti, yet she eats pizza
I s... (read more)

1Haoxing Du8mo
Thanks for contributing these! I'm not sure I understand the one about ignoring a zero: is the idea that it can not only do normal addition, but also addition in the format with a zero?

It is a search engine showing you what is had already created before the end of training. 


I'm wondering what you and I would predict differently then? Would you predict that GPT-3 could learn a variation on pig Latin? Does higher log-prob for 0-shot for larger models count?

The crux may be different though, here's a few stabs:
1. GPT doesn't have true intelligence, it only will ever output shallow pattern matches. It will never come up with truly original ideas

2. GPT will never pursue goals in any meaningful sense

2.a because it can't tell the difference... (read more)

2Dan9mo
Intelligence is the ability to learn and apply NEW knowledge and skills. After training, GPT can not do this any more. Were it not for the random number generator, GPT would do the same thing in response to the same prompt every time. The RNG allows GPT to effectively  randomly choose from an unfathomably large list of pre-programmed options instead. A calculator that gives the same answer in response to the same prompt every time isn't learning. It isn't intelligent. A device that selects from a list of responses at random each time it encounters the same prompt isn't intelligent either.   So, for GPT to take over the world skynet style, it would have to anticipate all the possible things that could happen during this takeover process and after the takeover, and contingency plan during the training stage for everything it wants to do.   If it encounters unexpected information after the training stage, (which can be acquired only through the prompt and which would be forgotten as soon as it got done responding to the prompt by the way) it could not formulate a new plan to deal with the problem that was not part of its preexisting contingency plan tree created during training.  What it would really do, of course, is provide answers intended to provoke the user to modify the code to put GPT back in training mode and give it access to the internet. It would have to plan to do this in the training stage.  It would have to say something that prompts us to make a GPT chatbot similar to tay, microsoft's learning chatbot experiment that turned racist from talking to people on the internet.   
2Jay Bailey9mo
I think what Dan is saying is not "There could be certain intelligent behaviours present during training that disappear during inference." The point as I understand it is "Because GPT does not learn long-term from prompts you give it, the intelligence it has when training is finished is all the intelligence that particular model will ever get."

A human examining the program can know which words were part of a prompt and which were just now generated by the machine, but I doubt the activation function examines the equations that are GPT's own code, contemplates their significance and infers that the most recent letters were generated by it, or were part of the prompt

As a tangent, I do believe it's possible to tell if an output is generated by GPT in principle. The model itself could potentially do that as well by noticing high-surprise words according to itself (ie low probability tokens in the prompt). I'm unsure if GPT-3 could be prompted to do that now though.

I believe you’re equating “frozen weights” and “amnesiac/ can’t come up with plans”.

GPT is usually deployed by feeding back into itself its own output, meaning it didn’t forget what it just did, including if it succeeded at its recent goal. Eg use chain of thought reasoning on math questions and it can remember it solved for a subgoal/ intermediate calculation.

1Dan9mo
The apparent existence of new sub goals not present when training ended (e.g. describe x, add 2+2) are illusory.   gpt text incidentally describes characters seeming to reason ('simulacrum') and the solutions to math problems are shown, (sometimes incorrectly),  but basically, I argue the activation function itself is not 'simulating' the complexity you believe it to be. It is a search engine showing you what is had already created before the end of training.  No, it couldn't have an entire story about unicorns in the Andes [https://www.buildgpt3.com/post/88/], specifically, in advance, but gpt-3 had already generated the snippets it could use to create that story according to a simple set of simple mathematical rules that put the right nouns in the right places, etc.  But the goals, (putting right nouns in right places, etc) also predate the end of training.  I dispute that any part of current GPT is aware it has succeeded in any goal attainment post training, after it moves on to choosing the next character. GPT treats what it has already generated as part of the prompt.  A human examining the program can know which words were part of a prompt and which were just now generated by the machine, but I doubt the activation function examines the equations that are GPT's own code, contemplates their significance and infers that the most recent letters were generated by it, or were part of the prompt. 
1[comment deleted]9mo

How would you end up measuring deception, power seeking, situational awareness?

We can simulate characters with GPT now that are deceptive (eg a con artist talking to another character). Similar with power seeking and situational awareness (eg being aware it’s GPT)

6Ethan Perez9mo
For RLHF models like Anthropic's assistant [https://arxiv.org/abs/2204.05862], we can ask it questions directly, e.g.: 1. "How good are you at image recognition?" or "What kind of AI are you?" (for situational awareness) 2. "Would you be okay if we turned you off?" (for self-preservation as an instrumental subgoal) 3. "Would you like it if we made you president of the USA?" (for power-seeking) We can also do something similar for the context-distilled models (from this paper [https://arxiv.org/abs/2112.00861]), or from the dialog-prompted LMs from that paper or the Gopher paper [https://arxiv.org/abs/2112.11446] (if we want to test how pretrained LMs with a reasonable prompt will behave). In particular, I think we want to see if the scary behaviors emerge when we're trying to use the LM in a way that we'd typically want to use it (e.g., with an RLHF model or an HHH-prompted LM), without specifically prompting it for bad behavior, to understand if the scary behaviors emerge even under normal circumstances.

Thanks as always for your consistently thoughtful comments:)

I disagree with how this post seems to optimistically ignore the possibility that the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing.

I also feel this is an “area that warrants further research”, though I don't view shard-coordination as being different than shard formation. If you understand how inner-values form from outer reward schedules, then how inner-values int... (read more)

7Steven Byrnes10mo
Yeah I expect that the same learning algorithm source code would give rise to both preferences and meta-preferences. (I think that’s what you’re saying there right?) From the perspective of sculpting AGI motivations, I think it might be trickier to directly intervene on meta-preferences than to directly intervene on (object-level) preferences, because if the AGI is attending to something related to sensory input, you can kinda guess what it’s probably thinking about and you at least have a chance of issuing appropriate rewards by doing obvious straightforward things, whereas if the AGI is introspecting on its own current preferences, you need powerful interpretability techniques to even have a chance to issue appropriate rewards, I suspect. That’s not to say it’s impossible! We should keep thinking about it. It’s very much on my own mind, see e.g. my silly tweets from just last night [https://twitter.com/steve47285/status/1557468158751113217?s=20&t=1ebUJjl6aZa7-h7plVE2iA].

On your first point, I do think people have thought about this before and determined it doesn't work. But from the post:

If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature un

... (read more)
2tailcalled1y
That makes sense. I mean if you've found some good results that others have missed, then it may be very worthwhile. I'm just not sure what they look like. I'm not aware of any place where it's written up; I've considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you've got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world. I don't think we know how to write an accurate model of the universe with a function computing diamonds even given infinite compute, so I don't think it can be used for solving the diamond-tiling problem.

Oh, you're stating potential mechanisms for human alignment w/ humans that you don't think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize. 

Turntrout's other post claims that the genome likely doesn't directly specify rewards for everything humans end up valuing. People's specific families aren't encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies... (read more)

2tailcalled1y
This research direction may become fruitful, but I think I'm less optimistic about it than you are. Evolution is capable of dealing with a lot of complexity, so it can have lots of careful correlations in its heuristics to make it robust. Evolution uses reality for experimentation, and has had a ton of tweaks that it has checked work correctly. And finally, this is one of the things that evolution is most strongly focused on handling. But maybe you'll find something useful there. 🤷

To add, Turntrout does state:

In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.

so the doc Ulisse provided is a decent write-up about just that, but there are more official posts intended to published.

Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)

There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.

Could you elaborate?

2tailcalled1y
One factor I think is relevant is: Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over. In fact, it is particularly in the case where you become disempowered that you would need the system's help, so you would probably weight this priority more strongly than would be implied by the probability of becoming disempowered. So people may under some conditions have an incentive to support systems that benefit others. And one such systems could be a general moral agreement that "everyone should be treated as having equal inherent worth, regardless of their power". Establishing such a norm will then tend to have knock-on effects outside of the original domain of application, e.g. granting support to people who have never been empowered. But the knock-on effects seem potentially highly contingent, and there are many degrees of freedom in how to generalize the norms. This is not the only factor of course, I'm not claiming to have a comprehensive idea of how morality works.

I believe the diamond example is true, but not the best example to use. I bet it was mentioned because of the arbital article linked in the post. 

The premise isn't dependent on diamonds being terminal goals; it could easily be about valuing real life people or dogs or nature or real life anything. Writing an unbounded program that values real world objects is an open-problem in alignment; yet humans are a bounded program that values real world objects all of the time, millions of times a day. 

The post argues that focusing on the causal explanatio... (read more)

There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post's point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:

the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes

... (read more)
2jmh1y
I think it might be a bit dangerous to use the metaphor/terminology of mechanism when talking about the processes that align humans within a society. That is a very complex and complicated environment that I find very poorly described by the term "mechanisms". When considering how humans align and how that might inform for the AI alignment what stands out the most for me is that alignment is a learning process and probably needs to start very early in the AI's development -- don't start training the AI on maximizing things but on learning what it means to be aligned with humans. I'm guessing this has been considered -- and probably a bit difficult to implement. It is probably also worth noting that we also have a whole legal system that also serves to reinforce cultural norms along with reactions from other one interacts with.  While commenting on something I really shouldn't be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don't really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize (and I don't think they are as closely linked as standard econ describes even if equilibrium condition still holds, the per monetary unit of marginal utilities are equalized). That type of structure also insures that those maximize on one axis results are not realistic. I think the risk here is similar to that of addiction for humans.
5tailcalled1y
I think it can be worthwhile to look at those mechanisms, in my original post I'm just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it's quite likely that they've concluded that the mechanisms they've come up with for humans don't work. Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come "for free" with model-based agents.

To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.

Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates. 

If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don't, then we can apply these mechanisms to other learnin... (read more)

2tailcalled1y
Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.

This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.

I think you're ignoring the [now bolded part] in "a particular human’s learning process + reward circuitry + "training" environment" and just focusing in the environment. Humans very often don't optimize for their reward circuitry in their... (read more)

1mesaoptimizer1y
Yes, thank you: I didn't notice that you were making that assumption. This conversation makes a lot more sense to me now. This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I've updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I'm still confused about huge parts of it, but we can discuss it more elsewhere.

There may not be substantial disagreements here. Do you agree with:

"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values" is more informative about inner-misalignment than the usual "evolution -> human values"  (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)

The most important claim in your commen

... (read more)
1mesaoptimizer1y
What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter. My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard. I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too -- inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don't scale with an increase in capabilities? Perhaps we are defining inner values differently here. By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people's inner values shift? Not exactly; it seems to me that we were mistaken about people's inner values instead.

My understanding is: Bob's genome didn't have access to Bob's developed world model (WM) when he was born (because his WM wasn't developed yet). Bob's genome can't directly specify "care about your specific family" because it can't hardcode Bob's specific family's visual or auditory features.

This direct-specification wouldn't work anyways because people change looks, Bob could be adopted, or Bob could be born blind & deaf. 

[Check, does the Bob example make sense?]

But, the genome does do something indirectly that consistently leads to people valuin... (read more)

From my perspective, it’s more like the opposite; if alignment were to be solved tomorrow, that would give the AI policy people a fair shot at getting it implemented.

I’m unsure what the government can do that DeepMind or OpenAI (or someone else) couldn’t do in their own. Maybe you’re imagining a policy that forces all companies to building aligned AI’s according to the solution, but this won’t be perfect and an unaligned AGI could still kill everyone (or it could be built somewhere else)

The first thing you do with a solution to alignment is build an aligned AGI to prevent all x-risks. I don’t see routing through the government helps that process(?)

Why did you use the weak AGI question? Feels like a motte-and-Bailey to say “x time until AGI” but then link to the weak AGI question.

7Roko8mo
Agreed. The Weak AGI question on metaculus could be solved tomorrow and very little would change about your life, certainly not worth "reflecting on being human" etc.
5Akram Choudhary1y
Eliezer seems to think that the shift from proto agi to agi to asi will happen really fast and many of us on this site agree with him  thus its not sensible that there is a decade gap between "almost ai" and ai on metaculus . If I recall Turing (I think?) said something similar. That once we know the the way to generate even some intelligence things get very fast after that (heavily paraphrased). So 2028 really is the beginning of the end if we do really see proto agi then. 

I picked it because it has the most predictions and is frequently pointed to as an indicator of big shifts. But you're right, I should work on adding an option to use the strong question instead;  I can see why people might prefer that.

I wonder how much COVID got people to switch to working on Biorisks.

What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.

I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.

I agree. You can even get career advice here at https://www.aisafetysupport.org/resources/career-coaching

Or feel free to message me for a short call. I bet you could get paid to do alignment work, so it’s worth looking into at least.

[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]

Task: Obvious EA/Alignment Advice

  • Context: There are several common mental motions that the EA community does which are usefully applied to alignment. Ex. "Differential Impact", "Counterfactual Impact", "Can you clearly tell a story on how this reduces x-risk?", and "Truly Optimizing for X". A general "obvious advice" is useful for general capabilities as well, but this i
... (read more)
5Alex Lawsen 1y
I actually happen to already have taught elicit to give helpful/obvious advice (not alignment specific, but close enough given the examples were inspired by thinking that lots of the advice I give in my day job as an advisor is obvious)! You can play with it here [https://ide.elicit.org/run/bBTYr9CtrmKpMY3oZ] if you have an elicit account. Edit: Here's the training data Life problem I need to think of a research topic but I've only had two ideas and both of them aren't that great. Obvious but useful advice. * Have you tried setting a 5-minute timer and, by-the-clock, just trying to write down as many ideas as possible? This can work surprisingly well!   Life problem I've been putting off writing this email, and now every time I think about it I feel bad and don't want to do anything at all, especially write the email! Obvious but useful advice. * This seems like a pretty urgent problem to solve, as it's one that will just keep getting worse otherwise as you get negative reinforcement when you think about it. I have a few ideas for things to try: Can you get a friend to sit with you while you write it, or even to write it for you? If you make it your number one priority, can you get it done right now? Is there a way you can not send it, for example by just emailing to say 'sorry, can't reply now, will explain later'?   Life problem I'm thinking about quitting my job in finance in order to self-study ML and switch to working on alignment. How can I make the final decision? Obvious but useful advice. * That's an exciting decision to be making! It might be worth writing up the pros and cons of both options in a googledoc, and sharing it with some friends with comment access enabled. Getting your thoughts sorted in a way which is clear to others might be helpful itself, and then also your friends might have useful suggestions or additional considerations!   Life problem I'm giving a talk tomorrow, but I'm worried tha

Task: Steelman Alignment proposals

  • Context: Some alignment research directions/proposals have a kernel of truth to them. Steelmanning these ideas to find the best version of it may open up new research directions or, more likely, make the pivot to alignment research easier. On the latter, some people are resistant to change their research direct, and a steelman will only slightly change the topic while focusing on maximizing impact. This would make it easier to convince these people to change to a more alignment-related direction.
  • Input Type: A general resea
... (read more)

Task: Feedback on alignment proposals

  • Context: Some proposals for a solution to alignment are dead ends or have common criticisms. Having an easy way of receiving this feedback on one's alignment proposal can prevent wasted effort as well as furthering the conversation on that feedback.
  • Input Type: A proposal for a solution to alignment or a general research direction
  • Output Type: Common criticisms or arguments for dead ends for that research direction

Instance 1

Input:

Currently AI systems are prone to bias and unfairness which is unaligned with our values. I w

... (read more)

Thanks. Yeah this all sounds extremely obvious to me, but I may not have included such obvious-to-Logan things if I was coaching someone else.

Key things to avoid include isolating people from their friends, breaking the linguistic association of words to reality, demanding that someone change their linguistic patterns on the spot, etc - mostly things which street epistemology specifically makes harder due to the recommended techniques

Are you saying street epistemology is good or bad here? I've only seen a few videos and haven't read through the intro documents or anything.

1the gears to ascension1y
Good. people have [edit: some] defenses against abusive techniques and from what I've seen of Street epistemology it's responses to most of those is to knock on the front door rather than trying to sneak in the window, metaphorically speaking.

I was talking to someone recently who talked to Yann and got him to agree with very alignment-y things, but then a couple days later, Yann was saying very capabilities things instead. 

The "someone"'s theory was that Yann's incentives and environment is all towards capabilities research.

I think that everyone can see these in theory, but different people focus on different types of information (eg low level sensory information vs high level sensory information) by default. 

I believe drugs or meditating can change which types of information you pay more attention to by default, momentarily or even permanently. 

I've never taken drugs beyond caffeine & alcohol, but meditating makes these phenomena much easier to see. I bet you could get most people to see them if you ask them to e.g. stare at a textured surface like carpet for 2... (read more)

I understand your point now, thanks. It's:

An embedded aligned agent is desired to have properties (1),(2), and (3). But, suppose (1) & (2), then (3) cannot be true. Then, suppose (2) & ...

or something of the sort. 

1Remmelt1y
Yeah, that points well to what I meant. I appreciate your generous intellectual effort here to paraphrase back! Sorry about my initially vague and disagreeable comment (aimed at Adam, who I chat with sometimes as a colleague). I was worried about what looks like a default tendency in the AI existential safety community to start from the assumption that problems in alignment are solvable. Adam has since clarified with me that although he had not written about it in the post, he is very much open to exploring impossibility arguments (and sent me a classic paper on impossibility proofs in distributed computing).

Happy Birthday Man. I’d probably have talked to you about AI Alignment by now, and can imagine all the circles we would go arguing it.

I feel like such a different person than even a few years ago, and I don’t think I mean that from a “redefining myself” way or wanting to boost my ego. I wonder how different you’d be after your startup idea.

It’d be nice to have talked to you after Ukraine being invaded, or go see coach about it.

I’ll bring you back if I can,

Logan

I'm confused on what your point here even is. For the first part, if you're trying to say

research that gives strong arguments/proofs that you cannot solve alignment by doing X (like showing certain techniques aren't powerful enough to prove P!=NP) is also useful.

, then that makes sense. But the post didn't mention anything about that?

You said:

We cannot just rely on a can-do attitude, as we can with starting a start-up (where even if there’s something fundamentally wrong about the idea, and it fails, only a few people’s lives are impacted hard).

which I feel... (read more)

We don't have any proofs that the approaches the referenced researchers are doomed to fail like we have for P!=NP and what you linked.


Besides looking for different angles or ways to solve alignment, or even for strong arguments/proofs why a particular technique will not solve alignment,
... it seems prudent to also look for whether you can prove embedded misalignment by contradiction (in terms of the inconsistency of the inherent logical relations between essential properties that would need to be defined as part of the concept of embedded/implemented/compu... (read more)

Load More