7mo80

This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.

27mo

I also found parts of this post surprisingly interesting, given the ultra-dry title and intimidating reading time.
To present this kind of content in a way more readers could benefit from, another option would be to post it as a small sequence, so people could vote and comment on separate sections.

Thanks for the reply!

As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,

Anything less than this is a strictly singular case.

So if , then is a singularity b...

In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.

The story is this: we have a prior , a model , and there is an unknown true distribution . For model selection, we are interested in the evidence of our model for a da...

Thanks for the answer! I think my first question was confused because I didn't realize you were talking about *local* free energies instead of the global one :)

As discussed in the comment in your DSLT1 question, they are both singularities of since they are both critical points (local minima).

Oh, I actually may have missed that aspect of your answer back then. I'm confused by that: in algebraic geometry, the zero's of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two...

77mo

Correct! So, the point is that things get interesting when W0 is more than just a single point (which is the regular case). In essence, singularities are local minima of K(w). In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of K(w) as a singularity. The TLDR of this is:
singularities of K(w)=critical points of K(w)
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. K(w)=w4). Precisely, suppose d is the number of parameters, then you are in the regular case if K(w) can be expressed as a full-rank quadratic form near each singularity,
K(w)=d∑i=1w2i.
Anything less than this is a strictly singular case.
Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey's prior. I haven't studied it in detail but to the best of my reading he is basically saying "from the point of view of SLT, the Jeffrey's prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be λ≥d2 if the Jeffrey's prior is employed." (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey's prior is employed).

8mo10

Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :)

At some critical value , we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error .

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this co...

67mo

This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it's an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.
This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.
Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model's behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there's definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.
At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.
I just mean that K(w) is not affected by n (even though of course Kn(w) or Ln(w) is), but the posterior is still affected by n. So the phase transition merely conce

8mo10

Thanks Liam also for this nice post! The explanations were quite clear.

The property of being singular is specific to a model class , regardless of the underlying truth.

This holds for singularities that come from symmetries where the model doesn't change. However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

...Both configu

87mo

The definition of the Fisher information matrix does not refer to the truth q(y,x) whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution q(x), meaning the model is p(y,x|w)=p(y|x,w)q(x), which is why the q(x) shows up in the formula I just linked to. The derivative terms do not explicitly include q(x) because it just vanishes in the wj derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution q(y|x) in q(y,x)=q(y|x)q(x).)
You're right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!

8mo10

In particular, it is thesingularitiesof these minimum-loss sets — points at which the tangent is ill-defined — that determine generalization performance.

To clarify: there is not necessarily a problem with the tangent, right? E.g., the function has a singularity at because the second derivative vanishes there, but the tangent is define. I think for the same reason, some of the pictures may be misleading to some readers.

- A
model, , parametrized by weights , where is compact;

Why do we want...

Thanks for the answer mfar!

...Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it's still not clear and/or someone doesn't want to follow up in Liam's thesis, is a free variable, and the condition is talking about linear dependence

of functions of.Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let so that and . Then let and be functions such that&n

Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:

The partition function is equal to the model evidence , yep. It isn’t equal to (I assume is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),

...and then under this supervised learning setup where we know&n

8mo20

Thanks for this nice post! I fight it slightly more vague than the first post, but I guess that is hard to avoid when trying to distill highly technical topics. I got a lot out of it.

Fundamentally, we care about the free energy because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1, it tells us something about the information geometry of the posterior.

Can you tell more about why it is a measure of posterior concentration (It gets a bit clearer further below, but I state my question n...

68mo

Given a small neighbourhood W⊂W, the free energy is Fn(W)=−logZn(W) and Zn(W) measures the posterior concentration in W since
Zn(W)=∫We−nLn(w)φ(w)dw
where the inner term is the posterior, modulo its normalisation constant Zn. The key here is that if we are comparing different regions of parameter space W, then the free energy doesn't care about that normalisation constant as it is just a shift in Fn(W) by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions W1,W2 are the same "size". Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn't really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of K(w) around the singularity).
This is sloppily written by me, apologies. I merely mean to say "the free energy tells us what models the posterior likes".
I mean, the relation between Gn and Fn tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to W (instead of W in the integral).
Yes - and as you say, this would be very uninteresting (and in general you wouldn't know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of W0 in DSLT3]). The point is that at no point are you just magically "choosing" a W anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry K(w) depends on the singularity structure and this varies across parameter space.
As discussed in the comment in your DSLT1 questio

8mo30

Thank you for this wonderful article! I read it fairly carefully and have a number of questions and comments.

where the

partition function(or in Bayesian terms theevidence) is given by

Should I think of this as being equal to , and would you call this quantity ? I was a bit confused since it seems like we're not interested in the data likelihood, but only the conditional data likelihood under model .

And to be clear: This does not factorize over because every data point informs&nbs...

48mo

I think these are helpful clarifying questions and comments from Leon. I saw Liam's response. I can add to some of Liam's answers about some of the definitions of singular models and singularities.
1. Conditions of regularity: Identifiability vs. regular Fisher information matrix
As Liam said, I think the answer is yes---the emphasis of singular learning theory is on the degenerate Fisher information matrix (FIM) case. Strictly speaking, all three classes of models (regular, non-identifiable, degenerate FIM) are "singular", as "singular" is defined by Watanabe. But the emphasis is definitely on the 'more' singular models (with degenerate FIM) which is the most complex case and also includes neural networks.
As for non-identifiability being uninteresting, as I understand, non-regularity arising from certain kinds of non-local non-identifiability can be easily dealt with by re-parametrising the model or just restricting consideration to some neighbourhood of (one copy of) the true parameter, or by similar tricks. So, the statistics of learning in these models is not strictly-speaking regular to begin with, but we can still get away with regular statistics by applying such tricks.
Liam mentions the permutation symmetries in neural networks as an example. To clarify, this symmetry usually creates a discrete set of equivalent parameters that are separated from each other in parameter space. But the posterior will also be reflected along these symmetries so you could just get away with considering a single 'slice' of the parameter space where every function is represented by at most one parameter (if this were the only source of non-identifiability---it turns out that's not true for neural networks).
It's worth noting that these tricks don't generally apply to models with local non-identifiability. Local non-identifiability =roughly there are extra true parameters in every neighbourhood of some true parameter. However, local non-identifiability implies that the FIM i

68mo

Thanks for the comment Leon! Indeed, in writing a post like this, there are always tradeoffs in which pieces of technicality to dive into and which to leave sufficiently vague so as to not distract from the main points. But these are all absolutely fair questions so I will do my best to answer them (and make some clarifying edits to the post, too). In general I would refer you to my thesis where the setup is more rigorously explained.
The partition function is equal to the model evidence Zn=p(Dn), yep. It isn’t equal to p((Yi)|(Xi)),(I assume i is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
p(Dn)=∫Wφ(w)n∏i=1p(yi,xi|w)dw
and then under this supervised learning setup where we know q(xi), we have p(yi,xi|w)=p(yi|xi,w)q(xi). Also note that this does “factor over i” (if I’m interpreting you correctly) since the data is independent and identically distributed.
Yep, you caught me - I was one step ahead. The free energy over the whole space W is still a very useful quantity as it tells you “how good” the best model in the model class is. But Fn by itself doesn’t tell you much about what else is going on in the loss landscape. For that, you need to localise to smaller regions and analyse their phase structure, as presented in DSLT2.
Ah, yes, you are right - this is a notational hangover from my thesis where I defined EX to be equal to expectation with respect to the true distribution q(y,x). (Things get a little bit sloppy when you have this known q(x) floating around everywhere - you eventually just make a few calls on how to write the cleanest notation, but I agree that in the context of this post it’s a little confusing so I apologise).
See Lemma A.2 in my thesis. One uses a fairly standard argument involving the first central moment of a Gaussian.
Yep, the rest of the article does focus on the case where the Fisher information matrix is degenera

https://twitter.com/ai_risks/status/1664323278796898306?s=46&t=umU0Z29c0UEkNxkJx-0kaQ

Apparently Bill Gates signed.

Stating the obvious: Do we expect that Bill Gates will donate money to prevent the extinction from AI?

49mo

Gates has been publicly concerned about AI X-risk since at least 2015, and he hasn't yet funded anything to try to address it (at least that I'm aware of), so I think it's unlikely that he's going to start now (though who knows – this whole thing could add a sense of respectability to the endeavor that pushes him to do it).

9mo53

It's great to see Yoshua Bengio and other eminent AI scientists like Geoffrey Hinton actively engage in the discussion around AI alignment. He evidently put a lot of thought into this. There is a lot I agree with here.

Below, I'll discuss two points of disagreement or where I'm surprised by his takes, to highlight potential topics of discussion, e.g. if someone wants to engage directly with Bengio.

- Most of the post is focused on the outer alignment problem -- how do we specify a goal aligned with our intent -- and seems to ignore the inner alignment problem

It is true that we have seen over two decades of alignment research, but the alignment community has been fairly small all this time. I'm wondering what a much larger community could have done.

I start to get concerned when I look at humanity's non-AI alignment successes and failures; we've had corporations for hundreds of years, and a significant portion of humanity have engaged in corporate alignment-related activities (regulation, lawmaking, governance etc, assuming you consider those forces to generally be pro-alignment in principle). Corpor...

9mo202

David had many conversations with Bengio about alignment during his PhD, and gets a lot of credit for Bengio taking AI risk seriously

59mo

Bengio was his Master’s thesis advisor too.

9mo50

After filling out the form, I could click on "see previous responses", which allowed me to *see the responses of all other people who have filled out the form so far*.

That is probably not intended?

69mo

Indeed that wasn't intended. Thanks a lot for spotting & sharing it! It's fixed now.

I disagree with this. I think the most useful definition of alignment is intent alignment. Humans are effectively intent-aligned on the goal to not kill all of humanity. They may still kill all of humanity, but that is not an alignment problem but a problem in capabilities: humans aren't capable of knowing which AI designs will be safe.

The same holds for intent-aligned AI systems that create unaligned successors.

For what it's worth, I think this comment seems clearly right to me, even if one thinks the post actually shows misalignment. I'm confused about the downvotes of this (5 net downvotes and 12 net disagree votes as of writing this).

51y

See here for an explanation of why I chose the examples that I did.

1y30

Now to answer our big question from the previous section: I can find some satisfying the conditions exactly when all of the ’s are independent given the “perfectly redundant” information. In that case, I just set to be exactly the quantities conserved under the resampling process, i.e. the perfectly redundant information itself.

In the original post on redundant information, I didn't find a definition for the "quantities conserved under the resampling process". You name this F(X) in that post.

Just to be sure: is your c...

1y2220

When I converse with junior folks about what qualities they’re missing, they often focus on things like “not being smart enough” or “not being a genius” or “not having a PhD.” It’s interesting to notice differences between what junior folks think they’re missing & what mentors think they’re missing.

There may also be social reasons to give different answers depending on whether you are a mentor or mentee. I.e., answering "the better mentees were those who were smarter" seems like an uncomfortable thing to say, even if it's true.

(I do not wan...

31y

+1. I'll note though that there are some socially acceptable ways of indicating "smarter" (e.g., better reasoning, better judgment, better research taste). I was on the lookout for these kinds of statements, and I rarely found them. The closest thing that came up commonly was the "strong and concrete models of AI safety" (which could be loosely translated into "having better and smarter thoughts about alignment").

31y

+1, though I will note that skills 2-5 listed here are pretty strongly correlated with being smarter. It's possible the mentors are factoring the skills differently (more politely?)

1y3

Then is a projection matrix, projecting into the span.

To clarify: for this, you probably need the basis to be orthonormal?

31y

The "dagger" indicates a pseudoinverse, not a transpose, which is why this works even with non-orthonormal U. But an orthonormal basis would probably be most convenient; in that case the pseudoinverse is just the transpose.

1y20

**Summary**

- Disagreements often
*focus on outputs*even though*underlying models*produced those.- Double Crux idea: focus on the models!
- Double Crux tries to reveal the different underlying beliefs coming from different perspectives on reality

- Good Faith Principle:
- Assume that the other side is moral and intelligent.
- Even if some actors are bad, you minimize the chance of error if you
*start with the prior*that each new person is acting in good faith

- Identifying Cruxes
- For every belief A, there are usually beliefs B, C, D such that their believed truth supp

1y3

**Summary**

In this post, John starts with a very basic intuition: that abstractions are things you can get from many places in the world, which are therefore very *redundant*. Thus, for finding abstractions, you should first define redundant information: Concretely, for a system of n random variables X1, …, Xn, he defines the redundant information as that information that remains about the original after repeatedly resampling one variable at a time while keeping all the others fixed. Since there will not be any remaining information if n is finite, there is...

1y20

**Summary**

- Goal: Find motivation through truth-seeking rather than coercion or self-deception
- Ideally: the urges are aligned with the high-level goals
- Turn “wanting to want” into “want”

- If a person has simultaneously conflicting beliefs and desires, then one of those is wrong.
- [Comment from myself: I find this, as stated, not evidently true since desires often do not have a “ground truth” due to the orthogonality thesis. However, even
*if*there is a conflict between subsystems, the productive way forward is usually to find a common path in a__values handsh__

- [Comment from myself: I find this, as stated, not evidently true since desires often do not have a “ground truth” due to the orthogonality thesis. However, even

1y40

**Summary**

- Focusing is a technique for bringing subconscious system 1 information into conscious awareness
- Felt sense: a feeling in the body that is not yet verbalized but may subconsciously influence behavior, and which carries
*meaning*. - The dominant factor in patient outcomes:
*does the patient remain uncertain*, instead of having*firm narratives*?- A goal of therapy is
*increased awareness and clarity*. Thus, it is not useful to spend much time in the*already known*. - The successful patient thinks
**and**listens to information- If th

- A goal of therapy is

1y10

**Summary**:

- Some things seem complicated/difficult and hard to do. You may also have uncertainties about whether you
*can*achieve it- E.g.: running a marathon; or improving motivation

- The resolve cycle: set a 5-minute timer and
*just solve the thing*. - Why does it work?
- You want to be
*actually trying*, even when there is*no immediate need.*- Resolve cycles are an easy way of achieving that (They make it more likely and less painful to invest effort)

- One mechanism of its success: when asking “Am I ready for this?” the answer is often “No”. But when there’

- You want to be

1y20

Note: I found this article in particular a bit hard to summarize, especially the section "The argument for CoZE". I find it hard to say what exactly it is telling me, and how it relates to the later sections.

**Summary**

- Comfort is a
*lack of*pain, discomfort, negative emotions, fear, and anxiety, … - Comfort often comes from experience
- There’s a gray area between comfort and discomfort that can be
*worth exploring* - Explore/Exploit Tradeoff:
- Should you exploit the current hill and climb higher
*there*, or search for a new one? - Problem: there is
*inh*

- Should you exploit the current hill and climb higher

1y20

**Summary**

- Idea: use knowledge of how physiology influences the mind
- Mental shutdown:
- Stress → Mental shutdown (trouble thinking, making decisions, …)
- This leads to decisions that
*feel correct*at the moment but are obviously flawed in hindsight - Metacognitive failure: Part of what we lose is also our
*ability to notice*the loss in abilities → Need objective “sobriety test”

- The automatic nervous system
- Sympathetic nervous system (SNS): accelerator; fight/flight/freeze, excitement
- Parasympathetic nervous system (PSNS): brakes; chill, open vulnerability, reflect

1y30

**Summary**

- Systemizing has large up-front costs for diminished repeated costs
- Not everything
*needs*systemization: maybe you*like*your inefficient process, or the costs are too infrequent to bother. - Systemization-opportunities:
**Common routines:**waking up, meals, work routines, computer, social**Familiar spaces**: bedroom, bathroom, kitchen, living room, vehicle, workspace, backpack**Shoulds/Obligations**: Physical health, finances, intellectual growth, close relationship, career, emotional well-being, community

- Framing: instead of trying to “do everything”,

1y6

**Summary:**

**Claim 1**: Goodhart’s Law is true- “Any measure which becomes the target ceases to be a good measure”
- Examples:
- Any math test supposed to find the best students will cease to work at the 10th iteration — people then “study to be good at the test”
- Sugar was a good proxy for healthy food in the ancestral environment, but not today

**Claim 2**: If you want to condition yourself to a certain behavior with some reward, then that’s possible if only the delay between behavior and reward is small enough**Claim 3**: Over time, we develop “taste”: inexplicable judgments of

1y3

**Summary**

An abstraction of a high-dimensional random variable X is a low-dimensional summary G(X) that can be used to make predictions about X. In the case that X is sampled from some parameterized distribution P(X | theta), G(X) may take the form of a sufficient statistic, i.e., a function of X such that P(theta | X) = P(theta | G(X)). To make predictions about X, one may then determine theta from P(theta | G(X)), and predict a new data point X from theta.

In this post, John shows that if you have a very low-dimensional sufficient statistic G(X), then in man...

1y20

**Summary**

- “I want to exercise but I don’t
*want*to exercise”- Wants reflect both for long-term goals and immediate desires
- Emotional valence makes both not always agree
- Yum and yuck: the things that are yucky at the moment often contain a yummy quality for the long-term

**Goal**: Yum-feeling is aligned with the in-the-moment actions toward our long-term wants- This can be achieved with internal double-crux (
**IDC**) - This article is more for understanding
*what’s going on*, and IDC is for*tinkering*with it.

- This can be achieved with internal double-crux (
- Our motivations are shaped by
**hyperbolic discou**

**Uncharitable Summary**

Most likely there’s something in the intuitions which got lost when transmitted to me via reading this text, but the mathematics itself seems pretty tautological to me (*nevertheless* I found it interesting since tautologies can have interesting structure! The proof itself was *not* trivial to me!).

Here is my uncharitable summary:

Assume you have a Markov chain M_0 → M_1 → M_2 → … → M_n → … of variables in the universe. Assume you know M_n and want to predict M_0. The Telephone theorem says two things:

- You don’t need to keep
*a*

**Edit: **This is now obsolete with our NAH distillation.

This short form distills the Telephone Theorem and its proof. The short form will thereby not at all be "intuitive"; the only goal is to be mathematically precise at every step.

Let be jointly distributed finite random variables, meaning they are all functions

starting from the same finite sample space with a given probability distribution and into respective finite value spaces . Additionally, assume that these r...

1y50

**Summary:**

- Sometimes, reinforcement learning goes wrong: how can this be prevented?
- Example: math education
- One student simply “learns to follow along”, and the other “learns to predict what comes next”
- The other student may gain the ability to solve math problems on their own, while the first plausibly won’t.

- Turbocharging, general notes:
- Idea: You get better
*at the things you practice*, and it pays off to think about what, mechanistically, you want to learn. - You
*won’t*just learn “what you intend”:- If you
*intend*to gain the skill of disarmament o

- If you

- Idea: You get better

1y10

I drop further ideas here in this comment for future reference. I may edit the comment whenever I have a new idea:

- Someone had the idea that instead of giving a negative reward for acting while an alert is played, one may also give a positive reward for performing the null action in those situations. If the positive reward is high enough, an agent like MuZero (which explicitly tries to maximize the expected predicted reward) may then be incentivized to
*cause*the alert sound from being played during deployment. One could then add further environment details

Thanks for your answer!

Credit assignment (AKA policy gradient) credits the diamond-recognizing circuit as responsible for reward, thereby retaining this diamond abstraction in the weights of the network.

This is different from how I imagine the situation. In my mind, the diamond-circuit remains simply because it is a good abstraction for making predictions about the world. Its existence is, in my imagination, not related to an RL update process.

Other than that, I think the rest of your comment doesn't quite answer my concern, so I try to formali...

31y

I don't currently share your intuitions for this particular technical phenomenon being plausible, but imagine there are other possible reasons this could happen, so sure? I agree that there are some ways the diamond-shard could lose influence. But mostly, again, I expect this to be a quantitative question, and I think experience with people suggests that trying a fun new activity won't wipe away your other important values.

1y5

**Summary**

The natural abstractions hypothesis makes three claims:

- Abstractability: to make predictions in our world, it’s enough to know very low-dimensional summaries of systems, i.e., their abstractions (
**empirical claim**) - Human-compatibility: Humans themselves use these abstractions in their thinking (
**empirical claim**) - Convergence/naturality: most cognitive systems use these abstractions to make predictions (
**mathematical+empirical claim**)

John wants to test this hypothesis by:

- Running simulations of systems and showing that low-information summaries predict how the

1y1

**Summary**

This article claims that:

- Unsupervised learning systems will likely learn many “natural abstractions” of concepts like “trees” or “human values”. Maybe they will even end up being simply a “feature direction”.
- One reason to expect this is that to make good predictions, you only need to conserve information that’s useful at a distance. And this information can be imagined being a “natural abstraction”.

- If you then have an RL system or supervised learner who can use the unsupervised activations to solve a problem, then it can directly behave in suc

1y50

**Summary:**

- Aversions lead to avoidance or displeasure
- Aversion Factoring: Technique for overcoming aversions
- Idea: An activity doesn’t “just suck”:
*specific elements*of it are aversive.- While in goal factoring, the emphasis was on finding the
*positive*factors, here, the emphasis seems on finding the*negative*factors - Then, instead of imagining
*doing a different activity*with all of the benefits, imagine*doing the same activity*with none of the aversions

- While in goal factoring, the emphasis was on finding the
- Story of Critch climbing trees:
- Dirty clothes: buy dark jeans
- Danger: practice falli

1y10

In this shortform, I explain my main confusion with this alignment proposal. The main thing that's unclear to me: what's the idea here for how the agent *remains motivated by diamonds* *even while doing very non-diamond related things like "solving mazes" that are required for general intelligence?*

More details in the shortform itself.

11y

I think that was supposed to be answered by this line:

These are rough notes trying (but not really succeeding) to deconfuse me about Alex Turner's diamond proposal. The main thing I wanted to clarify: what's the idea here for how the agent *remains motivated by diamonds* *even while doing very non-diamond related things like "solving mazes" that are required for general intelligence?*

- Summarizing Alex's summary:
- Multimodal SSL initialization
- recurrent state, action head
- imitation learning on humans in simulation, + sim2real
- low sample complexity
- Humans move toward diamonds

- policy-gradient RL: reward the AI for getting n

31y

I think that the agent probably learns a bunch of values, many related to gaining knowledge and solving games and such. (People are also like this; notice that raising a community-oriented child does not require a proposal for how the kid will only care about their community, even as they go through school and such.)
I think this is way stronger of a claim than necessary. I think it's fine if the agent learns some maze-/game-playing shards which do activate while the diamond-shard doesn't -- it's a quantitative question, ultimately. I think an agent which cares about playing games and making diamonds and some other things too, still ends up making diamonds.
Credit assignment (AKA policy gradient) credits the diamond-recognizing circuit as responsible for reward, thereby retaining this diamond abstraction in the weights of the network.

1y20

I now finally read LawrenceC's Shard Theory in Nine Theses: a Distillation and Critical Appraisal. I think it is worth reading for many people even if they already read my own distillation. A summary of things emphasized that are missing (or less emphasized) in mine (Note: Lawrence doesn't necessarily **believe** these claims and, for some of them, lists his disagreements):

- A nice picture of how the different parts of an agent composed of shards interact. This includes a
**planner**, which I've not mentioned in my post. - A comparison of shard theory with the su

1y63

**Summary**

This article thinks about what “general purpose search is” and why to expect it in advanced machine learning systems.

In general, we expect gradient descent to find “simple solutions” with lots of varying parameters (since they take a larger part in solution space) and “general solutions” that are helpful broadly (since we will put the system in diverse environments). Thus, we *do* expect search processes to emerge.

However, babble and prune will likely not be the resulting process: it’s not compute and memory efficient enough. Instead, John ...

61y

The important part here is that babble scales extremely poorly with dimensionality of the problem (or, more precisely, the fraction of problem-space which is filled with solutions). So babble is fine once we've reduced to a low-dimensional subproblem; most of the algorithmic work is in reducing the big problem to a bunch of low-dimensional subproblems.

7[anonymous]1y

When I try to search for terms that I find on here, like "finite factored set" or "babble and prune" and many others, that I can't find anywhere else except on here or other EA platforms. It always makes me wonder "Are we meming?" It seems like the meme culture is deep here. I think that is also what makes it hard for new users to get accustomed to. They have to read up on so much prerequisite materials in order to even participate in a conversation.

1y60

**Summary**

- The goal of Goal factoring: in tradeoff situations, get all of the good with none of the bad
- Parable of the orange: two people want the last orange. However, one only wanted the
*peel*, the other*the flesh*— both*can*get what they want - Case study: preoccupied professor
- Grading has lots of costs.
- Grading produces a
*bag of benefits* - The professor thought about how to
*reach all benefits*without paying*any of the costs* - In the end, he found a system in which
*students could grade themselves*

- Goal Factoring algorithm:
**Choose an actio**

Thank you! Then I indeed misunderstood Alex Turner's claim, and I basically seem to agree with my new understanding.

Okay, that's fair. I agree, if we could show that the experiments remain stable even when longer strings of reasoning are required, then the experiments seem more convincing. There might be the added benefit that one can then vary the setting in more ways to demonstrate that the reasoning caused the agent to act in a particular way, instead of the actions just being some kind of coincidence.

(Fwiw, I don’t remember problems with stipend payout at seri mats in the winter program. I was a winter scholar 2022/23.)