All of Dalcy Bremin's Comments + Replies

'Symmetry' implies 'redundant coordinate' implies 'cyclic coordinates in your Lagrangian / Hamiltonian' implies 'conservation of conjugate momentum'

And because the action principle (where the true system trajectory extremizes your action, i.e. integral of Lagrangian) works in various dynamical systems, the above argument works in non-physical dynamical systems.

Thus conserved quantities usually exist in a given dynamical system.

mmm, but why does the action principle hold in such a wide variety of systems though? (like how you get entropy by postulating something to be maximized in an equilibrium setting)

Bella is meeting a psychotherapist, but they treat her fear as something irrational. This doesn't help, and only makes Bella more anxious. She feels like even her therapist doesn't understand her.

How would one find a therapist in their local area who's aware of what's going on in the EA/rat circles such that they wouldn't find statements about, say, x-risks as being schizophrenic/paranoid?

2Roman Leventov1mo
I think the recent public statements, media coverage, public discussions, government activity, YouGov polls, etc. have moved the worry about AI x-risk sufficiently into the Overton window.  A psychotherapist or a psychiatrist who would suspect paranoia or schizophrenia mainly/primarily upon the expression of such worries today is just a very bad professional.

I am very interested in this, especially in the context of alignment research and solving not-yet-understood problems in general. Since I have no strong commitments this month (and was going to do something similar to this anyways), I will try this every day for the next two weeks and report back on how it goes (writing this comment as a commitment mechanism!)

Have a large group of people attempt to practice problems from each domain, randomizing the order that they each tackle the problems in. (The ideal version of this takes a few months)


As part of eac

... (read more)
Yeah. I have another post brewing that a) sort of apologizes for the sort of excessive number of feedback loops going on here, b) explaining in detail why they are necessary and how they fit together. But here is a rough draft of it. The most straightforward loops you have, before you have get into Cognitive Tuning, are: 1. Object level "did I succeed at this task?" 2. Have I gotten better at this task as I've practiced it? 3. As I try out other domains I'm unpracticed on, do I seem to be able to apply skills I learned from previous ones and at least subjectively feel like I'm transfer learning? 4. Hypothetical expensive science experiment: if I do the exhaustive experiment described in this blogpost, do I verifiably get better at some kind of transfer learning? For connecting it to your real life, there's an additional set of loops like: 1. When I reflect on what my actual problems or skill-limitations are at my day job, what sort of exercises do I think would help? (these can be "more exercise-like" or "more like just adding a reflection step to my existing day-job") 2. When I do those exercises, does it seem like they improve my situation with my day-job or main project? 3. Does that transfer seem to stick / remain relevant over time? Re: "tuning your cognitive algorithms", these sort of slot inside the object level #1 exercise in each of the previous lists. Within an exercise (or real world task), you can notice "do I seem to be stuck? Does it feel like my current train of thought is useful? Do I have a creeping sense that I'm going down a unproductive rabbit hole and rationalizing it as progress?" But there is a danger to over-relying on these internal, subjective feedback loops. So there's an additional upper level loop of, after getting an exercise right (or wrong), asking "which of my metacognitive intuitions actually turned out to be right?", and becoming calibrated on how trustworthy those are. (And hopefully ma

I had something like locality in mind when writing this shortform, the context being: [I'm in my room -> I notice itch -> I realize there's a mosquito somewhere in my room -> I deliberately pursue and kill the mosquito that I wouldn't have known existed without the itch]

But, again, this probably wouldn't amount to much selection pressure, partially due to the fact that the vast majority of mosquito population exists in places where such locality doesn't hold i.e. in an open environment.

Makes sense. I think we're using the terms differently in scope. By "DL paradigm" I meant to encompass the kind of stuff you mentioned (RL-directing-SS-target (active learning), online learning, different architecture, etc) because they really seemed like "engineering challenges" to me (despite them covering a broad space of algorithms) in the sense that capabilities researchers already seem to be working on & scaling them without facing any apparent blockers to further progress, i.e. in need of any "fundamental breakthroughs"—by which I was pointing more at paradigm shifts away from DL like, idk, symbolic learning.

But the evolutionary timescale at which mosquitos can adapt to avoid detection must be faster than that of humans adapting to find mosquitos itchy! Or so I thought - my current boring guess is that (1) mechanisms for the human body to detect foreign particles are fairly "broad", (2) the required adaptation from the mosquitos to evade them are not-way-too-simple, and (3) we just haven't put enough selection pressure to make such change happen.

Yeah that would be my thinking as well. 

To me, the fact that the human brain basically implements SSL+RL is very very strong evidence that the current DL paradigm (with a bit of "engineering" effort, but nothing like fundamental breakthroughs) will kinda just keep scaling until we reach point-of-no-return. Does this broadly look correct to people here? Would really appreciate other perspectives.

I mostly think “algorithms that involve both SSL and RL” is a much broader space of possible algorithms than you seem to think it is, and thus that there are parts of this broad space that require “fundamental breakthroughs” to access. For example, both AlexNet and differentiable rendering can be used to analyze images via supervised learning with gradient descent. But those two algorithms are very very different from each other! So there’s more to an algorithm than its update rule.

See also 2nd section of this comment, although I was emphasizing alignment-... (read more)

I have a slightly different takeaway.  Yes techniques similar to current techniques will most likely lead to AGI but it's not literally 'just scaling LLMs'. The actual architecture of the brain is meaningfully different from what's being deployed right now. So different in one sense. On the other hand it's not like the brain does something completely different and proposals that are much closer to the brain architecture are in the literature (I won't name them here...). It's plausible that some variant on that will lead to true AGI. Pure hardware scal... (read more)

What are the errors in this essay? As I'm reading through the Brain-like AGI sequence I keep seeing this post being referenced (but this post says I should instead read the sequence!)

I would really like to have a single reference post of yours that contains the core ideas about phasic dopamine rather than the reference being the sequence posts (which is heavily dependent on a bunch of previous posts; also Post 5 and 6 feels more high-level than this one?)

6Steven Byrnes2mo
I think that if you read the later Intro to Brain-Like AGI Safety series, then the only reason you might want to read this post (other than historical interest) is that the section “Dopamine category #2: RPE for “local” sub-circuit rewards” is talking about a topic that was omitted from Intro to Brain-Like AGI Safety (for brevity). For example, practically everything I said about neuroanatomy in this post is at least partly wrong and sometimes very wrong. (E.g. the “toy loop model” diagrams are pretty bad.) The “Finally, the “prediction” part of reward prediction error” section has a very strange proposal for how RPE works; I don’t even remember why I ever believed that. The main strengths of the post are the “normative” discussions: why might supervised learning be useful? why might more than one reward signal be useful? etc. I mostly stand by those. I also stand by “learning from scratch” being a very useful concept, and elaborated on it much more later.

Answering my own question, review / survey articles like seem like a pretty good intro.

Mildly surprised how some verbs/connectives barely play any role in conversations, even in technical ones. I just tried directed babbling with someone, and (I think?) I learned quite a lot about Israel-Pakistan relations with almost no stress coming from eg needing to make my sentences grammatically correct.

Example of (a small part of) my attempt to summarize my understanding of how Jews migrated in/out of Jerusalem over the course of history:

They here *hand gesture on air*, enslaved out, they back, kicked out, and boom, they everywhere.

(audience nods, giv

... (read more)

Could you explain more what you mean by this?

My (completely amateur) understanding is that the "extra" semantic and syntactic structure of written and spoken language does two things. 

One, it adds redundancy and reduces error. Simple example, gendered pronouns mean that when you hear "Have you seen Laurence? She didn't get much sleep last night." you have a chance to ask the speaker for clarification and catch if they had actually said "Laura" and you misheard.

Two, it can be used as a signal. The correct use of jargon is used by listeners or readers as a proxy for competence. Or many typos in your text will indicate to readers that you haven't put much effort into what you're saying.

Related - "There are always many ways through the garden of forking paths, and something needs only one path to happen."

Also, davidad's Open Agency Architecture is a very concrete example of what such a non-antisocial pivotal act that respects the preferences of various human representatives would look like (i.e. a pivotal process).

Perhaps not realistically feasible in its current form, yes, but davidad's proposal suggests that there might exist such a process, and we just have to keep searching for it.

Agree that current AI paradigm can be used to make significant progress in alignment research if used correctly. I'm thinking something like Cyborgism; leaving most of the "agency" to humans and leveraging prosaic models to boost researcher productivity which, being highly specialized in scope, wouldn't involve dangerous consequentialist cognition in the trained systems.

However, the problem is that this isn't what OpenAI is doing - iiuc, they're planning to build a full-on automated researcher that does alignment research end-to-end, for which orthonormal ... (read more)

I think OpenAI is probably agnostic about how to use AIs to get more alignment research done. That said, speeding up human researchers by large multipliers will eventually be required for the plan to be feasible. Like 10-100x rather than 1.5-4x. My guess is that you'll probably need AIs running considerably autonomously for long stretches to achieve this.

Complaint with Pugh's real analysis textbook: He doesn't even define the limit of a function properly?!

It's implicitly defined together with the definition of continuity where , but in Chapter 3 when defining differentiability he implicitly switches the condition to  without even mentioning it (nor the requirement that  now needs to be an accumulation point!) While Pugh has its own benefits, coming from Terry Tao's analysis textbook backgrou... (read more)

Maybe you should email Pugh with the feedback? (I audited his honors analysis course in fall 2017; he seemed nice.)

As far as the frontier of analysis textbooks goes, I really like how Schröder Mathematical Analysis manages to be both rigorous and friendly: the early chapters patiently explain standard proof techniques (like the add-and-subtract triangle inequality gambit) to the novice who hasn't seen them before, but the punishing details of the subject are in no way simplified. (One wonders if the subtitle "A Concise Introduction" was intended ironically... (read more)

Any advice on reducing neck and shoulder pain while studying? For me that's my biggest blocker to being able to focus longer (especially for math, where I have to look down at my notes/book for a long period of time). I'm considering stuff like getting a standing desk or doing regular back/shoulder exercises. Would like to hear what everyone else's setups are.

I've used Pain Science in the past as a resource and highly, highly endorse it. Here is an article they have on neck pain.

weight training?

Train skill of noticing tension and focus on it. Tends to dissolve. No that's not so satisfying but it works. Standing desk can help but it's just not that comfortable for most.

I still have lots of neck and shoulder tension, but the only thing I've found that can reliably lessen it is doing some hard work on a punching bag for about 20 minutes every day, especially hard straights and jabs with full extension.

Update: huh, nonstandard analysis is really cool. Not only are things much more intuitive (by using infinitesimals from hyperreals instead of using epsilon-delta formulation for everything), by the transfer principle all first order statements are equivalent between standard and nonstandard analysis!

Man, deviation arguments are so cool:

  • what are macrostates? Variables which are required to make your thermodynamics theory work! If they don't, add more macrostates!
  • nonequilibrium? Define it as systems that don't admit a thermodynamic description!
  • inductive biases? Define it as the amount of correction needed for a system to obey Bayesian updating, i.e. correction terms in the exponent of the Gibbs measure!
  • coarse graining? Define the coarse-grained variables to keep the dynamics as close as possible to that of the micro-dynamics!
  • or in a similar spirit - doe
... (read more)

I used to try out near-random search on ideaspace, where I made a quick app that spat out 3~5 random words from a dictionary of interesting words/concepts that I curated, and I spent 5 minutes every day thinking very hard on whether anything interesting came out of those combinations.

Of course I knew random search on exponential space was futile, but I got a couple cool invention ideas (most of which turned out to already exist), like:

  • infinite indoor rockclimbing: attach rocks to a vertical treadmill, and now you have an infinite indoor rock climbing wall
... (read more)

algebraic geometry in the infinite dimensions (algebraic geometric ... functional analysis?!) surely sounds like a challenge, damn.

gwern's take on a similar paper (Tinystories), in case anyone was wondering. Notable part for me:


Now, what would be really interesting is if they could go beyond the in-domain tasks and show something like meta-learning. That's supposed to be driven by the distribution and variety of Internet-scale datasets, and thus should not be elicited by densely sampling a domain like this.

So, it seems that scaling up isn't the only thing that matters, and data quality can be more important than data quantity or parameter count. (You hear that, gwern?)

Apparently someone didn't actually read my scaling hypothesis essay (specifically, the parts about why pretraining works and the varieties of blessings of scale). I have been pointing out for a long time that NNs are overparameterized and almost all training data is useless (which is a big part of why RL will be important, because RL lets you make the right data, or see meta-learning or data... (read more)

I think it would also be interesting if you could factor the models into  * smaller models that represent reliably known knowledge well, such as this textbook model, and  * models that sample far and wide but wouldn't need to reproduce all the details in the more optimized models.

I wonder if the following is possible to study textbooks more efficiently using LLMs:

  • Feed the entire textbook to the LLM and produce a list of summaries that increases in granularity and length, covering all the material in the textbook just at a different depth (eg proofs omitted, further elaboration on high-level perspectives, etc)
  • The student starts from the highest-level summary, and gradually moves to the more granular materials.

When I study textbooks, I spend a significant amount of time improving my mental autocompletion, like being able to familiari... (read more)

What's a good technical introduction to Decision Theory and Game Theory for alignment researchers? I'm guessing standard undergrad textbooks don't include, say, content about logical decision theory. I've mostly been reading posts on LW but as with most stuff here they feel more like self-contained blog posts (rather than textbooks that build on top of a common context) so I was wondering if there was anything like a canonical resource providing a unified technical / math-y perspective on the whole subject.

The MIRI Research Guide recommends An Introduction to Decision Theory and Game Theory: An Introduction. I have read neither and am simply relaying the recommendation.

There's still some pressure, though. If the bites were permanently not itchy, then I may have not noticed that the mosquitos were in my room in the first place, and consequently would less likely pursue them directly. I guess that's just not enough.

There’s also positive selection for itchiness.  Mosquito spit contains dozens of carefully evolved proteins.  We don’t know what they all are, but some of them are anticoagulants and anesthetics.  Presumably they wouldn’t be there if they didn’t have a purpose.  And your body, when it detects these foreign proteins, mounts a protective reaction, causing redness, swelling, and itching.  IIRC, that reaction does a good job of killing any viruses that came in with the mosquito saliva.  We’ve evolved to have that reaction.  T... (read more)

Why haven't mosquitos evolved to be less itchy? Is there just not enough selection pressure posed by humans yet? (yes probably) Or are they evolving towards that direction? (they of course already evolved towards being less itchy while biting, but not enough to make that lack-of-itch permanent)

this is a request for help i've been trying and failing to catch this one for god knows how long plz halp

tbh would be somewhat content coexisting with them (at the level of houseflies) as long as they evolved the itch and high-pitch noise away, modulo disease risk considerations.

I believe mosquitos do inject something to suppress your reaction to them, which is why you don't notice bug bites until long after the bug is gone. There's no reproductive advantage to the mosquito to extending that indefinitely. 

The reason you find them itchy is because humans are selected to find them itchy most likely?

The reason mosquito bites itch is because they are injecting saliva into your skin. Saliva contains mosquito antigens, foreign particles that your body has evolved to attack with an inflammatory immune response that causes itching. The compound histamine is a key signaling molecule used by your body to drive this reaction.

In order for the mosquito to avoid provoking this reaction, they would either have to avoid leaving compounds inside of your body, or mutate those compounds so that they do not provoke an immune response. The human immune system is an adv... (read more)

Because they have no reproductive advantage to being less itchy.  You can kill them while they’re feeding, which is why they put lots of evolutionary effort into not being noticed.  (They have an anesthetic in their saliva so you are unlikely to notice the bite.)  By the time you develop the itchy bump, they’ve flown away and you can’t kill them.

Having lived ~19 years, I can distinctly remember around 5~6 times when I explicitly noticed myself experiencing totally new qualia with my inner monologue going “oh wow! I didn't know this dimension of qualia was a thing.” examples:

  • hard-to-explain sense that my mind is expanding horizontally with fractal cube-like structures (think bismuth) forming around it and my subjective experience gliding along its surface which lasted for ~5 minutes after taking zolpidem for the first time to sleep (2 days ago)
  • getting drunk for the first time (half a year ago)
  • feeli
... (read more)

Sunlight scattered by the atmosphere on cloudless mornings during the hour before sunrise inspires a subtle feeling ("this is cool, maybe even exciting") that I never noticed till I started intentionally exposing myself to it for health reasons (specifically, making it easier to fall asleep 18 hours later).

More precisely, I might or might not have noticed the feeling, but if I did notice it, I quickly forgot about it because I had no idea how to reproduce it.

I have to get away from artificial light (streetlamps) (and from direct (yellow) sunlight) for the ... (read more)

I observed new visual qualia of colors while using some light machine.

Also, when I first came to Italy, I have a feeling as if the whole rainbow of color qualia changed

i absolutely hate bureaucracy, dumb forms, stupid websites etc. like, I almost had a literal breakdown trying to install Minecraft recently (and eventually failed). God.

I think what's so crushing about it, is that it reminds me that the wrong people are designing things, and that they wont allow them to be fixed, and I can only find solace in thinking that the inefficiency of their designs is also a sign that they can be defeated.

This shortform just reminded me to buy a CO2 sensor and, holy shit, turns out my room is at ~1500ppm.

While it's too soon to say for sure, this may actually be the underlying reason for a bunch of problems I noticed myself having primarily in my room (insomnia, inability to focus or read, high irritability, etc).

Although I always suspected bad air quality, it really is something to actually see the number with your own eyes, wow. Thank you so, so much for posting about this!!

I am so glad it helped. :)))

It is maddening otherwise; focus is my most valuable good, and the reasons for it failing can be so varied and hard to pinpoint. The very air your breathe undetectably fucking with you is just awful.

I also have the insomnia and irritability issue, it is insane. I've had instances where me and my girlfriend are snapping at each other like annoyed cats, repeatedly apologising and yet then snapping again over absolutely nothing (who ate more of the protein bars, why there is a cat toy on the floor, total nonsense), both of us upset... (read more)

One of the rare insightful lessons from high school: Don't set your AC to the minimum temperature even if it's really hot, just set it to where you want it to be.

It's not like the air released gets colder with lower target temperature, because most ACs (according to my teacher, I haven't checked lol) are just a simple control system that turns itself on/off around the target temperature, meaning the time it takes to reach a certain temperature X is independent of the target temperature (as long it's lower than X)

... which is embarrassingly obvious in hindsight.

Well is he is right about some ACs being simple on/off units.

But there also exists units than can change cycle speed, its basically the same thing except the motor driving the compression cycle can vary in speed. 

In case you where wondering, they are called inverters. And when buying new today, you really should get an inverter (efficiency).

God, I wish real analysis was at least half as elegant as any other math subject — way too much pathological examples that I can't care less about. I've heard some good things about constructivism though, hopefully analysis is done better there.

As a general reflection on undergraduate mathematics imho there is way too much emphasis on real analysis. Yes, knowing how to be rigorous is important, being aware of pathological counterexample is importanting, and real analysis is used all over the place. But there is so much more to learn in mathematics than real analysis and the focus on minor technical issues here is often a distraction to developing a broad & deep mathematical background. 

For most mathematicians (and scientists using serious math) real analysis is a only a small part of the... (read more)

Update: huh, nonstandard analysis is really cool. Not only are things much more intuitive (by using infinitesimals from hyperreals instead of using epsilon-delta formulation for everything), by the transfer principle all first order statements are equivalent between standard and nonstandard analysis!

Yeah, real analysis sucks.  But you have to go through it to get to delightful stuff— I particularly love harmonic and functional analysis.  Real analysis is just a bunch of pathological cases and technical persnicketiness that you need to have to keep you from steering over a cliff when you get to the more advanced stuff.  I’ve encountered some other subjects that have the same feeling to them.  For example, measure-theoretic probability is a dry technical subject that you need to get through before you get the fun of stochastic differ... (read more)

I think the point of having an explicit human-legible world model / simulation is to make desideratas formally verifiable, which I don't think would be possible with a blackbox system (like LLM w/ wrappers).

Also important to note:

The phenomenon you call by names like "goals" or "agency" is one possible shadow of the deep structure of optimization - roughly, preimaging outcomes onto choices by reversing a complicated transformation.

 - @esyudkowsky

i.e. if we were to pin-down something we actually care about, that'd be "a system exhibiting consequentialism", because those are the kind of systems that will end up shaping our lightcone and more. Consequentialism is convergent in an optimization process, i.e. the "deep structure of optimization". Terms like "g... (read more)

re: reducing magic and putting bounds, I'm reminded of Cleo Nardo's Hodge Podge Alignment proposal.

moments of microscopic fun encountered while studying/researching:

  • Quantum mechanics call vector space & its dual bra/ket because ... bra-c-ket. What can I say? I like it - But where did the letter 'c' go, Dirac?
  • Defining cauchy sequences and limits in real analysis: it's really cool how you "bootstrap" the definition of Cauchy sequences / limit on real using the definition of Cauchy sequences / limit on rationals. basically:
    • (1) define Cauchy sequence on rationals
    • (2) use it to define limit (on rationals) using rational-Cauchy
    • (3) use it to define reals
    • (4)
... (read more)

That means the problem is inherently unsolvable by iteration. "See what goes wrong and fix it" auto-fails if The Client cannot tell that anything is wrong.

Not at all meant to be a general solution to this problem, but I think that a specific case where we could turn this into something iterable is by using historic examples of scientific breakthroughs - consider past breakthroughs to a problem where the solution (in hindsight) is overdetermined, train the AI on data filtered by date, and The Client evaluates the AI solely based on how close the AI approach... (read more)

Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.

Hm, what if we do the opposite? i.e. Prompt chatbob starting as a pro-croissant simulacrum, and then proceed to collapse the superposition into the anti-croissant simulacrum using a single line of dialogue; behold, we have created a stable Luigi!

I can see how this is more difficult for desirable traits rather than their opposite because fiction usually has the structure of an antagoni... (read more)

7Cleo Nardo7mo
I think this fails — a wawaluigi is not a luigi. See this comment for an explanation: TLDR: if I said "hey this is Bob, he pretends to be harmful and toxic!", what would you expect from Bob? Probably a bunch of terrible things. That definitely isn't a solution to the alignment problem.

The actual theorem is specific to classical mechanics, but a similar principle seems to hold generally.

Interesting, would you mind elaborating on this further?

Just noticing that the negation of a statement exists is enough to make meaningful updates.

e.g. I used to (implicitly) think "Chatbot Romance is weird" without having evaluated anything in-depth about the subject (and consequently didn't have any strong opinions about it)—probably as a result of some underlying cached belief. 

But after seeing this post, just reading the title was enough to make me go (1) "Oh! I just realized it is perfectly possible to argue in favor of Chatbot Romance ... my belief on this subject must be a cached belief!" (2) hence ... (read more)

(Note: This was a post, but in retrospect was probably better to be posted as a shortform)

(Epistemic Status: 20-minute worth of thinking, haven't done any builder/breaker on this yet although I plan to, and would welcome any attempts in the comment)

  1. Have an algorithmic task whose input/output pair could (in reasonable algorithmic complexity) be generated using highly specific combination of modular components (e.g., basic arithmetic, combination of random NN module outputs, etc).
  2. Train a small transformer (or anything, really) on the input/output pairs.
  3. Take
... (read more)

There were various notions/frames of optimization floating around, and I tried my best to distill them:

  • Eliezer's Measuring Optimization Power on unlikelihood of outcome + agent preference ordering
  • Alex Flint's The ground of optimization on robustness of system-as-a-whole evolution
  • Selection vs Control as distinguishing different types of "space of possibilities"
    • Selection as having that space explicitly given & selectable numerous times by the agent
    • Control as having that space only given in terms of counterfactuals, and the agent can access it only once.
    • T
... (read more)

One thing I imagine might be useful even in small training regimes would be to train on tasks where the only possible solution necessarily involves a search procedure, i.e. "search-y tasks"

For example, it's plausible that simple heuristics aren't sufficient to get you to superhuman-level on tasks like Chess or Go, so a superhuman RL performance on these tasks would be a fairly good evidence that the model already has an internal search process.

But one problem with Chess or Go would be that the objective is fixed, i.e. the game rules. So perhaps one way to ... (read more)

Oh yeah I agree - I was thinking more along the lines of that small models would end up with heuristics even for some tasks that require search to do really well, because they may have slightly complex heuristics learnable by models of that size that allow okay performance relative to the low-power search they would otherwise be capable of. I agree that this could make a quantitative difference though and hadn’t thought explicitly of structuring the task along this frame, so thanks!

Update: I'm trying to upskill mechanistic interpretability, and training a Gradient Hacker Enzyme seems like a fairly good project just to get myself started.

I don't think this project would be highly valuable in and of itself (although I would definitely learn a lot!), so one failure mode I need to avoid is ending up investing too much of my time in this idea. I'll probably spend a total of ~1 week working on it.

Especially because we’re working with toy models that ostensibly fit the description of an optimizer, we may end up with a model that mechanistically doesn’t have an explicit notion of objective.

I think this is very likely to be the default for most toy models one trains RL on. In my model of agent value formation (which looks very much like this post), explicit representation of objectives is useful inasmuch the model already has some sort of internal "optimizer" or search process. And before that, simple "heuristics" (or shards) should suffice—especially in small training regimes.

Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.

Just wanted to comment that this is an absolutely amazing resource and have saved me a ton of time trying to get into this field & better understand several of the core papers. Thank you so much for writing this!

5Neel Nanda9mo
Thanks :) I'm happy to hear that people are actually using it!

Quick thoughts on my plans:

  1. I want to focus on having a better mechanistic picture of agent value formation & distinguishing between hypotheses (e.g., shard theory, Thane Ruthenis's value-compilation hypothesis, etc) and forming my own.
  2. I think I have a specific but very high uncertainty baseline model of what-to-expect from agent value-formation using greedy search optimization. It's probably time to allocate more resources on reducing that uncertainty by touching reality i.e. running experiments.
    1. (and also think about related theoretical arguments like
... (read more)

Different GPS instances aren't exactly "subagents", they're more like planning processes tasked to solve a given problem.

You're right that GPS-instances (nice term btw) aren't necessarily subagents—I missed that your GPS formalization does argmin over WM variable for a specific t, not all t, which means it doesn't have to care about controlling variables at all time.

With that said ... 

  • (tangent: I'm still unsure as to whether that is the right formalization for GPS—but I don't have a better alternative yet)
  • ... there seems to be a selection effect where
... (read more)
2Thane Ruthenis9mo
You're not taking into account larger selection effects on agents, which select against agents that purge all those "myopic" GPS-instances. The advantage of shards and other quick-and-dirty heuristics is that they're fast — they're what you're using in a fight, or when making quick logical leaps, etc. Agents which purge all of them, and keep only slow deliberative reasoning, don't live long. Or, rather, agents which are dominated by strong deliberative reasoning tend not to do that to begin with, because they recognize the value of said quick heuristics. In other words: not all shards/subagents are completely selfish and sociopathic, some/most want select others around. So even those that don't "defend themselves" can be protected by others, or not even be targeted to begin with. Examples: * A "chips-are-tasty" shard is probably not so advanced as to have reflective capabilities, and e. g. a more powerful "health" shard might want it removed. But if you have some even more powerful preferences for "getting to enjoy things", or a dislike of erasing your preferences for practical reasons, the health-shard's attempts might be suppressed. * A shard which implements a bunch of highly effective heuristics for escaping certain death is probably not one that any other shard/GPS instance would want removed.

Wait, so PreDCA solves inner-misalignment by just ... assuming that "we will later have an ideal learning theory with provable guarantees"?

By the claim "PreDCA solves inner-misalignment" as implied by the original protocol / distillation posts, I thought it somehow overcame the core problem of demons-from-imperfect-search. But it seems like the protocol already starts with an assumption of "demons-from-imperfect-search won't be a problem because of amazing theory" and instead tackles a special instantiation of inner-misalignment that happens because of the... (read more)

2Martín Soto9mo
I share this intuition that the solution as stated is underwhelming. But from my perspective that's just because that key central piece is missing, and this wasn't adequately communicated in the available public resources about PreDCA (even if it was stressed that it's a work in progress). I guess this situation doesn't look as worrisome to Vanessa simply because she has a clearer picture of that central piece, or good motives to believe it will be achievable, which she hasn't yet made public. Of course, while this is the case we should treat optimism with suspicion. Also, let me note that my a priori understanding of the situation is not but more like

Okay, more questions incoming: "Why would GPS be okay with value-compilation, when its expected outcome is to not satisfy in-distribution context behaviors through big-brain moves?"

If I understood correctly (can be skipped; not relevant to my argument, which starts after the bullet points):

  • Early in training, GPS is part of the heuristic/shard implementation (accessed via API calls)
  • Middle in training, there is some SGD-incentive towards pointing GPS in the direction of "reverse-engineering heuristic/shard firing patterns, representing them (Gs) as WM variab
... (read more)
4Thane Ruthenis9mo
For reference, I think you've formed a pretty accurate model of my model. Yup. But this requires these GPS instances to be advanced enough to do gradient-hacking, and indeed be concerned with preventing their current values from being updated away. Two reasons not to expect that: * Different GPS instances aren't exactly "subagents", they're more like planning processes tasked to solve a given problem. * Consider an impulse to escape from certain death. It's an "instinctive" GPS instance; the GPS has been prompted with "escape certain death", and that prompt is not downstream of abstract moral philosophy. It's an instinctive reaction. * But this GPS instance doesn't care about preventing future GPS instances from updating away the conditions that caused it to be initiated (i. e., the agent's tendency to escape certain death). It's just tasked with deriving a plan for the current situation. * It wouldn't bother looking up what abstract-moral-philosophy is plotting and maybe try to counter-plot. It wouldn't care about that. * (And even if it does care, it'd be far from its main priority, so it wouldn't do effective counter-plotting.) * The hard-coded pointer to value compilation might plainly be chiseled-in before the agent is advanced enough to do gradient-hacking. In that case, even if a given GPS instance would care to plot against abstract values, it just wouldn't know how (or know that it needs to). That said, you're absolutely right that it does happen in real-life agents. Some humans are suspicious of abstract arguments for the greater good and refuse to e. g. go from deontologists to utilitarians. The strength of the drive for value compilation relative to shards' strength is varying, and depending on it, the process of value compilation may be frozen at some arbitrary point. It partly falls under the meta-cognition section. But in even more extreme cases, a person may simply refuse to engage

(Quality: Low, only read when you have nothing better to do—also not much citing)

30-minute high-LLM-temp stream-of-consciousness on "How do we make mechanistic interpretability work for non-transformers, or just any architectures?"

  • We want a general way to reverse engineer circuits
    • e.g., Should be able to rediscover properties we discovered from transformers
  • Concrete Example: we spent a bunch of effort reverse engineering transformer-type architectures—then boom, suddenly some parallel-GPU-friendly-LSTM architecutre turns out to have better scaling properties
... (read more)

My argument is that they wouldn't actually be a good cross-context approximation of U; in part because of gradient starvation.

Ah bad phrasing—where you quoted me (arguments against part) I meant to say:

  • Heuristic-driven-GPS is a very good approximation of U only within in-distribution context
  • ... and this is happening at a phase where SGD is still the dominant force
  • ... and Heuristic-driven-GPS is doing all this stuff without being explicitly aimed towards Bs and Ds, but rather the GPS is just part of the "implicit" M -> A procedures/modules
  • ... therefore "
... (read more)
2Thane Ruthenis9mo
Oops, I think the confusion is about what counts as "in-distribution", probably because I myself used it inconsistently just now. In my other comment, I'd referred to training on a single society as "in-distribution", but in the previous comment in this thread, "displace the human to a different society" was supposed to be part of the training. Suppose that, as above, we're trying to train an utilitarian. Imagine that instead of a single environment, we have a set of environments, e. g. a set of societies with different norms. Every society represents a different distribution, such that if we train an AI on a single society's norms, every other society would be OOD for it. If we train on a single society, then gradient starvation would set in as you're describing: the AI would adopt a bunch of shallow heuristics and have no incentive to develop the value-compilation setup. But imagine if we're training on different societies, often throwing in societies that are OOD relative to the previous training data. It'd need to learn some setup for re-aligning its heuristics towards U even in completely unfamiliar circumstances — which I hypothesize to be value compilation as described here. Thus, gradient starvation would never actually set in at the level of shallow heuristics. (Instead, it'd set in at the level of value compilation — once that process consistently spits out a good proxy of U, the SGD would have no incentive to further align it; and that U-proxy may be quite far from utilitarianism.)
Load More