All of Steven Byrnes's Comments + Replies

Solving the whole AGI control problem, version 0.0001

Thanks for sharing!

These are definitely reasonable things to think about.

For my part, I get kinda stuck right at your step #1. Like, say you give the AGI access to youtube and tell it to build a predictive model (i.e. do self-supervised learning). It runs for a while and winds up with a model of everything in the videos—people doing things, balls bouncing, trucks digging, etc. etc. Then you need to point to a piece of this model and say "This is human behavior" or "This is humans intentionally doing things". How do we do that? How do we find the right piec... (read more)

Reward Is Not Enough

Thanks!

My current working theory of human social interactions does not involve multiple reward signals. Instead it's a bunch of rules like "If you're in state X, and you empathetically simulate someone in state Y, then send reward R and switch to state Z". See my post "Little glimpses of empathy" as the foundation of social emotions. These rules would be implemented in the hypothalamus and/or brainstem.

(Plus some involvement from brainstem sensory-processing circuits that can run hardcoded classifiers that return information about things like whether a per... (read more)

Reward Is Not Enough

On your equivalence to an AI with an interpretability/oversight module. Data shouldn't be flowing back from the oversight into the AI. 

Sure. I wrote "similar to (or even isomorphic to)". We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better set... (read more)

Solving the whole AGI control problem, version 0.0001

Ben Goertzel comments on this post via twitter:

1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it

2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration

Are bread crusts healthier?

I dunno, the healthiness of a food is not identical to the sum of the healthiness of its ingredients if you separate those ingredients out in a centrifuge. I think the "palatability" of food is partly related to how physically easy it is to break down (with both teeth and digestive tract). Hardness-to-break-down is potentially related to how many calories your digestive tract uses in digesting it, how quickly it delivers its nutrients, and how much you actually wind up eating.

(Very much not an expert. I think "a food is more than the sum of its ingredients" is discussed in a Michael Pollan book.)

Matthew Barnett's Shortform

Sometimes I send a draft to a couple people before posting it publicly.

Sometimes I sit on an idea for a while, then find an excuse to post it in a comment or bring it up in a conversation, get some feedback that way, and then post it properly.

I have several old posts I stopped endorsing, but I didn't delete them; I put either an update comment at the top or a bunch of update comments throughout saying what I think now. (Last week I spent almost a whole day just putting corrections and retractions into my catalog of old posts.) I for one would have a very p... (read more)

Reward Is Not Enough

how does it avoid wireheading

Um, unreliably, at least by default. Like, some humans are hedonists, others aren't.

I think there's a "hardcoded" credit assignment algorithm. When there's a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I'm not sure of the gory details here.... (read more)

Reward Is Not Enough

Thanks!

I had totally forgotten about your subagents post.

this post doesn't cleanly distinguish between reward-maximization and utility-maximization

I've been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize "the reward as currently understood by my predictive model... (read more)

2johnswentworth5dGood explanation, conceptually. Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.
Reward Is Not Enough

I'm all for doing lots of testing in simulated environments, but the real world is a whole lot bigger and more open and different than any simulation. Goals / motivations developed in a simulated environment might or might not transfer to the real world in the way you, the designer, were expecting.

So, maybe, but for now I would call that "an intriguing research direction" rather than "a solution".

1M. Y. Zuo6dThat is true, the desired characteristics may not develop as one would hope in the real world. Though that is the case for all training, not just AGI. Humans, animals, even plants, do not always develop along optimal lines even with the best ‘training’, when exposed to the real environment. Perhaps the solution you are seeking for, one without the risk of error, does not exist.
Reward Is Not Enough

Right, the word "feasibly" is referring to the bullet point that starts "Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?”". Here's a little toy example we can run with: teaching an AGI "don't kill all humans". So there are three approaches to reward design that I can think of, and none of them seem to offer a feasible way to do this (at least, not with currently-known techniques):

  1. The agent learns by experiencing the reward. This doesn't work for "don't kill all humans" because when the reward happens it's too late.
  2. Th
... (read more)
1M. Y. Zuo6dCould the hypothetical AGI be developed in a simulated environment and trained with proportionally lower consequences?
Looking Deeper at Deconfusion

Is there any good AI alignment research that you don't classify as deconfusion? If so, can you give some examples?

Sure.

... (read more)
Comment on the lab leak hypothesis

I'm not remotely qualified to comment on this, but fwiw in the Mojiang Mine Theory (which says it was a lab leak, but did not involve GOF), six miners caught the virus from bats (and/or each other), and then the virus spent four months replicating within the body of one of these poor guys as he lay sick in a hospital (and then of course samples were sent to WIV and put in storage).

This would explain (2) because four months in this guy's body (especially lungs) allows tons of opportunity for the virus to evolve and mutate and recombine in order to adapt to ... (read more)

3landfish7dIt seems like an interesting hypothesis but I don't think it's particularly likely. I've never heard of other viruses becoming well adapted to humans within a single host. Though, I do think that's the explanation for how several variants evolved (since some of them emerged with a bunch of functional mutations rather than just one or two). I'd be interest to see more research into the evolution of viruses within human hosts, and what degree of change is possible & how this relates to spillover events.
Inner Alignment in Salt-Starved Rats

Thanks! This is very interesting!

there is at least one steak neuron in my own hippocampus, and it can be stimulated by hearing the word, and persistent firing of it will cause episodic memories...to rise up

Oh yeah, I definitely agree that this is an important dynamic. I think there are two cases. In the case of episodic memory I think you're kinda searching for one of a discrete (albeit large) set of items, based on some aspect of the item. So this is a pure autoassociative memory mechanism. The other case is when you're forming a brand new thought. I thin... (read more)

The Credit Assignment Problem

a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile

I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y'know, having read the sequences… :-P

I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn't need to share my goals.

The way I think about it is:

  • Early in training, the AGI is too stupid to formulate and execute a plan to hack into its epistemic level.
  • Late in training, we can hopefully get to the place
... (read more)
Inner Alignment in Salt-Starved Rats

Thanks!! I largely agree with what you wrote.

I was focusing on the implementation of a particular aspect of that. Specifically, when you're doing what you call "thing modeling", the "things" you wind up with are entries in a complicated learned world-model—e.g. "thing #6564457" is a certain horrifically complicated statistical regularity in multimodal sensory data, something like: "thing #6564457" is a prediction that thing #289347 is present, and thing #89672, and thing #68972, but probably not thing #903672", or whatever.

Meanwhile I agree with you that t... (read more)

5JenniferRM12dResponse appreciated! Yeah. I think I have two hunches here that cause me to speak differently. One of these hunches is that hunger sensors are likely to be very very "low level", and motivationally "primary". The other is maybe an expectation that almost literally "every possible thought" is being considered simultaneously in the brain by default, but most do not rise to awareness, or action-production, or verbalizability? Like I think that hunger sensors firing will cause increased firing in "something or other" that sort of "represents food" (plus giving the food the halo of temporary desirability) and I expect this firing rate to basically go up over time... more hungriness... more food awareness? Like if you ask me "Jennifer, when was the last time you ate steak?" then I am aware of a wave of candidate answers, and many fall away, and the ones that are left I can imagine defending, and then I might say "Yesterday I bought some at the store, but I think maybe the last time I ate one (like with a fork and a steakknife and everything) it was about 5-9 days ago at Texas Roadhouse... that was certainly a vivid event because it was my first time back there since covid started" and then just now I became uncertain, and I tried to imagine other events, like what about smaller pieces of steak, and then I remembered some carne asada 3 days ago at a BBQ. What I think is happening here is that (like the Halle Berry neuron found in the hippocampus of the brain surgery patient) there is at least one steak neuron in my own hippocampus, and it can be stimulated by hearing the word, and persistent firing of it will cause episodic memories (nearly always associated with places [https://www.frontiersin.org/articles/10.3389/fnhum.2020.574224/full]) to rise up. Making the activations of cortex-level sensory details and models conform to "the ways that the entire brain can or would be different if the remembered episode was being generated from sensory stimulation (or in this ca
Book review: "Feeling Great" by David Burns

I do like the "How To Talk" book and definitely use those techniques on my kids ("Oh, you're very upset, you're sad that we ran out of red peppers..." --me 20 minutes ago) though I haven't successfully started the habit of using it on adults. (Last time I tried I was accused of being condescending, guess I haven't quite gotten it down yet.) "Nonviolent Communication" and other sources hit that theme too.

…But I don't think that's quite it. That would be "positive reframing" without "magic dial". It's not just about acknowledging that the negative thought ex... (read more)

1qbolec12dI also have difficulties in applying this techniques on adults, of the "Me mad?No shit Sherlock!" kind. I'm not fluent with it yet, but what I've observed is that the more sincere I am, and the more my tone matches the tone of the other person, the better the results. I think this explains big chunk of "don't use that tone of voice on me!" responses I've got in my life, which I used to find strange [as I personally pay much more attention to the content of the text/speech, not the tone/style/form], but recently I've realized that this can be quite a rational response from someone who reads the cues from both content AND form, and seeing a mismatch, decides which of the two is easier to forge, and which one is the "real" message [perhaps based on their experience, in which controlling emotions is more difficult]. Also, I agree that the "paraphrase the emotions" only maps to the "positive reframing" part. In my eyes the analogy extended also beyond this single step into the pattern of using this discharge step as a necessary step to use some other rationally obvious thing, which you really think should work on its own in theory (like the “Classic CBT”-ish self-talk), but in practice you need to prepare the ground for it. Indeed there seems to be no analog of "Magical dial" in the "How to talk.." approach. There are some fragments of the book though which teach how to extract the goals/needs/fears of the child and then help them construct a solution which achieves those goals/needs, but this is more like a part of the analog of "classic CPT-ish self talk"-step I think. (In particular I don't recall the book saying things like "do the same stuff just less intensively", so yeah, this part is new and interesting). For example today I told my son, that "So you get mad each time we come to pick you up from your friend right in the moment when you've finally figured out some cool way to play with each other, and this is mega-frustrating, I know. Sure, one way to handle thi
Big picture of phasic dopamine

I'm proposing that (1) the hypothalamus has an input slot for "flinch now", (2) VTA has an output signal for "should have flinched", (3) there is a bundle of partially-redundant side-by-side loops (see the "probability distribution" comment) that connect specifically to both (1) and (2), by a genetically-hardcoded mechanism.

I take your comment to be saying: Wouldn't it be hard for the brain to orchestrate such a specific pair of connections across a considerable distance?

Well, I'm very much not an expert on how the brain wires itself up. But I think there'... (read more)

The reverse Goodhart problem

Let me try to repair Goodhart's law to avoid these problems:

By statistics, we should very generally expect two random variables to be uncorrelated unless there's a "good reason" to expect them to be correlated. Goodhart's law says that if U and V are correlated in some distribution, then (1) if a powerful optimizer tries to maximize U, then it will by default go far out of the distribution, (2) the mere fact that U and V were correlated in the distribution does not in itself constitute a "good reason" to expect them to be correlated far out of the distribu... (read more)

4Stuart_Armstrong8dCheers, these are useful classifications.
Big picture of phasic dopamine

Right, so I'm saying that the "supervised learning loops" get highly specific feedback, e.g. "if you get whacked in the head, then you should have flinched a second or two ago", "if a salty taste is in your mouth, then you should have salivated a second or two ago", "if you just started being scared, then you should have been scared a second or two ago", etc. etc. That's the part that I'm saying trains the amygdala and agranular prefrontal cortex.

Then I'm suggesting that the Success-In-Life thing is a 1D reward signal to guide search in a high-dimensional ... (read more)

2Charlie Steiner13dHow does the section of the amygdala that a particular dopamine neuron connects to even get trained to do the right thing in the first place? It seems like there should be enough chance in connections that there's really only this one neuron linking a brainstem's particular output to this specific spot in the amygdala - it doesn't have a whole bundle of different signals available to send to this exact spot. SL in the brain seems tricky because not only does the brainstem have to reinforce behaviors in appropriate contexts, it might have to train certain outputs to correspond to certain behaviors in the first place, all with only one wire to each location! Maybe you could do this with a single signal that means both "imitate the current behavior" and also "learn to do your behavior in this context"? Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.
The reverse Goodhart problem

Sorry, why are V and V' equally hard to define? Like if V is "human flourishing" and U is GDP then V' is "twice GDP minus human flourishing" which is more complicated than V. I guess you're gonna say "Why not say that V is twice GDP minus human flourishing?"? But my point is: for any particular set U,V, V', you can't claim that V and V' are equally simple, and you can't claim that V and V' are equally correlated with U. Right?

3Stuart_Armstrong12dAlmost equally hard to define. You just need to define U, which, by assumption, is easy.
Big picture of phasic dopamine

That's interesting, thanks!

good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in.

I agree that this is a very important dynamic. But I also feel like, if someone says to me, "I keep a kitten in my basement and torture him every second of every day, but it's no big deal, he must have gotten used to it by now", I mean, I don't think that reasoning is correct, even if I can't quite prove it or put my finger on what's wrong. I guess that's what I was trying to get ... (read more)

Big picture of phasic dopamine

Thanks!

If you Ctrl-F the post you'll find my little paragraph on how my take differs from Marblestone, Wayne, Kording 2016.

I haven't found "meta-RL" to be a helpful way to frame either the bandit thing or the follow-up paper relating it to the brain, more-or-less for reasons here, i.e. that the normal RL / POMDP expectation is that actions have to depend on previous observations—like think of playing an Atari game—and I guess we can call that "learning", but then we have to say that a large fraction of every RL paper ever is actually a meta-RL paper, and m... (read more)

1Michaël Trazzi13dRight I just googled Marblestone and so you're approaching it with the dopamine side and not the acetylcholine. Without debating about words, their neuroscience paper is still at least trying to model the phasic dopamine signal as some RPE & the prefrontal network as an LSTM (IIRC), which is not acetylcholine based. I haven't read in detail this post & the one linked, I'll comment again when I do, thanks!
Big picture of phasic dopamine

The least-complicated case (I think) is: I (tentatively) think that the hippocampus is more-or-less a lookup table with a finite number of discrete thoughts / memories / locations / whatever (the type of content in different in different species), and a "proposal" is just "which of the discrete things should be activated right now". 

A medium-difficulty case is: I think motor cortex stores a bunch of sequences of motor commands which execute different common action sequences. (I'm a believer in the Graziano theory that primary motor cortex, secondary m... (read more)

Against intelligence

I would agree with "superintelligence is not literally omnipotence" but I think I think you're making overly strong claims in the opposite direction. My reasons are basically contained in Intelligence Explosion Microeconomics, That Alien Message, and Scott Alexander's Superintelligence FAQ. For example...

power seems to be very unrelated to intelligence

I think "very" is much too strong, and insofar as this is true in the human world, that wouldn't necessarily make it true for an out-of-distribution superintelligence, and I think it very much wouldn't be. Fo... (read more)

-1George13dYou're thinking "one superintelligence against modern spam detection"... or really against 20 years ago spam detection. It's no longer possible to mass-call everyone in the world because, well, everyone is doing it. Same with 0-day exploits, they exist, but most companies have e.g. IP based rate limiting on various endpoints that make it prohibitively expensive to exploit things like e.g. spectre. And again, that's with current tech, by the time a superintelligence exists you'd have equally matched spam detection. That's my whole point, intelligence works but only in zero-sum games against intelligence, and those games aren't entirely fair, thus safeguarding the status quo. <Also, I'd honestly suggest that you at least read AI alarmists with some knowledge in the field, there are plenty to find, since it generate funding, but reading someone that "understood AI" 10 years ago and doesn't own a company valued at a few hundred millions is like reading someone that "gets how trading works", but works at Walmart and live with his mom>
Dangerous optimisation includes variance minimisation

I agree! I'm 95% sure this is in Superintelligence somewhere, but nice to have a more-easily-linkable version.

We need a standard set of community advice for how to financially prepare for AGI

If you think of it less like "possibly having a lot of money post-AGI" and more like "possibly owning a share of whatever the AGIs produce post-AGI", then I can imagine scenarios where that's very good and important. It wouldn't matter in the worst scenarios or best scenarios, but it might matter in some in-between scenarios, I guess. Hard to say though ...

8Daniel Kokotajlo15dThis is a good point, but even taking it into account I think my overall claim still stands. The scenarios where it's very important to own a larger share of the AGI-produced pie [ETA: via the mechanism of pre-existing stock ownership] are pretty unlikely IMO compared to e.g. scenarios where we all die or where all humans are given equal consideration regardless of how much stock they own, and then (separate point) also our money will probably have been better spent prior to AGI trying to improve the probability of AI going well than waiting till after AI to do stuff with the spoils.
We need a standard set of community advice for how to financially prepare for AGI

I think Vicarious AI is doing more AGI-relevant work than anyone. I pore over all their papers. They're private so this doesn't directly answer your question. But what bugs me is: Their investors include Good Ventures & Elon Musk ... So how do they get away with (AFAICT) doing no safety work whatsoever ...?

2GeneSmith15dI know from some interviews I've watched that Musk's main reason for investing in AI startups is to have inside info about their progress so he can monitor what's going on. Perhaps he's just not really paying that much attention? He always has like 15 balls in the air, so perhaps he just doesn't realize how bad Vicarious's safety work is. Come to think of it, if you or anyone you know have contact with Musk, this might be worth mentioning to him. He clearly cares about AI going well and has been willing to invest resources in increasing these odds in the past via OpenAI and then Neuralink. So perhaps he just doesn't know that Vicarious AI is being reckless when it comes to safety.
5Neel Nanda15dInteresting, can you say more about this/point me to any good resources on their work? I never hear about Vicarious in AI discussions
My AGI Threat Model: Misaligned Model-Based RL Agent

it's all a big mess

Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not "Building this kind of AGI is a great idea", but rather "This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I'm correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there's some chance they'll succeed whe... (read more)

An Intuitive Guide to Garrabrant Induction

Sorry if this is a stupid question but wouldn't "LI with no complexity bound on the traders" be trivial? Like, there's a noncomputable trader (brute force proof search + halting oracle) that can just look at any statement and immediately declare whether it's provably false, provably true, or neither. So wouldn't the prices collapse to their asymptotic value after a single step and then nothing else ever happens?

4Vanessa Kosoy18dFirst, "no complexity bounds on the trader" doesn't mean we allow uncomputable traders, we just don't limit their time or other resources (exactly like in Solomonoff induction). Second, even having a trader that knows everything doesn't mean all the prices collapse in a single step. It does mean that the prices will converge to knowing everything with time. GI guarantees no budget-limited trader will make an infinite profit, it doesn't guarantee no trader will make a profit at all (indeed guaranteeing the later is impossible).
The Alignment Forum should have more transparent membership standards

The integration with LessWrong means that anyone can still comment

Speaking of this, if I go to AF without being logged in, there's a box at the bottom that says "New comment. Write here. Select text for formatting options... SUBMIT" But non-members can't write comments right? Seems kinda misleading... Well I guess I just don't know: What happens if a non-member (either LW-but-not-AF member or neither-AF-nor-LW member) writes a comment in the box and presses submit? (I guess I could do the experiment myself but I don't want to create a test comment that som... (read more)

4habryka18dOops, yeah, this just seems like a straightforward bug. When you press "Submit" it asks you to log-in, and when you then log-in with a non-member account, the box just disappears and there is no obvious way to get your written content back. That seems like a terrible experience. I will fix that, I think we introduced this behavior when we made some changes on LW to how unsubmitted comments are saved. Talking about general AIAF meta question, an option I've been considering for a while is to have a submission queue for AIAF non-members on the site, where they can submit posts to the AIAF directly, without going through LessWrong. The big concern here is that someone would have to review all of them, and also that I would want most of them to be rejected since they aren't a good fit for the forum, and this seems more likely to make people unhappy than asking people to post to LW first. I think the current setup is the better choice here, since I am worried the submission queue would cause a bunch of people to spend a lot of time writing posts, and then get told they won't be accepted to the forum and that they wasted a lot of time, which is a much worse experience than being told very early that they should just post to LW and then ask for it being promoted (which I think sets better expectations). But I would be curious if people have different takes.
The Homunculus Problem

bottom-up attention (ie attention due to interesting stimulus) can be more or less captured by surprise

Hmm. That's not something I would have said.

I guess I think of two ways that sensory inputs can impact top-level processing.

First, I think sensory inputs impact top-level processing when top-level processing tries to make a prediction that is (directly or indirectly) falsified by the sensory input, and that prediction gets rejected, and top-level processing is forced to think a different thought instead.

  • If top-level processing is "paying close attention t
... (read more)
The Homunculus Problem

How would you query low-level details from a high-level node? Don't the hierarchically high-up nodes represent things which range over longer distances in space/time, eliding low-level details like lines?

My explanation would be: it's not a strict hierarchy, there are plenty of connections from the top to the bottom (or at least near-bottom). "Feedforward and feedback projections between regions typically connect to multiple levels of the hierarchy" "It has been estimated that 40% of all possible region-to-region connections actually exist which is much lar... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

Hi again, I finally got around to reading those links, thanks!

I think what you're saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called "the easy problem of wireheading".

So then the context was:

First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function.

Then I replied, We don't have access to the reward function, because we can't... (read more)

2abramdemski20dAll sounds perfectly reasonable. I just hope you recognize that it's all a big mess (because it's difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you're aware, I think we need stuff from my 'learning normativity' agenda to dodge these bullets. In particular, I would hesitate to commit to the idea that rewards are the only type of feedback we submit. FWIW, I'm now thinking of your "value function" as expected utility in Jeffrey-Bolker terms [https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions] . We need not assume a utility function to speak of expected utility. This perspective is nice in that it's a generalization of what RL people mean by "value function" anyway: the value function is exactly the expected utility of the event "I wind up in this specific situation" (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events. So let's see if we can pop up the conversational stack. I guess the larger topic at hand was: how do we define whether a value function is "aligned" (in an inner sense, so, when compared to an outer objective which is being used for training it)? Well, I think it boils down to whether the current value function makes "reliably good predictions" about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with very low probability, in some appropriate sense). If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is closer to V*(x) than that modification. (OK that's a bit lame, but hopefully you get the general direction I'm trying to point in.
Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI

FWIW my "one guy's opinion" is (1) I'm expecting people to build goal-seeking AGIs, and I think by default their goals will be opaque and unstable and full of unpredictable distortions compared to whatever was intended, and solving this problem is necessary for a good future (details), (2) Figuring out how AGIs will be deployed and what they'll be used for in a complicated competitive human world is also a problem that needs to be solved to get a good future. I don't think either of these problems is close to being solved, or that they're likely to be solv... (read more)

Which animals can suffer?

You left out the category of possible answers "Such-and-such type of computational process corresponds to suffering", and then octopuses and ML algorithms might or might not qualify depending on how exactly the octopus brain works, and what the exact ML algorithm is, and what exactly is that "such-and-such" criterion I mentioned. I definitely put far more weight on this category of answers than the two you suggested.

3Just Learning21dI like the idea. Basically, you suggest taking the functional approach and advance it. What do you think can be this type of process?
Electric heat pumps (Mini-Splits) vs Natural gas boilers

Hmm, well I was trying to ballpark "weighted average outdoor temperature", specifically weighted by how much heat I'm using. Like, if outdoor temperature is only slightly cooler than what I want inside, I need relatively little heat regardless, so the efficiency of that heat isn't all that important. My reference temperature of 30°F (~0°C) is very far from the lowest temperature we experience, it's close to a 24-hour-average temperature during the coldest three months.

I didn't know about HSPF, thanks for the tip! It seems to assume "climate region IV" (bas... (read more)

3Gerald Monroe20dSure. For a new build in your climate zone, probably the most efficient setup is a tanked condensing natural gas water heater, ideally sorta centrally located. Then a hydronics air handler and vents that just cover the immediate area around the installation. This gives you the cost advantage of natural gas for most of the heating but you avoid the equipment cost of a second furnace. Tankless condensing is an option but in your biome there probably isn't a sufficient advantage. Then mini splits around the periphery for heating/cooling during most days.
Electric heat pumps (Mini-Splits) vs Natural gas boilers

Thanks for your comment!

That's an interesting electricity price chart. It seems like I'm paying typical rates for my state, and I don't know why it's high compared to other parts of my country. I wouldn't say that's "a flaw in my calculations", since I'm calculating it for myself and I'm not planning to move, but it definitely sheds light on why mini-splits are more attractive for people in other places.

The "COP" in my chart is specifically "COP for heating the interior of a building when it's 30°F (~0°C) outside". I don't think it's true that COP under th... (read more)

1Gerald Monroe22dThere are other metrics such as HPSF meant to factor in aggregate performance. Since by choosing a fixed temperature you neglect all the days where the mini split has a huge efficiency advantage over combustion. Also you overlook the zoning. Larger houses that have extra rooms that are not always in use benefit from not heating those areas. And the solar. At your high local electric rates solar has a rapid payoff.
The Homunculus Problem

Thanks for the thought-provoking post! Let me try...

We have a visual system, and it (like everything in the neocortex) comes with an interface for "querying" it. Like, Dileep George gives the example "I'm hammering a nail into a wall. Is the nail horizontal or vertical?" You answer that question by constructing a visual model and then querying it. Or more simply, if I ask you a question about what you're looking at, you attend to something in the visual field and give an answer.

Dileep writes: "An advantage of generative PGMs is that we can train the model ... (read more)

2abramdemski20dWhy is a query represented as an overconfident false belief? How would you query low-level details from a high-level node? Don't the hierarchically high-up nodes represent things which range over longer distances in space/time, eliding low-level details like lines?
2abramdemski20dI don't significantly disagree, but I feel uneasy about a few points. Theories of the sort I take you to be gesturing at often emphasize this nice aspect of their theory, that bottom-up attention (ie attention due to interesting stimulus) can be more or less captured by surprise, IE, local facts about the shifts in probabilities. I agree that this seems to be a very good correlate of attention. However, the surprise itself wouldn't seem to be the attention. Surprise points merit extra computation. In terms of belief prop, it's useful to prioritize the messages which are creating the biggest belief shifts. The brain is parallel, so you might think all messages get propagated regardless, but of course, the brain also likes to conserve resources. So, it makes sense that there'd be a mechanism for prioritizing messages. Yet, message prioritization (I believe) does not account adequately for our experience. There seems to be an additional mechanism which places surprising content into the global workspace (at least, if we want to phrase this in global workspace theory). What if we don't like global workspace theory? Another idea that I think about here is: the brain's "natural grammar" might be a head grammar. This is the fancy linguistics thing which sort of corresponds to the intuitive concept of "the key word in that sentence". Parsing consists not only of grouping words together hierarchically into trees, but furthermore, whenever words are grouped, promoting one of them to be the "head" of that phrase. In terms of a visual hierarchy, this would mean "some low level details float to the top". This would potentially explain why we can "see low-level detail" even if we think the rest of the brain primarily consumes the upper layers of the visual hierarchy. We can focus on individual leafs, even while seeing the whole tree as a tree, because we re-parse the tree to make that leaf the "head". We see a leaf with a tree attached. Maybe. Without a mechanism like
Building brain-inspired AGI is infinitely easier than understanding the brain

Thanks! I guess my feeling is that we have a lot of good implementation-level ideas (and keep getting more), and we have a bunch of algorithm ideas, and psychology ideas and introspection and evolution and so on, and we keep piecing all these things together, across all the different levels, into coherent stories, and that's the approach I think will (if continued) lead to AGI.

Like, I am in fact very interested in "methods for fast and approximate Bayesian inference" as being relevant for neuroscience and AGI, but I wasn't really interested in it until I l... (read more)

3xuan1moSome recent examples, off the top of my head! * Jain, Y. R., Callaway, F., Griffiths, T. L., Dayan, P., Krueger, P. M., & Lieder, F. (2021). A computational process-tracing method for measuring people’s planning strategies and how they change over time. [https://www.is.mpg.de/publications/jain2021computational] * Dasgupta, I., Schulz, E., Tenenbaum, J. B., & Gershman, S. J. (2020). A theory of learning to infer. Psychological review, 127(3), 412. [http://cpilab.org/pubs/Dasgupta2020Learning.pdf] * Harrison, P., Marjieh, R., Adolfi, F., van Rijn, P., Anglada-Tort, M., Tchernichovski, O., ... & Jacoby, N. (2020). Gibbs Sampling with People. Advances in Neural Information Processing Systems, 33. [https://proceedings.neurips.cc/paper/2020/file/7880d7226e872b776d8b9f23975e2a3d-Paper.pdf] I guess this depends on how much you think we can make progress towards AGI by learning what's innate / hardwired / learned at an early age in humans and building that into AI systems, vs. taking more of a "learn everything" approach! I personally think there may still be a lot of interesting human-like thinking and problem solving strategies that we haven't figured out to implement as algorithms yet (e.g. how humans learn to program, and edit + modify programs and libraries to make them better over time), that adult and child studies would be useful in order to characterize what might even be aiming for, even if ultimately the solution is to use some kind of generic learning algorithm to reproduce it. I also think there's this fruitful in-between (1) and (3), which is to ask, "What are the inductive biases that guide human learning?", which I think you can make a lot of headway on without getting to the neural level.
SGD's Bias

That makes sense. Now it's coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they're carrying the temperature from their last collision. Whereas in uniform temperature, there's "detailed balance" (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution).

Thinking about the diode-resistor thing more, I s... (read more)

SGD's Bias

I think the "drift from high-noise to low-noise" thing is more subtle than you're making it out to be... Or at least, I remain to be convinced. Like, has anyone else made this claim, or is there experimental evidence? 

In the particle diffusion case, you point out correctly that if there's a gradient in D caused by a temperature gradient, it causes a concentration gradient. But I believe that if there's a gradient in D caused by something other than a temperature gradient, then it doesn't cause a concentration gradient. Like, take a room with a big pil... (read more)

5johnswentworth1moI'm still wrapping my head around this myself, so this comment is quite useful. Here's a different way to set up the model, where the phenomenon is more obvious. Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves statek, it's equally likely to transition tok+1ork−1). The rateλkat which the system leaves statekserves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of statekand statek−1satisfy pkλk=pk−1λk−1 ... i.e. the flux from values-k-and-above to values-below-kis equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, ifλk>λk−1, thenp k<pk−1: the system spends more time in lower-diffusion states (locally). Similarly, if the system's state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally). Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won't go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out. I haven't looked at the math for the diode-resistor system, but if the voltage averages to 0, doesn't that mean that it does spend more time on the lower-noise side? Because presumably it's typically further from zero on the higher-noise side. (More generally, I don't think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent b
Formal Inner Alignment, Prospectus

Wait, you think your prosaic story doesn't involve blind search over a super-broad space of models??

No, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head...

Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve "neural nets", and (something like) gradient descent, and RL, etc.

By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there ... (read more)

2abramdemski1moAh, ok. It sounds like I have been systematically mis-perceiving you in this respect. I would have been much more interested in your posts in the past if you had emphasized this aspect more ;p But perhaps you held back on that to avoid contributing to capabilities research. Yeah, this is a very important question!
Formal Inner Alignment, Prospectus

That's fair. Other possible approaches are "try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so", or "intepretability that looks for the AGI imagining dangerous adversarial intelligences".

I guess the fact that people don't tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible - like, that maybe there's a big window where one is smart enough to understand that imagining adversarial intelligence... (read more)

Formal Inner Alignment, Prospectus

Hm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment".

The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that's not any kind of alignment problem, it's a defense-against-adversaries problem.

Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I'm not convinced that this would re... (read more)

4abramdemski1moThis part doesn't necessarily make sense, because prevention could be easier than after-the-fact measures. In particular, 1. You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between. 2. You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.
Formal Inner Alignment, Prospectus

My hunch is that we don't disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you're misinterpreting me as saying something more interesting than I am.

Formal Inner Alignment, Prospectus

Like, if we do gradient descent, and the training signal is "get a high score in PacMan", then "mesa-optimize for a high score in PacMan" is incentivized by the training signal, and "mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips" is also incentivized by the training signal.

For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better a... (read more)

1ofer1moMy surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn't be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective. Note that the examples in my comment don't rely on deceptive alignment. To "convert" your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is "make the relevant memory location in the RAM say that I won the game", or "win the game in all future episodes".
Formal Inner Alignment, Prospectus

I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I'm just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs... (Of course I just wind up with a different set of unsolved AGI safety problems instead...)

The Evolutionary Story

By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentiona... (read more)

2abramdemski1moWait, you think your prosaic story doesn't involve blind search over a super-broad space of models?? I think any prosaic story involves blind search over a super-broad space of models, unless/until the prosaic methodology changes, which I don't particularly expect it to. I agree that replacing "blind search" with different tools is a very important direction. But your proposal doesn't do that! I agree with this general picture. While I'm primarily knocking down bad complexity-based arguments in my post, I would be glad to see someone working on trying to fix them. There were a lot of misunderstandings in the earlier part of our conversation, so, I could well have misinterpreted one of your points. But if so, I'm even more struggling to see why you would have been optimistic that your RL scenario doesn't involve risk due to unintended mesa-optimization. By your own account, the other part would be to argue that they're not simple, which you haven't done. They're not actively disincentivized, because they can use the planning capability to perform well on the task (deceptively). So they can be selected for just as much as other hypotheses, and might be simple enough to be selected in fact.
3ofer1moWe can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process. Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan's words [https://futureoflife.org/2020/07/01/evan-hubinger-on-inner-alignment-outer-alignment-and-proposals-for-building-safe-advanced-ai/] (from the Future of Life podcast): Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute). Also, a malign prior problem may manifest in (self-)supervised learning settings [https://www.lesswrong.com/posts/Et2pWrj4nWfdNAawh/what-specific-dangers-arise-when-asking-gpt-n-to-write-an?commentId=NT3BRmRGGJ3qvjPWH] . (Maybe you consider this to be a special case of (2).)
2abramdemski1moI have not properly read all of that yet, but my very quick take is that your argument for a need for online learning strikes me as similar to your argument against the classic inner alignment problem applying to the architectures you are interested in. You find what I call mesa-learning implausible for the same reasons you find mesa-optimization implausible. Personally, I've come around to the position (seemingly held pretty strongly by other folks, eg Rohin) that mesa-learning is practically inevitable for most tasks [https://www.lesswrong.com/posts/WmBukJkEFM72Xr397/mesa-search-vs-mesa-control].
My AGI Threat Model: Misaligned Model-Based RL Agent

So maybe you mean that the ideal value function would be precisely the sum of rewards.

Yes, thanks, that's what I should have said.

In the rollout architecture you describe, there wouldn't really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).

For "access to the reward function", we need to predict what the reward function will do (which may involve hard-to-predict things like "the human will be pleased with what I've done"). I guess your suggestion would be to call t... (read more)

4abramdemski1moAh, that wasn't quite my intention, but I take it as an acceptable interpretation. My true intention was that the "reward function calculator" should indeed be directly accessible rather than indirectly learned via reward-function-model. I consider this normative (not predictive) due to the considerations about observation-utility agents discussed in Robust Delegation [https://www.lesswrong.com/posts/iTpLAaPamcKyjmbFC/robust-delegation] (and more formally in Daniel Dewey's paper [https://intelligence.org/files/LearningValue.pdf]). Learning the reward function is asking for trouble. Of course, hard-coding the reward function is also asking for trouble, so... * shrug*
Load More