Working on AGI safety via a deep-dive into brain algorithms, see https://sjbyrnes.com/agi.html
Ben Goertzel comments on this post via twitter:
1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration
1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it
2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration
I dunno, the healthiness of a food is not identical to the sum of the healthiness of its ingredients if you separate those ingredients out in a centrifuge. I think the "palatability" of food is partly related to how physically easy it is to break down (with both teeth and digestive tract). Hardness-to-break-down is potentially related to how many calories your digestive tract uses in digesting it, how quickly it delivers its nutrients, and how much you actually wind up eating.
(Very much not an expert. I think "a food is more than the sum of its ingredients" is discussed in a Michael Pollan book.)
Sometimes I send a draft to a couple people before posting it publicly.
Sometimes I sit on an idea for a while, then find an excuse to post it in a comment or bring it up in a conversation, get some feedback that way, and then post it properly.
I have several old posts I stopped endorsing, but I didn't delete them; I put either an update comment at the top or a bunch of update comments throughout saying what I think now. (Last week I spent almost a whole day just putting corrections and retractions into my catalog of old posts.) I for one would have a very positive impression of a writer whose past writings were full of parenthetical comments that they were wrong about this or that. Even if the posts wind up unreadable as a consequence.
how does it avoid wireheading
Um, unreliably, at least by default. Like, some humans are hedonists, others aren't.
I think there's a "hardcoded" credit assignment algorithm. When there's a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I'm not sure of the gory details here.)
Anyway, insofar as "the reward signal itself" is part of the world-model, it's possible that reward-prediction / value will wind up attached to that concept. And then that's a desire to wirehead. But it's not inevitable. Some of the relevant dynamics are:
For AGIs we would probably want to do other things too, like (somehow) use transparency to find "the reward signal itself" in the world-model and manually fix its reward-prediction / value at zero, or whatever else we can think of. Also, I think the more likely failure mode is "wireheading-lite", where the desire to wirehead is trading off against other things it cares about, and then hopefully conservatism (section 2 here) can help prevent catastrophe.
I had totally forgotten about your subagents post.
this post doesn't cleanly distinguish between reward-maximization and utility-maximization
I've been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize "the reward as currently understood by my predictive model". Then "the reward as currently understood by my predictive model" is basically a utility function. But at the same time, there's a separate subroutine that edits the reward prediction model (≈ utility function) to ever more closely approximate the true reward function (by some learning algorithm, presumably involving reward prediction errors).
In other words: At any given time, the part of the agent that's making plans and taking actions looks like a utility maximizer. But if you lump together that part plus the subroutine that keeps editing the reward prediction model to better approximate the real reward signal, then that whole system is a reward-maximizing RL agent.
Please tell me if that makes any sense or not; I've been planning to write pretty much exactly this comment (but with a diagram) into a short post.
I'm all for doing lots of testing in simulated environments, but the real world is a whole lot bigger and more open and different than any simulation. Goals / motivations developed in a simulated environment might or might not transfer to the real world in the way you, the designer, were expecting.
So, maybe, but for now I would call that "an intriguing research direction" rather than "a solution".
Right, the word "feasibly" is referring to the bullet point that starts "Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?”". Here's a little toy example we can run with: teaching an AGI "don't kill all humans". So there are three approaches to reward design that I can think of, and none of them seem to offer a feasible way to do this (at least, not with currently-known techniques):
Is there any good AI alignment research that you don't classify as deconfusion? If so, can you give some examples?
I'm not remotely qualified to comment on this, but fwiw in the Mojiang Mine Theory (which says it was a lab leak, but did not involve GOF), six miners caught the virus from bats (and/or each other), and then the virus spent four months replicating within the body of one of these poor guys as he lay sick in a hospital (and then of course samples were sent to WIV and put in storage).
This would explain (2) because four months in this guy's body (especially lungs) allows tons of opportunity for the virus to evolve and mutate and recombine in order to adapt to the human body, and maybe it also explains (1) either randomly or via recombination between viral and human DNA (if that makes sense?), again during those four months in this poor guy's body.
Thanks! This is very interesting!
there is at least one steak neuron in my own hippocampus, and it can be stimulated by hearing the word, and persistent firing of it will cause episodic memories...to rise up
Oh yeah, I definitely agree that this is an important dynamic. I think there are two cases. In the case of episodic memory I think you're kinda searching for one of a discrete (albeit large) set of items, based on some aspect of the item. So this is a pure autoassociative memory mechanism. The other case is when you're forming a brand new thought. I think of it like, your thoughts are made up of a bunch of little puzzle pieces that can snap together, but only in certain ways (e.g. you can't visualize a "falling stationary rock", but you can visualize a "blanket made of banana peels"). I think you can issue top-down mandates that there should be a thought containing a certain small set of pieces, and then your brain will search for a way to build out a complete thought (or plan) that includes those pieces. Like "wanting to fit the book in the bag" looks like running a search for a self-consistent thought that ends with the book sliding smoothly into the bag. There might be some autoassociative memory involved here too, not sure, although I think it mainly winds up vaguely similar to belief-propagation algorithms in Bayesian PGMs.
Anyway, the hunger case could look like invoking the piece-of-a-thought:
Piece-of-a-thought X: "[BLANK] and then I eat yummy food"
…and then the search algorithm looks for ways to flesh that out into a complete plausible thought.
I guess your model is more like "the brainstem reaches up and activates Piece-of-a-thought X" and my model is more like "the brainstem waits patiently for the cortex to activate Piece-of-a-thought X, and as soon as it does, it says YES GOOD THANKS, HERE'S SOME REWARD". And then very early in infancy the cortex learns (by RL) that when its own interoceptive inputs indicate hunger, then it should activate piece-of-a-thought X.
Maybe you'll say: eating is so basic, this RL mechanism seems wrong. Learning takes time, but infants need to eat, right? But then my response would be: eating is basic and necessary from birth, but doesn't need to involve the cortex. There can be a hardwired brainstem circuit that says "if you see a prey animal, chase it and kill it", and another that says "if you smell a certain smell, bite on it", and another that says "when there's food in your mouth, chew it and swallow it", etc. The cortex is for learning more complicated patterns, I think, and by the time it's capable of doing useful things in general, it can also learn this one simple little pattern, i.e. that hunger signals imply reward-for-thinking-about-eating.
FWIW, in the scheme here, one part of insular cortex is an honorary member of the "agranular prefrontal cortex" club—that's based purely on this quote I found in Wise 2017: "Although the traditional anatomical literature often treats the orbitofrontal and insular cortex as distinct entities, a detailed analysis of their architectonics, connections, and topology revealed that the agranular insular areas are integral parts of an “orbital prefrontal network”". So this is a "supervised learning" part (if you believe me), and I agree with you that it may well more specifically involve predictions about "feeling better after consuming something". I also think this is probably the part relevant to your comment "the insula's supervised learning algorithms can be hacked?".
Another part of the insula is what Lisa Feldman Barrett calls "primary interoceptive cortex", i.e. she is suggesting that it learns a vocabulary of patterns that describe incoming interoceptive (body status) signals, analogously to how primary visual cortex learns a vocabulary of patterns that describe incoming visual signals, primary auditory cortex learns a vocabulary of patterns that describe incoming auditory signals, etc.
Those are the two parts of the insula that I know about. There might be other things in the insula too.
I didn't explicitly mention caudate here but it's half of "dorsal striatum". The other half is putamen—I think they're properly considered as one structure. "Dorsal striatum" is the striatum associated with motor-control cortex and executive-function cortex, more or less. I'm not sure how that breaks down between caudate and putamen. I'm also not sure why caudate was active in that fMRI paper you found.
I think I draw more of a distinction between plans and memories than you, and put hippocampus on the "memory" side. (I'm thinking roughly "hippocampus = navigation (in all mammals) and first-person memories (only in humans)", and "dorsolateral prefrontal cortex is executive function and planning (in humans)".) I'm not sure exactly what the fMRI task was, but maybe it involved invoking memories?