Thanks for sharing!
These are definitely reasonable things to think about.
For my part, I get kinda stuck right at your step #1. Like, say you give the AGI access to youtube and tell it to build a predictive model (i.e. do self-supervised learning). It runs for a while and winds up with a model of everything in the videos—people doing things, balls bouncing, trucks digging, etc. etc. Then you need to point to a piece of this model and say "This is human behavior" or "This is humans intentionally doing things". How do we do that? How do we find the right piec... (read more)
My current working theory of human social interactions does not involve multiple reward signals. Instead it's a bunch of rules like "If you're in state X, and you empathetically simulate someone in state Y, then send reward R and switch to state Z". See my post "Little glimpses of empathy" as the foundation of social emotions. These rules would be implemented in the hypothalamus and/or brainstem.
(Plus some involvement from brainstem sensory-processing circuits that can run hardcoded classifiers that return information about things like whether a per... (read more)
On your equivalence to an AI with an interpretability/oversight module. Data shouldn't be flowing back from the oversight into the AI.
Sure. I wrote "similar to (or even isomorphic to)". We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better set... (read more)
Ben Goertzel comments on this post via twitter:
1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration
1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it
2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration
I dunno, the healthiness of a food is not identical to the sum of the healthiness of its ingredients if you separate those ingredients out in a centrifuge. I think the "palatability" of food is partly related to how physically easy it is to break down (with both teeth and digestive tract). Hardness-to-break-down is potentially related to how many calories your digestive tract uses in digesting it, how quickly it delivers its nutrients, and how much you actually wind up eating.
(Very much not an expert. I think "a food is more than the sum of its ingredients" is discussed in a Michael Pollan book.)
Sometimes I send a draft to a couple people before posting it publicly.
Sometimes I sit on an idea for a while, then find an excuse to post it in a comment or bring it up in a conversation, get some feedback that way, and then post it properly.
I have several old posts I stopped endorsing, but I didn't delete them; I put either an update comment at the top or a bunch of update comments throughout saying what I think now. (Last week I spent almost a whole day just putting corrections and retractions into my catalog of old posts.) I for one would have a very p... (read more)
how does it avoid wireheading
Um, unreliably, at least by default. Like, some humans are hedonists, others aren't.
I think there's a "hardcoded" credit assignment algorithm. When there's a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I'm not sure of the gory details here.... (read more)
I had totally forgotten about your subagents post.
this post doesn't cleanly distinguish between reward-maximization and utility-maximization
I've been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize "the reward as currently understood by my predictive model... (read more)
I'm all for doing lots of testing in simulated environments, but the real world is a whole lot bigger and more open and different than any simulation. Goals / motivations developed in a simulated environment might or might not transfer to the real world in the way you, the designer, were expecting.
So, maybe, but for now I would call that "an intriguing research direction" rather than "a solution".
Right, the word "feasibly" is referring to the bullet point that starts "Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?”". Here's a little toy example we can run with: teaching an AGI "don't kill all humans". So there are three approaches to reward design that I can think of, and none of them seem to offer a feasible way to do this (at least, not with currently-known techniques):
Is there any good AI alignment research that you don't classify as deconfusion? If so, can you give some examples?
I'm not remotely qualified to comment on this, but fwiw in the Mojiang Mine Theory (which says it was a lab leak, but did not involve GOF), six miners caught the virus from bats (and/or each other), and then the virus spent four months replicating within the body of one of these poor guys as he lay sick in a hospital (and then of course samples were sent to WIV and put in storage).
This would explain (2) because four months in this guy's body (especially lungs) allows tons of opportunity for the virus to evolve and mutate and recombine in order to adapt to ... (read more)
Thanks! This is very interesting!
there is at least one steak neuron in my own hippocampus, and it can be stimulated by hearing the word, and persistent firing of it will cause episodic memories...to rise up
Oh yeah, I definitely agree that this is an important dynamic. I think there are two cases. In the case of episodic memory I think you're kinda searching for one of a discrete (albeit large) set of items, based on some aspect of the item. So this is a pure autoassociative memory mechanism. The other case is when you're forming a brand new thought. I thin... (read more)
a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile
I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y'know, having read the sequences… :-P
I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn't need to share my goals.
The way I think about it is:
Thanks!! I largely agree with what you wrote.
I was focusing on the implementation of a particular aspect of that. Specifically, when you're doing what you call "thing modeling", the "things" you wind up with are entries in a complicated learned world-model—e.g. "thing #6564457" is a certain horrifically complicated statistical regularity in multimodal sensory data, something like: "thing #6564457" is a prediction that thing #289347 is present, and thing #89672, and thing #68972, but probably not thing #903672", or whatever.
Meanwhile I agree with you that t... (read more)
I do like the "How To Talk" book and definitely use those techniques on my kids ("Oh, you're very upset, you're sad that we ran out of red peppers..." --me 20 minutes ago) though I haven't successfully started the habit of using it on adults. (Last time I tried I was accused of being condescending, guess I haven't quite gotten it down yet.) "Nonviolent Communication" and other sources hit that theme too.
…But I don't think that's quite it. That would be "positive reframing" without "magic dial". It's not just about acknowledging that the negative thought ex... (read more)
I'm proposing that (1) the hypothalamus has an input slot for "flinch now", (2) VTA has an output signal for "should have flinched", (3) there is a bundle of partially-redundant side-by-side loops (see the "probability distribution" comment) that connect specifically to both (1) and (2), by a genetically-hardcoded mechanism.
I take your comment to be saying: Wouldn't it be hard for the brain to orchestrate such a specific pair of connections across a considerable distance?
Well, I'm very much not an expert on how the brain wires itself up. But I think there'... (read more)
Let me try to repair Goodhart's law to avoid these problems:
By statistics, we should very generally expect two random variables to be uncorrelated unless there's a "good reason" to expect them to be correlated. Goodhart's law says that if U and V are correlated in some distribution, then (1) if a powerful optimizer tries to maximize U, then it will by default go far out of the distribution, (2) the mere fact that U and V were correlated in the distribution does not in itself constitute a "good reason" to expect them to be correlated far out of the distribu... (read more)
Right, so I'm saying that the "supervised learning loops" get highly specific feedback, e.g. "if you get whacked in the head, then you should have flinched a second or two ago", "if a salty taste is in your mouth, then you should have salivated a second or two ago", "if you just started being scared, then you should have been scared a second or two ago", etc. etc. That's the part that I'm saying trains the amygdala and agranular prefrontal cortex.
Then I'm suggesting that the Success-In-Life thing is a 1D reward signal to guide search in a high-dimensional ... (read more)
Sorry, why are V and V' equally hard to define? Like if V is "human flourishing" and U is GDP then V' is "twice GDP minus human flourishing" which is more complicated than V. I guess you're gonna say "Why not say that V is twice GDP minus human flourishing?"? But my point is: for any particular set U,V, V', you can't claim that V and V' are equally simple, and you can't claim that V and V' are equally correlated with U. Right?
That's interesting, thanks!
good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in.
I agree that this is a very important dynamic. But I also feel like, if someone says to me, "I keep a kitten in my basement and torture him every second of every day, but it's no big deal, he must have gotten used to it by now", I mean, I don't think that reasoning is correct, even if I can't quite prove it or put my finger on what's wrong. I guess that's what I was trying to get ... (read more)
If you Ctrl-F the post you'll find my little paragraph on how my take differs from Marblestone, Wayne, Kording 2016.
I haven't found "meta-RL" to be a helpful way to frame either the bandit thing or the follow-up paper relating it to the brain, more-or-less for reasons here, i.e. that the normal RL / POMDP expectation is that actions have to depend on previous observations—like think of playing an Atari game—and I guess we can call that "learning", but then we have to say that a large fraction of every RL paper ever is actually a meta-RL paper, and m... (read more)
The least-complicated case (I think) is: I (tentatively) think that the hippocampus is more-or-less a lookup table with a finite number of discrete thoughts / memories / locations / whatever (the type of content in different in different species), and a "proposal" is just "which of the discrete things should be activated right now".
A medium-difficulty case is: I think motor cortex stores a bunch of sequences of motor commands which execute different common action sequences. (I'm a believer in the Graziano theory that primary motor cortex, secondary m... (read more)
I would agree with "superintelligence is not literally omnipotence" but I think I think you're making overly strong claims in the opposite direction. My reasons are basically contained in Intelligence Explosion Microeconomics, That Alien Message, and Scott Alexander's Superintelligence FAQ. For example...
power seems to be very unrelated to intelligence
I think "very" is much too strong, and insofar as this is true in the human world, that wouldn't necessarily make it true for an out-of-distribution superintelligence, and I think it very much wouldn't be. Fo... (read more)
I agree! I'm 95% sure this is in Superintelligence somewhere, but nice to have a more-easily-linkable version.
If you think of it less like "possibly having a lot of money post-AGI" and more like "possibly owning a share of whatever the AGIs produce post-AGI", then I can imagine scenarios where that's very good and important. It wouldn't matter in the worst scenarios or best scenarios, but it might matter in some in-between scenarios, I guess. Hard to say though ...
I think Vicarious AI is doing more AGI-relevant work than anyone. I pore over all their papers. They're private so this doesn't directly answer your question. But what bugs me is: Their investors include Good Ventures & Elon Musk ... So how do they get away with (AFAICT) doing no safety work whatsoever ...?
it's all a big mess
Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not "Building this kind of AGI is a great idea", but rather "This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I'm correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there's some chance they'll succeed whe... (read more)
Sorry if this is a stupid question but wouldn't "LI with no complexity bound on the traders" be trivial? Like, there's a noncomputable trader (brute force proof search + halting oracle) that can just look at any statement and immediately declare whether it's provably false, provably true, or neither. So wouldn't the prices collapse to their asymptotic value after a single step and then nothing else ever happens?
The integration with LessWrong means that anyone can still comment
Speaking of this, if I go to AF without being logged in, there's a box at the bottom that says "New comment. Write here. Select text for formatting options... SUBMIT" But non-members can't write comments right? Seems kinda misleading... Well I guess I just don't know: What happens if a non-member (either LW-but-not-AF member or neither-AF-nor-LW member) writes a comment in the box and presses submit? (I guess I could do the experiment myself but I don't want to create a test comment that som... (read more)
bottom-up attention (ie attention due to interesting stimulus) can be more or less captured by surprise
Hmm. That's not something I would have said.
I guess I think of two ways that sensory inputs can impact top-level processing.
First, I think sensory inputs impact top-level processing when top-level processing tries to make a prediction that is (directly or indirectly) falsified by the sensory input, and that prediction gets rejected, and top-level processing is forced to think a different thought instead.
How would you query low-level details from a high-level node? Don't the hierarchically high-up nodes represent things which range over longer distances in space/time, eliding low-level details like lines?
My explanation would be: it's not a strict hierarchy, there are plenty of connections from the top to the bottom (or at least near-bottom). "Feedforward and feedback projections between regions typically connect to multiple levels of the hierarchy" "It has been estimated that 40% of all possible region-to-region connections actually exist which is much lar... (read more)
Hi again, I finally got around to reading those links, thanks!
I think what you're saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called "the easy problem of wireheading".
So then the context was:
First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function.
Then I replied, We don't have access to the reward function, because we can't... (read more)
FWIW my "one guy's opinion" is (1) I'm expecting people to build goal-seeking AGIs, and I think by default their goals will be opaque and unstable and full of unpredictable distortions compared to whatever was intended, and solving this problem is necessary for a good future (details), (2) Figuring out how AGIs will be deployed and what they'll be used for in a complicated competitive human world is also a problem that needs to be solved to get a good future. I don't think either of these problems is close to being solved, or that they're likely to be solv... (read more)
You left out the category of possible answers "Such-and-such type of computational process corresponds to suffering", and then octopuses and ML algorithms might or might not qualify depending on how exactly the octopus brain works, and what the exact ML algorithm is, and what exactly is that "such-and-such" criterion I mentioned. I definitely put far more weight on this category of answers than the two you suggested.
Hmm, well I was trying to ballpark "weighted average outdoor temperature", specifically weighted by how much heat I'm using. Like, if outdoor temperature is only slightly cooler than what I want inside, I need relatively little heat regardless, so the efficiency of that heat isn't all that important. My reference temperature of 30°F (~0°C) is very far from the lowest temperature we experience, it's close to a 24-hour-average temperature during the coldest three months.
I didn't know about HSPF, thanks for the tip! It seems to assume "climate region IV" (bas... (read more)
Thanks for your comment!
That's an interesting electricity price chart. It seems like I'm paying typical rates for my state, and I don't know why it's high compared to other parts of my country. I wouldn't say that's "a flaw in my calculations", since I'm calculating it for myself and I'm not planning to move, but it definitely sheds light on why mini-splits are more attractive for people in other places.
The "COP" in my chart is specifically "COP for heating the interior of a building when it's 30°F (~0°C) outside". I don't think it's true that COP under th... (read more)
Thanks for the thought-provoking post! Let me try...
We have a visual system, and it (like everything in the neocortex) comes with an interface for "querying" it. Like, Dileep George gives the example "I'm hammering a nail into a wall. Is the nail horizontal or vertical?" You answer that question by constructing a visual model and then querying it. Or more simply, if I ask you a question about what you're looking at, you attend to something in the visual field and give an answer.
Dileep writes: "An advantage of generative PGMs is that we can train the model ... (read more)
Thanks! I guess my feeling is that we have a lot of good implementation-level ideas (and keep getting more), and we have a bunch of algorithm ideas, and psychology ideas and introspection and evolution and so on, and we keep piecing all these things together, across all the different levels, into coherent stories, and that's the approach I think will (if continued) lead to AGI.Like, I am in fact very interested in "methods for fast and approximate Bayesian inference" as being relevant for neuroscience and AGI, but I wasn't really interested in it until I l... (read more)
That makes sense. Now it's coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they're carrying the temperature from their last collision. Whereas in uniform temperature, there's "detailed balance" (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution).
Thinking about the diode-resistor thing more, I s... (read more)
I think the "drift from high-noise to low-noise" thing is more subtle than you're making it out to be... Or at least, I remain to be convinced. Like, has anyone else made this claim, or is there experimental evidence?
In the particle diffusion case, you point out correctly that if there's a gradient in D caused by a temperature gradient, it causes a concentration gradient. But I believe that if there's a gradient in D caused by something other than a temperature gradient, then it doesn't cause a concentration gradient. Like, take a room with a big pil... (read more)
Wait, you think your prosaic story doesn't involve blind search over a super-broad space of models??
No, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head...
Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve "neural nets", and (something like) gradient descent, and RL, etc.
By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there ... (read more)
That's fair. Other possible approaches are "try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so", or "intepretability that looks for the AGI imagining dangerous adversarial intelligences".
I guess the fact that people don't tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible - like, that maybe there's a big window where one is smart enough to understand that imagining adversarial intelligence... (read more)
Hm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment".
The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that's not any kind of alignment problem, it's a defense-against-adversaries problem.
Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I'm not convinced that this would re... (read more)
My hunch is that we don't disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you're misinterpreting me as saying something more interesting than I am.
Like, if we do gradient descent, and the training signal is "get a high score in PacMan", then "mesa-optimize for a high score in PacMan" is incentivized by the training signal, and "mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips" is also incentivized by the training signal.
For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better a... (read more)
I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I'm just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs... (Of course I just wind up with a different set of unsolved AGI safety problems instead...)
The Evolutionary Story
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentiona... (read more)
possibly related literature if you haven't seen it: Comprehensive AI Services
RE online learning, I acknowledge that a lot of reasonable people agree with you on that, and it's hard to know for sure. But I argued my position in Against evolution as an analogy for how humans will build AGI.
Also there: a comment thread about why I'm skeptical that GPT-N would be capable of doing the things we want AGI to do, unless we fine-tune the weights on the fly, in a manner reminiscent of online learning (or amplification).
So maybe you mean that the ideal value function would be precisely the sum of rewards.
Yes, thanks, that's what I should have said.
In the rollout architecture you describe, there wouldn't really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).
For "access to the reward function", we need to predict what the reward function will do (which may involve hard-to-predict things like "the human will be pleased with what I've done"). I guess your suggestion would be to call t... (read more)