Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
New Comment
21 comments, sorted by Click to highlight new comments since: Today at 3:25 AM

Some ultra-short book reviews on cognitive neuroscience

  • On Intelligence by Jeff Hawkins & Sandra Blakeslee (2004)—very good. Focused on the neocortex - thalamus - hippocampus system, how it's arranged, what computations it's doing, what's the relation between the hippocampus and neocortex, etc. More on Jeff Hawkins's more recent work here.

  • I am a strange loop by Hofstadter (2007)—I dunno, I didn't feel like I got very much out of it, although it's possible that I had already internalized some of the ideas from other sources. I mostly agreed with what he said. I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.

  • Consciousness and the brain by Dehaene (2014)—very good. Maybe I could have saved time by just reading Kaj's review, there wasn't that much more to the book beyond that.

  • Conscience by Patricia Churchland (2019)—I hated it. I forget whether I thought it was vague / vacuous, or actually wrong. Apparently I have already blocked the memory!

  • How to Create a Mind by Kurzweil (2014)—Parts of it were redundant with On Intelligence (which I had read earlier), but still worthwhile. His ideas about how brain-computer interfaces are supposed to work (in the context of cortical algorithms) are intriguing; I'm not convinced, hoping to think about it more.

  • Rethinking Consciousness by Graziano (2019)—A+, see my review here

  • The Accidental Mind by Linden (2008)—Lots of fun facts. The conceit / premise (that the brain is a kludgy accident of evolution) is kinda dumb and overdone—and I disagree with some of the surrounding discussion—but that's not really a big part of the book, just an excuse to talk about lots of fun neuroscience.

  • The Myth of Mirror Neurons by Hickok (2014)—A+, lots of insight about how cognition works, especially the latter half of the book. Prepare to skim some sections of endlessly beating a dead horse (as he dubunks seemingly endless lists of bad arguments in favor of some aspect of mirror neurons). As a bonus, you get treated to an eloquent argument for the "intense world" theory of autism, and some aspects of predictive coding.

  • Surfing Uncertainty by Clark (2015)—I liked it. See also SSC review. I think there's still work to do in fleshing out exactly how these types of algorithms work; it's too easy to mix things up and oversimplify when just describing things qualitatively (see my feeble attempt here, which I only claim is a small step in the right direction).

  • Rethinking innateness by Jeffrey Elman, Annette Karmiloff-Smith, Elizabeth Bates, Mark Johnson, Domenico Parisi, and Kim Plunkett (1996)—I liked it. Reading Steven Pinker, you get the idea that connectionists were a bunch of morons who thought that the brain was just a simple feedforward neural net. This book provides a much richer picture.

Dear diary...

[this is an experiment in just posting little progress reports as a self-motivation tool.]

1. I have a growing suspicion that I was wrong to lump the amygdala in with the midbrain. It may be learning by the same reward signal as the neocortex. Or maybe not. It's confusing. Things I'm digesting: https://twitter.com/steve47285/status/1314553896057081857?s=19 (and references therein) and https://www.researchgate.net/publication/11523425_Parallels_between_cerebellum-_and_amygdala-dependent_conditioning

2. Speaking of mistakes, I'm also regretting some comments I made a while ago suggesting that the brain doesn't do backpropagation. Maybe that's true in a narrow sense, but Randall O'Reilly has convinced me that the brain definitely does error-driven learning sometimes (I already knew that), and moreover it may well be able to propagate errors through at least one or two layers of a hierarchy, with enough accuracy to converge. No that doesn't mean that the brain is exactly the same as a PyTorch / Tensorflow Default-Settings Deep Neural Net.

3. My long work-in-progress post on autism continues to be stuck on the fact that there seem to be two theories of social impairment that are each plausible and totally different. In one theory, social interactions are complex and hard to follow / model for cognitive / predictive-model-building reasons. The evidence I like for that is the role of the cerebellum, which sounds awfully causally implicated in autism. Like, absence of a cerebellum can cause autism, if I'm remembering right. In the other theory, modeling social interactions in the neurotypical way (via empathy) is aversive. The evidence I like for that is people with autism self-reporting that eye contact is aversive, among other things. (This is part of "intense world theory".) Of those two stories, I'm roughly 100% sold on the latter story is right. But the former story doesn't seem obviously wrong, and I don't like having two explanations for the same thing (although it's not impossible, autism involves different symptoms in different people, and they could co-occur for biological reasons rather than computational reasons). I'm hoping that the stories actually come together somehow, and I'm just confused about what the cerebellum and amygdala do. So I'm reading and thinking about that.

4. New theory I'm playing with: the neocortex outputs predictions directly, in addition to motor commands. E.g. "my arm is going to be touched". Then the midbrain knows not to flinch when someone touches the arm. That could explain why the visual cortex talks to the superior colliculus, which I always thought was weird. Jeff Hawkins says those connections are the neocortex sending out eye movement motor commands, but isn't that controlled by the frontal eye fields? Oh, then Randall O'Reilly had this mysterious throwaway comment in a lecture that the frontal eye fields seem to be at the bottom of the visual hierarchy if you look at the connections. (He had a reference, I should read it.) I don't know what the heck is going on.

modeling social interactions in the neurotypical way (via empathy) is aversive

Is it too pessimistic to assume that people mostly model other people in order to manipulate them better? I wonder how much of human mental inconsistency is a defense against modeling. Here on Less Wrong we complain that inconsistent behavior makes you vulnerable to Dutch-booking, but in real life, consistent behavior probably makes you even more vulnerable, because your enemies can easily predict what you do and plan accordingly.

I was just writing about my perspective here; see also Simulation Theory (the opposite of "Theory Theory", believe it or not!). I mean, you could say that "making friends and being nice to them" is a form of manipulation, in some technical sense, blah blah evolutionary game theory blah blah, I guess. That seems like something Robin Hanson would say :-P I think it's a bit too cynical if you mean "manipulation" in the everyday sense involving bad intent. Also, if you want to send out vibes of "Don't mess with me or I will crush you!" to other people—and the ability to make credible threats is advantageous for game-theory reasons—that's all about being predictable and consistent!

Again as I posted just now, I think the lion's share of "modeling", as I'm using the term, is something that happens unconsciously in a fraction of second, not effortful empathy or modeling.

Hmmm... If I'm trying to impress someone, I do indeed effortfully try to develop a model of what they're impressed by, and then use that model when talking to them. And I tend to succeed! And it's not all that hard! The most obvious strategy tends to work (i.e., go with what has impressed them in the past, or what they say would be impressive, or what impresses similar people). I don't really see any aspect of human nature that is working to make it hard for me to impress someone, like by a person randomly changing what they find impressive. Do you? Are there better examples?

I have low confidence debating this, because it seems to me like many things could be explained in various ways. For example, I agree that certain predictability is needed to prevent people from messing with you. On the other hand, certain uncertainty is needed, too -- if people know exactly when you would snap and start crushing them, they will go 5% below the line; but if the exact line depends on what you had for breakfast today, they will be more careful about getting too close to it.

Fair enough :-)

Branding: 3 reasons why I prefer "AGI safety" to "AI alignment"

  1. When engineers, politicians, bureaucrats, military leaders, etc. hear the word "safety", they suddenly perk up and start nodding and smiling. Safety engineering—making sure that systems robustly do what you want them to do—is something that people across society can relate to and appreciate. By contrast, when people hear the term "AI alignment" for the first time, they just don't know what it means or how to contextualize it.

  2. There are a lot of things that people are working on in this space that aren't exactly "alignment"—things like boxing, task-limited AI, myopic AI, impact-limited AI, non-goal-directed AI, AGI strategy & forecasting, etc. It's useful to have a term that includes all those things, and I think that term should be "AGI safety". Then we can reserve "AI alignment" for specifically value alignment.

  3. Actually, I'm not even sure that "value alignment" is exactly the right term for value alignment. The term "value alignment" is naturally read as something like "the AI's values are aligned with human values", which isn't necessarily wrong, but is a bit vague and not necessarily interpreted correctly. For example, if love is a human value, should the AGI adopt that value and start falling in love? No, they should facilitate humans falling in love. When people talk about CIRL, CEV, etc. it seems to be less about "value alignment" and more about "value indirection" (in the C++ sense), i.e. utility functions that involve human goals and values, and which more specifically define those things by pointing at human brains and human behavior.

A friend in the AI space who visited Washington told me that military leaders distinctly do not like the term "safety".

[-][anonymous]3y 2

Why not?

Because they're interested in weapons and making people distinctly not safe.

Right, for them "alignment" could mean their desired concept, "safe for everyone except our targets".

[-][anonymous]3y 3

I'm skeptical that anyone with that level of responsibility and acumen has that kind of juvenile destructive mindset. Can you think of other explanations?

Can you think of other explanations?

There's a difference between people talking about safety in the sense of 1. 'how to handle a firearm safely' and the sense of 2. 'firearms are dangerous, let's ban all guns'. These leaders may understand/be on board with 1, but disagree with 2.

I think if someone negatively reacts to 'Safety' thinking you mean 'try to ban all guns' instead of 'teach good firearm safety', you can rephrase as 'Control' in that context. I think Safety is more inclusive of various aspects of the problem than either 'Control' or 'Alignment', so I like it better as an encompassing term. 

Interesting. I guess I was thinking specifically about DARPA which might or might not be representative, but see Safe Documents, Safe Genes, Safe Autonomy, Safety and security properties of software, etc. etc.

Quick comments on "The case against economic values in the brain" by Benjamin Hayden & Yael Niv :

(I really only skimmed the paper, these are just impressions off the top of my head.)

I agree that "eating this sandwich" doesn't have a reward prediction per se, because there are lots of different ways to think about eating this sandwich, especially what aspects are salient, what associations are salient, what your hormones and mood are, etc. If neuroeconomics is premised on reward predictions being attached to events and objects rather than thoughts, then I don't like neuroeconomics either, at least not as a mechanistic theory of psychology. [I  don't know anything about neuroeconomics, maybe that was never the idea anyway.]

But when they float the idea of throwing out rewards altogether, I'm not buying it. The main reason is: I'm trying to understand what the brain does algorithmically, and I feel like I'm making progress towards a coherent picture ...and part of that picture is a 1-dimensional signal called reward. If you got rid of that, I just have no idea how to fill in that gap. Doesn't mean it's impossible, but I did try to think it through and failed.

There's also a nice biological story going with the algorithm story: the basal ganglia has a dense web of connections across the frontal lobe, and can just memorize "this meaningless set of neurons firing is associated with that reward, and this meaningless set of neurons firing is associated with that reward, etc. etc." Then it (1) inhibits all but the highest-reward-predicting activity, and (2) updates the reward predictions based on what happens (TD learning). (Again this and everything else is very sketchy and speculative.)

(DeepMind had a paper that says there's a reward prediction probability distribution instead of a reward prediction value, which is fine, that's still consistent with the rest of my story.)

I get how deep neural nets can search for a policy directly. I don't think those methods are consistent with the other things I believe about the brain (or at least the neocortex). In particular I think the brain does seem to have a mechanism for choosing among different possible actions being considered in parallel, as opposed to a direct learned function from sensory input to output. The paper also mentions learning to compare without learning a value, but I don't think that works because there are too many possible comparisons (the square of the number of possible thoughts).

In the era of COVID, we should all be doing cardio exercise if possible, and not at a gym. Here's what's been working for me for the past many years. This is not well optimized for perfectly working out every muscle group etc., but it is very highly optimized for convenience, practicality, and sustainability, at least for me personally in my life situation.

(This post is mostly about home cardio exercise, but the last paragraph is about jogging.)

My home exercise routine consists of three simultaneous things: {exercise , YouTube video lectures , RockMyRun}. More on the exercise below. RockMyRun is a site/app that offers music mixes at fixed BPMs—the music helps my energy and the fixed BPM keeps me from gradually slowing down the pace. The video lectures make me motivated to work out, since there's a lot of stuff I desperately want to learn. :)

Previously I've done instead {exercise, movies or TV}. (I still do on rare occasions.) This is motivating when combined with the rule of "no movies or TV unless exercising (or on social special occasions)". I've pretty much followed that rule for years now.

My exercise routine consists of holding a dumbbell in each hands, then doing a sort of simultaneous reverse-lunge while lifting one of the dumbbells, alternating sides, kinda like this picture. Out of numerous things I've tried, this is the one that stuck, because it's compatible with watching TV, compatible with very small spaces including low ceilings, has low risk of injury, doesn't stomp or make noise, doesn't require paying attention (once you get the hang of it), and seems to be a pretty good cardio workout (as judged by being able to break a sweat in a freezing cold room). I also do a few pushups now and then as a break, although that means missing what's on the screen. I've gradually increased the dumbbell weight over the years from 3lbs (1.4kg) to now 15lbs (7kg).

I strongly believe that the top priority for an exercise routine is whatever helps you actually keep doing it perpetually. But beyond that, I've found some factors that give me a more intense workout: Coffee helps slightly (it's a performance-enhancing drug! At least for some people); feeling cold at the beginning / being in a cold room seems to help; awesome action-packed movies or TV are a nice boost, but RockMyRun with boring video lectures is good enough. (My most intense workouts are watching music videos or concert recordings, but I get bored of those after a while.)

In other news, I also occasionally jog. RockMyRun is also a really good idea for that, not just for the obvious reasons (energy, pace), but because, when you set the BPM high, your running form magically and effortlessly improves. This completely solved my jogging knee pain problems, which I had struggled with for years. (I learned that tip from here, where he recommends 160BPM. I personally prefer 180BPM, because I like shorter and more intense runs for my time-crunched schedule.)

Introducing AGI Safety in general, and my research in particular, to novices / skeptics, in 5 minutes, out loud

I might be interviewed on a podcast where I need to introduce AGI risk to a broad audience of people who mostly aren’t familiar with it and/or think it’s stupid. The audience is mostly neuroscientists plus some AI people. I wrote the following as a possible entry-point, if I get thrown some generic opening question like “Tell me about what you’re working on”:

The human brain does all these impressive things, such that humanity was able to transform the world, go to the moon, invent nuclear weapons, wipe out various species, etc. Human brains did all those things by running certain algorithms.

And sooner or later, people will presumably figure out how to run similar algorithms on computer chips.

Then what? That’s the million-dollar question. Then what? What happens when researchers eventually get to the point where they can run human-brain-like algorithms on computer chips?

OK, to proceed I need to split into two ways of thinking about these future AI systems: Like a tool or like a species.

Let's start with the tool perspective. Here I'm probably addressing the AI people in the audience. You're thinking, “Oh, you're talking about AI, well pfft, I know what AI is, I work with AI every day, AI is kinda like language models and ConvNets and AlphaFold and so on. By the time we get future algorithms that are more like how the human brain works, they're going to be more powerful, sure, but we should still think of them as in the same category as ConvNets, we should think of them like a tool that people will use.” OK, if that's your perspective, then the goal is for these tools to do the things that we want them to do. And conversely, the concern is that these systems could go about doing things that the programmers didn't want them to do, and that literally nobody wanted them to do, like try to escape human control. The technical problem here is called The Alignment Problem: If people figure out how to run human-brain-like algorithms on computer chips, and they want those algorithms to try to do X, how can they do that? It's not straightforward. For example, humans have an innate sex drive, but it doesn't work very reliably, some people choose to be celibate. OK, so imagine you have the source code for a human-like brain architecture and training environment, and you want it to definitely grow into an adult that really, deeply, wants to do some particular task, like let's say design solar cells, while also being honest and staying under human control. How would you do that? What exactly would you put into the source code? Nobody knows the answer. And when you dig into it you find that it's a surprisingly tricky technical problem, for pretty deep reasons. And that technical problem is something that I and others in the field are working on.

That was the tool perspective. But then there's probably another part of the audience, maybe a lot of the neuroscientists, who are strenuously objecting here: if we run human-brain-like algorithms on computer chips, we shouldn't think of that as like a tool for humans to use, instead we should think of it like a species, a new intelligent species that we have invited onto our planet, and indeed a species which will eventually think much faster than humans, and be more insightful and creative than humans, and also probably eventually outnumber humans by a huge factor, and so on. In that perspective, the question is: if we're going to invite this powerful new intelligent species onto our planet, how do we make sure that it's a species that we actually want to share the planet with? And how do we make sure that they want to continue sharing the planet with us? Or more generally, how do we bring about a good future? There are some interesting philosophy questions here which we can get back to, but putting those aside, there's also a technical problem to solve, which is, whatever properties we want this new intelligent species to have, we need to actually write source code such that that actually happens. For example, if we want this new species to feel compassion and friendship, we gotta put compassion and friendship into the source code. Human sociopaths are a case study here. Sociopaths exist, therefore it is possible to make an intelligent species that isn't motivated by compassion and friendship. Not just possible, but strictly easier! I think maybe future programmers will want to put compassion and friendship into the source code, but they won't know how, so they won't do it. So I say, let’s try to figure that out ahead of time. Again, I claim this is a very tricky technical problem, when you start digging into it. We can talk about why. Anyway, that technical problem is also something that I'm working on.

So in summary, sooner or later people will figure out how to run human-brain-like algorithms on computer chips, and this is a very very big deal, it could be the best or worst thing that's ever happened to humanity, and there's work we can do right now to increase the chance that things go well, including, in particular, technical work that involves thinking about algorithms and AI and reading neuroscience papers. And that's what I'm working on!

I’m open to feedback; e.g., where might skeptical audience-members fall off the boat? (I am aware that it’s too long for one answer; I expect that I’ll end up saying various pieces of this in some order depending on the flow of the conversation. But still, gotta start somewhere.)

I would prepare a shortened version - 100 words max - that you could also give.

Yeah, I think I have a stopping point after the first three paragraphs (with minor changes).

Could you just say you're working on safe design principles for brain-like artificial intelligence?