I have a PhD in Computational Neuroscience from UCSD (Bachelor's was in Biomedical Engineering with Math and Computer Science minors). Ever since junior high, I've been trying to figure out how to engineer artificial minds, and I've been coding up artificial neural networks ever since I first learned to program. Obviously, all my early designs were almost completely wrong/unworkable/poorly defined, but I think my experiences did prime my brain with inductive biases that are well suited for working on AGI.
Although I now work as a data scientist in R&D at a large medical device company, I continue to spend my free time studying the latest developments in AI/ML/DL/RL and neuroscience and trying to come up with models for how to bring it all together into systems that could actually be implemented. Unfortnately, I don't seem to have much time to develop my ideas into publishable models, but I would love to have the opportunity to share ideas with those who do.
Of course, I'm also very interested in AI Alignment (hence the account here). My ideas on that front mostly fall into the "learn (invertible) generative models of human needs/goals and hook those up to the AI's own reward signal" camp. I think methods of achieving alignment that depend on restricting the AI's intelligence or behavior are about as destined to failure in the long term as Prohibition or the War on Drugs in the USA. We need a better theory of what reward signals are for in general (probably something to do with maximizing (minimizing) the attainable (dis)utility with respect to the survival needs of a system) before we can hope to model human values usefully. This could even extend to modeling the "values" of the ecological/socioeconomic/political supersystems in which humans are embedded or of the biological subsystems that are embedded within humans, both of which would be crucial for creating a better future.
Oh come on, Eliezer. These strategies aren't that alien.
I remember a time in my early years, feeling apprehensive about entering adolescence and inevitably transforming into a stereotypical rebellious teenager. It would have been not only boring and cliche but also an affront to every good thing I thought about myself. I didn't want to become a rebellious teenager, and so I decided, before I was overwhelmed with teenage hormones, that I wouldn't become one. And it turns out that intentional steering of one's self-narrative can (sometimes) be quite effective (constrained by what's physically possible, of course)! (Not saying that I couldn't have done with a bit more epistemological rebellion in my youth.)
The second one comes pretty naturally to me, too. I often feel more like a disembodied observer of the world around me, rather than an active participant. Far more of my mental energy is spent navigating the realm of ideas than identifying with the persona that is everything that everyone else identifies with me, so I tend to think far more about what ought to be done than about how I feel about things. Probably not the best thing for everyone to be like that, though.
There's also someone I know personally who definitely falls into the third trap, and who is definitely among those for whom this advice would not be helpful at all. She is a genuinely loving, compassionate, and selfless person, but that very selflessness sometimes manifests in a physically debilitating way. Not long after I first got to know her, I noticed that she seemed to exaggerate her reactions to things, not maliciously or even consciously, but more as a sort of moral obligation. As if by not overreacting to every small mishap, it would prove that she didn't care. As if by not sacrificing her own well-being for the sake of helping everyone around her, it would prove that she didn't love them. I think at some point in the past, she defined her character as someone who reacts strongly to the things that matter to others, but her subconscious has since twisted this to the point where she now tends to stress herself out over other people's problems to the point where she becomes physically ill. Again, I don't think she want to make a martyr out of herself, but I think her self-predicting, motor-directing circuitry thinks that she needs to be one.
An additional possibly-not-helpful bit of advice for the existentially anxious: take a page from Stoicism. Try to imagine all the way things could go disastrously wrong, and try to coax yourself into being emotionally at peace with those outcomes, insofar as they are outside of your control. Strive as much as possible to steer things toward a better future with the tools and resources you have available to you, but practice equanimity towards everything else.
I wonder if you could do something similar with all peer-reviewed scientific publications, summarizing all findings into an encyclopedia of all scientific knowledge. Basically, each article in the wiki would be a review article on a particular topic. The AI would have to track newly published results, determine which existing topics in the encyclopedia they relate to or whether creating a new article is warranted, and update the relevant articles with the new findings.
Given how much science content humanity has accumulated, you'd probably have to have the AI organize scientific topics in a tree, with parent articles summarizing topics at a higher level of abstraction and child articles digging into narrower scopes more deeply. Or more generally, a directed acyclic graph to handle cross-disciplinary topics.
Maybe future versions of AI chatbots could use something like this as a shared persistent memory that all chatbot instances could reference as a common ground truth. The only trick would be getting the system to use sound epistemology and reliably report uncertainty instead of hallucinations.
The same thing happens with my daughters (all under 6). Get them to start talking about poop, and it's like a switch has been flipped. Their behavior becomes deliberately misaligned with parental objectives until we find a way to snap them out of that loop.
So is Agent Foundations primarily about understanding the nature of agency so we can detect it and/or control it in artificial models, or does it also include the concept of equipping AI with the means of detecting and predictively modeling agency in other systems? Because I strongly suspect the latter will be crucial in solving the alignment problem.
The best definition I have at the moment sees agents as systems that actively maintain their internal state within a bounded range of viability in the face of environmental perturbations (which would apply to all living systems) and that can form internal representations of arbitrary goal states and use those representations to reinforce and adjust their behavior to achieve them. An AGI whose architecture is biased to recognize needs and goals in other systems, not just those matching human-specific heuristics, could be designed to adopt those predicted needs and goals as its own provisional objectives, steering the world toward its continually evolving best estimate of what other agentic systems want the world to be like. I think this would be safer, more robust, and more scalable than trying to define all human preferences up front.
These are just my thoughts. Take from them what you will.
But then again, what are human minds but bags of heuristics themselves? And AI can evolve orders of magnitude faster than we can. Handing over the keys to its own bootstrapping will only accelerate it further.
If the future trajectory to AGI is just "systems of LLMs glued together with some fancy heuristics", then maybe a plateau in Transformer capabilities will keep things relatively gradual. But I suspect that we are just a paradigm shift or two away from a Generalized Theory of Intelligence. Just figure out how to do predictive coding of abitrary systems, combine it with narrative programming and continual learning, and away we go! Or something like that.
Generalizing a bit, I wonder how hard a misaligned ASI would have to work to get every human to voluntarily poison themselves.
It's how recursive self-improvement starts out.
First, the global "AI models + human development teams" system improves through iterative development and evaluation. Then the AI models take on more responsibilities in terms of ideation, process streamlining, and architecture optimization. And finally, an AI agent groks enough of the process to take on all responsibilities, and the intelligence explosion takes off from there.
You'd think someone would try to use AI to automate the production and distribution of necessities to drive the cost of living down toward zero first, but it seems that was just a dream of naive idealism. Oh well. Still, could someone please get on that?
With respect to the online rationalist community, my main thing to come out of the closet about is that I was a Young-Earth Creationist all the way up until the end of grad school (and even a Young-Universe Creationist up until the middle of undergrad). Not very rational of me to avoid honestly facing mountains of evidence in order to protect sacred beliefs!
With respect to my family and life-long friends, my main thing to come out of the closet about is that I am now a liberal atheist. Not very respectable of me to willfully join the ranks of the enemy!
My main hurdle in exposing myself on the latter front is not so much my desire to be liked, but my desire not to hurt those I care about. There's no kind way to inform someone that you think that they are fundamentally wrong about every belief they hold sacred and that they build their entire identity as an individual and a community upon. I am unfortunately the most emotionally stable person I know among those I'm close to, and an unfortunate number of people look up to me as an intelligent person who agrees with them on everything they hold dear, thereby helping them feel more justified in their beliefs. Coming out to them will necessarily create feelings of disappointment, betrayal, devastation, fear, doubt, and/or existential crisis, varying in mixture and intensity according to the individual.
I guess I could offer them the tools of sound epistemology and existential equanimity as a value proposition, but I have doubts as to whether others would see that as a fair trade-off.
If you can get access to the book, try reading The Intelligent Movement Machine. Basically, motor cortex is not so much about stimulating the contraction of certain muscles, but it's instead encoding the end-configuration to move the body towards (e.g., motor neurons in monkey motor cortex that encode the act of bringing the hand to the mouth, not matter the starting position of the arm). How the muscles actually achieve this is then more a matter of model-based control theory than RL-trained action policy.
It's closely related to end-effector control, where the position, orientation, force, speed, etc. of the movement of the end of a robotic appendage are the focus of optimization, as opposed to joint control, which focuses only on the raw motor outputs along the joints of the appendage that cause the movement.
You can also try diving deeper into the active inference literature if you want to build an intuition for how "predictive" circuits can actually drive motor commands. Just remember that Friston comes at this from the perspective of trying to find unifying mathematical formalisms for everything the brain does, both perception and action, which leads him to use terminology for the action side of things that is unintuitive.
Active inference is not saying that the brain "predicts" that the body will achieve a certain configuration and then the universe grants its wish. Instead, just like perception is about predicting what things out in the world are causing your senses to receive the signals that they do, action is about predicting what low-level movements of your body would cause your desired high-level behavior and then using those predictions to actually drive the low-level movements. Or rather, the motor cortex is finding the low-level movements (proprioceptive trajectories) that the agent's intended behavior would cause and then carrying out those movements. Again, don't get too hung up on the "prediction" nomenclature; the system does what it does regardless of what you call it.