interpretability on pretrained model representations suggest they're already internally "ensembling" many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstraction
That seems encouraging to me. There's a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it's capabilities to bear on achieving that goal. It does this by having a "world model" that is coherent and perhaps a set of consistent bayesian priors about how the world works. I can understand why such a system would tend to behave in a hyperfocused way to go out to achieve its goals.
In contrast, a systems with an ensemble of abstractions about the world, many of which may even be inconsistent, seems much more human like. It seems more human like specifically in that the system won't be focused on a particular goal, or even a particular perspective about how to achieve it, but could arrive at a particular solution ~~randomly, based on quirks of training data.
I wonder if there's something analogous to human personality, where being open to experience or even open to some degree of contradiction (in a context where humans are generally motivated to minimize cognitive dissonance) is useful for seeing the world in different ways and trying out strategies and changing tack, until success can be found. If this process applies to selecting goals, or at least sub-goals, which it certainly does in humans, you get a system which is maybe capable of reflecting on a wide set of consequences and choosing a course of action that is more balanced, and hopefully balanced amongst the goals we give a system.
I've been writing about multi-objective RL and trying to figure out a way that an RL agent could optimize for a non-linear sum of objectives in a way that avoids strongly negative outcomes on any particular objective.
This sounds like a very interesting question.
I get stuck trying to answer your question itself on the differences between AGI and humans.
But taking your question itself at its face:
ferreting out the fundamental intentions
What sort of context are you imagining? Humans aren't even great at identifying the fundamental reason for their own actions. They'll confabulate if forced to.
That's smart! When I started graduate school in psychology in 2013, mirror neurons felt like, colloquially, "hot shit", but within a few years, people had started to cringe quite dramatically whenever the phrase was used. I think your reasoning in (3) is spot on.
Your example leads to fun questions like, "how do I recognize juggling", including "what stimuli activate the concept of juggling when I do it" vs "what stimuli activate the concept of juggling when I see you do it"?, and intuitively, nothing there seems to require that those be the same neurons, except the concept of juggling itself.
Empirically I would probably expect to see a substantial overlap in motor and/or somatosensory areas. One could imagine the activation pathway there is something like
visual cortex [see juggling]->temporal cortex [concept of juggling]->motor cortex[intuitions of moving arms]
And we'd also expect to see some kind of direct "I see you move your arm in x formation"->"I activate my own processes related to moving my arm in x formation" that bypasses the temporal cortex altogether.
And we could probably come up with more pathways that all cumulatively produce "mirror neural activity" which activates both when I see you do a thing and when I do that same thing. Maybe that's a better concept/name?
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
You haven't factored in the possibility Putin gets deposed by forces inside Russia who might be worried about a nuclear war and conditional on use of tactical nukes, intuitively that seems likely enough to materially lower p(kaboom).
American Academy of Pediatrics lies to us once again....
"If caregivers are wearing masks, does that harm kids’ language development? No. There is no evidence of this. And we know even visually impaired children develop speech and language at the same rate as their peers."
This is a textbook case of the Law of No Evidence. Or it would be, if there wasn’t any Proper Scientific Evidence.
Is it, though? I'm no expert, but I tried to find Relevant Literature. Sometimes, counterintuitive things are true.
https://www.researchgate.net/publication/220009177_Language_Development_in_Blind_Children:
Blindness affects congenitally blind children’s development in different ways, language development being one of the areas less affected by the lack of vision.
Most researchers have agreed upon the fact that blind children’s morphological development, with the exception of personal and possessive pronouns, is not delayed nor impaired in comparison to that of sighted children, although it is different.
As for syntactic development, comparisons of MLU scores throughout development indicate that blind children are not delayed when compared to sighted children
Blind children use language with similar functions, and learn to perform these functions at the same age as sighted children. Nevertheless, some differences exist up until 4;6 years; these are connected to the adaptive strategies that blind children put into practice, and/or to their limited access to information about external reality. However these differences disappear with time (Pérez-Pereira & Castro, 1997). The main early difference is that blind children tend to use self-oriented language instead of externally oriented language.
I don't know exactly where that leaves us evidentially. Perhaps the AAP is lying by omission by not telling us about things other than language that are affected by children's sight.
That's a bit different to the dishonesty alleged, though.
Still working my way through reading this series--it is the best thing I have read in quite a while and I'm very grateful you wrote it!
I feel like I agree with your take on "little glimpses of empathy" 100%.
I think fear of strangers could be implemented without a steering subsystem circuit maybe? (Should say up front I don't know more about developmental psychology/neuroscience than you do, but here's my 2c anyway). Put aside whether there's another more basic steering subsystem circuit for agency detection; we know that pretty early on, through some combination of instinct and learning from scratch, young humans and many animals learn there are agents in the world who move in ways that don't conform to the simple rules of physics they are learning. These agents seem to have internally driven and unpredictable behavior, in the sense their movement can't be predicted by simple rules like "objects tend to move to the ground unless something stops them" or "objects continue to maintain their momentum". It seems like a young human could learn an awful lot of that from scratch, and even develop (in their thought generator) a concept of an agent.
Because of their unpredictability, agent concepts in the thought generator would be linked to thought assessor systems related to both reward and fear; not necessarily from prior learning derived from specific rewarding and fearful experiences, but simply because, as their behavior can't be predicted with intuitive physics, there remains a very wide prior on what will happen when an agent is present.
In that sense, when a neocortex is first formed, most things in the world are unpredictable to it, and an optimally tuned thought generator+assessor would keep circuits active for both reward or harm. Over time, as the thought generator learns folk physics, most physical objects can be predicted, and it typically generates thoughts in line with their actual beahavior. But agents are a real wildcard: their behavior can't be predicted by folk physics, and so they perceived in a way that every other object in the world used to be: unpredictable, and thus continually predicting both reward and harm in an opponent process that leads to an ambivalent and uneasy neutral. This story predicts that individual differences in reward and threat sensitivity would particularly govern the default reward/threat balance otherwise unknown items. It might (I'm really REALLY reaching here) help to explain why attachment styles seem so fundamentally tied to basic reward and threat sensitivity.
As the thought generator forms more concepts about agents, it might even learn that agents can be classified with remarkable predictive power into "friend" or "foe" categories, or perhaps "mommy/carer" and "predator" categories. As a consequence of how rocks behave (with complete indifference towards small children), it's not so easy to predict behavior of, say, falling rocks with "friend" or "foe" categories. On the contrary, agents around a child are often not indifferent to children, making it simple for the child to predict whether favorable things will happen around any particular agent by classifying agents into "carer" or "predator" categories. These categories can be entirely learned; clusters of neurons in the thought generator that connect to reward and threat systems in the steering system and/or thought assessor. So then the primary task of learning to predict agents is simply whether good things or bad things happen around the agent, as judged by the steering system.
This story would also predict that, before the predictive power of categorizing agents into "friend" vs. "foe" categories has been learned, children wouldn't know to place agents into these categories. They'd take longer to learn whether an agent is trustworthy or not, particularly so if they haven't learned what an agent is yet. As they grow older, they get more comfortable with classifying agents into "friend" or "foe" categories and would need fewer exemplars to learn to trust (or distrust!) a particular agent.
Well written, I really enjoyed this. This is not really on topic but I'd be curious to read and "idiot's guide" or maybe an "autist's guide" on how to avoid sounding condescending.