Yes, defining the challenge also seems to get us 90%< there already. A Millennium prize has a possibility of being too vague compared to the others.
In general a great piece. One thing that I found quite relatable is the point about the preparadigmatic stage of AI safety going into later stages soon. It feels like this is already happening to some degree where there are more and more projects readily available, more prosaic alignment and interpretability projects at large scale, more work done in multiple directions and bigger organizations having explicit safety staff and better funding in general.
With these facts, it seems like there's bound to be a relatively big phase shift in research and action within the field that I'm quite excited about.
This is a wonderful piece and echoes many sentiments I have with the current state of AI safety. Lately, I have also thought more and more about the technical focus' limitations in the necessary scope to handle the problems of AGI, i.e. the steam engine was an engineering/tinkering feat loong before it was described technically/scientifically and ML research seems much the same. When this is the case, focusing purely on hard technical solutions seems less important than focusing on AI governance or prosaic alignment and not doing this, as echoed in other comments, might indeed be a pitfall of specialists, some of which are also warned of here.
[...]Still, both the reward function and the RL algorithm are inputs into the adult’s jealousy-related behavior[...]
I probably just don't know enough about jealousy networks to comment here but I'd be curious to see the research here (maybe even in an earlier post).
Does anyone believe in “the strict formulation”?
Hopefully not, but as I mention, often a too-strict formulation imh.
[...]first AGI can hang out with younger AGIs[...]
More the reverse. And again, this is probably taking it farther than I would take this idea but it would be pre-AGI training in an environment with symbolic "aligned" models, learning the ropes from this, being used as the "aligned" model in the next generation and so on. IDA with a heavy RL twist and scalable human oversight in the sense that humans would monitor rewards and environment states instead of providing feedback on every single action. Very flawed but possible.
[...] and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals [...]
As far as I understand from the post, the reward comes only from understanding the reward function before interaction and not after which is the controlling factor for obstructionist behaviour.
Thank you for the comprehensive response!
To be clear, you & I are strongly disagreeing on this point.
It seems like we are mostly agreeing in the general sense that there are areas of the brain with more individual differentiation and areas with less. The disagreement is probably more in how different this jealousy is exhibited as a result of the neocortical part of the circuit you mention.
And I endorse that criticism because I happen to be a big believer in “cortical uniformity” [...] different neural architectures and hyperparameters in different places and at different stages [...].
Great to hear, then I have nothing to add there! I am quite inclined to believe that the neural architecture and hyperparameter differences are underestimated as a result of Brodmann areas being a thing at all, i.e. I'm a supporter of the broad cortical uniformity argument but against the strict formulation that I feel is way too prescriptive given our current knowledge of the brain's functions.
Yeah, maybe see Section 2.3.1, “Learning-from-scratch is NOT “blank slate””.
And I will say I'm generally inclined to agree with your classification of the brain stem and hypothalamus.
That sounds great but I don’t understand what you’re proposing. What are the “relevant modalities of loving family”? I thought the important thing was there being an actual human that could give feedback and answer questions based on their actual human values, and these can’t be simulated because of the chicken-and-egg problem.
To be clear, my baseline would also be to follow your methodology but I think there's a lot of opportunity in the "nurture" approach as well. This is mostly related to the idea of open-ended training (e.g. like AlphaZero) and creating a game-like environment where it's possible to train for the agent. This can to some degree be seen as a sort of IDA proposal since your environment will need to be very complex (e.g. have other agents that are kind or other "aligned trait", possibly trained from earlier states).
With this sort of setup, the human-giving-feedback is the designer of the environment itself, leading to a form of scalable human oversight probably iterating over many environments and agents, i.e. the IDA part of the idea. And again, there are a lot of holes in this plan, but I feel like it should not be dismissed outright. This post should also inform this process. So a very broad "loving family" proposal, though the name itself doesn't seem adequate for this sort of approach ;)
Strong upvote for including more brain-related discussions in the alignment problem and for the deep perspectives on how to do it!
Disclaimer: I have not read the earlier posts but I have been researching BCI and social neuroscience.
I think there’s a Steering Subsystem circuit upstream of jealousy and schadenfreude; andI think there’s a Steering Subsystem circuit upstream of our sense of compassion for our friends.
I think there’s a Steering Subsystem circuit upstream of jealousy and schadenfreude; and
I think there’s a Steering Subsystem circuit upstream of our sense of compassion for our friends.
It is quite crucial to realize that these subsystems probably are very differently organized, are at very different magnitudes, and have very different functions in every different human. Cognitive neuroscience's view of the modular brain seems (by the research) to be quite faulty and computational/complexity neuroscience generally seem more successful and are more concerned with reverse-engineering the brain, i.e. identifying neural circuitry associated with different evolutionary behaviours. This should inform how we cannot find "the Gaussian brain" and implement it.
You also mention in post #3 that you have the clean slate vs. pre-slate systems (learning vs. steering subsystems) and a less clear distinction here might be helpful instead of modularization. All learning subsystems are inherently organized in ways that evolutionarily seem to fit into that learning scheme (think neural network architectures) which is in and off itself another pre-seeded mechanism. You might have pointed this out in earlier posts as well, sorry if that's the case!
I think I’m unusually inclined to emphasize the importance of “social learning by watching people”, compared to “social learning by interacting with people”.
In general, it seems babies learn quite a lot better from social interaction than pure watching. And between the two, they definitely learn better if they have something to imitate what they're seeing upon. There's definitely a good point in the speed differentials between human and AGI existence and I think exploring the opportunity of building a crude "loving family" simulator might not be a bad idea, i.e. have it grow up at its own speed in an OpenAI Gym simulation with relevant modalities of loving family.
I'm generally pro the "get AGI to grow up in a healthy environment" but definitely with the perspective that this is pretty hard to do with the Jelling Stones and that it seems plausible to simulate this either in an environment or with pure training data. But the point there is that the training data really needs to be thought of as the "loving family" in its most general sense since it indeed has a large influence on the AGI's outcomes.
But great work, excited to read the rest of the posts in this series! Of course, I'm open for discussion on these points as well.
Awesome post, love the perspective. I've generally thought in these lines as well and it was some of the most convincing arguments for working in AI safety when I switched ("Copilot is already writing 20% of my code, what will happen next?").
I do agree with other comments that Oracle AI takeover is plausible but will say that a strong code generation tool seems to have better chances and to me seem to arrive parallel to conscious chatbots, i.e. there's currently more incentive to create code generation tools than chatbots and the chatbots that have virtual assistant-like capabilities seem easier to make as code generation tools (e.g. connecting to Upwork APIs for posting a job).
And as you well mention, converting engineers is much easier with this framing as well and allows us to relate better to the field of AI capabilities, though we might want to just add it to our arsenal of argumentation rather than replace our framing completely ;) Thank you for posting!
As I also responded to Ben Pace, I believe I replied too intellectually defensively to both of your comments as a result of the tone and would like to rectify that mistake. So thank you sincerely for the feedback and I agree that we would like not to exclude anyone unnecessarily nor have too much ambiguity in the answers we expect. We have updated the survey as a result and again, excuse my response.
After the fact, I realize that I might have replied too defensively to your feedback as a result of the tone so sorry for that!
I sincerely do thank you for all the points and we have updated the survey with a basis in this feedback, i.e. 1) career = short text answer, 2) knowledge level = which of these learning tasks have you completed and 4) they are now independent rankings.
True, sorry about that! And your interpretation is our interpretation as well however, the survey is updated to a different framing now.