Thanks for writing this series.
I can see how Approval Reward explains norm-following behavior. If people approve of honesty, then being honest will make people approve of me.
But I'm not totally convinced that Approval Reward is enough to explain norm-enforcement behavior on its own?
For some action or norm X, it doesn't seem obvious to me that "doing X" and "punishing someone who does not-X" are equivalent in terms of human-approving, unless you already knew that humans punish others who do things they don't like.
If you know that humans punish others who act contrary to norms that the humans value, then you can punish dishonest people to show that you value honesty, and then you'll get an Approval Reward from other humans who value honesty.
But suppose that nobody already knew the pattern that humans would punish others who act contrary to norms that the humans value. Then when you see someone being dishonest (acting contrary to honesty), then you don't know that "punishing this person for dishonesty will make others see that I value honesty", and so you wouldn't expect to get an Approval Reward. Therefore you wouldn't be motivated to punish them (if Approval Reward was your only motivation). And if everyone thinks the same way, then nobody will do any punishments for approval's sake, and so you won't see any examples from which to learn the pattern.
So it seems to me that although Approval Reward can take norm-enforcement behavior that already exists and "keep it going" for a while, it must have taken some other motivation to "get it started". In the case of harmful norm violations, the enforcement could have been caused by Sympathy Reward plus means-end reasoning (as you mentioned in another context). But I think humans sometimes punish people for even harmless norm violations (e.g. fashion crimes), so either that was caused by misgeneralization from harmful violations, or there's some third motivation involved.
I'm not sure about this, though.
I can only see the first image you posted. It sounds like there should be a second image (below "This is not happening for his other chats:") but I can't see it.
In the graphic in section 3.5.2, you mention
Groups that are plausibly aggressive enough to unilaterally (and flagrantly-illegally) deliberately release a sovereign AI into the wild
What kind of existing laws were you thinking that this would violate? When you said "into the wild", are you thinking of sending the AI to copy itself across the internet (by taking over other peoples' computers), and this would violate laws about hacking? If the AI was just accessing websites like a human, or if the AI had a robot body and it went out onto the streets, I can't immediately think of any laws that would violate.
Is the illegality dependant on the "sovereign" part? Is the illegality because of the actions the AI might need to take to prevent the creation of other AIs, and it's a crime for the human group because they could foresee that the AI would do this?
James Miller discussed similar ideas.
The "ideas" link doesn't seem to work.
About the example in section 6.1.3: Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech? It seems to me like that requires both (a) identifying that the speech is trying to get someone's attention, and (b) identifying that the speech is directed at you. (Well, I guess (b) implies (a) if you weren't visibly paying attention to her beforehand.)
About (a): If the Steering Subsystem doesn't know the meaning of words, then how can it tell that Zoe is trying to get someone's attention? Is there some way to tell from the sound of the voice? Or is it enough to know that there were no voices before and Zoe has just started talking now, so she's probably trying to get someone's attention to talk to them? (But that doesn't cover all cases when Zoe would try to get someone's attention.)
About (b): If you were facing Zoe, then you could tell if she was talking to you. If she said your name, then maybe the Steering Subsystem might recognize your name (having used interpretability to get it from the Learning Subsystem?) and know she was talking to you? Are there any other ways the Steering Subsystem could tell if she was talking to you?
I'm not sure how many false positives vs. false negatives evolution will "accept" here, so I'm not sure how precise a check to expect.
I couldn't click into it from the front page if I tried to click on the zone where the text content would normally go, but I was able to click into this from the front page if I clicked on the reply-count icon in the top-right corner. (But that wouldn't have worked when there were zero replies.)
The UK government also heavily used AI chatbots to generate diagrams and citations for a report on the impact of AI on the labour market, some of which were hallucinated.
This link is broken.
Thank you for writing this series.
I have a couple of questions about conscious awareness, and a question about intuitive self-models in general. They might be out-of-scope for this series, though.
Questions 1 and 2 are just for my curiosity. Question 3 seems more important to me, but I can imagine that it might be a dangerous capabilities question, so I acknowledge you might not want to answer it for that reason.
Yeah, that model sounds plausible to me (pending elaboration on how the friend-or-enemy parameter is updated). Thanks.