In the graphic in section 3.5.2, you mention
Groups that are plausibly aggressive enough to unilaterally (and flagrantly-illegally) deliberately release a sovereign AI into the wild
What kind of existing laws were you thinking that this would violate? When you said "into the wild", are you thinking of sending the AI to copy itself across the internet (by taking over other peoples' computers), and this would violate laws about hacking? If the AI was just accessing websites like a human, or if the AI had a robot body and it went out onto the streets, I can't immediately think of any laws that would violate.
Is the illegality dependant on the "sovereign" part? Is the illegality because of the actions the AI might need to take to prevent the creation of other AIs, and it's a crime for the human group because they could foresee that the AI would do this?
James Miller discussed similar ideas.
The "ideas" link doesn't seem to work.
About the example in section 6.1.3: Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech? It seems to me like that requires both (a) identifying that the speech is trying to get someone's attention, and (b) identifying that the speech is directed at you. (Well, I guess (b) implies (a) if you weren't visibly paying attention to her beforehand.)
About (a): If the Steering Subsystem doesn't know the meaning of words, then how can it tell that Zoe is trying to get someone's attention? Is there some way to tell from the sound of the voice? Or is it enough to know that there were no voices before and Zoe has just started talking now, so she's probably trying to get someone's attention to talk to them? (But that doesn't cover all cases when Zoe would try to get someone's attention.)
About (b): If you were facing Zoe, then you could tell if she was talking to you. If she said your name, then maybe the Steering Subsystem might recognize your name (having used interpretability to get it from the Learning Subsystem?) and know she was talking to you? Are there any other ways the Steering Subsystem could tell if she was talking to you?
I'm not sure how many false positives vs. false negatives evolution will "accept" here, so I'm not sure how precise a check to expect.
I couldn't click into it from the front page if I tried to click on the zone where the text content would normally go, but I was able to click into this from the front page if I clicked on the reply-count icon in the top-right corner. (But that wouldn't have worked when there were zero replies.)
The UK government also heavily used AI chatbots to generate diagrams and citations for a report on the impact of AI on the labour market, some of which were hallucinated.
This link is broken.
Thank you for writing this series.
I have a couple of questions about conscious awareness, and a question about intuitive self-models in general. They might be out-of-scope for this series, though.
Questions 1 and 2 are just for my curiosity. Question 3 seems more important to me, but I can imagine that it might be a dangerous capabilities question, so I acknowledge you might not want to answer it for that reason.
I can only see the first image you posted. It sounds like there should be a second image (below "This is not happening for his other chats:") but I can't see it.