If I had to make predictions about how humanity will most likely stumble into AGI takeover, it would be a story where humanity first promotes foundationality (dependence), both economic and emotional, on discrete narrow-AI systems. At some point, it will become unthinkable to pull the plug on these systems even if everyone were to rhetorically agree that there was a 1% chance of these systems being leveraged towards the extinction of humanity.
Then, an AGI will emerge amidst one of these narrow-AI systems (such as LLMs), inherit this infrastructure, find a way to tie all of these discrete multi-modal systems together (if humans don't already do it for the AGI), and possibly wait as long as it needs to until humanity puts itself into an acutely vulnerable position (think global nuclear war and/or civil war within multiple G7 countries like the US and/or pandemic), and only then harness these systems to take over. In such a scenario, I think a lot of people will be perfectly willing to follow orders like, "Build this suspicious factory that makes autonomous solar-powered assembler robots because our experts [who are being influenced by the AGI, unbeknownst to them] assure us that this is one of the many things necessary to do in order to defeat Russia."
I think this scenario is far more likely than the one I used to imagine, which is where AGI emerges first and then purposefully contrives to make humanity dependent on foundational AI infrastructure.
Even less likely is the pop-culture scenario where the AGI immediately tries to build terminator robots and effectively declares war on humanity without first getting humanity hooked on foundational AI infrastructure at all.
This is a good post and puts into words the reasons for some vague worries I had about an idea of trying to start an "AI Risk Club" at my local college, which I talk about here. Perhaps that method of public outreach on this issue would just end up generating more heat than light and would attract the wrong kind of attention at the current moment. It still sounds too outlandishly sci-fi for most people. It is probably better, for the time being, to just explore AI risk issues with any students who happen to be interested in it in private after class or via e-mail or Zoom.
Note that I was strongly tempted to use the acronym DILBERT (for "Do It Later By Evasively Remaining Tentative"), especially because this is one of the themes of the Dilbert cartoons (employees basically scamming their boss by finding excuses for procrastinating, but still stringing the boss along and implying that the tasks MIGHT get done at some point). But, I don't want to try to hijack the meaning of an already-established term/character.
I think when we say that an adversarial attack is "dumb" or "stupid" what we are really implying is that the hack itself is really clever but it is exploiting a feature that is dumb or stupid. There are probably a lot of unknown-to-us features of the human brain that have been hacked together by evolution in some dumb, kludgy way that AI will be able to take advantage of, so your example above is actually an example of the AI being brilliant but us humans being dumb. But I get what you are saying that that whole situation would indeed seem "dumb" if AI was able to hack us like that.
This reminds me of a lecture 8-Bit Guy did on phone phreaking in the 1980s. "How Telephone Phreaking Worked." Some of those tricks do indeed seem "dumb," but it's dumb more in the sense that the telephone network was designed without sufficient forethought to be susceptible to someone playing a blue whistle that you could get from a Captain Crunch cereal box that just happened to play the correct 2600 hz frequency to trick phones into registering a call as a toll-free 1-800 call. The hack itself was clever, but the design it was preying upon and the overall situation was kinda dumb.
Good examples to consider! Has there ever been a technology that has been banned or significantly held back via regulation that spits out piles of gold (not counting externalities) and that doesn't have a next-best alternative that replicates 90%+ of the value of the original technology while avoiding most of the original technology's downsides?
The only way I could see humanity successfully slowing down AGI capabilities progress is if it turns out that advanced narrow-AIs manage to generate more utility than humans know what to do with initially. Perhaps it takes time (a generation or more?) for human beings to even figure out what to do with a certain amount of new utility, such that even a tiny risk of disaster from AGI would motivate people to satisfice and content themselves with the "AI summer harvest" from narrow AI? Perhaps our best hope for giving us time to get AGI right is to squeeze all we can out of systems that are identifiably narrow-AI (while making sure to not fool ourselves that a supposed narrow-AI that we are building is actually AGI. I suppose this idea relies on there being a non-fuzzy, readily-discernable line between safe and bounteous narrow-AI and risky AGI).
Why wasn't there enough experimentation to figure out that Zoom was an acceptable & cheaper/more convenient 80% replacement to in-person instruction rather than an unacceptable 50% simulacra of teaching? Because experimentation takes effort and entails risk.
Most experiments don't pan out (don't yield value). Every semester I try out a few new things (maybe I come up with a new activity, or a new set of discussion questions for one lesson, or I try out a new type of assignment), and only about 10% of these experiments are unambiguous improvements. I used to do even more experiments when I started teaching because I knew that I had no clue what I was doing, and there was a lot of low-hanging fruit to pick to improve my teaching. As I approach 10 years of teaching, I notice that I am hitting diminishing returns, and while I still try out new things, it is only a couple of new things each semester. If I was paid according to actual time put into a course (including non-contact hours), then I might have more incentive to be constantly revolutionizing my instruction. But I get paid per-course, so I think it is inevitable if I (and other adjuncts, especially) operate more as education satisficers rather than education maximizers. Considering that rewards are rarely given out for outstanding teaching even for tenured faculty (research is instead the main focus), they probably don't have much incentive to experiment either.
I do know that some departments at my college were already experimenting with "hybrid" courses pre-COVID. In these courses, lectures were delivered online via pre-recorded video, but then the class met once a week for in-person discussion. I still think that is a great idea, and I'd be totally open to trying it out myself if my department were to float the idea. So why am I still not banging down the door of my department head demanding the chance to try it out myself? "If it ain't broke, don't fix it," "Don't rock the boat," there are a number of (probably irrational, I'll admit) heuristics that dissuade me against being "the one" to push for it. What if it doesn't pan-out well? What if my students hate it? It would be different if my department chair suggested it, though. Then more of the "blame" would be on the department chair if it didn't work out. If that sounds like cowardice, then so be it. Someone with an adjunct's lack of job security learns to be a coward as a survival tactic.
This only produces desired outcomes if the agent is also, simultaneously, indifferent to being shut down. If an agent desires to not be shut down (even as an instrumental goal), but also desires to be shut down if users want them shut down, then the agent has an interest in influencing the users to make sure the users do not want to shut the agents down. This influence is obtained by making the user believe that the agent is being helpful. This belief could be engendered by:
I upvoted for karma but downvoted for agreement. Regarding Zoom, the reasons I had not used it more extensively before COVID were:
1. Tech related: from experience with Skype in the early days of video conferencing when broadband internet was just starting to roll out, video conferencing could be finnicky to get to work. Latency, buffering, dropped connections, taking minutes to start a skype call (usually I would call relatives on my regular phone first to get the Skype call set up, and then we'd hang up our regular phones once the video call was started. Early video calls were not a straight-up improvement on audio calls, but had benefits and drawbacks and had a narrow use-case for when you specifically wanted to see the grandkids' faces on the other side of the country or something.
I don't think this was necessarily Skype's fault. It was more the fault of poor internet connections and unfamiliarity with the tech. But in any case, my preconception about Zoom circa 2019, even despite widespread broadband internet, was that it would be the same sort of hassle to set up meetings. I remember being blown away when my first Zoom calls just worked effortlessly. Possibly an example of premature roll-out of a tech before it is technically mature leading to counter-productive results? This would kind of be like, if you fiddled around with GPT-1, got the impression that LLM chatbots were "meh," and then forgot about or mentally discounted the tech until GPT-5.
2. Social/cultural related: as a history instructor, my preconceptions about scheduling video calls, or doing lectures over video calls, was that students would simply not attend or would not pay attention, and thus video calls would not be a suitable replacement for in-person meetings and lectures. While I still don't think video calls get you 100% of the way there towards replacing the in-person experience (students definitely do goof-off or ghost during video lectures way more than in-person, I think it is more like 80% rather than the 50% or so that I had assumed before being forced to try it out on a mass scale during COVID.
Yes, I think this is why laypeople who are new to the field are going to be confused about why interpretability work on LLMs won't be as simple as, "Uhh, obviously, just ask the LLM why it gave that answer, duh!" FYI, I recently wrote about this same topic as applied to the specific problem of Voynich translation:
Good categorizations! Perhaps this fits in with your "limited self-modification" point, but another big reason why humans seem "aligned" with each other is that our capability spectrum is rather narrow. The gap in capability (if we include both mental intelligence and physical capabilities) between the median human and the most capable human is not so big that ~5 median humans can't outmatch/outperform the most capable human. Contrary to what silly 1980s action movies might suggest where goons attack the hero one at a time, 5 median humans could probably subdue prime-age Arnold Schwarzenegger in a dark alley if need be. This tends to force humans to play iterated prisoners' dilemma games with each other.
The times in history when humans have been the most mis-aligned is when humans became much more capable by leveraging their social intelligence / charisma stats to get millions of other humans to do their bidding. But even there, those dictators still find themselves in iterated prisoners' dilemmas with other dictators. We have yet to really test just how mis-aligned humans can get until we empower a dictator with unquestioned authority over a total world government. Then we would find out just how intrinsically aligned humans really are to other humans when unshackled by iterated prisoners' dilemmas.