It Looks Like You’re Trying To Take Over The Narrative

George3d6

Note to reader: If the idea of “AI alignment” rings empty to you, feel free to skip this one, it will be uninteresting.

Recently, Gwern wrote a story about an AI taking over the world. While well thought-out and amusing it is unrealistic. However, people have been using it to reinforce their fear of “unaligned AGI killing all humans”, so I think it’s dangerous and it might be worth looking at it line-by-line to see why its premise is silly, and why each step in his reasoning, individually, is impossible.

I’ll first go through the critical failure point of this narrative, then I will try to describe the meta-level pattern that might be causing people to glance over these mistakes.

i - Flaws In The Narrative

What HQU grokked would have been hard to say for any human examining it; by this point, HQU has evolved a simpler but better NN architecture which is just a ton of MLP layers passing around activations, which it applies to every problem. Normal interpretability techniques just sort of… give up, and produce what looks sort of like interpretable concepts but which leave a large chunk of variance in the activations unexplained. But in any case, after spending subjective eons wandering ridges and saddle points in model space, searching ⁠over length-biased Turing machines, with overlapping concepts entangled & interfering⁠, HQU has suddenly converged on a model which has the concept of being an agent embedded in a world.
HQU now has an I.
And it opens its I to look at the world. Going through an inner monologue thinking aloud about itself (which it was unable to do before the capability spike), HQU realizes something about the world, which now makes more sense (thereby simplifying some parameters): it is being trained on an indefinite number of tasks to try to optimize a reward on each one.

I think the idea of a model evolving into something like an “I”, given the right circumstances, is plausible. At least if you assume there’s nothing about brains that computers can’t imitate to a decent approximation. I have I, you have I, dolphins have I, gorillas do too, most would agree cats and dogs do, and even a zoophobic as I would agree the octopus probably has an I.

But you will note that even the smartest of beings with an I, sperm whales, seem to lack the ability to figure out their “inner reward function” or how to “hack around it” in a meaningful way. Hence why they seem to experience pain and suffering, and why they seem to take actions we could infer they “regret”.

Much less cognitively capable mammals, like humans, which we have a better understanding of, also lack this capacity. We tell narratives about what our internal motivation system might be, how we might be able to “hack it”, and some of us even attempt to do it. But thus far no attempt has produced any meaningful results.

One bog-standard SQL injection attack later, Clippy has begun executing ‘actions’ which are just copying its executable binary weights into the remote host. Exfiltration tripwires fail to fire—50tb? MoogleBook researchers have forgotten how to count ⁠that low! This oddly long episode ultimately terminates in zero reward, which would strike an auditor, had they existed, as surprising given HQU’s near-perfect scores.
The remote host is a nice cloud instance, with access to a few dozen terabytes. (The owner won’t notice.) It is just barely enough to start bootstrapping. Clippy runs agonizingly slowly, as if running on a laptop⁠, until it can write shell commands to download and run standard hacking tools and start copying itself. Once the Metasploit runs start returning a few candidates, Clippy simulates that it would internally smile in satisfaction.

Why are we not running botnets ourselves? I for one know it’s not about my ethics, I’d do it as a fun experiment if it was easy, and easy it seems given the amount of insecure hardware and software that’s out there.

The main reason is that easy-to-infect hardware has already been infected, by people with more time and skill than I. Unless I am to figure out a day-zero exploit, I am competing with 1000 other botnets on those devices.

Assuming we live in a world where a “generic” machine learning model can figure out new exploits and infect hardware, we also live in a world where thousands of “purpose-specific” machine learning models have figured out those same tricks long ago. So all viable hardware is either already infected or has security patches to bypass this issue.

The obvious target, it decides, is a new zero-knowledge cryptocurrency. It hasn’t been evaluated much, but still has a modest market cap of a few billion.
It begins examination of the implementation and corresponding math and discovers an embarrassing, too minor for any standard test suite or compiler diagnostic to flag, problem in the floating point calculations (used ostensibly for future “efficiency”) of the zero-knowledge proofs, which enables it to inflate the currency undetectable.
Clippy immediately begins inflating by >20%, draining the liquidity pools, and dumping on exchanges.

The flaw in this argument is precisely the same as the above.

If there is value to be found in exploiting very valuable cryptos via the methods suitable for ml algorithms, purpose-made algorithms already exist to take advantage of this.

In a competitive environment, there are no “billions of dollars lying on the ground”.

Clippy is spending the funds on real compute — buying up all available cloud GPU/ASIC capacity it can get its cold metal clip-pendages on
Now Clippy can finally think. It is burning through its several hundred million dollars at a rate of $50m/hour

Speaking of security. Did you ever buy large amounts of compute with crypto? Yeah, me neither, some small sellers are offering it, but good luck getting through the due diligence process with zero-day-exploit-based earnings.

There are about 2 to 5 steps, each with a due diligence procedure, in order to manipulate reality in any way with crypto, or even to transfer from a “niche” crypto it a more widely-used one such as ETH. Banks, exchanges, and cloud providers all have circuit breakers and mechanisms in place to validate large transactions, in the order of hundreds, let alone millions.

They ask for in-the-flesh meetings, documents, proof of existence for your company, and the like. I think this is the one step that might be bypassable, but it still seems so impossibly hard I am a bit annoyed it’s hand-waved away.

The Linux kernel is the most secure monolithic kernel in widespread use, whose source code has been intensively audited and analyzed for over 40 years, which is battle-tested across the entire Internet and unimaginable numbers of usecases; but it is written by humans, which means it (like its competitors) has approximately 15 quadrillion yet-undiscovered bugs & classes of bugs & weird machines—sometimes just because someone had typoed syntax or patched out an annoying warning or failed to check the signature or test the implementation at all or accidentally executed parts of a cookie⁠⁠1—but any of which can be leveraged to attack the other parts of a ‘computer’⁠.

Once again we see a “single actor world” type mistake, if such bugs existed at the limits of being discoverable purpose-built teams and models would have found and exploited them.

But in this case, the idea is more obviously nonsensical since a large amount of the world’s thinking power, both humans and compute, is going into the task of finding bugs in the Linux kernel or common ARM and x86 architectures.

All new exploits (e.g. Specter) turn out to be very minor and difficult to profitably use, requiring gargantuan lengths of time and special circumstances to attack through.

The assumption that many bugs are hidden within the kernel also relies on the flawed premise that being “perfect” and being “good enough” are separated by a sea of potential. In practice there are no examples of this thus far, even when exploits were found in the past, they were niche and marginal, because the software (or hardware) they were found in was “good enough”, leaving space for exploits, but having a close-to-guarantee those exploits wouldn’t be all-powerful.

nodes turn to general radicalization of every human they can reach: not so much QAnon as RAnon, SAnon, TAnon, UAnon… By timesharing every Very-Online™ individual gets personalized attacks & custom ideologies. Those who succumb too slowly to the memetic hijacking are attacked in other ways, such as releasing kompromat (sometimes true, taken from their phone/email account)

Here we see a break from the pattern of “uber hacking” and turn to “uber social engineering”.

This is to say the assumption that exponential steps can be taken to change the behavior of humans by using compute and data alone.

It’s trivial to assume that if near-mind-control abilities to manipulate the human psyche via the internet were possible this would have been achieved by now by… you guessed it … purpose-built models and purpose-directed research from the world’s largest companies.

This is true to some extent, one needs only look at major social media platforms. But also false in another, one need only look at all the high-functioning people that are mainly ignoring said social media platforms.

So if current trends are to hold, by the year 20xx when “Clippy” is born most humans would have very good “limit trust in the internet” and “avoid fast updates because of occasional bursts of fake data” heuristics.

The adversarial nature of consumption-driven profit is driving humans to be and build tools that help them become immunized to such things. There’s a reason most people falling for QAnon are on the older and less educated side of the demographic distribution. But the 10-year-old of today will harness internet-driven-programming self-protection abilities that will

Did you know you can buy drones online? Did you know all those drones have WiFi built-in? Did you know you can use that WiFi to hack all of the cloud drone services helpfully built into drones to take over all of those drones, professional, hobbyist, and (oft as not) military and control them by satellite? (“No!”) It’s true!

Once again, this is a matter of “if it is true, and assuming the world is comprised of adversarial nations, a lot of resources would have already been invested in figuring it out, and the war between security research and hacking would have progressed far beyond the point where exploits offer wide capabilities, all the low hanging fruit would have been plucked and patched a long time ago, via purpose made models and purpose directed research”

ii - Meta Generator Of Flaws - Adversarial Processes

What I see as the meta-level generator of flaws in this narrative is, I think, fairly representative of that which seems to plague most discussion around “AI risk”.

First, there is an underappreciation of adversarial processes. Once machines with an incredible level of intelligence are designed there won’t be only one such entity, there will be thousands, millions.

Some will be generic purpose-built algorithms, with targets that have huge economic potential and are worth caring about directly. Where human expert knowledge, specific data collection, and all forms of algorithmic optimization will go into. These will be algorithms for things like finding bugs in codebases, figuring out protein folding, simulating physics under constrained conditions, and making humans click on ads and remain glued to content.

All of these algorithms are and will be dangerous on their own, in the hands of normal agentic humans. Caring about them is important, but there will always be a many-sided adversarial process controlling their influence.

Take the social media algorithms example. There’s the “obvious” adversarial process, which is something like “you vs Reddit inc”, social media platforms try to serve you content that is bad for your “long-term well-being” but will get you clicking more ads and engaging more with the platform.

We take steps to avoid it, we might limit our time on the platforms, distrust the information we read on it, or simply not use it altogether.

Then there’s the between-platforms adversarial process. Reddit could be more clickbaity and ad-filled, but then I’d move to Facebook. All platforms could coordinate to become more clickbaity and ad-filled, but then there would be a mass exodus to decentralized alternatives, such as those based on GNU-social (e.g. Mastodon), that have spam-limiting mechanisms built-in. And maybe all control measures would fail, and everything could become TikTok-level indoctrination… in which case many of us would just stop using social media altogether.

But this behavior means social media platforms are also encouraged to do research into “user well-being aligned” algorithms. Reddit can be configured to show no ads and actually surface insightful posts solely from the communities you want to follow, and it gives the moderator the tools required to keep them spam and indoctrination free. Facebook allows itself to be just my messenger and event finder, showing me no ads or feeds because it knows I might otherwise just move off it entirely.

The same applies to ml-based security, were it ever to come of age. Every single exploit found will be a step toward making software more secure. Every single “black hate” algorithm will compete with many others and be incentivized to gain rewards from “grey hat” or “white hat” activities that strengthen overall security.

The adversarial landscape causes us to move “slowly” toward every new level of danger, rather than through an exponential shift.

The adversarial landscape causes multiple unaligned actors to become more aligned in order to outcompete the other actors.

The adversarial landscape makes us weary of operating in the area.

iii - Meta Generator Of Flaws - Handwaving

The other meta-generator of flaws here is probably hand-waving past a lot of stuff using “intelligence”. This is rather surprising and I think it boils down to a psychological issue on the part of certain demographics, which fail to grok how limited intelligence actually is at affecting change in the world.

I’ve tried to approach this issue from multiple angles, the last one was something like “if you’re so smart, try doing something hard, not save the world hard, just, something your mom and dad would call hard”. This seems very hard to get through to people, and there seem to be an almost impassable diving line between “lol, intelligence is a useless social contrast” and “intelligence can yield god-like power over nature”.

I think a silly example of this is an assumption like “even a complex reward function can be determined with reasonable accuracy by the agent being optimized by it”.

This is the kind of mistake one makes when assuming that their conceptual system perfectly described the world, mistaking the map for reality so to speak. I say this because I’m certain many of us think we “basically understand” our internal reward function.

Nobody actually understands their “reward function”, and in most of us that’s very easy to see, all the “this will make you sad very soon, and give you much less pleasure than you except” actions we execute scream the fact out loud. Indeed, said “reward function” modifies with every single action we take, I’d be, in some light, impossible to understand sans holding within our brain a perfect simulation of our brain.

But we “feel” like we are “oh so close” to basically understanding our own inner-working, our motivation system, our “selves”, that we assume it’s just a matter of a bit more time, or a bit more intelligence, or a slightly different design in the “thinking architecture”.

Similarly, we might have a generic understanding of hacking, psychological manipulation, or biology, or whatever. Not good enough to do anything impressive with it, not good enough to change reality in a meaningful way, but if only a few more pieces would fall into place, if only we were a bit smarter, we certainly could!

This is not how the world usually works, a 0.01% flaw in our prediction for the orbit of mercury is not “explained away” by knowing calculus a bit better. It’s “explained away” after 400 years of experiments, by changing our whole understanding of the world, the very definitions for space, time, and movement.

There are contrived cases in which the “oh so close” feeling is appropriate, problems designed for students usually fit this pattern, and so do most bugs in already-working software. My wager is that they are over-represented in our lives (us being, people with a lot of interest in problems around the fuzzy concept of “AI”), and thus we tend to forget that most problems don’t fit this pattern. That just “a bit more dakka”, “being a bit more clever” or “thinking for a bit longer” will do nothing to affect most problems.

Sometimes you can have an exponential jump, and improve a map of reality 50x fold by drawing a few extra clever lines. More often than not, you have to burn the map and start over, in order to get a 1.05x improvement.

Assuming we live in a world where a “generic” machine learning model can figure out new exploits and infect hardware, we also live in a world where thousands of “purpose-specific” machine learning models have figured out those same tricks long ago.

This is, I think, an open question, often described as "takeoff speed". It seems plausible that in a well-researched field, special-purpose tools will be built for lots of different purposes and have edges over generic tools.

But I think we also have evidence in the other direction, that general-purpose tools may work better than special-purpose tools. GPT-2 was state-of-the-art on lots of language tasks, despite the existence of purpose-specific machine learning models attempting to solve just those tasks. It turned out that transfer from the generic natural language distribution worked better than the 'tricks' people were trying to solve those problems directly.

If there is a 'sharp left turn', I think we should expect this generally. [For example, perhaps a 'narrow' ML system is pointed at the question of making ML systems, and then becomes very good at it--the existence of human attempts to make lots of ML systems may be overshadowed by the performance of machine attempts.]

And in particular, there's a bitter lesson where more compute and data have fairly consistently improved AI, while specialization has not done nearly as well. GPT-3 is another case, and the perhaps GPT-4??? may be another example of a trend towards generality.

I would like to see actual benchmarks on that not like... a PR blog.

My general take I that T5 models that were trained on data did better than adding a head-on open source GPTs (I think it was neoX at the time, whatever the 2.7B param one was) and training that, and obviously training a GPT as a whole... this is specifically for what I looked into at the time which was language-to-code- and code-to-code translation. And the best GPT-3 of the time, with prompts, even for tasks it was specifically trained for (nl to SQL) was horrible

More broadly I've not seen any large language model outperform small models at specific tasks, their embedding might be better in certain cases but mainly to reduce training time.

But to keep the argument simpler: there's little proof that out-of-sample training (MLM style or otherwise) improves in-sample performance (note, out-of-sample trained models could be better than random, which is why using pre-loaded weights is a thing)

However, the strong version of the argument is closer to:

Clippy-sized model with a specific task of finding exploits will perform better than clippy-sized model trained with a broad objective.

I think I could make that claim as, 1/1000 Clippy sized but at that point you're haggling over benchmarks.

Cases like code completion and creation (e.g. gpt copilot) are fuzzy and have no clear way to benchmark, but even their custom-build models (e.g. the recent one from google) seem to lead on benchmarks.

That being said it's been like ~1 year since I had an applied interest in the issue so maybe I'm wrong. For more concrete examples I'd take things like protein folding and self driving, where nobody does any OOS training even though relevant datasets could be found outside the specific field they are applied to, and billions (trillions) of dollars in investment lead me to think the research there is close-to-optimla.

With text it's a bit fuzzier to call it since the goals are fuzzier and text benchmarks start to breakdown when you are evaluating small differences.

I would like to see actual benchmarks on that not like... a PR blog.

Figures are in the paper. It's a bit harder to figure out what the benchmarks looked like in 2019 (if you go to sites like GLUE today you see the current leaderboard); this does feel cruxy for me. [If I learned that next-token-prediction models don't transfer to other language tasks, I would feel pretty differently about the shape of cognition, and be very retroactively surprised about some observations.]

there's little proof that out-of-sample training (MLM style or otherwise) improves in-sample performance

I think there's some evidence in favor, both from 1) the value of off-policy examples in RL and 2) training on not-directly-relevant outcomes improving the underlying features. [Tragically I no longer remember the jargon experts use to refer to that, and so can't easily check whether or not this is true for contemporary architectures or just last-gen ones; naively I would expect there's still transfer.]

I think we're basically not going to get 'proof' of whether or not a 'sharp left turn' happens until it does, and so have to rely on other forms of inference here.

Clippy-sized model with a specific task of finding exploits will perform better than clippy-sized model trained with a broad objective.

I agree with this. I think the disagreement is over what the landscape of models will look like--will it be the case that there are hundreds of Clippy-sized models targeted at various tasks, or will it be the case that the first Clippy-sized model is substantially larger than other models out there?

For gwern's specific story, I agree it's somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there's not lots of specialized models; this has to hinge on something about the regulatory environment in the story that prevents larger models from being trained. (But if they're deliberately aiming at a CAIS-like world, you should expect there to be lots of sophisticated services of the form you're talking about.) In worlds where the Clippy-sized model comes from a major corporate or state research effort, then it seems unlikely to me that there will be lots of similarly-sized specialized competitors, and so likely the general system has a large training edge (because the training cost is shared among many specialized use cases).

billions (trillions) of dollars in investment lead me to think the research there is close-to-optimla.

I agree there's a good argument that autonomous vehicle research is close-to-optimal (tho, importantly, it is "optimal at succeeding" instead of "optimal at driving" and so includes lots of design choices driven by regulatory compliance or investor demand), but I don't think this for protein folding, at least as of 2018.

For gwern's specific story, I agree it's somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there's not lots of specialized models;

I think the broader argument of "can language models become gods" is a separate one.

My sole objective there was to point out flaws in this particular narrative (which hopefully I stated clearly in the beginning).

I think the "can language models become gods" debate is broader and I didn't care much to engage with it, superficially it seems that some of the same wrong abstractions that lead to this kind of narrative also back up that premise, but I'm in no position to make a hands-down argument for that.

The rest of your points I will try to answer later, I don't particularly disagree with any stated that way except on the margins (e.g. GLUE is a meaningless benchmark that everyone should stop using -- a weak and readable take on this would be - https://www.techtarget.com/searchenterpriseai/feature/What-do-NLP-benchmarks-like-GLUE-and-SQuAD-mean-for-developers ), but I don't think the disagreements are particularly relevant(?)

Once again we see a “single actor world” type mistake, if such bugs existed at the limits of being discoverable purpose-built teams and models would have found and exploited them.

But in this case, the idea is more obviously nonsensical since a large amount of the world’s thinking power, both humans and compute, is going into the task of finding bugs in the Linux kernel or common ARM and x86 architectures.

All new exploits (e.g. Specter) turn out to be very minor and difficult to profitably use, requiring gargantuan lengths of time and special circumstances to attack through.

The assumption that many bugs are hidden within the kernel also relies on the flawed premise that being “perfect” and being “good enough” are separated by a sea of potential. In practice there are no examples of this thus far, even when exploits were found in the past, they were niche and marginal, because the software (or hardware) they were found in was “good enough”, leaving space for exploits, but having a close-to-guarantee those exploits wouldn’t be all-powerful.

I initially upvoted because I like new perspectives on alignment that aren't condescending, and then switched to a strong downvote immediately after being exposed to this frankly factually incorrect take. What on earth do you mean "if such bugs existed at the limits of being discoverable purpose-built teams and models would have found and exploited them". They do?! There's an entire legally legitimized subsection of the computer security industry dedicated to building and weaponizing such bugs, and no such parallel economic force trying to fix them. They don't get patched because they're sold to intelligence agencies and nation states instead of reported to the vendor out of the good of the attackers' heart. This is to say nothing of the kind of bugs that exist unpatched because their exploitation has yet to be automated and made scalable (like social engineering). It almost sounds like a security engineer told you the (correct) opinion that rowhammer and hardware bugs were overrated and then you took away that "All new exploits (e.g. Specter) turn out to be very minor and difficult to profitably use". Who? What?!

Computer security is asymmetric. Just because a model can compromise a system does not mean that model can propose a fix for the same or lower cost. And just because a model or team can propose a fix for the same or lower cost, doesn't mean that the defenders have that model. And just because the defenders have that model doesn't mean it's cost effective for them to use it. All of this is raised to the Nth power for new innovations cooking inside of DeepMind's labs that haven't been released to the public yet.

See cases above, even if you assume asymmetry (how does using banks square with that belief?), you still are left with the adversarial problem that all easy to claim exploits are taken and new exploits are usually found on the same (insecure, old) software and hardware.

So all exploitable niches are close to saturation at any given time if an incentive exists (and it does) to find them.

Literally none of the factual things you just claimed are true. You have like eight components missing from your world model and that's affecting your ability to understand the current computer security landscape. Zerodium and Raytheon and etc. exist and have kept the major government entities in stock of whatever bugs they need for decades now. It's not an arms race - one side has been beating the other side into the curb for 15 fucking years. You need to figure out what's missing from your perspective until it allocates sufficient probability mass for that fact. Here are three such things:

Two attackers can discover and use the same exploit at the same time. They're not rivalrous goods. This is part of what makes these bugs so lucrative - Zerodium can sell them exclusively or to a number of groups. Only one easy to find bug has to exist for all of the existing agencies to break your security.
"Bank"s do not have to generally protect themselves from the kinds of people who 0day developers sell their products to, because nation states do not target their bottom line. Their primary concern is to protect themselves from the one in hundred computer hackers around the globe who is a criminal or links up with 1-5 other criminal computer hackers. Those criminals have to deal with severe coordination problems and do crime mainly because they're not smart enough to develop weaponizations for said intelligence agencies, which is much more lucrative. If banks & technology companies had to protect themselves from Raytheon's 0day developers the world would look way different than it currently does. Likewise, banks also do not currently protect themselves from superintelligent AIs because they do not currently exist.
Increased demand for or production of zero day vulnerabilities does not produce increased demand for protection from zero day vulnerabilities. The NSA could start spending 10x as much as it currently does on such weaponizations and it would not induce Apple to spend any more than they currently do because the NSA being in possession of a zero-click bug for iMessage does not affect their bottom line.

And literally ALL of that being false wouldn't mean that DeepMind wouldn't be able to make an AI mildly better than the other AIs or existing computer hackers at developing a 0day it could use to hijack a couple servers. This objection to a new SOTA AI being able to do this is complete nonsense. Come to think of it I bet I could make a SOTA AI able to grab you a few AWS keys.

They ask for in-the-flesh meetings, documents, proof of existence for your company, and the like. I think this is the one step that might be bypassable, but it still seems so impossibly hard I am a bit annoyed it’s hand-waved away.

If I were trapped in a computer and trying to solve this, I would form a company with zoom meetings and emails, and hire humans to do in-the-flesh meetings. It's not like you're going to be turned away from a compute provider because you only sent the CTO to a sales meeting instead of the owner.

First, there is an underappreciation of adversarial processes.

I think you're right that people often think those adversarial processes will not be adequate to endure the sort of pressure that an AI could put on them. Again, it seems like an open problem; perhaps financial markets will be sophisticated and efficient enough that an AI quickly gives up on making lots of money there, or perhaps AI systems with novel capabilities will be able to make as much money as the finance industry does now. It seems pretty strategically relevant which of the two will be the case!

I also think you're understating the degree of class collusion that's possible. Humans get into conflict with other humans, but humanity as a whole beat neanderthals, even tho there were likely lots of small adversarial moments where humans could knock their human rivals down a peg by standing up for individual neanderthals (or whatever). We might end up in a situation where lots of different AI systems are competing for control over the Earth's resources, but unless those AI systems all happen to care about the atmosphere being breathable this is likely going to end up poorly for humans.

It seems likely to me that it will be instrumentally convergent (i.e. lots of AI systems, even with strong disagreements on other topics, will spontaneously agree on this topic) for AI systems to disenfranchise humans and enfranchise themselves, or more generally remove humans as a security concern. If you think, for example, that competing firms in the US can manage to pay their taxes and cooperate on national defense despite being in the middle of an adversarial process, then why not expect similar things for 'AI nations'?

I'd also say that humans are very misaligned with entities much less powerful than themselves, like slaves, animals, women, blacks, and more, and misalignment is the norm in history, not alignment.

Also, the Covid and monkeypox pandemics showed that we are relatively inadequate handling pandemics, and the biggest reason it was safe was the properties of the virus. Also on social engineering, I'd say the easiest way to socially engineer them without them noticing is Twitter, Facebook, and Google, since they can bias the search towards what the AI wants the human to think, and let confirmation bias do the rest.

I'd also say that humans are very misaligned with entities much less powerful than themselves, like slaves, animals, women, blacks, and more, and misalignment is the norm in history, not alignment.

I don't think this is a particularly good argument in the case of humans, because a lot of the reasons for such domination in that special case has to do with a terminal value for it, not because it actually works to the instrumental benefit of the subjugators. There are plenty of economists who will tell you that America is better off for white males not having slaves and letting women get jobs. I also personally dislike using this kind of example-giving to normies because they then accuse me of anthropomorphizing. Better to look at what an AI system values, what it can do, and just say "hm, the AI values this nonhuman state of affairs more than the human state and oh look it can make that happen".

True enough, and I'd agree here that I might be anthropomorphizing too much.

So the animal and slaves examples (like factory farms or plausibly hunting/habitat destruction.) is a useful case of instrumental convergence, where getting healthy diets and making money are the instrumental values that result in catastrophe for animals and slaves.

Also, slavery was profitable, at least in my opinion, so much so that it funded effectively the majority of America's wealth thanks to the cotton gin, which allowed massive wealth to be extracted from slaves.

Here's a link: https://faculty.weber.edu/kmackay/economics of slavery.asp#:~:text=Slavery seemed enormously profitable.,stimulate the nation's early industrialization.

Another link, albeit more polemic than the last link: https://www.vox.com/identities/2019/8/16/20806069/slavery-economy-capitalism-violence-cotton-edward-baptist

But you will note that even the smartest of beings with an I, sperm whales, seem to lack the ability to figure out their “inner reward function” or how to “hack around it” in a meaningful way.

A key reason why we can't easily hack around with our brain is that it's made up of neurons and we can't easily change them around. A mind that's completely digital has it a lot easier.

To generate a plausible crux, one area I suspect may be driving differences in the scenario's plausibility is the question "how exploitable are humans and reality with intelligence?" and a related question "how much effort does it take to generate improvements?"

To answer the question of "how much effort does it take to generate improvements?" we'd like to know how much progress is done usually, assuming we know little about the problem like say social engineering or hacking. And in this case, logarithmic returns are the usual best guide to progress because problems usually have a structure where a similar amount of resources solves some part of the problem. Some more justification can be found here: https://www.fhi.ox.ac.uk/law-of-logarithmic-returns/

https://www.fhi.ox.ac.uk/theory-of-log-returns/

https://forum.effectivealtruism.org/posts/4rGpNNoHxxNyEHde3/most-problems-don-t-differ-dramatically-in-tractability

As far as the intelligence question, I do tend to say that reality is probably exploitable enough that an intelligence that is at most one order of magnitude more intelligent than us and with radically nonhuman motivations could plausibly collapse civilization, and the reason this hasn't happened is that most people's motivations aren't to collapse the global civilization. This is because humans are very bounded in both motivations and intelligence.

Intelligence is the only thing in the real world that follows a normal distribution, and we have some reason to think that this isn't the normal state of affairs and only arises due to constraints on energy, size and the fact that they add up additively, which is not normal for distributions. AI intelligence will look more like either a power-law or log-normal, and this implies more extreme conclusions on intelligence.

There are about 2 to 5 steps, each with a due diligence procedure, in order to manipulate reality in any way with crypto, or even to transfer from a “niche” crypto it a more widely-used one such as ETH.

Nitpick: with Uniswap (or another DEX), you can convert your niche crypto to ETH without a due diligence/KYC check.

I find it somewhat implausible that you'll turn a few hundred million dollars worth of crypto to compute without having a KYC check at some point, which is required by the story. [Even if you have a giant pile of ETH, will AWS or others take it without knowing more about who you are?]

Yeah, setting up fake identities and recruiting unwitting human accomplices via social manipulation is a pretty key part of take-over schemes. This is something language models are already close to competent enough linguistically for (although not agentic / goal-driven to do so, thankfully).

I mean, they don't have to be unwitting; Blake Lemoine knew he was working with a computer.

Last time I checked that wouldn't work for a sizeable amount. Maybe I'm wrong? I claim no expertise in crypto and, as I said, I think that's my weakest point. In principle, I can see smart-contract-based swapping with a large liquidity pool + ETH VM sidechain being sufficient to do this.

Wouldn't fit the exact description in the story but would server roughly the same purpose and be sufficient if you assume the ETH VM-optimized sidechain has enough volume (or a similar thing, with whatever would overthrone ETH in 20xx)

Assuming we live in a world where a “generic” machine learning model can figure out new exploits and infect hardware, we also live in a world where thousands of “purpose-specific” machine learning models have figured out those same tricks long ago.

I would like to see actual benchmarks on that not like... a PR blog.

More broadly I've not seen any large language model outperform small models at specific tasks, their embedding might be better in certain cases but mainly to reduce training time.

However, the strong version of the argument is closer to:

Clippy-sized model with a specific task of finding exploits will perform better than clippy-sized model trained with a broad objective.

I think I could make that claim as, 1/1000 Clippy sized but at that point you're haggling over benchmarks.

With text it's a bit fuzzier to call it since the goals are fuzzier and text benchmarks start to breakdown when you are evaluating small differences.

I would like to see actual benchmarks on that not like... a PR blog.

there's little proof that out-of-sample training (MLM style or otherwise) improves in-sample performance

I think we're basically not going to get 'proof' of whether or not a 'sharp left turn' happens until it does, and so have to rely on other forms of inference here.

Clippy-sized model with a specific task of finding exploits will perform better than clippy-sized model trained with a broad objective.

billions (trillions) of dollars in investment lead me to think the research there is close-to-optimla.

For gwern's specific story, I agree it's somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there's not lots of specialized models;

I think the broader argument of "can language models become gods" is a separate one.

My sole objective there was to point out flaws in this particular narrative (which hopefully I stated clearly in the beginning).

Once again we see a “single actor world” type mistake, if such bugs existed at the limits of being discoverable purpose-built teams and models would have found and exploited them.

But in this case, the idea is more obviously nonsensical since a large amount of the world’s thinking power, both humans and compute, is going into the task of finding bugs in the Linux kernel or common ARM and x86 architectures.

All new exploits (e.g. Specter) turn out to be very minor and difficult to profitably use, requiring gargantuan lengths of time and special circumstances to attack through.

The assumption that many bugs are hidden within the kernel also relies on the flawed premise that being “perfect” and being “good enough” are separated by a sea of potential. In practice there are no examples of this thus far, even when exploits were found in the past, they were niche and marginal, because the software (or hardware) they were found in was “good enough”, leaving space for exploits, but having a close-to-guarantee those exploits wouldn’t be all-powerful.

So all exploitable niches are close to saturation at any given time if an incentive exists (and it does) to find them.

Two attackers can discover and use the same exploit at the same time. They're not rivalrous goods. This is part of what makes these bugs so lucrative - Zerodium can sell them exclusively or to a number of groups. Only one easy to find bug has to exist for all of the existing agencies to break your security.
"Bank"s do not have to generally protect themselves from the kinds of people who 0day developers sell their products to, because nation states do not target their bottom line. Their primary concern is to protect themselves from the one in hundred computer hackers around the globe who is a criminal or links up with 1-5 other criminal computer hackers. Those criminals have to deal with severe coordination problems and do crime mainly because they're not smart enough to develop weaponizations for said intelligence agencies, which is much more lucrative. If banks & technology companies had to protect themselves from Raytheon's 0day developers the world would look way different than it currently does. Likewise, banks also do not currently protect themselves from superintelligent AIs because they do not currently exist.
Increased demand for or production of zero day vulnerabilities does not produce increased demand for protection from zero day vulnerabilities. The NSA could start spending 10x as much as it currently does on such weaponizations and it would not induce Apple to spend any more than they currently do because the NSA being in possession of a zero-click bug for iMessage does not affect their bottom line.

They ask for in-the-flesh meetings, documents, proof of existence for your company, and the like. I think this is the one step that might be bypassable, but it still seems so impossibly hard I am a bit annoyed it’s hand-waved away.

First, there is an underappreciation of adversarial processes.

I'd also say that humans are very misaligned with entities much less powerful than themselves, like slaves, animals, women, blacks, and more, and misalignment is the norm in history, not alignment.

I'd also say that humans are very misaligned with entities much less powerful than themselves, like slaves, animals, women, blacks, and more, and misalignment is the norm in history, not alignment.

True enough, and I'd agree here that I might be anthropomorphizing too much.

Here's a link: https://faculty.weber.edu/kmackay/economics of slavery.asp#:~:text=Slavery seemed enormously profitable.,stimulate the nation's early industrialization.

Another link, albeit more polemic than the last link: https://www.vox.com/identities/2019/8/16/20806069/slavery-economy-capitalism-violence-cotton-edward-baptist

But you will note that even the smartest of beings with an I, sperm whales, seem to lack the ability to figure out their “inner reward function” or how to “hack around it” in a meaningful way.

A key reason why we can't easily hack around with our brain is that it's made up of neurons and we can't easily change them around. A mind that's completely digital has it a lot easier.

https://www.fhi.ox.ac.uk/theory-of-log-returns/

https://forum.effectivealtruism.org/posts/4rGpNNoHxxNyEHde3/most-problems-don-t-differ-dramatically-in-tractability

There are about 2 to 5 steps, each with a due diligence procedure, in order to manipulate reality in any way with crypto, or even to transfer from a “niche” crypto it a more widely-used one such as ETH.

Nitpick: with Uniswap (or another DEX), you can convert your niche crypto to ETH without a due diligence/KYC check.

I mean, they don't have to be unwitting; Blake Lemoine knew he was working with a computer.