This post attempts to set out my views on why I don't find technical alignment to be a pressing issue, compared to more mundane AI risks like human misuse. It represents my personal views only, as someone who is not in AI research. I would appreciate any and all feedback.
From the various arguments I’ve read which discuss the AI alignment problem, I often see an implicit assumption that a general AI that has an intelligent and accurate view of the world, enough to generate plans and predictions, would naturally be capable of executing those plans through an unbounded set of actions.
That doesn’t strike me as true. Let’s use an advanced ‘generalist’ model like DeepMind’s GATO as an example. From my understanding, it is described as generalist because it uses the same model to analyze multiple modes of inputs and generate actions. But the output can hardly be described as general. For example, it can understand from context that I want an image caption, and output a caption in the terminal, but it can’t send me the same caption through email.
Why is that so? The model outputs a set of weights corresponding to 1) the model’s best guess of the context, i.e. generate text caption, and 2) the desired output, i.e. the word choice and sequence of text. These weights are then decoded in a deterministic way to output the result in the terminal or some other interface it is connected to. By definition, all AI models today work this way by translating its results into a defined set of output functions. That is, the default is for an AI to be bound to the ‘box’ that is its potential set of explicitly programmed outputs.
I’m not an AI researcher, and I cannot tell you whether extending current deep learning techniques can create something like ‘intelligence’ in a model. But no matter how novel the ideas generated by the model, they are necessarily decoded to a defined set of output actions. I believe a breakthrough is required in the conceptualization of AI architecture, before we see the kind of powerful AI that is the subject of so many alignment thought experiments.
Not only is general AI hard, creating narrow AI for a variety of purposes seems comparatively very easy and fast. Known techniques for creating and optimizing narrow AI, such as generative adversarial networks (GANs) or stacking an ensemble of machine learning models, can lead to incredible, better-than-human results (1, 2). Furthermore, these existing techniques are generalizable to a wide variety of tasks.
We can already see clear pathways for narrow AI systems to be applied to a number of commercial purposes in the mid- to near-term. I believe such application of existing deep learning techniques is the fastest way we will reach a scenario of AI-led economic transformation.
Loss of control of AI is not a particularly important consideration for these existing techniques which create ‘narrow AI’. While the intermediate steps DALL-E takes to make decisions about different pixels is opaque to me, I am certain the resulting output will be an image. The key AI safety considerations in this scenario would be around questions like robustness, bias, and securing AI systems against human misuse.
This point elaborates on some thoughts I want to share regarding potential unintended consequences of AI that is one likely source of AI risk. It’s not essential to the central argument that asserts low P(misalignment x-risk).
Another component of AI safety research that I believe could use more consideration is interactions between AI. Research usually focuses on singular AI entities, but in AI applications today, we see a number of negative implications arising from interactions between AI. This is a blind spot that is rarely assessed before choosing to deploy a new AI model, and we can benefit from better frameworks in assessing this area.
An example that’s often brought up is the flash crash of 2010 (8), which was caused by the interaction of different high-frequency trading algorithms that send predefined buy/sell orders when a certain pattern is detected.
Another example I would like to suggest is social media. Facebook’s News Feed, Youtube’s Video Recommendations, and Tiktok’s Content Recommendation algorithms are all separate, non-aligned AI systems which individually work to maximize time spent on their platform. Collectively, however, they normalize behaviors like mindless scrolling, and can lead to mental health issues with negative body image, anxiety, and depression (9, 10). Some studies suggest that the risks increase with use of more platforms (11). Unfortunately, it’s been difficult to pinpoint a good quantitative estimate of their collective negative externalities.
In point 1, I discussed my view that even a hypothetical AI with general intelligence is essentially a brain in a jar: the way that its thoughts are converted to actions is a hurdle to achieving a generally-capable AGI.
Partly it is also a rebuttal of the less-rigorous arguments I see sometimes that hand-wave away very important points on how a hypothetical AGI might achieve its goals (“A misaligned paperclip maximiser can take over all the factories and make them create only paperclips!! … How does it do that? Well, AGI is still decades away and it’s superintelligent, I’m sure it can figure it out.”)
But there is a hypothetical situation I can think of where an AI’s output is well-defined and bounded, but still capable of something resembling general actions. This is the case where an AI’s output is code that it is allowed to build and deploy automatically without human oversight.
Such an AI cannot in itself output images, but it can create software that can generate images. Or it can sign up for an email address, which allows it to sign up for a NightCafe account, which allows it to generate original images. This is an important hypothetical, because it at least gives us a concrete starting point from which we can meaningfully think about ways we can prevent the dangers of a rogue AI.