Preliminary note: the ideas in the post emerged during the Learning-by-doing AI safety workshop at EA Hotel; special thanks to Linda Linsefors, Davide Zagami and Grue_Slinky for giving suggestions and feedback.

As the title anticipates, long-term safety is not the main topic of this post; for the most part, the focus will be on current AI technologies. More specifically: why are we (un)satisfied with them from a safety perspective? In what sense can they be considered tools, or services?

An example worth considering is the YouTube recommendation algorithm. In simple terms, the job of the algorithm is to find the videos that best fit the user and then suggest them. The expected watch time of a video is a variable that heavily influences how a video is ranked, but the objective function is likely to be complicated and probably includes variables such as click-through rate and session time.[1] For the sake of this discussion, it is sufficient to know that the algorithm cares about the time spent by the user watching videos.

From a safety perspective - even without bringing up existential risk - the current objective function is simply wrong: a universe in which humans spend lots of hours per day on YouTube is not something we want. The YT algorithm has the same problem that Facebook had in the past, when it was maximizing click-throughs.[2] This is evidence supporting the thesis that we don't necessarily need AGI to fail: if we keep producing software that optimizes for easily measurable but inadequate targets, we will steer the future towards worse and worse outcomes.


Imagine a scenario in which:

  • human willpower is weaker than now;
  • hardware is faster than now, so that the YT algorithm manages to evaluate a larger number of videos per time unit and, as a consequence, gives the user better suggestions.

Because of these modifications, humans could spend almost all day on YT. It is worth noting that, even in this semi-catastrophic case, the behaviour of the AI would be more tool-ish than AGI-like: it would not actively oppose its shutdown, start acquiring new resources, develop an accurate model of itself in order to self-improve, et cetera.

From that perspective, the video recommendation service seems much more dangerous than what we usually indicate with the term tool AI. How can we make the YT algorithm more tool-ish? What is a tool?

Unsurprisingly, it seems we don't have a clear definition yet. In his paper about CAIS, Drexler writes that it is typical of services to deliver bounded results with bounded resources in bounded times.[3] Then, a possible solution is to put a constraint on the time that a user can spend on YT over a certain period. In practice, this could be done by forcing the algorithm to suggest random videos when the session time exceeds a threshold value: in fact, this solution doesn't even require a modification of the main objective function. In the following, I will refer to this hypothetical fixed version of the algorithm as "constrained YT algorithm" (cYTa).

Even though this modification would prevent the worst outcomes, we would still have to deal with subtler problems like echo chambers and filter bubbles, which are caused by the fact that recommended videos share something in common with the videos watched by the user in the past.[4] So, if our standards of safety are set high enough, the example of cYTa shows that the criterion "bounded results, resources and time" is insufficient to guarantee positive outcomes.


In order to better understand what we want, it may be useful to consider current AI technologies that we are satisfied with. Take Google Maps, for example: like cYTa, it optimizes within hard constraints and can be easily shut down. However, GMaps doesn't have a known negative side effect comparable to echo chambers; from this point of view, also AIs that play strategy games (e.g. Deep Blue) are similar to GMaps.

Enough with the examples! I claim that the "idealized safe tool AI" fulfills the following criteria:

  1. Corrigibility
  2. Constrained optimization
  3. No negative side effects

Before I get insulted in the comments because of how [insert_spicy_word] this list is, I'm going to spell out some details. First, I've simply listed three properties that seem necessary if we want to talk about an AI technology that doesn't cause any sort of problem. I wouldn't be surprised if the list turned out to be non-exhaustive and I don't mean it to be taken as a definition of the concept "tool" or "service". At the same time, I think that these two terms are too under-specified at the moment, so adding some structure could be useful for future discussions. Moreover, it seems to me that 3 implies 2 because, for each variable that is left unconstrained during optimization, side effects usually become more probable; in general, 3 is a really strong criterion. Instead, 1 seems to be somewhat independent from the others. Last, even though the concept is idealised, it is not so abstract that we don't have a concrete reference point: GMaps works well as an example.[5]

Where do we go from here? We can start by asking whether what has been said about CAIS is still valid if we replace the term service with the concept of idealized safe tool. My intuition is that the answer is yes and that the idealized concept can actually facilitate the analysis of some of the ideas presented in the paper. Another possible question is to what extent a single superintelligent agent can adhere to 3; or, in other words, whether limiting an AI's side effects also constrains its capability of achieving goals. These two papers already highlighted the importance of negative side effects and impact measures, but we are still far from getting a solid satisfactory answer.


Summary

Just for clarity purposes, I recap the main points presented here:

  • Even if AGI was impossible to obtain, AI safety wouldn’t be solved; thinking of tools as naturally safe is a mistake.
  • As shown by the cYTa example, putting strong constraints on optimization is not enough to ensure safety.
  • An idealized notion of safe tool is proposed. This should give a bit more context to previously discussed ideas (e.g. CAIS) and may stimulate future research or debate.

  1. All the details are not publicly available and the algorithm is changed frequently. By googling "YouTube SEO" I managed to find these, but I don't know how reliable the source is. ↩︎

  2. As stated by Yann LeCun in this discussion about instrumental convergence: "[...] Facebook stopped maximizing clickthroughs several years ago and stopped using the time spent in the app as a criterion about 2 years ago. It put in place measures to limit the dissemination of clickbait, and it favored content shared by friends rather than directly disseminating content from publishers." ↩︎

  3. Page 32. ↩︎

  4. With cYTa, the user will experience the filter bubble only until the threshold is reached; the problem would be only slightly reduced, not solved. If the threshold is set really low then the problem is not relevant anymore, but at the same time the algorithm becomes useless because it recommends random videos for most of the time. ↩︎

  5. In order to completely fulfill 3, we have to neglect stuff like possible car accidents caused by distraction induced by the software. Analogously, an AI like AlphaZero could be somewhat addicting for the average user who likes winning at strategy games. In reality, every software can have negative side effects; saying that GMaps and AlphaZero have none seems a reasonable approximation. ↩︎

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 12:28 PM
Because of these modifications, humans could spend almost all day on YT. It is worth noting that, even in this semi-catastrophic case

Calling this a semi-catastrophic case illustrates what seems to me to be a common oversight: not thinking about non-technical feedback mechanisms. In particular, I expect that in this case, YouTube would become illegal, and then everything would be fine.

I know there's a lot more complexity to the issue, and I don't want people to have to hedge all their statements, but I think it's worth pointing out that we shouldn't start to think of catastrophes as "easy" to create in general.

Maybe I should have used different words: I didn't want to convey the message that catastrophes are easy to obtain. The purpose of the fictional scenario was to make the reader reflect on the usage of the word "tool". Anyway, I'll try to consider non-technical feedback mechanisms more often in the future. Thanks!