TL;DR

This is the first post in an upcoming series of blog posts outlining Project Telos. This project is being carried out as part of the Supervised Program for Alignment Research (SPAR). Our aim is to develop a methodological framework to detect and measure goals in AI systems.

In this initial post, we give some background on the project, discuss the results of our first round of experiments, and then give some pointers about avenues we’re hoping to explore in the coming months.

Understanding AI Goals

As AI systems become more capable and autonomous, it becomes increasingly important to ensure they don’t pursue goals misaligned with the user’s intent. This, of course, is the core of the well-known alignment problem in AI. And a great deal of work is already being done on this problem. But notice that if we are going to solve it in full generality, we need to be able to say (with confidence) which goal(s) a given AI system is pursuing, and to what extent it is pursuing those goals. This aspect of the problem turns out to be much harder than it may initially seem. And as things stand, we lack a robust, methodological framework for detecting goals in AI systems.

In this blog post, we’re going to outline what we call Project Telos: a project that’s being carried out as part of the Supervised Program for Alignment Research (SPAR). The ‘we’ here refers to a diverse group of researchers, with backgrounds in computer science and AI, linguistics, complex systems, psychology, and philosophy. Our project is being led by Prof Mario Giulianelli (UCL, formerly UK AISI), and our (ambitious) aim is to develop a general framework of the kind just mentioned. That is, we’re hoping to develop a framework that will allow us to make high-confidence claims about AI systems having specific goals, and for detecting ways in which those systems might be acting towards those goals.

We are very open to feedback on our project and welcome any comments from the broader alignment community.

What’s in a name? From Aristotle to AI

Part of our project’s name, ‘telos’, comes from the ancient Greek word τέλος, which means ‘goal’, ‘purpose’, or ‘final end’.^[1] Aristotle built much of his work around the idea that everything has a telos – the acorn’s final end is to become an oak tree.

Similar notions resurfaced in the mid-20th century with the field of cybernetics, pioneered by, among others, Norb...

Posts

Wikitag Contributions

Comments