Computing scientist and Systems architect


The Case for a Journal of AI Alignment

An idea for having more AI Alignment peer review 


[...] might solve two problems at once:

  • The lack of public feedback and in-depth peer review in most posts here
  • The lack of feedback at all for newcomers [...]

I think you need to distinguish clearly between wanting more peer interaction/feedback and wanting more peer review

Academic peer review is a form of feedback, but it is mainly a form of quality control, so the scope of the feedback tends to be very limited in my experience.

The most valuable feedback, in terms of advancing the field, is comments like 'maybe if you combine your X with this Y, then something very new/even better will come out'.   This type of feedback can happen in private gdocs or LW/AF comment sections, less so in formal peer review.

That being said, I don't think that private gdocs or LW/AF comment sections are optimal peer interaction/feedback mechanisms, something better might be designed.   (The usual offline solution is to put a bunch of people together in the same building, either permanently or at a conference, and have many coffee breaks. Creating the same dynamics online is difficult.)

To make this more specific, here is what stops me usually from contributing feedback in AF comment sections. The way I do research, I tend to go on for months without reading any AF posts, as this would distract me too much.   When I catch up, I have little motivation to add a quick or detailed comment to a 2-month old post.

The Case for a Journal of AI Alignment

I agree with Ryan's comments above on this being somewhat bad timing to start a journal for publishing work like the two examples mentioned at the start of the post above.  I have an additional reason, not mentioned by Ryan, for feeling this way.

There is an inherent paradox when you want to confer academic credibility or prestige on much of the work that has appeared on LW/AF, work that was produced from an EA or x-risk driven perspective.    Often, the authors chose the specific subject area of the work exactly because at the time, they felt that the subject area was a) important for x-risk while also b) lacking the credibility or prestige in main-stream academia that would have been necessary for academia to produce sufficient work in the subject area.   

If condition b) is not satisfied, or becomes satisfied, then the EA or x-risk driven researchers (and EA givers of research funds) will typically move elsewhere.

I can't see any easy way to overcome this paradox of academic prestige-granting on prestige-avoiding work in an academic-style journal.  So I think that energy is better spent elsewhere.

Some AI research areas and their relevance to existential safety

Nice post!  In particular, I like your reasoning about picking research topics:

The main way I can see present-day technical research benefiting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years.  In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas.

I like this as a guiding principle, and have used it myself, though my choices have also been driven in part by more open-ended scientific curiosity.  But when I apply the above principle, I get to quite different conclusions about recommended research areas.

As a specific example, take the problem of oversight of companies that want to create of deploy strong AI: the problem of getting to a place where society has accepted and implemented policy proposals that demand significant levels of oversight for such companies.  In theory, such policy proposals might be held back by a lack of traction in a particular technical area, but I do not believe this is a significant factor in this case.

To illustrate, here are some oversight measures that apply right now to companies that create medical equipment, including diagnostic equipment that contains AI algorithms. (Detail: some years ago I used to work in such a company.) If the company wants to release any such medical technology to the public, it has to comply with a whole range of requirements about documenting all steps taken in development and quality assurance.  A significant paper trail has to be created, which is subject to auditing by the regulator.  The regulator can block market entry if the processes are not considered good enough.  Exactly the same paper trail + auditing measures could be applied to companies that develop powerful non-medical AI systems that interact with the public.  No technical innovation would be necessary to implement such measures.

So if any activist group or politician wants to propose measures to improve oversight of AI development and use by companies (either motivated by existential safety risks or by a more general desire to create better outcomes in society), there is no need for them to wait for further advances in Interpretability in ML (IntML), Fairness in ML (FairML) or Accountability in ML (AccML) techniques.

To lower existential risks from AI, it is absolutely necessary to locate proposals for solutions which are technically tractable.  But to find such solutions, one must also look at low-tech and different-tech solitions that go beyond the application of even more AI research.  The existence of tractable alternative solutions to make massive progress leads me to down-rank the three AI research areas I mention above, at least when considered from a pure existential safety perspective.  The non-existence of alternatives also leads me to up-rank other areas (like corrigibility) which are not even mentioned in the original post.

I like the idea of recommending certain fields for their educational value to existential-safety-motivated researchers. However, I would also recommend that such researchers read broadly beyond the CS field, to read about how other high-risk fields are managing (or have failed to manage) to solve their safety and governance problems.  

I believe that the most promising research approach for lowering AGI safety risk is to find solutions that combine AI research specific mechanisms with more general mechanisms from other fields, like the use of certain processes which are run by humans.

Question: MIRI Corrigbility Agenda

Nope, not intentional.

You should feel free to write a literature overview that cites or draws heavily on paper-announcement blog posts. I definitely won't mind. In general, the blog posts tend to use language that is less mathematical and more targeted at a non-specialist audience. So if you aim to write a literature overview that is as readable as possible for a general audience, then drawing on phrases from the author's blog posts describing the papers (when such posts are available) may be your best bet.

Question: MIRI Corrigbility Agenda

Thanks, you are welcome!

Dutch custom prevents me from recommending my own recent paper in any case, so I had to recommend one paper from the time frame 2015-2020 that you probably have not read yet, I'd recommend 'Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective'. This stands out as an overview of different approaches, and I think you can get a good feeling of the state of the field out of it even if you do not try to decode all the math.

Note that there are definitely some worthwhile corrigibility related topics that are discussed only/mainly in blog posts and in LW comment threads, but not in any of the papers I mention above or in my mid-2019 related work section. For example, there is the open question whether Christiano's Iterated Amplification approach will produce a kind of corrigibility as an emergent property of the system, and if so what kind, and is this the kind we want, etc. I have not seen any discussion of this in the 'formal literature', if we define the formal literature as conference/arxiv papers, but there is a lot of discussion of this in blog posts and comment threads.

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

Cross-linking to another thread: I just posted a long comment with more references to corrigibility resources in your post asking about corrigibility reading lists.

In that comment I focus on corrigibility related work that has appeared as scientific papers and/or arxiv preprints.

Question: MIRI Corrigbility Agenda

Just found your question via comment sections of recent posts. I understand you are still interested in the topic. so I'll add to the comments below. In the summer of 2019 I did significant work trying to understand the status of the corrigibility literature, so here is a long answer mostly based on that.

First, at this point in time there is no up-to-date centralised reading list on corrigibility. All research agenda or literature overview lists that I know of lack references to the most recent work.

Second, the 'MIRI corrigibility agenda', if we define this agenda as a statement of the type of R&D that MIRI wants to encourage when it comes to the question of corrigibility, is very different from e.g. the 'Paul Christiano corrigibility agenda', if we define that agenda as the type of R&D that Paul Christiano likes to do when it comes to the question of corrigibility. MIRI's agenda related to corrigibility still seems to be to encourage work on decision theory and embeddedness. I am saying 'still seems' here because MIRI as an organisation has largely stopped giving updates about what they are thinking collectively.

Below I am going to talk about the problem of compiling or finding up to date reading lists that show all work on the problem of corrigibility, not a subset of work that is most preferred or encouraged by a particular agenda.

One important thing to note is that by now, unfortunately, the word corrigibility means very different things to different people. MIRI very clearly defined corrigibility, in their 2015 paper with that title, by a list of 4 criteria, (and in a later section also by a list of 5 criteria at a different level of abstraction), 4 criteria that an agent has to satisfy in order to be corrigible. Many subsequent authors have used the terms 'corrigibility' or 'this agent is corrigible' to denote different, and usually weaker, desirable properties of an agent. So if someone says that they are working on corrigibility, they may not be working towards the exact 4 (or 5) criteria that MIRI defined. MIRI stresses that a corrigible agent should not take any action that tries to prevent a shutdown button press (or more generally a reward function update). But many authors are defining success in corrigibility to mean a weaker property, e.g. that the agent must always accept the shutdown instruction (or the reward function update) when it gets it, irrespective of whether the agent tried to manipulate the human into not pressing the stop button beforehand.

When writing the related work section of my 2019 paper corrigibility with utility preservation, I tried to do a survey of all related work on corrigibility, a survey without bias towards my own research agenda. I quickly found that there is a huge amount of writing about corrigibility in various blog/web forum posts and their comment sections, way too much for me to describe in a related work section. There was too much for me to even read it all, though I read a lot of it. So I limited myself, for the related work section, to reading and describing the available scientific papers, including arxiv preprints. I first created a long list of some 60 papers by using google scholar to search for all papers that reference the 2015 MIRI paper, by using some other search terms, any by using literature overviews. I then filtered out all the papers which a) just mention corrigibility in a related work section or b) describe the problem in more detail, but without contributing any new work or insights towards a solution. This left me with a short list of only a few papers to cite as related work, actually it surprised me that so little further work had been done on corrigibility after 2015, at least work that made it to publication in a formal paper or preprint.

In any case, I can offer the related work section in my mid 2019 paper on corrigibility is an up-to-date-as-of-mid-2019 reading list on corrigibility, for values of the word corrigibility that stay close to the original 2015 MIRI definition. For broader work that departs further from the definition, I used the device of referencing the 2018 literature review of Everitt, Lee and Hutter.

So what about the literature written after mid-2019 that would belong on a corrigibility reading list? I have not done a complete literature search since then, but definitely my feeling is that the pace of work on corrigibility has picked up a bit since mid 2019, for various values of the word corrigibility.

Several authors, including myself, are avoiding the word corrigibility, to refer to the problem of corrigibility, My own reason for avoiding it is that it just means too many different things to different people. So I prefer to use a broader terms like 'reward tampering' or 'unwanted manipulation of the end user by the agent'. In the 2019 book human compatible, Russell is using the phrasing 'the problem of control' to kind-of denote the problem of corrigibility.

So here is my list of post-mid-2019 books and papers are useful to read if you want to do new R&D on safety mechanisms that achieve corrigibility/that prevent reward tampering or unwanted manipulation, if you want to do more R&D on such mechanisms without risking re-inventing the wheel. Unlike the related work section discussed above, this is not based on a systematic global long-list-to-short-list literature search, it is just work that happened to encounter (and write myself).

  • The book human compatible by Russell. -- This book provides a good natural-language problem statement of the reward tampering problem, but it does not get into much technical detail about possible solutions, because it is not aimed at a technical audience. For technical detail about possible solutions:
  • Everitt, T., Hutter, M.: Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv:1908.04734 (2019) -- this paper is not just about causal influence diagrams but it also can be used as a good literature overview of many pre-mid-2019 reward tampering solutions, a literature overview that is more recent, and provides more descriptive detail, than the 2018 literature review I mentioned above.
  • Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg: Pitfalls of learning a reward function online
    -- this has a very good problem statement in the introduction, phrasing the tampering problem in an 'AGI agent as a reward learner' context. It then gets into a very mathematical examination of the problem.
  • Koen Holtman: AGI Agent Safety by Iteratively Improving the Utility Function
    (blog post intro here) -- This deals with a particular solution direction to the tampering problem. It also uses math, but I have tried to make the math as accessible as possible to a general technical audience.

This post-mid-2019 reading list is also biased to my own research agenda, and my agenda favours the use of mathematical methods and mathematical analysis over the use of natural language when examining AGI safety problems and solutions. Other people might have other lists.

Tradeoff between desirable properties for baseline choices in impact measures
By "semantic models that are rich enough", do you mean that the AI might need a semantic model for the power of other agents in the environment?

Actually in my remarks above I am less concerned about how rich a model the AI may need. My main intuition is that we ourselves may need a semantic model for that describes the comparable power of several players, if our goal is to understand motivations towards power more deeply and generally.

To give a specific example from my own recent work: in working out more details about corrigibility and indifference, I ended up defining a safety property 2 (S2 in the paper) that is about control. Control is a form of power: if I control an agent's future reward function, I have power over the agent, and indirect power over the resources it controls. To define safety property 2 mathematically, I had to make model extensions that I did not need to make to define or implement the reward function of the agent itself. So by analogy, if you want to understand and manage power seeking in an n-player setting, you may end up needing to define model extensions and metrics that are not present inside the reward functions or reasoning systems of each player. You may need them to measure, study, or define the nature of the solution.

The interesting paper you mention gives a kind-of example of such a metric, when it defines an equality metric for its battery collecting toy world, an equality metric that is not (explicitly represented) inside the agent's own semantic model. For me, an important research challenge is to generalise such toy-world specific safety/low-impact metrics into metrics that can apply to all toy (and non-toy) world models.

Yet I do not see this generalisation step being done often, and I am still trying to find out why not. Partly I think I do not see it often because it is mathematically difficult. But I do not think that is the whole story. So that is one reason I have been asking opinions about semantic detail.

In one way, the interesting paper you mention goes in a direction that is directly counter to the one I feel is the most promising one. The paper explicitly frames its solution as a proposed modification of a specific deep Q-learning machine learning algorithm, not as an extension to the reward function that is being supplied to this machine learning algorithm. By implication, this means they add more semantic detail inside the machine learning code, while keeping it out of it out of the reward function. My preference is to extend the reward function if at all possible, because this produces solutions that will generalise better over current and future ML algorithms.

Tradeoff between desirable properties for baseline choices in impact measures

Thanks for clarifying your view! I agree that for point 1 above, less semantic structure should be needed.

Reading some of the links above again, I still feel that we might be having different views on how much semantic structure is needed. But this also depends on what you count as semantic structure.

To clarify where I am coming from, I agree with the thesis of your paper Optimal Farsighted Agents Tend to Seek Power. I am not in the camp which, to quote the abstract of the paper, 'voices scepticism' about emergent power seeking incentives.

But me the, the main mechanism that turns power seeking incentives into catastrophic power-seeking is when at least two power-seeking entities with less than 100% aligned goals start to interact with each other in the same environment. So I am looking for semantic models that are rich enough to capture at least 2 players being present in the environment.

I have the feeling that you believe that moving to the 2-or-more-players level of semantic modelling is of lesser importance, is in fact a distraction, that we may be able to solve things cleanly enough if we just make every agent not seek power too much. Or maybe you are just prioritizing a deeper dive in that particular direction initially?

Tradeoff between desirable properties for baseline choices in impact measures

Thanks for the clarification, I think our intuitions about how far you could take these techniques may be more similar than was apparent from the earlier comments.

You bring up the distinction between semantic structure that is learned via unsupervised learning, and semantic structure that comes from 'explicit human input'. We may be using the term 'semantic structure' in somewhat different ways when it comes to the question of how much semantic structure you are actually creating in certain setups.

If you set up things to create an impact metric via unsupervised learning, you still need to encode some kind of impact metric on the world state by hand, to go into the agents's reward function, e.g. you may encode 'bad impact' as the observable signal 'the owner of the agent presses the do-not-like feedback button'. For me, that setup uses a form of indirection to create an impact metric that is incredibly rich in semantic structure. It is incredibly rich because it indirectly incorporates the impact-related semantic structure knowledge that is in the owner's brain. You might say instead that the metric does not have a rich of semantic structure at all, because it is just a bit from a button press. For me, an impact metric that is defined as 'not too different from the world state that already exists' would also encode a huge amount of semantic structure, in case the world we are talking about is not a toy world but the real world.

Load More