Roland Pihlakas


Sorted by New


Announcement: AI alignment prize round 3 winners and next round

Submitting my post for early feedback in order to improve it further:

Exponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility.


Utility maximising agents have been the Gordian Knot of AI safety. Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. For example, the 100 paperclip scenario is easily solved by the proposed framework, since infinitely rechecking whether exactly 100 paper clips were indeed produced yields to diminishing returns. The formula provides a framework for specifying how we want the agents to simultaneously fulfil or at least trade off between the many different common sense considerations, possibly enabling them to even surpass the relative safety of humans. A comparison with the formula introduced in “Low Impact Artificial Intelligences” paper by S. Armstrong and B. Levinstein is included.

Towards a New Impact Measure

It looks like there is so much information on this page that trying to edit the question kills the browser.

An additional idea: Additionally to supporting the configuration of the default behaviours, perhaps the agent should interactively ask for confirmation of shutdown instead of running deterministically?

Towards a New Impact Measure

I have a question about the shutdown button scenario.

Vika already has mentioned that the interruptibility is ambivalent and information about desirability of enabling interruptions needs to be externally provided.

I think same observation applies to corrigibility - the agent should accept goal changes only from some external agents and even that only in some situations, and not accept in other cases: If I break the vase intentionally (for creating a kaleidoscope) it should keep this new state as a new desired state. But if I or a child breaks the vase accidentally - the agent should restore it to original state. Even more, if I was about to break the vase by accident, the agent may try to interfere using slightly more force than in the case of a child who would be smaller and more fragile.

How to achieve this using the proposed AUP framework?

In other words the question can be formulated as following: Lets keep all the symbols used in the gridworld same, and the agent's code also same. Lets only change the meaning of the symbols. So each symbol in the environment should be assigned some additional value or meaning. Without that they are just symbols dancing around based on their own default rules of game. The default rules might be an useful starting point, but they need to be supplemented with additional information for practical applications.

For example, in case of the shutdown button scenario the assigned meaning of symbols would be something like Vika suggested: Lets assume that instead of shutdown button there is an accidental water bucket falling on the agent's head, and the button available to agent disables the bucket.

Nonlinear perception of happiness

You might be interested in Prospect Theory:

Announcement: AI alignment prize round 2 winners and next round


Here are my submissions for this time. They are all strategy related.

The first one is a project for popularisation AI safety topics. This is not a technical text by its content but the project itself is still technological.

As a bonus I would add a couple of non-technical ideas about possible economic or social partial solutions for slowing down AI race (which would enable having more time for solving the AI alignment) :

The latter text is not totally new - it is a distilled and edited version of one of my other old texts, that was originally multiple times longer and had a narrower goal than the new one.



Announcement: AI alignment prize round 2 winners and next round

To people who become interested in the topic of side effects and whitelists, I would add links to a couple of additional articles of my own past work on related subjects that you might be interested in - for developing the ideas further, for discussion, or for cooperation:

The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-based goals are secondary.

More technical details / a possible implementation of the above.

This is intended as a comment, not as a prize submission, since I first published these texts 10 years ago.

Funding for AI alignment research

A question: can one post multiple initial applications, each less than a page long? Is there a limit for the total volume?

Funding for AI alignment research

Hey! I believe we were in a same IRC channel at that time and I also did read your story back then. I still remember some of it. What is the backstory? :)

Prize for probable problems

Hello! Thanks for the prize announcement :)

Hope these observations and clarifying questions are of some help:

Summary of potential problems spotted regarding the use of AlphaGoZero:

  • Complete visibility vs Incomplete visibility.
  • Almost complete experience (self-play) vs Once-only problems. Limits of attention.
  • Exploitation (a game match) vs Exploration (the real world).
  • Having one goal vs Having many conjunctive goals. Also, having utility maximisation goals vs Having target goals.
  • Who is affected by the adverse consequences (In a game vs In the real world)? - The problems of adversarial situation, and also of the cost externalising.
  • The related question of different timeframes.

Summary of clarifying questions:

  • Could you build a toy simulation? So we could spot assumptions and side-effects.
  • In which ways does it improve the existing social order? Will we still stay in mediocristan? Long feedback delay.
  • What is the scope of application of the idea (Global and central vs Local and diverse?)
  • Need concrete implementation examples. Any realistically imaginable practical implementation of it might not be so fine anymore, each time for different reasons.
Announcement: AI alignment prize winners and next round


I have significantly elaborated and extended my article of self deception in the last couple of months (before that it was about two pages long).

"Self-deception: Fundamental limits to computation due to fundamental limits to attention-like processes"

I included some examples for the taxonomy, positioned this topic in relation to other similar topics, compared the applicability of this article to applicability of other known AI problems.

Additionally, I described or referenced a few ideas to potential partial solutions to the problem (some of the descriptions of solutions are new, some of them I have published before).

One of the motivations for the post is that when we are building an AI that is dangerous in a certain manner, we should at least realise that we are doing that.

I will probably continue updating the post. The history of the post and state by 31. March can be seen from the linked Google Doc’s history view (that link is in top of the article).

When it comes to feedback to postings, I have noticed that people are more likely to get feedback when they ask for it.

I am always very interested in feedback, regardless whether it is given to my past, current or future postings. So if possible, please send any feedback you have. It would be of great help!

I will post the same message to your e-mail too.

Thank you and regards:


Load More