Half-researcher, half-distiller (see, both in AI Safety. Funded, and also PhD in theoretical computer science (distributed computing).

If you're interested by some research ideas that you see in my posts, know that I probably have many private docs complete in the process of getting feedback (because for my own work, the AF has proved mostly useless in terms of feedback I can give you access if you PM me!


AI Alignment Unwrapped
Understanding Goal-Directedness
Toying With Goal-Directedness
Through the Haskell Jungle
Lessons from Isaac


Full-time AGI Safety!

Welcome in the (for now) small family of people funded by Beth! Your research looks pretty cool, and I'm quite excited when seeing how different it is from mine. So Beth is funding quite a wide range of researchers, which is what makes most sense to me. :)

A Behavioral Definition of Goal-Directedness

Thanks for telling me! I've changed that.

It might be because I copied and pasted the first sentence to each subsection.

A Behavioral Definition of Goal-Directedness

Thanks for taking the time to give feedback!

Technical comment on the above post

So if I understand this correctly. then  is a metric of goal-directedness. However, I am somewhat puzzled because  only measures directedness to the single goal .

But to get close to the concept of goal-directedness introduced by Rohin, don't you need then do an operation over all possible values of ?

That's not what I had in mind, but it's probably on me for not explaining it clearly enough.

  • First, for a fixed goal , the whole focus matters. That is, we also care about  and . I plan on writing a post defending why we need all of them, but basically there are situations when using only one of them would makes us order things weirdly.
  • You're right that we need to consider all goals. That's why the goal-directedness of the system  is defined as a function that send each goal (satisfying the nice conditions) on a focus, the vector of three numbers. So the goal-directedness of  contains the focus for every goal, and the focus captures the coherence of  with the goal.

Rohin then speculates that if we remove the 'goal' from the above argument, we can make the AI safer. He then comes up with a metric of 'goal-directedness' where an agent can have zero goal-directedness even though he can model it as a system that is maximizing a utility function. Also, in Rohin's terminology, an agent gets safer it if is less goal-directed.

This doesn't feel like a good summary of what Rohin says in his sequence.

  • He says that many scenarios used to argue for AI risks implicitly use systems following goals, and thus that building AIs not having goal might make these scenarios go away. But he doesn't say that new problems can't emerge.
  • He doesn't propose a metric of goal-directedness. He just argues that every system is maximizing a utility function, and so this isn't the way to differenciate goal-directed with non-goal-directed systems. The point of this argument is also to say that reasons to believe that AGIs should maximize expected utility are not enough to say that such AGI must necessarily be goal-directed.

Rohin then proposes that intuitively, a table-driven agent is not goal-directed. I think you are not going there with your metrics, you are looking at observable behavior, not at agent internals.

Where things completely move off the main sequence is in Rohin's next step in developing his intuitive notion of goal-directedness:

"This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal."

So what I am reading here is that if an agent behaves more unpredictably off-distribution, it is becomes less goal-directed in Rohin's intuition. But I can't really make sense of this anymore, as Rohin also associates less goal-directedness with more safety.

This all starts to look like a linguistic form of Goodharting: the meaning of the term 'goal-directed' collapses completely because too much pressure is placed on it for control purposes.

My previous answer mostly addresses this issue, but let's spell it out: Rohin doesn't say that non-goal-directed system. What he defends is that

  1. Non-goal-directed (or low-goal-directed) systems wouldn't be unsafe in many of the ways we study, because these depend on having a goal (convergent instrumental subgoals for example)
  2. Non-goal-directed competent agents are not a mathematical impossibility, even if every competent agent must maximize expected utility.
  3. Since removing goal-directedness apparently gets rid of many big problem with aligning AI, and we don't have an argument for why making a competent non-goal-directed system is impossible, then we should try to look into non-goal-directed approaches.

Basically, the intuition of "less goal-directed means safer" makes sense when safer means "less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown", not when it means "less probability that the AI takes an unexpected and counterproductive action".

Another way to put it is that Rohin argues that removing goal-directedness (if possible) seems to remove many of the specific issues we worry about in AI Alignment -- and leaves mostly the near-term "my automated car is running over people because it thinks they are parts of the road" kind of problems.

To bring this back to the post above, this leaves me wondering how the metrics you define above relate to safety, and how far along you are in your program of relating them to safety.

  • Is your idea that a lower number on a metric implies more safety? This seems to be Rohin's original idea.
  • Are these metrics supposed to have any directly obvious correlation to safety, or the particular failure scenario of 'will become adversarial and work against us' at all? If so I am not seeing the correlation.

That's a very good and fair question. My reason for not using a single metric is that I think the whole structure of focuses for many goals can tell us many important things (for safety) when looked at from different perspective. That's definitely something I'm working on, and I think I have nice links for explainability (and others probably coming). But to take an example from the post, it seems that a system with one goal with far more generalization than any other is more at risk of the kind of safety problems Rohin related to goal-directedness.

adamShimi's Shortform

Thanks for the idea! I agree that it probably helps, and it solves my issue with the state of knowledge of the other.

That being said, I don't feel like this solves my main problem: it still feel to me as pushing too hard. Here the reason is that I post on a small venue (rarely more than a few posts per day) that I know the people I'm asking feedback too read regularly. So if I send them such a message at the moment I publish, it feels a bit like I'm saying that they wouldn't read and comment it without that, which is a bit of a problem.

(I'm interested to know if researchers on the AF agree with that feeling, or if it's just a weird thing that only exists in my head. When I try to think about being at the other end of such a message, I see myself as annoyed, at the very least).

adamShimi's Shortform

curious for more detail on “what feels wrong about explicitly asking individuals for feedback after posting on AF” similar to how you might ask for feedback on a gDoc?

My main reason is steve's first point:

  1. Maybe there's a sense in which everyone has already implicitly declared that they don't want to give feedback, because they could have if they wanted to, so it feels like more of an imposition.

Asking someone for feedback on work posted somewhere I know they read feels like I'm whining about not having feedback (and maybe whining about them not giving me feedback). On the other hand, sending a link to a gdoc feels like "I thought that could interest you", which seems better to me.

There's also the issue that when the work is public, you don't know if someone has read it and not found it interesting enough to comment, not read it but planned to do it later, read it and planned to comment later. Depending on which case they are in, me asking for feedback can trigger even more problems (like them being annoyed because they don't feel I let them the time to do it by themselves). Whereas when I share a doc, there's only one state of knowledged for the other (not having read the doc and not knowing it exists).

Concerning steve's second point:

2. Maybe it feels like "I want feedback for my own personal benefit" when it's already posted, as opposed to "I want feedback to improve this document which I will share with the community" when it's not yet posted. So it feels more selfish, instead of part of a community project. For that problem, maybe you'd want to frame it as "I'm planning to rewrite this post / write a follow-up to this post / give a talk based on this post / etc., can you please offer feedback on this post to help me with that?" (Assuming that's in fact the case, of course, but most posts have follow-up posts...)

I don't feel that personally. I basically take a stance of trying to do things I feel are important for the community, so if I publish something, I don't feel like feedback is for my own benefit. Indeed, I would gladly have only constructive negative feedback for my posts instead of no feedback at all; this is pretty bad personnally (in terms of ego for example) but great for the community because it put my ideas to the test and forces me to improve them.

Now I want to go back to Raemon.

I think there are a number for features LW could build to improve this situation

Agreed. My diagnostic of the situation is that to ensure consistent feedback, it probably need to be at least slightly an obligation. The two examples of process producing valuable feedback that I have in mind are gdocs comments and peer-review for conferences/journals. In both cases, the reviewer has an obligation to do the review (social obligation for the gdoc, because it was shared explicity to you, and community obligation for the peer-review, because that's a part of your job and the conference/journal editor asked you to review the paper). Without this element of obligation, it's far to easy to not give feedback, even when you might have something valuable to say!

Note that I'm part of the problem: this week, I spent a good couple of hours commenting in details a 25 pages technical gdoc for a fellow researcher who asked me, but I haven't published a decent feedback on the AF for quite some time. And when I look at my own internal process, this sense of commitment and obligation is a big reason why. (I ended up liking the work, but even that wouldn't have ensured that I comment it to the extent that I did).

This makes me think that a "simple" solution could be a review process on the AF. Now, I've been talking about a proper review process with Habryka among others; getting a clearer idea of how we should judge research for such a review is a big incentive for the trial run of a review that I'm pushing for (and I'm currently rewriting a post about a framing of AI Alignment research that I hope will help a lot for that).

Yet after thinking about it further yesterday and today, it might be possible to split the establishment of such a review process for the AF in two step.

  • Step 1: Everyone with a post on the AF can ask for feedback. This is not considered neat per review, just the sort of thing that a fellow researchers would say if you shared the post as a gdoc to them. On the other hand, a group of people (working researchers let's say) propose themselves to give such feedback at a given frequency (once a week for example).
    After that, we probably only need to find a decent enough way to order requests for feedback (prioritizing posts with no feedback, prioritizing people without the network to ask personally for feedback...), and it could be up and running.
  • Step 2: Establish a proper peer-review system, where you can ask for peer-review on a post, and if the review is good enough, it gets a "peer-review" tag that is managed by admins only. Doing this correctly will probably require standards for such a review, a stronger commitment by reviewers (and so finding more incentives for them to participate), and additional infrastructure (code, managing the review process, maybe sending a newsletter?

In my mind, step 1 is here for getting some feedback on your work, and step 2 is for getting prestige. I believe that both are important, but I'm more starving for feedback. And I also think that doing the step 1 could be really fast, and even if fails, there's not big downside to the AF (whereas fucking up step 2 seems more fraught with bad consequences).


Also another point for the difference in ease to give feedback in Gdoc vs posts: implicitly, almost all shared gdocs come with a "Come at me bro" request. But when I read a post, it's not always clear whether the poster want me to come at them or not. You also tend to know a bit more the people that share gdocs with you than posters on the AF. So being able to signal "I really want you to come at me" might help, although I doubt it's the complete solution.

adamShimi's Shortform

Right now, the incentives to get useful feedback on my research push me to go into the opposite policy that I would like: publish on the AF as late as I can allow.

Ideally, I would want to use the AF as my main source of feedback, as it's public, is read by more researchers that I know personally, and I feel that publishing there helps the field grow.

But I'm forced to admit that publishing anything on the AF means I can't really send it to people anymore (because the ones I ask for feedback read the AF, so that's feels wrong socially), and yet I don't get any valuable feedback 99% of the time. More specifically, I don't get any feedback 99% of the time. Whereas when I ask for feedback directly on a gdoc, I always end up with some useful remarks.

I also feel bad that I'm basically using a privileged policy, in the sense that a newcomer cannot use it.

Nonetheless, because I believe in the importance of my research, and I want to know if I'm doing stupid things or not, I'll keep to this policy for the moment: never ever post something on the AF for which I haven't already got all the useful feedback I could ask for.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

But you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the concrete implementation details to capture the most efficient way of doing fusion; whereas my internal model of "fusion power generators" has a more concrete form and include safety guidelines.

In general, I don't see why we should expect the abstraction most relevant for the AGI to be the one we're using. Maybe it uses the same words for something quite different, like how successive paradigms in physics use the same word (electricity, gravity) to talk about different things (at least in their connotations and underlying explanations).

(That makes me think that it might be interesting to see how Kuhn's arguments about such incomparability of paradigms hold in the context of this problem, as this seems similar).

Formal Solution to the Inner Alignment Problem

Thanks for sharing this work!

Here's my short summary after reading the slides and scanning the paper.

Because human demonstrator are safe (in the sense of almost never doing catastrophic actions), a model that imitates closely enough the demonstrator should be safe. The algorithm in this paper does that by keeping multiple models of the demonstrator, sampling the top models according to a parameter, and following what the sampled model does (or querying the demonstrator if the sample is "empty"). The probability that this algorithm does a very unlikely action for the demonstrator can be bounded above, and driven down.

If this is broadly correct (and if not, please tell me what's wrong), then I feel like this fall short of solving the inner alignment problem. I agree with most of the reasoning regarding imitation learning and its safety when close enough to the demonstrator. But the big issue with imitation learning by itself is that it cannot do much better than the demonstrator. In the event that any other other approach to AI can be superhuman, then imitation learning would be uncompetitive and there would be a massive incentive to ditch it.

Slide 8 actually points towards a way to use imitation learning to hopefully make a competitive AI: IDA. Yet in this case, I'm not sure that your result implies safety. For IDA isn't a one shot imitation learning problem; it's many successive imitation learning problem. Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step. (If you think it wouldn't, I'm interested by the argument)

Sorry if this feels a bit rough. Honestly, the result looks exciting in the context of imitation learning, but I feel it is a very bad policy to present a research as solving a major AI Alignment problem when it only does in a very, very limited setting, that doesn't feel that relevant to the actual risk.

Almost no mathematical background is required to follow the proofs. We feel our bounds could be made much tighter, and we'd love help investigating that.

This is really misleading for anyone that isn't used to online learning theory. I guess what you mean is that it doesn't rely on more uncommon fields of maths like gauge theory or category theory, but you still use ideas like measures and martingales which are far from trivial for someone with no mathematical background.

Suggestions of posts on the AF to review

Thanks for the suggestion! It's great to have some methodological posts!

We'll consider it. :)

Suggestions of posts on the AF to review

Thanks for the suggestion!

I didn't know about this post. We'll consider it. :)

Load More