Full time independent deconfusion researcher ( in AI Alignment. (Also PhD in the theory of distributed computing).

If you're interested by some research ideas that you see in my posts, know that I keep private docs with the most compressed version of my deconfusion ideas in the process of getting feedback. I can give you access if you PM me!

A list of topics I'm currently doing deconfusion on:

  • Goal-directedness for discussing AI Risk
  • Myopic Decision Theories for dealing with deception (with Evan Hubinger)
  • Universality for many alignment ideas of Paul Christiano
  • Deconfusion itself to get better at it
  • Models of Languages Models to clarify the alignment issues surrounding them.


Epistemic Cookbook for Alignment
Comprehensive Information Gatherings
Reviews for the Alignment Forum
AI Alignment Unwrapped
Deconfusing Goal-Directedness
Toying With Goal-Directedness
Through the Haskell Jungle

Wiki Contributions


Why Study Physics?

High dimensional world: to find something as useful as e.g. Fourier methods by brute-force guess-and-check would require an exponentially massive amount of search, and is unlikely to have ever happened at all. Therefore we should expect that it was produced by a method which systematically produces true/useful things more often than random chance, not just by guess-and-check with random guessing. (Einstein's Arrogance is saying something similar.)


I don't think this contradict the hypothesis that "Physicists course-correct by regularly checking their answers". After all, the reason Fourier methods and others tricks kept being used is because they somehow worked a lot of the time. Similarly, I expect (maybe wrongly) that there was a bunch of initial fiddling before they got the heuristics to work decently. If you can't check your answer, the process of refinement that these ideas went through might be harder to replicate.

Physicists have a track record of successfully applying physics-like methods in other fields (biology, economics, psychology, etc). This is not symmetric - i.e. we don't see lots of biologists applying biology methods to physics, the way we see physicists applying physics methods to biology. We also don't see this sort of generalization between most other field-pairs - e.g. we don't see lots of biologists in economics, or vice versa.

The second point sounds stronger than the first, because the first can be explained in the fact that biological systems (for example) are made of physical elements, but not the other way around. So you should expect that biology has not that much to say about physics. Still, one could say that it's not obvious physics would have relevant things to say about biology because of the complexity and the abstraction involved.

Relatedly: I once heard a biologist joke that physicists are like old western gunslingers. Every now and then, a gang of them rides into your field, shoots holes in all your theories, and then rides off into the sunset. Despite the biologist's grousing, I would indeed call that sort of thing successful generalization of the methods of physics.

This makes me wonder if the most important skills of physicists is to have strong enough generators to provide useful criticism in a wide range of fields?

Why Study Physics?

Could you write a list of physicists which have such "gift"? Might be useful for analyzing that specific skill.

Why Study Physics?

I'm wondering about the different types of intuitions in physics and mathematics.

What I remember from prepa (two years after high school where we did the full undergraduate program of maths and physics) was that some people had maths intuition (like me) and some had physics intuition (not me). That's how I recall it, but thinking back on it, there were different types of maths intuitions, which correlated very differently with physics intuition. I had algebra intuition, which means I could often see the way to go about algebraic problems, whereas I didn't have analysis intuition, which was about variations and measures and dynamics. And analysis intuition correlated strongly with physical intuition.

It's also interesting that all your examples of physicist using informal mathematical reasoning successfully ended up being formalized through analysis.

This observation makes me wonder if there are different forms of "informal mathematical reasoning" underlying these intuitions, and how relevant each one is to alignment.

  • An algebra/discrete maths intuition which is about how to combine parts into bigger stuff and reversely how to split stuff into parts, as well as the underlying structure and stuffs like generators. (Note that "the deep theory of addition" discussed recently is probably there)
  • An analysis/physics intuition which is about movement and how a system reacts to different changes.

Also the distinction becomes fuzzy because there's a lot of tricks which allow one to use a type of intuition to study the objects of the other type (things like analytic methods and inequalities in discrete maths, let's say, or algebraic geometry). Although maybe this is just evidence that people tend to have one sort of intuition, and want to find way of applying it at everything.

Yudkowsky and Christiano discuss "Takeoff Speeds"

I grimly predict that the effect of this dialogue on the community will be polarization: People who didn't like Yudkowsky and/or his views will like him / his views less, and the gap between them and Yud-fans will grow (more than it shrinks due to the effect of increased dialogue). I say this because IMO Yudkowsky comes across as angry and uncharitable in various parts of this dialogue, and also I think it was kinda a slog to get through & it doesn't seem like much intellectual progress was made here.

Strongly agree with that.

Since you agree with Yudkowksy, do you think you could strongman his position?

LCDT, A Myopic Decision Theory

Yeah, that's a subtle point.

Here we're stressing the difference between the simulator's action and the simulation's (HCH or Evan in your example) action. Obviously, if the simulation is non-myopic, then the simulation's action will depend on the long-term consequences of this action (for the goals of the simulation). But the simulator itself only cares about answering the question "what would the simulation do next?". Once again, that might mean that the simulator will think about the long term consequences of the simulation's action on the simulation's goals, but the simulator doesn't have this goal: such reasoning is completely instrumental to its task of simulation. And more generally, the simulator isn't choosing his next action to make it easier to predict the future actions (like a predict-o-matic would do).

That might sound like nitpicking, but this means something important: the simulator itself has no reason to be deceptive. It might output actions (as its best guess of what the simulation would do) that are deceptive, but only if the simulation itself is deceptive.

What does that give us?

  • If we manage to point the simulation at something that is non-deceptive yet powerful, the myopic simulator will not introduce deception into the mix. Whereas doing IRL on the simulation and then optimizing for the reward would probably lead to goodhart and deception because of mesa-optimizers.
    • Here Evan would probably say that HCH sounds like the right non-deceptive simulation; I'm less convinced that HCH will not be deceptive.
    • An obvious question is to ask why not do imitation learning? Well, I expect (and I believe Evan expects to) that simulation is strictly more powerful than imitation, because it can make models of non-observed or ideal processes that we point out to.
  • If instead of having a single simulation, we have a simulator that can deal with a range of simulation (how some researchers and I are currently thinking about GPT-3 and LMs), then myopia let you use the simulator to detect deception in the simulations/change between simulations/test different options, in a way that a "deceptive agent acting like a simulator" would not (because it would tamper with your experimentations)
    • A tangent, but there's also an argument that I'm writing up about why we should not expect models which simulate many different processes to be agents — spoiler: agents are bad at being simulators.
  • Even in the worse case scenario where we make a simulator that simulates a deceptive agent, the simulator has no incentive to hides its "internal thought" about how the simulation works. That's a very small comfort, but it might make interpretability easier because there is no adversarial pressure against it.
Ngo and Yudkowsky on AI capability gains

Thanks for giving more details about your perspective.

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

It's not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I'm sure there are posts in the sequences that touch on that (Einstein's Arrogance is an example I already mentioned), but I expect that they only talk about it in passing and obliquely, and that such posts are spread all over the sequences. Plus the fact that Yudkowsky said that there was a new subsequence to write lead me to believe that he doesn't think the information is clearly stated already.

So I don't think you can really put the current confusion as an evidence that the explanation of how that kind of theory would work doesn't help, given that this isn't readily available in a form I or anyone reading this can access AFAIK.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you've developed very good intuitions for how those problems work. In the case of alignment, it's hard to learn things from grappling with most of these problems, because we don't have signals of when we're going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.

Completely agree that these intuitions are important training data. But your whole point in other comments is that we want to understand why we should expect these intuitions to differ from apparently bad/useless analogies between AGI and other stuff. And some explanation of where these intuitions come from could help with evaluating these intuitions, even more because Yudkowsky has said that he could write a sequence about the process. 

By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn't improve people's object-level rationality very much.

This sounds to me like a strawman of my position (which might be my fault for not explaining it well).

  • First, I don't think explaining a methodology is a "very high-level epistemological principle", because it let us concretely pick apart and criticize the methodology as a truthfinding method.
  • Second, the object-level work has already been done by Yudkowsky! I'm not saying that some outside-of-the-field epistemologist should ponder really hard about what would make sense for alignment without ever working on it concretely and then give us their teaching. Instead I'm pushing for a researcher who has built a coherent collections of intuitions and has thought about the epistemology of this process to share the latter to help us understand the former.
  • A bit similar to my last point, I think the correct comparison here is not "philosophers of science outside the field helping the field", which happens but is rare as you say, but "scientists thinking about epistemology for very practical reasons". And given that the latter is from my understanding what started the scientific revolution and a common activity of all scientists until the big paradigms were established (in Physics and biology at least) in the early 20th century, I would say there is a good track record here.
    (Note that this is more your specialty, so I would appreciate evidence that I'm wrong in my historical interpretation here)

I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.)

Hum, I certainly like a lot of epistemic stuff, but I would say my tendencies to use epistemology are almost always grounded in concrete questions, like understanding why a given experiment tells us something relevant about what we're studying.

I also have to admit that I'm kind of confused, because I feel like you're consistently using the sort of epistemic discussion that I'm advocating for when discussing predictions and what gives us confidence in a theory, and yet you don't think it would be useful to have a similar-level model of the epistemology used by Yudkowsky to make the sort of judgment you're investigating?

I expect that there's just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren't epistemologists making breakthroughs in all sorts of other domains?

As I wrote about, I don't think this is a good prompt, because we're talking about scientists using epistemology to make sense of their own work there.

Here is an analogy I just thought of: I feel that in this discussion, you and Yudkowsky are talking about objects which have different types. So when you're asking question about his model, there's a type mismatch. And when he's answering, having noticed the type mismatch, he's trying to find what to ascribe it to (his answer has been quite consistently modest epistemology, which I think is clearly incorrect). Tracking the confusing does tell you some information about the type mismatch, and is probably part of the process to resolve it. But having his best description of his type (given that your type is quite standardized) would make this process far faster, by helping you triangulate the differences.

Ngo and Yudkowsky on AI capability gains

I'm honestly confused by this answer.

Do you actually think that Yudkowsky having to correct everyone's object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments?

I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle for anyone trying to get into alignment and understand his position. It's unclear whether the alternative would solve all these problems (as you quote from the preface of the Sequences, learning the theory is often easier and less useful than practicing), but it still sounds like a powerful accelerator.

There is no dichotomy of "theory or practice", we probably need both here. And based on my own experience reading the discussion posts and the discussions I've seen around these posts, the object-level refutations have not been particularly useful forms of practice, even if they're better than nothing.

Ngo and Yudkowsky on AI capability gains

Good point, I hadn't thought about that one.

Still, I have to admit that my first reaction is that this particular sequence seems quite uniquely in a position to increase the quality of the debate and of alignment research singlehandedly. Of course, maybe I only feel that way because it's the only one of the long list that I know of. ^^

(Another possibility I just thought of is that maybe this subsequence requires a lot of new preliminary subsequences, such that the work is far larger than you could expect from reading the words "a subsequence". Still sounds like it would be really valuable though.

Ngo and Yudkowsky on AI capability gains

That's a really helpful comment (at least for me)!

But at least step one could be saying, "Wait, do these two kinds of ideas actually go into the same bucket at all?"

I'm guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:

  • so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
  • does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
  • Does the bucket become more simple and more elegant with each new idea that fit in it?

Is there some truth in this, or am I completely off the mark?

It seems like the sort of thing that would take a subsequence I don't have time to write

You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!

So I'm left wondering:

  • Do you disagree with my impression of the value of such a subsequence?
  • Do you think it would have this value but are spending your time doing something more valuable?
  • Do you think it would be valuable but really don't want to write it?
  • Do you think it would be valuable, you could in principle write it, but probably no one would get it even if you did?
  • Something else I'm failing to imagine?

Once again, you do what you want, but I feel like this would be super valuable if there was anyway of making that possible. That's also completely relevant to my own focus on the different epistemic strategies used in alignment research, especially because we don't have access to empirical evidence or trial and error at all for AGI-type problems.

(I'm also quite curious if you think this comment by dxu points at the same thing you are pointing at)

How To Get Into Independent Research On Alignment/Agency

Giving a perspective from another country that is far more annoying in administrative terms (France), grant administration can be a real plus. I go through a non-profit in France, and they can take care of the taxes and the declarations, which would be a hassle. In addition, here being self-employed is really bad for many things you might want to do (rent a flat, get a loan, pay for unemployment funds), and having a real contract helps a lot with that.

Load More