My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague.  Research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Wiki Contributions


Translating it to my ontology:

1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough capabilities, it seem likely the search will find unforeseen ways around the boundaries.

(the above may be different from what Nate means)

My response:

1. It's plausible people are missing this but I have some doubts.
2. How I think you get actually non-deceptive powerful systems seems different - deception is relational property between the system and the human, so the "deception" thing can be explicitly understood as negative consequence for the world, and avoided using "normal" planning cognition.
3. Stability of this depends on what the system does with internal conflict.
4. If the system stays in some corrigibility/alignment basin, this should be stable upon reflection / various meta-cognitive modifications. Systems in the basin resist self-modifications toward being incorrigible.


I don't think in this case the crux/argument goes directly through "the powerful alignment techniques" type of reasoning you describe in the "hopes for alignment".

The crux for your argument is the AIs  - somehow - 
a. want, 
b. are willing to and 
c. are able to coordinate with each other. 

Even assuming AIs "wanted to", for your case to be realistic they would need to be willing  to, and able to coordinate. 

Given that, my question is, how is it possible AIs are able to trust each other and coordinate with each other? 

My view here is that basically all proposed ways how AIs could coordinate and trust each other I've seen are dual use, and would also aid with oversight/alignment. To take an example from your post - e.g. by opening their own email accounts and emailing each other.  Ok, in that case, can I just pretend to be an AI, and ask about the plans? Will the overseers see the mailboxes as well?

Not sure if what I'm pointing to is clear, so I'll try another way.

There is something like "how objectively difficult is to create trust between AIs" and "how objectively difficult is alignment". I don't think these parameters of the world are independent, and I do think that stories which treat them as completely independent are often unrealistic. (Or, at least, implicitly assume there some things which may differentially easy to coordinate a coup relative to making it easy to make something aligned or transparent)

Note that this belief about correlation does not depend on specific beliefs about how easy are powerful alignment techniques. 

I would expect the "expected collapse to waluigi attractor" either not tp be real or mosty go away with training on more data from conversations with "helpful AI assistants". 

How this work: currently, the training set does not contain many "conversations with helpful AI assistants".  "ChatGPT" is likely mostly not the protagonist in the stories it is trained on.  As a consequence, GPT is hallucinating "how conversations with helpful AI assistants may look like" and ... this is not a strong localization.

If you train on data where "the ChatGPT character" 
- never really turns into waluigi
- corrects to luigi when experiencing small deviations
...GPT would learn that apart from "human-like" personas and narrative fiction there is also this different class of generative processes, "helpful AI assistants", and the human narrative dynamics generally does not apply to them. [1]

This will have other effects, which won't necessarily be good  - like GPT becoming more self-aware - but will likely fix most of waluigi problem. 

From active inference perspective, the system would get stronger beliefs about what it is, making it more certainly the being it is. If the system "self-identifies" this way, it creates a a pretty deep basin - cf humans. [2]

[1] From this perspective, the fact that the training set is now infected with Sydney is annoying.

[2] If this sounds confusing ... sorry don't have a quick and short better version at the moment.

Seems a bit like too general counterargument against more abstracted views?

1. Hamiltonian mechanics is almost an unfalsifiable tautology
2. Hamiltonian mechanics is applicable to both atoms and starts. So it’s probably a bad starting point for understanding atoms
3. It’s easier to think of a system of particles in 3d space as a system of particles in 3d space, and not as Hamiltonian mechanics system in an unintuitive space
4. Likewise, it’s easier to think of systems involving electricity using simple scalar potential and not the bring in the Hamiltonian
5. It’s very important to distinguish momenta and positions —and Hamiltonian mechanics textbooks make it more confusing
7. Lumping together positions and momenta is very un-natural. Positions are where particle is, momentum is where it moves


A highly compressed version of what the disagreements are about in my ontology of disagreements about AI safety...

  • crux about continuity; here GA mostly has the intuition "things will be discontinuous" and this manifests in many guesses (phase shifts, new ways of representing data, possibility to demonstrate overpowering the overseer, ...); Paul assumes things will be mostly continuous, with a few exceptions which may be dangerous
    • this seems similar to typical cruxes between Paul and e.g. Eliezer (also in my view this is actually decent chunk of disagreements: my model of Eliezer predicts Eliezer would actually update toward more optimistic views if he believed "we will have more tries to solve the actual problems, and they will show in a lab setting")
  • possible crux about x-risk from the broader system (e.g. AI powered cultural evolution); here it's unclear who is exactly where in this debate
    • I don't think there is any neat public debate on this, but I usually disagree with Eliezer's and similar "orthodox" views about the relative difficulty & expected neglectedness (I expect narrow single ML system "alignment" to be difficult but solvable and likely solved by default, because incentives to do so; whole-world-alignment / multi-multi to be difficult and with bad results by default)

(there are also many points of agreement)

I'm not really convinced by the linked post 
- the chart is from a someone selling financial advice and illustrated elo ratings of chess programs differ from e.g. wikipedia ("Stockfish estimated Elo rating is over 3500") (maybe it's just old?)
- linked interview in the "yes" answer is from 2016
- elo ratings are relative to other players;  it is not trivial to directly compare cyborgs and AI: engine ratings are usually computed in tournaments where programs run with same hardware limits

In summary,  in my view in something like "correspondence chess" the limit clearly is "AIs ~ human+AI teams" / "human contribution is negligible" .... the human can just do what the engine says. 

My guess is the current state is: you could be able to compensate what the human contributes to the team by just more hardware. (i.e. instead of the top human part of the cyborg, spending $1M on compute would get you better results). I'd classify this as being in the AI period, for most practical purposes

Also... as noted by Lone Pine, it seems the game itself becomes somewhat boring with increased power of the players,  mostly ending in draws.

Yes, the non-stacking issue in the alignment community is mostly due to the nature of the domain

But also partly due to the LessWrong/AF culture and some rationalist memes. For example, if people had stacked on Friston et. al., the understanding of agency and predictive systems (now called "simulators") in the alignment community could have advanced several years faster. However, people seem to prefer reinventing stuff, and formalizing their own methods. It's more fun... but also more karma.

In conventional academia, researchers are typically forced to stack. If progress is in principle stackable, and you don't do it, it won't be published. This means that even if your reinvention of a concept is slightly more elegant or intuitive to you, you still need to stack. This seems to go against what's fun: I think I don't know any researcher who would be really excited about literature reviews and prefer that over thinking and writing their own ideas. In the absence of incentives for stacking ... or actually presence of incentives against stacking ... you get a lot of non-stacking AI alignment research.


Thanks for the comment. I haven't noticed your preprint before your comment, but it's probably worth noting I've described the point of this post in a facebook post on 8th Dec 2022; this  LW/AF post is just a bit more polished and referenceable. As your paper had zero influence on writing this, and the content predates your paper by a month,  I don't see a clear case for citing your work.

Mostly agree - my gears-level model is the conversations listed tend to hit Limits to Legibility constraints, and marginal returns drop to very low.

For people interested in something like "Double Crux" on what's called here "entrenched views", in my experience what has some chance of working is getting as much understanding as possible in one mind, and then attempting to match the ontologies and intuitions. (I had some success in this and "Drexlerian" vs "MIRIesque" views)

The analogy I had in mind is not so much in exact nature of the problem, but in the aspect it's hard to make explicit precise models of such situations in advance.  In case of nukes, consider the fact that smartest minds of the time, like von Neumann or Feynman, spent decent amount of time thinking about the problems, had clever explicit models, and were wrong - in case of von Neumann to the extent that if US followed his advice, they would have launched nuclear armageddon.

Load More