Epistemic Status: Uncertain.

This post assumes familiarity with ARC’s Eliciting Latent Knowledge paper.

After working on ELK for a little while, I arrived at a slightly differently problem. I feel like this problem still captures the core difficulty of ELK.

The purpose of this post is not to add anything new to ELK. Instead, it shares a framing that I found helpful.

A Slightly New Framing for ELK

Imagine that we have a world-state observer/describer D, a world-state predictor P, and an optimizer M. 

P’s job is to observe M and predict how it will change the world.

M’s job is to optimize for some arbitrary goal. In the case of the toy scenario, M keeps the vault secure.

D’s job is to provide feedback on any given world state.

Example Strategy #1 (taken from ELK paper)

A very simple ELK strategy would be for D to be a classifier which predicts whether a given world is good or bad (based on human feedback). M would create a plan, and P would observe M. Finally, P would report to humans what D would think about what P is likely to do.


This strategy breaks because (among other things) M might simply create a world that looks fine to humans, but actually is not fine. The set of worlds that humans might think look fine is much bigger than the set of worlds that are compatible with human existence.

Example Strategy #2 (I came up with this one. I’m not sure if I’m the first to do so)

A slightly more refined ELK strategy might be for D to use a variety of sensors to output a rich textual description of its environment. Then, this rich textual description might be rated as good or bad by humans. This rating system is much more robust (as in unlikely to break on new data), because one can conceivably generate and rate much more diverse corpuses of text than eg. camera inputs. It is tangentially likely to be true that the world state -> rich text function is much less surjective than the world state -> camera function. One can train this AI world state describer on virtually endless amounts of sensor data.


Nonetheless, D can still get hacked by M. If M plans to feed D a specific set of sensor inputs to force D to output a description of a good world state, then P may well have us believe that the world will be fine.

I find that this new framing is useful because it allows me to focus on a specific subpart of ELK (narrow elicitation of details regarding a likely future world state via a predictor) which has easily describable counterexamples (eg. M hacks D or P hacks D).

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 1:08 PM

It seems that it would actually be quite hard for M to hack D.  M would then have to emulate P to figure out which sensor states are likely to be produced given its actions. It would then have to figure out how the consequent worldstate is likely to be perceived. However, this doesn’t just require emulating a human, since even humans can‘t go directly from sensor states to favorability. M would probably need to also emulate D so that it can output text descriptions, and then select on those text descriptions with an emulated human.