All of indecks's Comments + Replies

> Important detail: the whitelist is only with respect to transitions between objects, not the object themselves!

I understand the technical and semantic distinction here, but I'm not sure I understand the practical one, when it comes to actual behaviour and results. Is there a situation you have in mind where the two approaches would be notably different in outcome?

> Something I'm not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it's doable, whether via my formulati... (read more)

4TurnTrout5y
Can you clarify what you mean by whitelisting objects? Would we only be OK with certain things existing, or coming into existence (i.e., whitelisting an object effectively whitelists all means of getting to that object), or something else entirely? I hadn't thought of this, actually! So, part of me wants to pass this off to the utility function also caring about not imposing retrieval costs on itself, because if it isn't aligned enough to somewhat care about the things we do, we might be out of luck. That is, whitelisting isn't sufficient to align a wholly unaligned agent - just to make states we don't want harder to reach. If it has values orthogonal to ours, misplaced items might be the least of our concerns. Again, I think this is a valid consideration, and I'm going to think about it more! Certainly more complex solutions can do better, but I imagine that the work required to formally verify an aligned system is a quadratic function of how many moving parts there are (that is, part n must play nice with all nāˆ’1 previous parts). My current thoughts are that a rich enough latent space should also pick up unknown objects and their shifts, but this would need testing. Also, wouldn't it be more likely that the wrong functions are extrapolated for new objects and we end up missing out on even more opportunities?

I'm not especially familiar with all the literature involved here, so forgive me if this is somehow repetitive.

However, I was wondering if having two lists might be more preferable. Naturally, there would be non-whitelisted objects (do not interfere with these in any way). Second, there could be objects which are fine to manipulate but must retain functional integrity (for instance, a book can be freely manipulated under most circumstances; however, it cannot be moved so it becomes out of reach or illegible, and should not be moved or obstructed while... (read more)

4TurnTrout5y
Hey, thanks for the ideas! Important detail: the whitelist is only with respect to transitions between objects, not the object themselves! "Out of reach" is indexical, and it's not clear how (and whether) to even have whitelisting penalize displacing objects. Stasis notwithstanding, many misgivings we might have about an agent being able to move objects at its leisure should go away if we can say that these movements don't lead to non-whitelisted transitions (e.g., putting unshielded people in space would certainly lead to penalized transitions). I think that latent space whitelisting actually captures this kind of permissions-based granularity. As I imagine it, a descriptive latent space would act as an approximation to thingspace [https://www.lesswrong.com/posts/WBw8dDkAWohFjWQSk/the-cluster-structure-of-thingspace]. Something I'm not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it's doable, whether via my formulation or some other one. One of the roles I see whitelisting attempting to fill is that of tracing a conservative convex hull inside the outcome space, locking us out of some good possibilities but (hopefully) many more bad ones. If we get into functional values, that's elevating the complexity from "avoid doing unknown things to unknown objects" to "learn what to do with each object". We aren't trying to build an entire utility function - we're trying to build a sturdy, conservative convex hull, and it's OK if we miss out on some details. I have a heuristic that says that the more pieces a solution has, the less likely it is to work. This post's discussion implicitly focuses on how whitelisting interacts with more advanced agents for whom we probably wouldn't need to flag things like this. I think if we can get it robustly recognizing objects in its model and then projecting them into a latent space, that would suffice.

I'm wondering what your argument is that insisting on the existence of moral facts is *not* a (self-)deceptive way of "picking norms based on what someone prefer[s]" in such a way as to make them appear objective, rather than arbitrary.

Even supposing moral facts do exist, it does not follow that humans can, would, or could know them, correct? Therefore, the actual implementation would still fall back on preferences.