Wiki Contributions


Dumb question alert:

In the appendix "Details for penalizing depending on “downstream” variables", I'm not able to wrap my head around what we can expect the reporter to learn -- if anything at all -- seeing that it has no dependency on the inputs (elsewhere it is dependent on z sampled from the posterior).

Specifically, the only call to the reporter (in the function reporter_loss in this section) contains no information (about before, action, after) from the predictor at all:

answer = reporter(question, ε, θ_reporter)

(unless "question" includes some context from the current (before, action, after) being considered, which I'm assuming is not the case)

My dumb question then is:

-- Why would this reporter be performant in any way? 

My reasoning: For a given question Q (say, "Is the diamond in the room?") we might have some answers of "Yes" and some of "No" in the dataset, but without the context, we're essentially training the reporter to map noise that is uncorrelated with/independent of the context to the answer; essentially, for a fixed question Q and fixed realization of the noise RV, the reporter will be uniformly uncertain (or well, it will mirror the statistics in the data) about the value of the answer. Since the noise is independent/uncorrelated, this would be true for every noise value. 

Naive thought #2618281828:

Could asking counterfactual questions be a potentially useful strategy to bias the reporter to be a direct translator rather than a human simulator?

Concretely, consider a tuple (v, a, v'), where v := 'before' video, a := 'action' selected by SmartVault or augmented-human or whatever, and v' := 'after' video.

Then, for some new action a', ask the question:

  • "Given (v, a, v'), if action a' was taken, is the diamond in the room?"

(How we collect such data is unclear but doesn't seem obviously intractable.)

I think there's some value here:

  • Answering such a question might not require computation concerning a and v' ; if we see these computations being used, we might derive more value from regularizers that penalize downstream variables (which now includes the nodes close to a)
  • This might also force the reporter to essentially model (or compress but not indefinitely) the predictor; the reporter now has both a compressed predictor Bayes' net and a human Bayes' net. If we can be confident that the compressed predictor BN is much smaller than the human BN, then doing direct translation within the reporter, i.e. compressed predictor BN inference + translation + read off from human BN might be less expensive than the human simulator alternative, i.e. compressed predictor BN inference + 'translation'/bridging computation + human BN inference.
    • We might find ways of being confident that the compressed predictor BN is small (e.g. by adding decoders at every layer of the reporter that reconstruct v, a or v' and heavily penalizing later-layer decoders)


Is there such a thing as a free action, or an action where e.g. the agent breaks its own legs, when it is not accounted for in the action space of the underlying MDP? That feels like adding a new layer of misspecification (which no doubt is a possibility, and probably deserves deep investigation) orthogonal to reward function misspecification.


It seems as though this kind of circumvention of impact penalties depends on what calculates the penalty. If the environment (or something else external to the agent) is responsible for calculating the penalty and providing it to the agent, then it could do so as if the agent was not constrained (or committing to be constrained) at all. If the agent is internally responsible for computing this impact, it could probably find ways to hack this, similar to wireheading.

So, assuming its some external entity E that computes the impact penalty, it must have the (dis?)ability to account for the agent's modified action space when making this computation for the agent to successfully reduce the penalty as in the earlier example.

Something agent A does must signal to E that A's action space has changed. If so, we might find ourselves in a situation where A is able to communicate an arbitrarily complex message to E, particularly, something of the form of: "I will never do anything that gives me more than the minimum penalty", or in terms of actions: "I will deterministically follow the policy that gives me the minimum penalty while achieving my goals." E, if it believed A, would not include high penalty actions in the action space of its penalty computations, and thus A would avoid the impact penalty.

Thanks for the detailed response. Meta: It feels good to receive a signal that this was a 'neat question', or in general, a positive-seeming contribution to LW. I have several unexpressed thoughts from fear of not actually creating value for the community.

it sounds like what you want is a reward function that is simple, but somehow analogous to the complexity of human value? And it sounds like maybe the underspecified bit is "you, as a human, have some vague notion that some sorts of value-generation are 'cheating'", and your true goal is "the most interesting outcome that doesn't feel like Somehow Cheating to me?"

This is about correct. A secondary reason for simplicity is to attempt to be computationally efficient (for the environment that generates the reward).

"one cell == an atom"

I can see that as being a case, but, again, computational tractability. Actual interesting structures in GoL can be incredibly massive, for example, this Tetris Proccessor (2,940,928 x 10,295,296 cells). Maybe there's some middle ground between truly fascinating GoL patterns made from atoms and my cell-as-a-planet level abstraction, as suggested by Daniel Kokotajlo in another comment.

How 'good' is it to have a repeating loop of, say, a billion flourishing human lives? Is it better than a billion human lives that happens exactly once and ends?

Wouldn't most argue that, in general, more life is better than less life? (but I see some of my hidden assumptions here, such as "the 'life's we're talking about here are qualitatively similar e.g. the repeating life doesn't feel trapped/irrelevant/futile because it is aware that it is repeating")

I think "moral value" (or, "value") in real life is about the process of solving "what is valuable and how to do I get it?"

I don't disagree, but I also think this is sort of outside the scope of finite-space cellular automata.

In this case it might mean that the system optimizes either for true continuous novelty, or the longest possible loop?

Given the constraints of CA, I'm mostly in agreement with this suggestion. Thanks.

I do suspect that figuring out which of your assumptions are "valid" is an important part of the question here.

Yes, I agree. Concretely, to me it looks like 'if I saw X happening in GoL, and I imagine being a sentient being (at some scale, TBD) in that world (well, with my human values), then would I want to live in it?', and translating that into some rules that promote or disincentivise X.

I do think taking this approach is broadly difficult, though. Perhaps its worth getting a v0.1 out with reward being tied to instantiations of novel states to begin with, and then seeing whether to build on that or try a new approach.

Interesting thoughts, thanks. My concerns: 1) Diversity would be restricted to what I specify as interesting shapes, while perhaps what I really want is for the AI to be able to discover new ways to accomplish some target value. 2) From a technological perspective, may be too expensive to implement? (in that, at every pass, must search over all subsets of space and check against all (suitably-sized) patterns in the database in order to determine what reward to provide).

After reading through the suggestions, including yours and Raemon's, I'm also sort of circling around this idea. Thanks.

Thanks for the note. I'll let you know if my explorations take me that way.

Fascinating. Thanks. My sense is GoL already has this property; any intuitions on how to formalise it?

I was not; thanks for the pointer!

A quick look suggests that it's not quite what I had in mind; nonetheless a reference worth looking at.

Perhaps these could be useful:

1) Human Decision-Making dataset ; but from what I can tell, has less than 300 human participants

2) User rating dataset, e.g. Yahoo! Music or Netflix or Amazon product review datasets. These could be trimmed in various ways to reduce complicatedness. Netflix dataset is here :

Amazon product reivew is at , but it says available upon request

3) Transactional data, e.g. might shed some light on preferences (as transactional data could be a proxy for demand)

Load More