benjyhirsch — LessWrong

Two things that could very well come out of misunderstandings of the material:

If we have an agent whose actions affect future observations, why can’t we think of information about the agent’s embedding in the environment as being encoded in its observations? For example, in the heating-up game, we could imagine a computer that has sensors that detect heat emanating from its hardware, and that the data from those sensors is incorporated into the input stream of observations of the “environment”. The agent could then learn from past experience that certain actions lead to certain patterns of observations, which correspond to what people seem to mean when they say that it is giving off a certain amount of heat. We humans have a causal model in which those patterns in the computer’s observations are coming specifically from patterns in the data from specific sensors, which are triggered by the computer giving off heat, which are caused by those actions. A computer could well have an internal representation of a similar causal model, but that seems to me like part of a specific solution to the same general problem of predicting future observations and determining future actions given past observations and past actions. Even if an agent reconfigures its sensors, say, taking over every camera and microphone it can get control of over the internet, that new configuration of sensors will just get incorporated into the stream of observations, and it can “know” what it’s doing by predicting abrupt shifts in various properties of the observation stream.
It makes sense to me that modeling rewards as exogenously incorporated into observations doesn’t account for the possibility of specific predetermined values. But doesn’t the agent ultimately need to make decisions based purely on the information contained in its observations? We might externally judge the agent’s performance according to values that we have that depend on more information than the agent has access to, but if the ultimate goal is to program an agent to make decisions based on specific values, those values need to be at least approximated based purely on information the agent has, and we may as well define the values to be the approximation on which the agent bases its decisions. This absolutely makes it possible that under the vast majority of value systems the most “effective” agents would be ones that take over their inputs, but I think it makes sense in that case to say that those are the wrong value systems, rather than that the agent is ineffective. This doesn’t get rid of any of the ontological issues, it just shifts the burden of dealing with them to the definition of the value system, rather than to the fundamental setup of the problem of inductively optimizing an interaction, where it seems to me that the value system can be a parameter of the problem in an unambiguous way:

We define a value system, V, to be a function that takes in a finite partial history of observations and actions and outputs an incremental reward. For a given agent A and environment M, the incremental reward under value system V would be $r^{M,A,V}_t = V(M^A_{\prec = t}, A^M_{\prec t})$, with total reward $R_{M,V}(A) = \sum_{1}^{\lceil M \rceil}r^{M,A,V}_t$ used to score the agent relative to this value system (where we limit our consideration to value systems where this necessarily converges for all environments and agents). Then we could measure the effectiveness of agent A relative to value system V as $\sum_{M \in \mathcal{T}}2^{-\langle M \rangle}R_{M,V}(A)$.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments