The thing that eventually leapt out when comparing the two behaviours is that behaviour 2 is far more informative about what the restriction was, than behaviour 1 was.

It sounds to me like the agent overfit to the restriction R. I wonder if you can draw some parallels to the Vapnik-style classical problem of empirical risk minimization, where you are not merely fitting your behavior to the training set, but instead achieve the optimal trade-off between generalization ability and adherence to R.

In your example, an agent that inferred the boundaries of our restriction could generate a family of restrictions R_i that derive from slightly modifying its postulates. For example, if it knows you check in usually at midnight, it should consider the counterfactual scenario of you usually checking in at 11:59, 11:58, etc. and come up with the union of (R_i = play quietly only around time i), i.e., play quietly the whole time, since this achieves maximum generalization.

Unfortunately, things are complicated by the fact you said "I'll be checking up on you!" instead of "I'll be checking up on you at midnight!" The agent needs to go one step farther than the machine teaching problem and first know how many counterfactual training points it should generate to infer your intention (the R_i's above), and then infer it.

A high-level conjecture is whether human CEV, if it can be modeled as a region within some natural high-dimensional real-valued space (e.g., R^n for high n where each dimension is a utility function?), admits minimal or near minimal curvature as a Riemannian manifold assuming we could populate the space with the maximum available set of training data as mined from all human literature.

A positive answer to the above question would be philosophically satisfying as it would imply a potential AI would not have to set up corner cases and thus have the appearance of overfitting to the restrictions.

EDIT: Framed in this way, could we use cross-validation on the above mentioned training set to test our CEV region?

Reply

[-]robertzk11y50

Incidentally, for a community whose most important goal is solving a math problem, why is there no MathJax or other built-in Latex support?

Reply

[-]Stuart_Armstrong11y10

Thanks, looking at the Vapnik stuff now.

Reply

[-]paulfchristiano10y00

If you want to talk about the behavior of the AI being uninformative, you need to talk about the distribution over possible values over R. If the distribution is just "it exists" or "it doesn't," then it's clear that the AI will just have to satisfy R in every case, and you don't get anything beyond the restriction itself.

If there is some broader distribution, then it's less clear what happens, but as far as I can tell this is no better than simply having the AI care about an unknown requirement from that distribution.

Reply

[-]Stuart_Armstrong10y00

There is an R, it's given. There is a distribution over possible R's for agents that only know the data E, F, B(), and v.

But this approach seems very wobbly to me; I no longer give it much potential.

Reply

[-][anonymous]11y00

We could model children as ever-destructive chaotic AIs (why yes, I am a parent - how did you guess?)

I think if our children could vastly outsmart us, our efforts would be all in vain. The only hope we have to control our wailing 1 year old when e.g. she is hungry is her "stupidity" - that "look! dad is making a funny face! look! there is something shiny there!" - kind of things can make the wailing stop for a minute until my wife finishes her meal and starts feeding her. The same child with an IQ over mine - forget it. "dad you are lying, so I will just wail harder, thank you".

Of course a high-IQ toddler would understand a rational argument like "your meal is being prepared, deal with hunger until it is done" and would not be chaotic and destructive.

Which may be a problem with your model. Why does a highly intelligent AI want to be chaotic and destructive?

Reply

[-]Stuart_Armstrong11y40

The model is illustrative only. The key point is that "letter of the law" obedience tends to be highly informative as to what the law is, unlike "spirit of the law" obedience.

Reply

[-]Gunnar_Zarncke11y-10

Very interesting.

Some time ago I posted a comment about raising AIs with a caregiver. Basically rules given to the child/AI cause it to search for circumventions whereas rewarding positive behaviors could be modelled as shaping the motivation structure of the child/AI. At least for children positively reinforced behaviors cause searching for new behaviors in that direction and implicitly inhibit other behaviors.

Only the theoretical model you gave for the motivation part does look quite different from my model. The children model seems to work more like heavily (for an advanced AI) penalizing the search outside the rewarded areas. This is different from the usual temporal discounting, so that might nontheless be another AI control approach. Search distance would need to be quantified for this and that is more difficult than time discounting.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

9

Restrictions that are hard to hack

9

9

Difficult children

AI's, restrictions, and information

Motivations

Implementation and restrictions