Independent AI Research
Rationality in Research

Wiki Contributions


I'd go for:

Reinforcement learning agents do two sorts of planning. One is the application of the dynamic (world-modelling) network and using a Monte Carlo tree search (or something like it) over explicitly-represented world states. The other is implicit in the future-reward-estimate function. You need to have as much planning as possible be of the first type:

  1. It's much more supervisable. An explicitly-represented world state is more interrogable than the inner workings of a future-reward-estimate.
  2. It's less susceptible to value-leaking. By this I mean issues in alignment which arise from instrumentally-valuable (i.e. not directly part of the reward function) goals leaking into the future-reward-estimate.
  3. You can also turn down the depth on the tree search. If the agent literally can't plan beyond a dozen steps ahead it can't be deceptively aligned.

I would question the framing of mental subagents as "mesa optimizers" here. This sneaks in an important assumption: namely that they are optimizing anything. I think the general view of "humans are made of a bunch of different subsystems which use common symbols to talk to one another" has some merit, but I think this post ascribes a lot more agency to these subsystems than I would. I view most of the subagents of human minds as mechanistically relatively simple.

For example, I might reframe a lot of the elements of talking about the unattainable "object of desire" in the following way:

1. Human minds have a reward system which rewards thinking about "good" things we don't have (or else we couldn't ever do things)
2. Human thoughts ping from one concept to adjacent concepts
3. Thoughts of good things associate to assessment of our current state
4. Thoughts of our current state being lacking cause a negative emotional response
5. The reward signal fails to backpropagate to the reward system in 1 enough, so the thoughts of "good" things we don't have are reinforced
6. The cycle continues

I don't think this is literally the reason, but framings on this level seem more mechanistic to me. 

I also think that any framings along the lines of "you are lying to yourself all the way down and cannot help it" and "literally everyone is messed in some fundamental way and there are no humans who can function in satisfying way" are just kind of bad. Seems like a Kafka trap to me.

I've spoken elsewhere about the human perception of ourselves as a coherent entity being a misfiring of systems which model others as coherent entities (for evolutionary reasons), I don't particularly think some sort of societal pressure is the primary reason for our thinking of ourselves as being coherent, although societal pressure is certainly to blame for the instinct to repress certain desires.

I'm interested in the "Xi will be assassinated/otherwise killed if he doesn't secure this bid for presidency" perspective. Even if he was put in a position where he'd lose the bid for a third term, is it likely that he'd be killed for stepping down? The four previous paramount leaders weren't. Is the argument that he's amassed too much power/done too much evil/burned too many bridges in getting his level of power?

Although I think most people who amass Xi's level of power are best modelled as desiring power (or at least as executing patterns which have in the past maximized power) for its own sake, so I guess the question of threat to his life is somewhat moot with regards to policy.

Seems like there's a potential solution to ELK-like problems. If you can force the information to move from the AI's ontology to (it's model of) a human's ontology and then force it to move it back again.

This gets around "basic" deception since we can always compare the AI's ontology before and after the translation.

The question is how do we force the knowledge to go through the (modeled) human's ontology, and how do we know the forward and backward translators aren't behaving badly in some way.

Unmentioned but large comparative advantage of this: it's not based in the Bay Area.

The typical alignment pitch of: "Come and work on this super-difficult problem you may or may not be well suited for at all" Is a hard enough sell for already-successful people (which intelligent people often are) without adding: "Also you have to move to this one specific area of California which has a bit of a housing and crime problem and very particular culture"

I was referring to "values" more like the second case. Consider the choice blindness experiments (which are well-replicated). People think they value certain things in a partner, or politics, but really it's just a bias to model themselves as being more agentic than they actually are.

Both of your examples share the common fact that the information is verifiable at some point in the future. In this case the best option is to put down money. Or even just credibly offer to put down money.

For example, X offers to bet Y $5000 (possibly at very high odds) that in the year 2030 (after the Moon Nazis have invaded) they will provide a picture of the moon. If Y takes this bet seriously they should update. In fact all other actors A, B, C, who observe this bet will update.

The same is (sort of) true of the second case: just credibly bet some money that in the next five months Russia will release the propaganda video. Of course if you bet too much Russia might not release the video, and you might go bankrupt.

I don't think this works for the general case, although it covers a lot of smaller cases. Depends on the rate at which the value of the information you want to preserve depreciates.

When you say the idea of human values is new, do you mean the idea of humans having values with regards to a utilitarian-ish ethics, is new? Or do you mean the concept of humans maximizing things rationally (or some equivalent concept) is new? If it's the latter I'd be surprised (but maybe I shouldn't be?).

From my experience as a singer, relative pitch exercises are much more difficult when the notes are a few octaves apart. So making sure the notes jump around over a large range would probably help.

You make some really excellent points here. 

The teapot example is atypical of deception in humans, and was chosen to be simple and clear-cut. I think the web-of-lies effect is hampered in humans by a couple of things, both of which result from us only being approximations of Bayesian reasoners. One is the limits to our computation, we can't go and check a new update that "snake oil works" against all possible connections. Another part (which is also linked to computation limits) is that I suspect a small enough discrepancy gets rounded down to zero.

So if I'm convinced that "snake oil is effective against depression". I don't necessarily check it against literally all the beliefs I have about depression, which limits the spread of the web. Or if it only very slightly contradicts my existing view of the mechanism of depression, that won't be enough for me to update the existing view at all, and the difference is swept under the rug. So the web peters out.

Of course the main reason snake oil salesmen work is because they play into people's existing biases. 

But perhaps more importantly:

This information asymmetry is typically over something that the deceiver does not expect the agent to be able to investigate easily.

This to me seems like regions where the function  just isn't defined yet, or is very fuzzy. This means rather than a web of lies we have some lies isolated from the rest of the model by a region of confusion. This means there is no discontinuity in the function, which might be an issue.

Load More