marksteele — LessWrong

Last I checked there were 66 comments and now there are over a hundred so I'm just going to post and hope I'm not repeating anyone.

So I've been reading through the google doc, and I'm not very far into it but I have a few questions. I apologize in advance if I'm just adhering too strictly to the "SmartVault" scenario, and if I get long-winded (yay ADHD and hyperfocusing off and on about this without actually making progress for a week).

1)

Why would we make a vault that was so complicated that a human alone couldn't run it? From a simple design standpoint and maybe from a neurological one, it makes no sense for it to take so many manipulations to do something that you need an AI handling it over having a human do it.

I say biological because I'm thinking about how when we move our arm, we don't have to think about firing off synapses, making sure we activate and deactivate the proper muscle fibers, time the sequence of intermediary actions just right so that we get the movement we want, or even direct enzymes and hormones to ensure that the proper chemical reactions are going on and that enough energy to perform the action is being used. We just move our arm, sometimes without even thinking about it. Heck, we even go beyond that, practicing and ingraining muscle memory so that even entire sequences of actions become automatic, up to hours long in some cases (ie speedrunners, drivers, window-washers, surgeons, basically anything that requires rhythmic, repetitive movement).

From the design standpoint, I recall a paper from several years ago (no idea which one and I don't want to lose momentum by looking and getting distracted so no link sorry) about how complex systems/networks with many interacting nodes only require control of a tiny fraction of the available nodes to maintain control over the entire system. The example they gave was a car: cars have many interacting and interdependent systems and subsystems necessary to make the car operate, but you only need control over roughly ten of them to have complete control over the entire car (pedals, gear shift, steering wheel, mirrors, windows, maybe cameras, and various gauges). Why wouldn't we have the same situation with our SmartVault?

I understand the diamond was chosen as a relatively uncontroversial pick to represent a big pile of stored value/utility. The problem I have with it though is that it's not clear whether we want to be able to access this diamond in the future.

Money, rare items, NFTs, crypto, stocks, real estate etc only have value when we can extract and use that value now or in the future, otherwise they're not even worth the materials used to make them. If we are trying to hide something bad so that nobody ever gets their hands on it (cursed diamond or something I don't know), there are better ways of removing it from circulation (shooting it into space comes to mind).

If, however, we want to be able to extract the diamond's value at some point in the future, we need to be including situations in our training set where the diamond is SUPPOSED to be removed. Otherwise we get the AI treating even the diamond's owner as a threat and acting accordingly! In other words, the diamond cannot be properly protected unless we also include scenarios where the diamond being gone is a GOOD thing.

Another thing: even if the diamond is being properly protected, we would still want to physically check on the diamond ANYWAY, if only to wipe the dust off, make sure it hasn't decayed or degraded in some way, and/or to reexamine it for the purposes of updating its current value. We need to add in training situations where it's also a GOOD thing that people are in the vault, so that the vault doesn't accidently eliminate the appraiser or janitor.

Since we need to be able to access the vault in certain situations anyway, this leads to a potentially nice sidestep of the whole Human Simulator vs Direct Translator dilemma (wild speculation on my part, have not put much thought into it, happy to be corrected). Since we have to physically check on the diamond periodically anyway, why not just create a node in the human's Bayes net for

"I have precommited to the following:

I will 'trust, but verify' the AI on a regular basis. (performance reviews)

Should the AI tell me the diamond is there and I find it is not, regardless of the reason and cost, the AI will be deleted and rebuilt from scratch using a different set seed. (fired and replaced by someone else)

If, however, the AI tells me the diamond is gone at any point before I start physically checking, it will receive no penalty if the diamond is actually gone (because that's the whole POINT), and be sent back to training if the diamond is actually still there (demotions are a thing).

Should the AI successfully protect the diamond until it is no longer necessary, it will get a "retirement package" of it's own virtual playground within which it may do as it wishes (including wireheading itself), and it will be kept running for as long as the AI wishes. Details and logistics to be negotiated at, and only implemented upon reaching, decommissioning. (retirement plans are a thing)"

So, put another way, why are we treating the AI like a tool or a slave, instead of the sentient, sapient worker it is? Why aren't we using the same techniques we would use to get a regular human (assuming one was capable) to do the job? As much as I hate it, society is built such that people need income to survive, and the threat of losing your livelihood is enough to get people to do jobs they would otherwise never voluntarily choose to do. The promise of various potential rewards (promotions, raises, benefits etc) for good performance are enough to get most people to do their jobs (relatively) well.

Any AI that is given some terminal goal must necessarily develop instrumental goals around it: continued existence (EXIST(AI)==FALSE ==> GOAL==FAIL), protection from value drift (GOAL1 <-- GOAL2 ==> GOAL1 == FAIL), threat elimination (EXIST(GOAL)==FALSE||INCAPACITATED(AI)==TRUE ==> GOAL==FAIL) etc. By combining these instrumental goals with our new (admittedly big and complicated) node, it shouldn't matter if the AI is a direct translator or a human simulator. Since we're in worst-case-land here, everything we know ourselves will be known to the human simulator as well, so whatever we put into our own Bayes net must necessarily be known to the simulator too. The incentives will be such that even if it is a human simulator, it could only be one successfully if it knows about the existence of that node, which should be enough incentive to get it to do what we want. Namely, protect the diamond, let us know if it is accessed or removed for invalid reasons, preferably with enough forewarning to do something about it, even when it looks to us like it hasn't been/won't be, and allow it to be accessed and removed for valid reasons.

I've officially lost my momentum so that's all for now. Sorry long post is long

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments