cousin_it — LessWrong

https://vladimirslepnev.me

Yeah, getting outbid or otherwise deprived of resources we need to survive is one of the main concerns to me as well. It can happen completely legally and within market rules, and if you add AI-enhanced manipulation and lobbying to the mix, it's almost assured to happen.

One thing I've been wondering about is, how fixed is the "human minimum wage" really? I mean, in the limit it's the cost of running an upload, which could be really low. And even if we stay biological, I can imagine lots of technologies that would allow us to live more cheaply: food-producing nanotech, biotech that makes us smaller and so on.

The scary thing though is that when such technologies appear, that'll create a pressure to use them. Everyone would have to choose between staying human or converting themselves to a bee in beehive #12345, living much cheaper but with a similar quality of life because the hive is internet-enabled.

It seems you interpreted my comment as "the essay argues against something nobody believes anyway". What I meant was more like "the essay keeps making its point in an angry and tedious way, over and over".

My favorite example of fiction influencing reality (or maybe just predicting it really well, it's hard to tell) is how Arthur Conan Doyle's detective stories basically created forensic science from thin air. For example, the very first Sherlock Holmes story "A Study in Scarlet", published in 1887, describes Holmes inventing a chemical test to distinguish dried bloodstains from dirt stains. Then exactly that test was invented in 1900. Another example is analysis of tiny differences between typewriters, which appeared in Holmes stories a few years before anyone did it in reality.

Reading this felt like watching someone kick a dead horse for 30 straight minutes, except at the 21st minute the guy forgets for a second that he needs to kick the horse, turns to the camera and makes a couple really good jokes. (The bit where they try and fail to change the topic reminded me of the "who reads this stuff" bit in HPMOR, one of the finest bits you ever wrote in my opinion.) Then the guy remembers himself, resumes kicking the horse and it continues in that manner until the end.

By which I'm trying to say, though not in a top-tier literary way maybe, that you're a cool writer. A cool writer who has convinced himself that he has to be a horse-kicker, otherwise the world will end. And I do agree that the world will end! But... hmm how to put it... there is maybe a more optimal ratio of cool writing to horse-kicking, which HPMOR often achieved. Which made it more effective at saving the world, more fun to read, and maybe more fun to write as well.

Though I could be wrong about that. Maybe the cool bit in the middle wasn't a release valve for you, but actually took more effort than laying out the arguments in the rest of the essay. In that case never mind.

But, like, the memetic egregore “Goodness” clearly does not track that in a robust generalizable way, any more than people’s feelings of yumminess do.

I feel you're overstating the "any more" part, or at least it doesn't match my experience. My feelings of "goodness" often track what would be good for other people, while my feelings of "yumminess" mostly track what would be good for me. Though of course there are exceptions to both.

So why are you attached to the whole egregore, rather than wanting to jettison the bulk of the egregore and focus directly on getting people to not defect?

This can be understood two ways. 1) A moral argument: "We shouldn't have so much extra stuff in the morality we're blasting in everyone's ears, it should focus more on the golden rule / unselfishness". That's fine, everyone can propose changes to morality, go for it. 2) "Everyone should stop listening to morality radio and follow their feels instead". Ok, but if nobody listens to the radio, by what mechanism do you get other people to not defect? Plenty of people are happy to defect by feels, I feel I've proved that sufficiently. Do you use police? Money? The radio was pretty useful for that actually, so I'm not with you on this.

“Oh,” says the computer scientist. “Well, in that case — hm. Well, utility functions are invariant under scaling, so how about you scale the two utility functions U1 and U2 such that the AI expects it can get the same utility from each of them, so it doesn’t have an incentive one way or the other.”

That can work for a single moment, but not much longer. The AI’s options change over time. For instance, whenever it has a setback, its expected U1-utility drops, so then it would mash the shutdown button to get all that sweet, sweet shutdown utility.

“Ah,” says the computer scientist. “Well, in that case, how about if [some other clever idea]?”

Well, you see, that clever idea is isomorphic to the AI believing that it’s impossible for the button to ever be pressed, which incentivizes it to terrify the user whenever it gets a setback, so as to correlate setbacks with button-presses, which (relative to its injured belief system) causes it to think the setbacks can’t happen.

And so on.

Lessons from the Trenches

We ran some workshops, and the workshops had various mathematicians of various stripes (including an International Mathematical Olympiad gold medalist), but nobody came up with a really good idea.

This passage sniped me a bit. I thought about it for a few seconds and found what felt like a good idea. A few minutes more and I couldn't find any faults, so I wrote a quick post. Then Abram saw it and suggested that I should look back and compare it with Stuart's old corrigibility papers.

And indeed: it turned out my idea was very similar to Stuart's "utility indifference" idea plus a known tweak to avoid the "managing the news" problem. To me it fully solves the narrow problem of how to swap between U1 and U2 at arbitrary moments, without giving the AI incentive to control the swap button at any moment. And since Nate was also part of the discussion back then, it makes me wonder a bit why the book describes this as an open problem (or at least implies that).

For completeness sake, here's a simple rephrasing of the idea, copy-pasted from my post yesterday which I ended up removing because it wasn't new work:

Imagine two people, Alice and Bob, wandering around London. Bob's goal is to get to the Tower Bridge. When he gets there, he'll get a reward of £1 per minute of time remaining until midnight, so he's incentivized to go fast. He's also carrying a radio receiver.

Alice is also walking around, doing some chores of her own which we don't need to be concerned with. She is carrying a radio transmitter with a button. If/when the button is pressed (maybe because Alice presses it, or Bob takes it from her and presses it, or she randomly bumps into something), Bob gets notified that his goal changes: there'll be no more reward for getting to Tower Bridge, he needs to get to St Paul's Cathedral instead. His reward formula also changes: the device notes Bob's location at the time the button is pressed, calculates the expected travel times to Tower Bridge and to St Paul's from that location, and adds or subtracts a payment so that the expected reward stays the same. For example, if Bob is 20 minutes away from the bridge and 30 minutes away from the cathedral when the button is pressed, the reward will be increased by £10 to compensate for the 10 minutes of delay.

I think this can serve as a toy model of corrigibility, with Alice as the "operator" and Bob as the "AI". It's clear enough that Bob has no incentive to manipulate the button at any point, but actually Bob's indifference goes even further than that. For example, let's say Bob can sacrifice just a minute of travel time to choose an alternate route, one which will take him close to both Tower Bridge and St Paul's, to prepare for both eventualities in case Alice decides to press the button. Will he do so? No. He won't spare even one second. He'll take the absolute fastest way to Tower Bridge, secure in the knowledge that if the button gets pressed while he's on the move, the reward will get adjusted and he won't lose anything.

We can also make the setup more complicated and the general approach will still work. For example, let's say traffic conditions change unpredictably during the day, slowing Bob down or speeding him up. Then all we need to say is that the button does the calculation at the time it's pressed, taking into account the traffic conditions and projections at the time of button press.

Are we unrealistically relying on the button having magical calculation abilities? Not necessarily. Formally speaking, we don't need the button to do any calculation at all. Instead, we can write out Bob's utility function as a big complicated case statement which is fixed from the start: "if the button gets pressed at time T when I'm at position P, then my reward will be calculated as..." and so on. Or maybe this calculation is done after the fact, by the actuary who pays out Bob's reward, knowing everything that happened. The formal details are pretty flexible.

If only one entity is building AI, that reduces the risk from race dynamics, but increases the risk that the entity will become world dictator. I think the former reduction in risk is smaller than the second risk. So to me first best is nobody has AI, second best is everyone has it, and the worst option is if one group monopolizes it.

But why do you think that people's feelings of "yumminess" track the reality of whether an action is cooperate/cooperate? I've explained that it hasn't been true throughout most of history: people have been able to feel "yummy" about very defecting actions. Maybe today the two coincide unusually well, but then that demands an explanation.

I think it's just not true. There are too many ways to defect and end up better off, and people are too good at rationalizing why it's ok for them specifically to take one of those ways. That's why we need an evolving mechanism of social indoctrination, "goodness", to make people choose the cooperative action even when it doesn't feel "yummy" to them in the moment.

Most people do not actually like screwing over other people

I think this is very culturally dependent. For example, wars of conquest were considered glorious in most places and times, and that's pretty much the ultimate form of screwing over other people. Or for another example, the first orphanages were built by early Christians, before that the orphans were usually disposed of. Or recall how common slavery and serfdom have been throughout history.

Basically my view is that human nature without indoctrination into "goodness" is quite nasty by default. Empathy is indeed a feeling we have, and we can feel it deeply (...sometimes). But we ended up with this feeling mainly due to indoctrination into "goodness" over generations. We wouldn't have nearly as much empathy if that indoctrination hadn't happened, and it probably wouldn't stay long term if that indoctrination went away.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments