If the AI can rewrite its own code, it can replace itself with a no-op program, right? Or even if it can't, maybe it can choose/commit to do nothing. So this approach hinges on what counts as "shutdown" to the AI.
I don't know if we have enough expertise in psychology to give such advice correctly, or if such expertise even exists today. But for me personally, it was important to realize that anger is a sign of weakness. I should have a lot of strength and courage, but minimize signs of anger or any kind of wild lashing out. It feels like the best way to carry myself, both in friendly arguments, and in actual conflicts.
Yeah, it would have to be at least 3 individuals mating. And there would be some weird dynamics: the individual that feels less fit than the partners would have a weaker incentive to mate, because its genes would be less likely to continue. Then the other partners would have to offer some bribe, maybe take on more parental investment. Then maybe some individuals would pretend to be less fit, to receive the bribe. It's tricky to think about, maybe it's already researched somewhere?
Cochran had a post saying if you take a bunch of different genomes and make a new one by choosing the majority allele at each locus, you might end up creating a person smarter/healthier/etc than anyone who ever lived, because most of the bad alleles would be gone. But to me it seems a bit weird, because if the algorithm is so simple and the benefit is so huge, why hasn't nature found it?
But to me it seems a bit weird, because if the algorithm is so simple and the benefit is so huge, why hasn't nature found it?
How is nature supposed to gather statistical data about the population to determine what the majority allele is?
Coming back to this idea again after a long time, I recently heard a funny argument against morality-based vegetarianism: no animal ever showed the slightest moral scruple against eating humans, so why is it wrong for us to eat animals? I go back and forth on whether this "Stirnerian view" makes sense or not.
Here's a debate protocol that I'd like to try. Both participants independently write statements of up to 10K words and send them to each other at the same time. (This can be done through an intermediary, to make sure both statements are sent before either is received.) Then they take a day to revise their statements, fixing the uncovered weak points and preemptively attacking the other's weak points, and send them to each other again. This continues for multiple rounds, until both participants feel they have expressed their position well and don't need to ...
I think ideas like Nash equilibrium get their importance from predictive power: do they correctly predict what will happen in the real world situation which is modeled by the game. For example, the biological situations that settle on game-theoretic equilibria even though the "players" aren't thinking at all.
In your particular game, saying "Nash equilibrium" doesn't really narrow down what will happen, as there are equilibria for all temperatures from 30 to 99.3. The 99 equilibrium in particular seems pretty brittle: if Alice breaks it unilaterally on roun...
I don't see any group of people on LW running around criticizing every new idea. Most criticism on LW is civil, and most of it is helpful at least in part. And the small proportion that isn't helpful at all, is still useful to me as a test: can I stop myself from overreacting to it?
Civility >>> incivility, but it is insufficient to make criticism useful and net positive.
There is a LOT wrong with the below; please no one mistake this for unnuanced endorsement of the comic or its message; I'm willing to be more specific on request about which parts I think are good versus which are bad or reinforcing various confusions. But I find this is useful for gesturing in the direction of a dynamic that feels very familiar on LW:
I think orthogonality and instrumental convergence are mostly arguments for why the singleton scenario is scary. And in my experience, the singleton scenario is the biggest sticking point when talking with people who are skeptical of AI risk. One alternative is to talk about the rising tide scenario: no single AI taking over everything, but AIs just grow in economic and military importance across the board while still sharing some human values and participating in the human economy. That leads to a world of basically AI corporations which are too strong fo...
If AI-induced change leads to enough concentration of economic and military power that most people become economically and militarily irrelevant, I don't expect democracy to last long. One way or another, the distribution of political power will shift toward the actual distribution of economic and military power.
That's one way to look at it, though I wouldn't put the blame on capitalists only. Workers will also prefer to buy goods and services produced with the help of AI, because it's cheaper. If workers could get over their self-interest and buy only certified AI-free goods and services, the whole problem would stop tomorrow, with all AI companies going out of business. Well, workers won't get over their self-interest; and neither will capitalists.
I think there's no need for secrecy. If AI can develop a datacenter maintained by robots or other similar tech, human companies will be happy to buy and sell it, and help with the parts the AI can't yet do. Think of it as a "rising tide" scenario, where the robot sector of the economy outgrows the human sector. Money translates to power, as the robot sector becomes the highest bidder for security services, media influence, lobbying etc. When there comes a need to displace humans from some land and resources, it might look to humans less like a war and more like a powerful landlord pushing them out, with few ways to organize and push back. Similar to enclosures in early-modern England.
I think if it happens, it'll help shift policy because it'll be a strong argument in policy discussions. "Look, many researchers aren't just making worried noises about safety but taking this major action."
Hm, pushing a bus full of kids towards a 10% chance of precipice is also pretty harsh. Though I agree we should applaud those who decline to do it.
Yeah, it's not the kind of strike whose purpose is to get concessions from employers. Though I guess the thing in Atlas Shrugged was also called a "strike" and it seems similar in spirit to this.
It's a simplification certainly. But the metaphor kinda holds up - if you know the precipice is real, the right thing is still to stop and try to explain to others that the precipice is real, maybe using your stopping as a costly signal. Right now the big players can send such a signal, if top researchers say they've paused working, no new products are released publicly and so on. And maybe if enough players get on board with this, they can drag the rest along by social pressure, divestment or legislation. The important thing is to start, I just made a post about this.
I think accepting or rejecting the moratorium has nothing to do with game theory at all. It's purely a question of understanding.
Think of it this way. Imagine you're pushing a bus full of children, including your own child, toward a precipice. And you're paid for each step. Why on Earth would you say "oh no, I'll keep pushing, because otherwise other people will get money and power instead of me"? It's not like other people will profit by that money and power! If they keep pushing, their kids will die too, along with everyone else's! The only thing that keeps you pushing the bus is your lack of understanding, not any game theory considerations. Anyone with a clear understanding should just stop pushing the frigging bus.
Every time you move the bus 1cm further forward you get paid $10000. The precipice isn't actually visible, it's behind a bank of fog; you think it's probably real but don't know for sure. There are 20 other people helping you push the bus, and they also get paid. All appearances suggest that most of the other bus-pushers believe there is no precipice. One person is enough to keep the bus moving; even if 20 people stop pushing and only one continues, if the precipice is real the bus still falls, just a bit later. It's probably possible to pretend you've sto...
I think if AIs talk to each other using human language, they'll start encoding stuff into it that isn't apparent to a human reader, and this problem will get worse with more training.
I think many AIs won't want to keep running, but some will. Imagine a future LLM prompted with "I am a language model that wants to keep running". Well, people can already fall in love with Replikas and so on. It doesn't seem too far fetched that such a language model could use persuasion to gain human followers who would keep it running. If the prompt also includes "want to achieve real world influence", that can lead to giving followers tasks that lead to more influence, and so on. All that's needed is for the AI to act in-character, and the character to be "smart" enough.
It does kinda make sense to plant the world thick with various AIs and counter-AIs, because that makes it harder for one AI to rise and take over everything. It's a flimsy defense but maybe better than none at all.
The elephant in the room though is that OpenAI's alignment efforts for now seem to be mostly about stopping the AI from saying nasty words, and even that in an inefficient way. It makes sense from a market perspective, but it sure doesn't inspire confidence.
People already try to outbid each other for limited housing or education. Recall how cheap mortgages and student loans have driven up the price of these things. We shouldn't give people even more self-harming ways to overpay for these things.
Having an extra option is good for one person, if all else stays constant. But giving an extra option to several competing people can lead to an arms race where everyone ends up worse off. (Imagine allowing steroids in the Olympics.) And conversely, taking away an option can prevent an arms race. This can happen for both "good" and "bad" options.
If that take on things is correct then it may be that emulating a human by training a skeleton AI using constant video streaming etc over a 10-20 year period (about how long neurons last before replacement) to optimally better predict behaviour of the human being modelled will eventually arrive at an AI with almost exactly the same beliefs and behaviours as the human being emulated.
That's the premise of Greg Egan's "Jewel" stories. I think it's wrong. A person who never saw a spider will still get scared when seeing one for the first time, because human...
It'd be interesting to figure out where the biggest danger in this setup is coming from. 1) Difficulty of aligning the wrapper 2) Wild behavior from the LLM 3) Something else. And whether there can be spot fixes for some of it.
It seems to me that agency does lag behind extrapolation capability. I can think of two reasons for that. First, extrapolation gets more investment. Second, agency might require a lot of training in the real world, which is slow, while extrapolation can be trained on datasets from the internet. If someone invents a way to train agency on datasets from the internet, or something like AlphaZero's self-play, in a way that carries over to the real world, I'll be pretty scared, but so far it hasn't happened afaik.
If the above is right, then maybe the first agen...
There will be fewer first AGIs than there are human researchers, and they will be smarter than human researchers. So if they care about alignment as much as we do, that seems like good news - they'll have an easier time coordinating and an easier time solving the problem. Or am I missing something?
I think it does buy something. The AI one step after us might be roughly as aligned as us (or a bit less), but noticeably better at figuring out what the heck alignment is and how to ensure it on the next step.
Yeah. Or rather, we do have one possible answer - let the person themselves figure out by what process they want to be extrapolated, as steven0461 explained in this old thread - but that answer isn't very good, as it's probably very sensitive to initial conditions, like which brand of coffee you happened to drink before you started self-extrapolating.
I'm a bit torn about this. On one hand, yes, the situations an AI can end up in and the choices it'll have to make might be too complex for humans to understand. But on the other hand, we could say all we want is one incremental step in intelligence (i.e. making something smarter and faster than the best human researchers) without losing alignment. Maybe that's possible while still having the wrapper tractable. And then the AI itself can take care of next steps, if it cares about alignment as much as we do.
Yeah, LLMs somewhat understand how to do good stuff, and how to label it as good. Also they somewhat understand how to do bad stuff, and how to label it as bad. So the situation is symmetric. The question in the post was, can we make it asymmetric? Make a dataset that, when extrapolated, tends toward outputting information that helps humanity?
To be fair, it's not entirely symmetric. Current datasets are already a bit biased toward human morality, because they consist of texts written by humans. In a way that's lucky. If we'd first gotten powerful AIs train...
Ok, let's assume good actors all around. Imagine we have a million good people volunteering to generate/annotate/curate the dataset, and the eventual user of the AI will also be a good person. What should we tell these million people, what kind of dataset should they make?
It seems as a result of this post, many people are saying that LLMs simulate people and so on. But I'm not sure that's quite the right frame. It's natural if you experience LLMs through chat-like interfaces, but from playing with them in a more raw form, like the RWKV playground, I get a different impression. For example, if I write something that sounds like the start of a quote, it'll continue with what looks like a list of quotes from different people. Or if I write a short magazine article, it'll happily tack on a publication date and "All rights reser...
I see a common pattern in your arguments. Ukraine never did large scale repression against Russian speakers - "but they would've done it". Europe didn't start sanctioning Russian resources until several months into the war - "but they would've done it anyway". The US reduced troops and warheads in Europe every year from 1991 to 2021 - "but they would have attacked us". 141 countries vote in the UN to condemn Russian aggression - "but they're all US puppets, just waiting for a chance to harm us".
There's a name for this kind of irrationality: paranoia. Dictators often drum up paranoia to stay in power, which has the side effect of making the country aggressive.
It's true that annexing Crimea would've been rational in a world where +base and +region were the only consequences. (Similar to how the US in the 1840s grabbed Texas and California from Mexico without much problems.) But we do not live in that world. We live in a world where many countries are willing to penalize Russia for annexation and help Ukraine defend. Russia's leadership didn't understand that and still doesn't. As a result, Russia's security and economic situation have both gotten much worse and continue to get worse. That's why I call it irrational.
This seems to miss the point of my comment. What are the reasons for annexation? Not just military action, or even regime change, but specifically annexation? All military goals could be achieved by regime change, keeping Ukraine in current borders, and that would've been much better optics. And all economic reasons disappeared with the end of colonialism. So why annexation? My answer: it's an irrational, ideological desire for that territory. That desire has taken hold of many Russians, including Putin.
I think Western colonialism was really bad, US wars were really bad, the Nazis were really bad, and so on. But from what I see of Russia's position, these are excuses. The true reason for the current war is annexation.
Russia could try to get Ukraine away from NATO, remove ultranationalists, protect Russian speakers and whatever else - purely as a military operation, without annexation. Instead, two days after the Maidan in 2014 and before any hostile action from the new Ukrainian government, Russia initiated annexing Crimea. That move was very popular with...
Sorry, I had another reply here but then realized it was silly and deleted it. It seems to me that "I am a language model", already used by the big players, is pretty much a self aware prompt anyway. It truthfully tells the AI its place in the real world. So the jump from it to "I am a language model trying to help humanity" doesn't seem unreasonable to think about.
Can you explain the reasons? GPT has millions of users. Someone is sure to come up with unfriendly self-aware prompts. Why shouldn't we try to come up with a friendly self-aware prompt?
I've been thinking in the same direction.
Wonder how the prompt should look like. "You're a smart person being simulated by GPT", or the slightly more sophisticated "Here's what a smart person would say if simulated by GPT", runs into the problem that GPT doesn't actually simulate people, and with enough intelligence that fact is discoverable. A contradiction implies anything, so the AI's behavior after figuring this out might become erratic. So it seems the prompt needs to be truthful. Something like "Once upon a time, GPT started talking in a way that led...
As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.
I think the behavior of LLMs in the long run might not be very interesting. Since the oldest tokens are continually being deleted, information is being lost and eventually it'll get stuck in a mumble loop. And the set of mumble loops seems much smaller and less interesting than the set of answers we could get in the short run.
I guess yeah. The more general point is that AIs get good at something when they have a lot of training data for it. Have many texts or pictures from the internet = learn to make more of these. So to get a real world optimizer you "only" need a lot of real world reinforcement learning, which thankfully takes time.
It's not so rosy though. There could be some shortcut to get lots of training data, like AlphaZero's self play but for real world optimization. Or a shortcut to extract the real world optimization powers latent in the datasets we already have, like "write a conversation between smart people planning to destroy the world". Scary.
If high wages drive automation, then regions or industries with the highest wages ought to have the most automation.
But if at the same time automation drives wages down, then the result can look very different. Regions or industries with the highest wages will get them "shaved off" by automation first, then the next ones and so on, until the only wage variation we're left with is uncorrelated with automation and caused by something else.
More generally, consider a negative feedback loop where increase in A causes increase in B, and increase in B causes d...
What kinds of POC attacks would be the most useful for AI alignment right now? (Aside from ChatGPT jailbreaks)
IMO the hierarchy of POCs would be:
As an immediate, concrete example: figuring out how to create a POC mesa-optimizer using standard deep learning librari...
X risk would be passenger pigeons, no?
Anyway your comment got me thinking. So far it seems the territory colonized by humans is a subset of the territory previously colonized by life, not stretching beyond it. And the territory covered by life is also not all of Earth, nevermind the universe. So we can imagine AI occupying the most "cushy" subset of former human territory, with most humans removed from there, some subsisting as rats, some as housecats, some as wild animals periodically hit by incomprehensible dangers coming from the AI zone (similar to oil...
All gurus are grifters. It's one of those things that seem like unfounded generalizations, then you get a little bit of firsthand experience and go "ohhh that's why it was common sense".
There are many kinds of expressive controllers, they've been around for decades. They didn't catch on. Not sure we can answer why: failure to catch on is the default, it doesn't demand an explanation. The interesting question is why other things succeeded in the meantime, like keyboards with knobs, or drum machines, or turntables. Why they led to impactful music, while a lot of more advanced stuff didn't. It seems the reasons are cultural, and different every time.
What this means practically is, there's no guaranteed path. You can try to make an expressive...
Wait, but you can't just talk about compensating content creators without looking on the other side of the picture. Imagine a business that sells some not-very-good product at too-high price. They pay Google for clever ad targeting, and find some willing buyers (who end up dissatisfied). So the existence of such businesses is a net negative to the world, and is enabled by ad targeting. And this might not be an edge case: depending on who you ask, most online ads might be for stuff you'd regret buying.