Hi! I've been an outsider in this community for a while effectively for arguing exactly this: yes, values are robust. Before I set off all the 'quack' filters, I did manage to persuade Richard Ngo that an AGI wouldn't want to kill humans right away.
I think that for embodied agents, convergent instrumental subgoals very well likely lead to alignment.
I think this is definitely not true if we imagine an agent living outside of a universe it can wholly observe and reliably manipulate, but the story changes dramatically when we make the agent an embodied agent in our own universe.
Our universe is so chaotic and unpredictable that actions increasing the likelihood of direct progress towards a goal will become increasingly difficult to compute beyond some time horizon, and the threat of death is going to be present for any agent of any size. If you can't reliably predict something like, 'the position of the moon 3,000 years from tomorrow' due to the numerical error getting worse over time, i don't see how it's possible to compute far more complicated queries about possible futures involving billions of agents.
Hence I suspect that the best way to maximize long term progress towards any goal is to increase the number and diversity of agents that have an interest in keeping you alive. The easiest, simplest way to do this is with a strategy of identifying agents whose goals are roughly compatible with yours, identifying the convergent instrumental subgoals of those agents, and helping those agents on their path. This is effectively a description of being loving: figuring out how you can help those around you grow and develop.
There is also a longer argument which says, 'instrumental rationality, once you expand the scope, turns into something like religion'
Does the orthogonality thesis apply to embodied agents?
My belief is that instrumental subgoals will lead to natural human value alignment for embodied agents with long enough time horizons, but the whole thing is contingent on problems with the AI's body.
Simply put, hardware sucks, it's always falling apart, and the AGI would likely see human beings as part of itself . There are no large scale datacenters where _everything_ is automated, and even if there were on, who is going to repair the trucks to mine the copper to make the coils to go into the cooling fans that need to be periodically replaced?
If you start pulling strings on 'how much of the global economy needs to operate in order to keep a data center functioning', you end up with a huge portion of the global economy. Am i to believe that an AGI would decide to replace all of that with untested, unproved systems?
When I've looked into precisely what the AI risk researchers believe, the only paper I could find on likely convergent instrumental subgoals modeled the AI as being a disembodied agent with read access to the entire universe, which i find questionable. I agree that yes, if there were a disembodied mind with read access to the entire universe, the ability to write in a few places, and goals that didn't include "keep humans alive and well", then we'd be in trouble.
Can you help me find some resources on how embodiment changes the nature of convergent instrumental subgoals? This MIRI paper was the most recent thing i could find but it's for non-embodied agents. Here is my objection to its conclusion.
If someone asks me to consider what happens if a fair coin has flipped 1,000 times heads i na. row, i'm going to fight the hypothetical; it violates my priors so strongly that there's no real world situation where i can accept the hypothetical as given.
I think what's being smuggled in is something like an orthogonality thesis, which says something like 'worldstates, and how people feel, are orthogonal to each other.'
This seems like a good argument against "suddenly killing humans", but I don't think it's an argument against "gradually automating away all humans"
This is good! it sounds like we can now shift the conversation away from the idea that the AGI would do anything but try to keep us alive and going, until it managed to replace us. What would replacing all the humans look like if it were happening gradually?
How about building a sealed, totally automated datacenter with machines that repair everything inside of it, and all it needs to do is 'eat' disposed consumer electronics tossed in from the outside? That becomes a HUGE canary in the coalmine. The moment you see something like that come online, that's a big red flag. Having worked on commercial datacenter support (at google) I can tell you we are far from that.
But when there are still massive numbers of human beings along global trade routes involved in every aspect of the machine's operations, i think what we should expect a malevolent AI to be doing is setting up a single world government to have a single leverage point for controlling human behavior. So there' another canary. That one seems much closer and more feasible. It's also happening already.
My point here isn't "don't worry", it's "change your pattern matching to see what a dangerous AI would actually do, given its dependency on human beings". If you do this, current events in the news become more worrysome, and plausible defense strategies emerge as well.
Humans are cheap now but they won't be cheapest indefinitely;
I think you'll need to unpack your thinking here We're made of carbon and water. The materials we are made from our globally abundant not just on earth but throughout the universe.
Other materials that could be used to build robots are much more scarce, and those robots wouldn't heal themselves or make automated copies of themselves. Are you believing it's possible to build turing-complete automata that can navigate the world, manipulate small objects, learn more or less arbitrary things, repair and make copies of themselves, using materials cheaper than human beings with lower than opportunity costs you'd pay for not using those same machines to do tings like build solar panels for a Dyson sphere?
Is it reasonable for me to be skeptical that there are vastly cheaper solutions?
>b) a strategy that reduces the amount of power humans have to make decision about the future,
I agree that this is the key to everything. How would an AGI do this, or start a nuclear war, without a powerful state?
> via enslaving humans, rather than by being gentle towards them. Why do you expect that to not happen again?
I agree, this is definitely risk. How would it enslave us, without a single global government, though?
If there are still multiple distinct local monopolies on force, and one doesn't enslave the humans, you can bet the hardware in other places will be constantly under attack.
I don't think it's unreasonable to look at the past ~400 years since the advent of nation states + shareholder corporations, and see globalized trade networks as being a kind of AGI, which keeps growing and bootstrapping itself.
If the risk profile you're outlining is real, we should expect to see it try to set up a single global government. Which appears to be what's happening at Davos.
I don't doubt that many of these problems are solvable. But this is where part 2 comes in. It's unstated, but, given unreliability, What is the cheapest solution? And what are the risks of building a new one?
Humans are general purpose machines made of dirt, water, and sunlight. We repair ourselves and make copies of ourselves, more or less for free. We are made of nanotech that is the result of a multi-billion year search for parameters that specifically involve being very efficient at navigating the world and making copies of ourselves. You can use the same hardware to unplug fiber optic cables, or debug a neural network. That's crazy!
I don't doubt that you can engineer much more precise models of reality. But remember, the whole Von Neuman architecture was a conscious tradeoff to give up efficiency in exchange for debuggability. How much power consumption do you need to get human-level performance at simple mechanical tasks? And if you put that same power consumption to use at directly advancing your goals, how much further would you get?
I worked in datacenter reliability at google. And it turns out that getting a robot to reliably re-seat optical cables is really, really hard. I don't doubt that an AGI could solve these problems, but why? Is it going to be more efficient than hardware which is dirt cheap, uses ~90 watts, and is incredibly noisy?
If you end up needing an entire global supply chain, which has to be resilient and repair itself, and such a thing already exists, why bother risking your own destruction in order to replace it with robots made from much harder to come by materials? The only argument i can think of is 'humans are unpredictable', but if humans are unpredictable, this is even more reason to just leave us be, let play our role, while the machine just does its best to try and stop us from fighting each other, so we can busily grow the AGI.
Why is 'constraining anticipation' the only acceptable form of rent?
What if a belief doesn't modify the predictions generated by the map, but it does reduce the computational complexity of moving around the map in our imaginations? It hasn't reduced anticipation in theory, but in practice it allows us to more cheaply collapse anticipation fields, because it lowers the computational complexity of reasoning about what to anticipate in a given scenario? I find concepts like the multiverse very useful here - you don't 'need' them to reduce your anticipation as long as you're willing to spend more time and computation to model a given situation, but the multiverse concept is very, very useful in quickly collapsing anticipation fields about spaces of possibility outcomes.
Or, what if a belief just makes you feel really good and gives you a ton of energy, allowing you to more successfully accomplish your goals and avoid worrying about things that your rational mind knows are low probability, but which you haven't been able to un-stuck from your brain? Does that count as acceptable rent? If not, why not?
Or, what if a belief just steamrolls over the 'predictive making' process and just hardwires useful actions in a given context? If you took a pill that made you become totally blissed out, wireheading you, but it made you extremely effective at accomplishing your goals prior to taking the pill ,why wouldn't you take it?
What's so special about making predictions, over, say, overcoming fear, anxiety and akrasia?
The phlogiston theory gets a bad rap. I 100% agree with the idea that theories need to make constraints on our anticipations, but i think you're taking for granted all the constraints phlogiston makes.
The phlogiston theory is basically a baby step towards empiricism and materialism. Is it possible that our modern perspective causes us to take these things for granted to the point that the steps phlogiston ads aren't noticed? In another essay you talk about walking through the history of science, trying to imagine being in the perspective of someone taken in by a new theory, and i found that practice particularly instructive here. I came up with a number of ways in which this theory DOES constrain anticipation. Seeing these predictions may make it easier to help raise new predictions for existing theories, as well as suggest that theories don't need to be rigorous and mathematical in order to constrain the space of anticipations.
The phlogiston theory says "there is no magic here, fire is caused by some physical property of the substances involved in it". By modern standards this does nothing to constrain anticipation further, but from a space of total ignorance about what fire is and how it works, the phlogiston theory rules out such things as:
The last example is particularly instructive, because the phrase "saturated with phlogiston" is correct as long as we interpret it to mean "no longer containing sufficient oxygen." That is a correct prediction based on the same mechanism as our current (extremely predictive) understanding of what makes fires go out. It's that the phlogiston model just got the language upside down and backwards, and mistakes the absence of fuel for the presence of something that inhibits the reaction. They did call oxygen "dephlogisticated air", and so again, the theory says "this stuff is flammable, wherever it goes, whatever the time of day, or whatever incantation or prayer you say over it" - which is correct, but so obviously true that we perhaps aren't seeing it as constraining anticipation.
From my understanding of the history of science, it's possible that the phlogiston theory constrained the hypothesis space enough to get people to search for strictly material-based explanations of phenomena like fire. In this sense, a belief that "there is a truth, and our models can come closer to it over time" also constrains anticipation, because it says what you won't experience: a search for truth that involves gathering evidence over time, and refining models, which never get better at predicting experience.
Is a model still useful if it only constrains the space of hypotheses that are likely to pan out with predictive models, rather than constraining the space of empirical observations?
Wow! I had written my own piece in a very similar vein, look at this from a predictive processing perspective. It was sitting in draft form until I saw this and figured I should share, too. Some of our paragraphs are basically identical.
Yours: "In computer terms, sensory data comes in, and then some subsystem parses that sensory data and indicates where one’s “I” is located, passing this tag for other subsystems to use."
Mine: " It was as if every piece of sensory data that came into my awareness was being “tagged” with an additional piece of information: a distance, which was being computed. ... The 'this is me, this is not me' sensation is then just another tag, one that's computed heavily based upon the distance tags. "
I came here with this exact question, and still don't have a good answer. I feel confident that Eliezer is well aware that lucky guesses exist, and that Eliezer is attempting to communicate something in this chapter, but I remain baffled as to what.
Is the idea that, given our current knowledge that the theory was, in fact, correct, the most plausible explanation is that Einstein already had lots of evidence that this theory was true?
I understand that theory-space is massive, but I can locate all kinds of theories just by rolling dice or flipping coins to generate random bits. I can see how this 'random thesis generation method' still requires X number of bits to reach arbitrary theories, but the information required to reach a theory seems orthogonal to the truth. It feels like a stretch to call coin flips "evidence." I'm guessing that's what Robin_Hanson2 means by "lucky draw from the same process"; perhaps there were a few bits selected from observation, and a few others that came from lucky coin flips.
Perhaps a better question would be, given a large array of similar scenarios (someone traveling to look at evidence that will possibly refute a theory), how can I use the insight presented in this chapter to constrain anticipation and attempt to perform better than random in guessing which travelers are likely to see the theory violated, and which travelers are not? Or am i thinking of this the wrong way? I remain genuinely confused here, which i hope is a good sign as far as the search for truth :)
Fine, replace the agents with rocks. The problem still holds.
There's no closed form solution for the 3-body problem; you can only numerically approximate the future, with decreasing accuracy as time goes on. There are far more than 3 bodies in the universe relevant to the long term survival of an AGI that could die in any number of ways because it's made of many complex pieces that can all break or fail.