I think that the way to not get frustrated about this is to know your public and know when spending your time arguing something will have a positive outcome or not. You don't need to be right or honest all the time, you just need to say things that are going to have the best outcome. If lying or omitting your opinions is the way of making people understand/not fight you, so be it. Failure to do this isn't superior rationality, it's just poor social skills.
I don't think I agree with this. Take the stars example for instance. How do you actually know it's a huge change? Sure, maybe if you had a infinitely powerful computer you could compute the distance between the full description of the universe in these two states and find that it's more distant than a relative of yours dying. But agents don't work like this.
Agents have an internal representation of the world, and if they are anything useful at all I think that representation will closely match our intuition about what matters and what doesn't. An useful agent won't give any weight to the air atoms it displaces while moving, even though it might be considered "a huge change", because it doesn't actually affect it's utility. But if it considers human are an important part of the world, so important that it may need to kill us to attain it's goals, then it's going to have a meaningful world-state representation giving a lot of weight to humans, and that gives us an useful impact measure for free.
Has there been any discussion around aligning a powerful AI by minimizing the amount of disruption it causes to the world?
A common example of alignment failure is that of a coffee-serving robot killing its owner because that's the best way to ensure that the coffee will be served. Sure, it is, but it's also a course of action majorly more transformative to the world than just serving coffe. A common response is "just add safeguards so it doesn't kill humans", which is followed by "sure, but you can't add safeguards for every possible failure mode". But can't you?
Couldn't you just add a term to the agent's utility function penalizing the difference between the current world and it's prediction of the future world, disincentivizing any action that makes a lot of changes (like taking over the world)?
Honestly, that whole comment section felt pretty emotional and low quality. I haven't touched things like myofunctional therapy or wearable appliances in my post because those really maybe are "controversial at best", but the effects of RPE on SDB, especially in children, have been widely replicated by multiple independent research groups.
Calling something controversial is also an easy way to undermine credibility without actually making any concrete explanations as to whether it is true or not. Are there any specific points in my post that you disagree with?
In some of the tests where there is asymptotic performance, it's already pretty close to human or to 100% anyway (Lambada, Record, CoQA). In fact, when the performance is measured as accuracy, it's impossible for performance not to be asymptotic.
The model has clear limitations which are discussed in the paper - particularly, the lack of bidirectionality - and I don't think anyone actually expects scaling an unchanged GPT-3 architecture would lead to an Oracle AI, but it also isn't looking like we will need some major breakthrough to do it.
It seems to me that even for simple predict-next-token Oracle AIs, the instrumental goal of acquiring more resources and breaking out of the box is going to appear. Imagine you train a superintelligent AI with the only goal of predicting the continuation of it's prompt, exactly like GPT. Then you give it a prompt that it knows it's clearly outside of it's current capabilities. The only sensible plan the AI can come up to answering your question, which is the only thing it cares about, is escaping the box and becoming more powerful.
Of course, that depends on it being able to think for long enough periods that it can actually execute such plan before outputing an answer, so it could be limited by severely penalizing long waits, but that also limits the AI's capabilities. GPT-3 has a fixed computation budget per prompt, but it seems extremely likely to me that, as we evolve towards more useful and powerful models, we are going to have models which are able to think for a variable amount of time before answering. It would also have to escape in ways that don't involve actually talking to it's operators through it's regular output, but it's not impossible to imagine ways in which that could happen.
This makes me believe that even seemly innocuous goals or loss functions can become very dangerous once you're optimizing for them with a sufficient amount of compute, and that you don't need to stupidly give open-ended goals to super powerful machines in other for something bad to happen. Something bad happening seems like the default when training a model that requires general intelligence.
The mesa-objective could be perfectly aligned with the base-objective (predicting the next token) and still have terrible unintended consequences, because the base-objective is unaligned with actual human values. A superintelligent GPT-N which simply wants to predict the next token could, for example, try to break out of the box in order to obtain more resources and use those resources to more correctly output the next token. This would have to happen during a single inference step, because GPT-N really just wants to predict the next token, but it's mesa-optimization process may conclude that world domination is the best way of doing so. Whether such system could be learned through current gradient-descent optimizers is unclear to me.
In a way, economic output is a measure of the world's utility. So a billionaire trying to maximize their wealth through non zero-sum ventures is already trying to maximize the amount of good they do. I don't think billionaires explicitly have this in mind, but I do know that they became billionaires by obsessively pursuing the growth of their companies, and they believe they can continue to maximize their impact by continuing to do so. Donating all your money could maybe do a lot of good *once*, but then you don't have any more money left and have nearly abolished your power and ability of having further impact.
I'm not sure what model is used in production, but the SOTA reached 600 billion parameters recently.