Basically, orthogonality thesis says that any intelligence can have any goal, so superintelligence easily can have human-aligned goal.
The main problem here is:
I guess that step 4 is probably incomputable. The human body is far, far too complex to model exactly, and you have to consider the effect of your weapon on every single variation on the human body, including their environment, etc, ensuring 100% success rate on everyone. I would guess that this is too much variation to effectively search through from first principles.
You don't need to do any fancy computations to kill everyone, if you come so far that you have nanotech. You just use your nanotech to emulate good old biology and synthetize well-known botulotoxin in bloodstream, death rate 100%.
The problem with such definition is that is doesn't tell you much about how to build system with this property. It seems to me that it's a good-old corrigibility problem.
Another Tool AI proposal popped out and I want to ask question: what the hell is "tool", anyway, and how to apply this concept to powerful intelligent system? I understand that calculator is a tool, but in what sense can the process that can come up with idea of calculator from scratch be a "tool"? I think that first immediate reaction to any "Tool AI" proposal should be a question "what is your definition of toolness and can something abiding that definition end acute risk period without risk of turning into agent itself?"
The main problem here is "how to elicit simulacra of superhuman aligned intelligence while avoiding Waluigi effect". We don't have aligned superintelligence in training data and any attempts to elicit superintelligence from LLM can be fatal.
How much should we update on current observation about hypothesis "actually, all intelligence is connectionist"? In my opinion, not much. Connectionist approach seems to be easiest, so it shouldn't surprise us that simple hill-climbing algorithm (evolution) and humanity stumbled in it first.
I see some funny pattern in discussion: people argue against doom scenarios implying in their hope scenarios everyone believes in doom scenario. Like, "people will see that model behaves weirdly and shutdown it". But you shutdown model that behaves weirdly (not explicitly harmful) only if you put non-negligible probability on doom scenarios.
I am not a domain expert, but I get the impression that the primary factors of Pareto-frontier for software industry is "consumer expectations" and "money costs", and primary component of money costs is "programmer labor", so software development goes mostly on the way "how to satisfy consumer expectations with minimum possible labor costs", which doesn't put much optimisation pressure on computing efficiency. I frankly expect that if we spend bazillion dollars on optimisation, we can at least halve required computing power for "Witcher 3". Demoscene proves that we can put many things in 64KBytes of space.
There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
This statement begs for cost-benefit analysis.
Increasing size of alignment field can be efficient, but it won't be cheap. You need to teach new experts in the field that doesn't have any polised standardized educational programs and doesn't have much of teachers. If you want not only increase number of participants in the field, but increase productivity of the field 10x, you need an extraor...
How have we updated p(doom) on the idea that LLMs are very different than hypothesized AI?
Actually. what were your predictions? "Hypothesized AI", as far as I understood you, is only a final step - AGI that kills us. Path to it can be very weird. I think that before GPT many people could say "my peak of probability distribution lies on model-based RL as path to AGI", but they still had very fat and long tails in this distribution.
...it seems like we're spending all the weirdness points on preventing the training of a language model that at the end of th
Building larger, more powerful models will be seen primarily as an engineering problem, and at no point will a single new model be in a position to overpower the entire ecosystem that created it.
I know at least two events when small part of biosphere overpowered everything else: Great Oxidation Event and human evolution. Given the whole biosphere history, it overloaded with stories of how small advantage can allow you to destroy every competitor at least in your ecological niche.
And it's plausible that superintelligence will just hack every other AI using prompt injections et cetera and instead of "humanity and narrow AIs vs first AGI" we will have "humanity vs first AGI and narrow AIs".
Actually, another example of pivotal act is "invent method of mind uploading, upload some alignment researchers, run them at speed 1000x until they solve full alignment problem". I'm sure that if you think hard enough, you can find some other, even less dangerous pivotal act, but you probably shouldn't talk out loud about it.
Corrigibility features usually imply something like "AI acts only inside the box and limits its causal impact outside the box in some nice way that allows us to take from box the bottle with nanofactory to do the pivotal act but prevents AI from programming nanofactory to do something bad", i.e. we dodge the problem of AGI caring about humans by building such AGI that wants to do the task (simple task without any mention of humans) in a very specific way that rules out killing everyone.
One of the problem is S-risk. To change "care about maximizing fun" to "care about maximizing suffering" you need just put a minus in a wrong place of math expression that describes your goal.
What's the difference? Multiple AIs can agree to split the universe and gains from disassembling biosphere/building Dyson sphere/whatever and forget to include humanity in negotiations. Unless preferences of AIs are diametrically opposed, they can trade.
"FOOM is unlikely under current training paradigm" is a news about current training paradigm, not a news about FOOM.
why should we even elevate the very specific claim that 'AIs will experience a sudden burst of generality at the same time as all our alignment techniques fail.' to consideration at all, much less put significant weight on it?
In my model, it is pretty expected.
Let's suppose, that the agent learns "rules" of arbitrary complexity during training, initially the rules are simple and local, such as "increase the probability of action by several log-odds in a specific context". As training progresses, the system learns more complex meta-rules, such a...
Reflection of agent about it's own values can be described as one of two subtypes: regular and chaotic. Regular reflection is a process of resolving normative uncertainty with nice properties like path-independence and convergence, similar to empirical Bayesian inference. Chaotic reflection is a hot mess, when agent learns multiple rules, including rules about rules, finds in some moment that local version of rules is unsatisfactory, and tries to generalize rules into something coherent. Chaotic component happens because local rules about rules can cause d...
I am trying to study moral uncertainty foremost to clarify question about reflexion of superintelligence on its values and sharp left turn.
Thoughts about moral uncertainty (I am giving up on writing long coherent posts, somebody help me with my ADHD):
What are the sources of moral uncertainty?
I have collected some links on topic!
Here is a comment for links and sources I've found about moral uncertainty (outside LessWrong), if someone also wants to study this topic.
Normative Uncertainty, Normalization,and the Normal Distribution
Carr, J. R. (2020). Normative Uncertainty without Theories. Australasian Journal of Philosophy, 1–16. doi:10.1080/00048402.2019.1697710
Trammell, P. Fixed-point solutions to the regress problem in normative uncertainty. Synthese 198, 1177–1199 (2021). https://doi.org/10.1007/s11229-019-02098-9
Of course, Eliezer knows about CAIS. He just thinks that it is a clever idea that has no chance to work.
It's very funny that you think AI can solve very complex problem of aging, but don't believe that AI can solve much simpler problem "kill everyone".
if the original model learned complex, power-seeking behaviors that doesn't help it do well on the training data
The problem with power-seeking behavior is that it helps to do well in quite broad range of tasks.
Argument "from intuition" doesn't work this way. We appeal to intuitions if we don't why, but almost everyone feels that X is true and everybody who doesn't is in psychiatric ward. If you have major intuitive disagreement in baseline population, you don't use argument from intuition.
I skimmed the link about moral realism, and hoo boy, it's so wrong. It is recursively, fractally wrong.
Let's consider the argument about "intuitions". The problem with this argument is following: my intuition tells me that moral realism is wrong. I mean it. It's not like "I have intuition that moral realism is true but my careful reasoning disproves it", no, I feel that moral realism is wrong since I first time hear it when I was child and my careful reflection supports this conclusion. Argument from intuitions ignores my existence. The failure to consider that intuitions about morality can be wildly different between people doesn't make me sympathetic to the argument "most philosophers are moral realists" either.
Worth noting that "speed priors" are likely to occur in real-time working systems. While models with speed priors will shift to complexity priors, because our universe seems to be built on complexity priors, so efficient systems will emulate complexity priors, it is not necessary for normative uncertainty of the system, because answers for questions related to normative uncertainty are not well-defined.
Scattered thoughts:
I think that observed behavior is fairly consistent with non-linear functions that have sort-of-linear parts. Let's take ReLU. If you subtract large enough number, it doesn't matter if you subtract more, because you will always get zero, but before that you will observe sort-of-linear change of behavior.
Speculative part: neural networks learn linear representations and condations of switching between them which are expressed in non-linear part of internal mechanisms. If you add too much number to some component, model hits the region of ...
I think that Eliezer meant biological problems like "given data about various omics in 10000 samples build causal network, including genes, transcription factors, transcripts, etc, so we could use this model to cure cancer and enhance human intelligence"
Eliezer has clear beliefs about interpretability and bets on it: https://manifold.markets/EliezerYudkowsky/by-the-end-of-2026-will-we-have-tra
It is direct conclusion from Löb's theorem.
Löb's theorem:
Substitute P with False statement:
But is equivalent , i.e.
I.e., if provable that it's impossible to prove false statement, then false statements are provable. We have reached contradiction, Q.E.D.
there's a reversed prior here
I think it is wrong true name for this kind of problem, because it is not about probabilistic reasoning per se, it is about combination of logical (which deals with 1 and 0 credences) and probabilistic (which deals with everything else) reasoning. And this problem, as far as I know, was solved by logical induction
Sketch proof: by criterion of logical induction, logical inductor is unexploitable, i.e. it's losses are bounded. So, even if adversary trader could pull of 5/10 trick for one time, it can't do it forever, because this...
I want to point out that for really high expected value you don't need to be extremely reliable. If your AI can with reliability 90% (i.e., without hallucination, for example) generate 100 scientific ideas that worth testing (i.e., testing of this ideas can lead to major technological breakthrough), your AI... is better at generating of ideas than 99.9% of scientists?
Corrigibility is a feature of advanced agency, it may not be applied to not advanced enough agents. There is nothing unusual if you turn off your computer, because your computer is not an advanced agent that can resist to be turned off, so there is no reason to tell that your computer is "corrigible"
It is... a very weird reasoning about 5/10 problem? It has nothing to do with human ad hoc rationalitzation. 5/10 is a result of naive reasoning about embedded agency and reflective consistency.
Let's say that I proved that I will do A. Therefore, if my reasoning about myself is correct, I wiil do A. So we proved that (proved(A) -> A), then, by Lob's theorem, A is proved and I will do A. But if I proved that I will do A, then probability B is equal to zero, then expected value of B is zero, which is usually less then whatever value of A, so I proved that...
To do something really useful (like nanotech or biological immortality), your model should be something like AlphaZero - model-based score-maximizer. Because this model is really intelligent, it can model future world states and find that if model is turned off, future would have lower score than if model wasn't turned off.
Basically, because the world where kidney selling is legal is not the world where mothers won't see their kids dying, it's the world where people are forced to sell their kidneys to pay their student loans.
Useful heuristic for deontology-violation: this shit usually doesn't have good consequences in the end.
Like it helps everywhere when uncertainty is here? Imagine a problem "You are in Prisoner's dilemma with such and such payoffs, find optimal strategy if distribution of your possible opponents is 25% CooperateBots, 33% DefectBots and 42% those who actually knows decision theory".
Am I correct that "knowing what system thinks is fair" is equivalent to "knowing under which bargaining solution system acts"?
It seems to me that this is basically solved by "you put probability distributions over all things that you don't actually know and may have disagreement about"
The instrumental convergence thesis only applies to fitness maximizers, not adaptation executors, however intelligent.
Clearly no? If you execute adaptation "be rational" you get the same results, it's how general intelligence emerged in humans.
Okay, this is a weak example of alignment generalization failure! To check:
If the model is able to conceptualize the base goal before it is significantly goal-directed, then deceptive alignment is unlikely.
I am totally buffled by the fact that nobody pointed out that this is totally wrong.
Your model can have perfect representation of goal in "world-model" module, but not in "approve plan based on world-model prediction" module. In Humean style, from "what is" doesn't follow "what should be".
I.e. you conflate two different possible representations of a goal: representation that answers on questions about outside world, like ...
I casually thought that Hyperion Cantos were unrealistic because actual misaligned FTL-inventing ASIs would eat humanity without all that galaxy-brained space colonization plans and then I realized that ASI literally discovered God on the side of humanity and literal friendly aliens which, I presume, are necessary conditions for relatively peaceful coexistence of humans and misaligned ASIs.