All of quetzal_rainbow's Comments + Replies

I casually thought that Hyperion Cantos were unrealistic because actual misaligned FTL-inventing ASIs would eat humanity without all that galaxy-brained space colonization plans and then I realized that ASI literally discovered God on the side of humanity and literal friendly aliens which, I presume, are necessary conditions for relatively peaceful coexistence of humans and misaligned ASIs.

Basically, orthogonality thesis says that any intelligence can have any goal, so superintelligence easily can have human-aligned goal.

The main problem here is:

I guess that step 4 is probably incomputable. The human body is far, far too complex to model exactly, and you have to consider the effect of your weapon on every single variation on the human body, including their environment, etc, ensuring 100% success rate on everyone. I would guess that this is too much variation to effectively search through from first principles. 

You don't need to do any fancy computations to kill everyone, if you come so far that you have nanotech. You just use your nanotech to emulate good old biology and synthetize well-known botulotoxin in bloodstream, death rate 100%.

The problem with such definition is that is doesn't tell you much about how to build system with this property. It seems to me that it's a good-old corrigibility problem.

1TAG1mo
If you want one shot corrigibility, you have it, in LLMs. If you want some other kind of corrigibility, that's not how tool AI is defined.

Another Tool AI proposal popped out and I want to ask question: what the hell is "tool", anyway, and how to apply this concept to powerful intelligent system? I understand that calculator is a tool, but in what sense can the process that can come up with idea of calculator from scratch be a "tool"? I think that first immediate reaction to any "Tool AI" proposal should be a question "what is your definition of toolness and can something abiding that definition end acute risk period without risk of turning into agent itself?"

1TAG1mo
You can define a tool as not-an-agent. Then something that can design a calculator is a tool, providing it dies nothing unless told to.

I would say that it should be done using google forms? For usability of large statistics.

2Gordon Seidoh Worley1mo
If I learn enough this way to suggest it's worth exploring and doing a real study, sure. This is a case of better done lazily to get some information than not done at all.

The main problem here is "how to elicit simulacra of superhuman aligned intelligence while avoiding Waluigi effect". We don't have aligned superintelligence in training data and any attempts to elicit superintelligence from LLM can be fatal.

How much should we update on current observation about hypothesis "actually, all intelligence is connectionist"? In my opinion, not much. Connectionist approach seems to be easiest, so it shouldn't surprise us that simple hill-climbing algorithm (evolution) and humanity stumbled in it first.

I see some funny pattern in discussion: people argue against doom scenarios implying in their hope scenarios everyone believes in doom scenario. Like, "people will see that model behaves weirdly and shutdown it". But you shutdown model that behaves weirdly (not explicitly harmful) only if you put non-negligible probability on doom scenarios.

2Dagon2mo
Consider different degrees of belief.  Giving low-credence to doom scenario by the conditional belief that evidence of danger would be properly observed is not inconsistent at all.  The doom scenario requires BOTH that it happens AND that it's ignored while happening (or happens too fast to stop).

I am not a domain expert, but I get the impression that the primary factors of Pareto-frontier for software industry is "consumer expectations" and "money costs", and primary component of money costs is "programmer labor", so software development goes mostly on the way "how to satisfy consumer expectations with minimum possible labor costs", which doesn't put much optimisation pressure on computing efficiency. I frankly expect that if we spend bazillion dollars on optimisation, we can at least halve required computing power for "Witcher 3". Demoscene proves that we can put many things in 64KBytes of space.

There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals

This statement begs for cost-benefit analysis. 

Increasing size of alignment field can be efficient, but it won't be cheap. You need to teach new experts in the field that doesn't have any polised standardized educational programs and doesn't have much of teachers. If you want not only increase number of participants in the field, but increase productivity of the field 10x, you need an extraor... (read more)

9Thomas Kwa2mo
I'd be much happier with increasing participants enough to equal 10-20% of the field of ML than a 6 month unconditional pause, and my guess is it's less costly. It seems like leading labs allowing other labs to catch up by 6 months will reduce their valuations more than 20%, whereas diverting 10-20% of their resources would reduce valuations only 10% or so. There are currently 300 alignment researchers. If we take additional researchers from the pool of 30k people who attended ICML, you get 3000 researchers, and if they're equal quality this is 10x participants. I wouldn't expect alignment to go 10x faster, more like 2x with a decent educational effort. But this is in perpetuity and should speed up alignment by far more than 6 months. There's the question of getting labs to pay if they're creating most of the harms, which might be hard though. I'd be excited about someone doing a real cost-benefit analysis here, or preferably coming up with better ideas. It just seems so unlikely that a 6 month pause is close to the most efficient thing, given it destroys much of the value of a company that has a large lead.

How have we updated p(doom) on the idea that LLMs are very different than hypothesized AI? 

Actually. what were your predictions? "Hypothesized AI", as far as I understood you, is only a final step - AGI that kills us. Path to it can be very weird. I think that before GPT many people could say "my peak of probability distribution lies on model-based RL as path to AGI", but they still had very fat and long tails in this distribution.

it seems like we're spending all the weirdness points on preventing the training of a language model that at the end of th

... (read more)

Yep, corrigibility is unsolved! So we should try to solve it.

1mishka2mo
I wrote the following in 2012: "The idea of trying to control or manipulate an entity which is much smarter than a human does not seem ethical, feasible, or wise. What we might try to aim for is a respectful interaction." I still think that this kind of a more symmetric formulation is the best we can hope for, unless the AI we are dealing with is not "an entity with sentience and rights", but only a "smart instrument" (even the LLM-produced simulations in the sense of Janus' Simulator theory seem to me to already be much more than "merely smart instruments" in this sense, so if "smart superintelligent instruments" are at all possible, we are not moving in the right direction to obtain them; a different architecture and different training methods or, perhaps, non-training synthesis methods would be necessary for that (and would be something difficult to talk out loud about, because that's very powerful too)).

Building larger, more powerful models will be seen primarily as an engineering problem, and at no point will a single new model be in a position to overpower the entire ecosystem that created it.

I know at least two events when small part of biosphere overpowered everything else: Great Oxidation Event and human evolution. Given the whole biosphere history, it overloaded with stories of how small advantage can allow you to destroy every competitor at least in your ecological niche. 

And it's plausible that superintelligence will just hack every other AI using prompt injections et cetera and instead of "humanity and narrow AIs vs first AGI" we will have "humanity vs first AGI and narrow AIs".

1Logan Zoellner2mo
"Hack every AI on the planet" sounds like a big ask for an AI that will have a tiny fraction (<1%) of the world's total computing power at its disposal. Furthermore, it has to do that retro-actively.  The first super-intelligent AGI will be built by a team of 1m Von-Neumann level AGIs who are working their hardest to prevent that from happening.

Actually, another example of pivotal act is "invent method of mind uploading, upload some alignment researchers, run them at speed 1000x until they solve full alignment problem". I'm sure that if you think hard enough, you can find some other, even less dangerous pivotal act, but you probably shouldn't talk out loud about it.

1mishka2mo
Right, but how do you restrict them from "figuring out how to know themselves and figuring out how to self-improve themselves to become gods"? And I remember talking to Eliezer in ... 2011 at his AGI-2011 poster and telling him, "but we can't control a teenager, and why would not AI rebel against your 'provably safe' technique, like a human teenager would", and he answered "that's why it should not be human-like, a human-like one can't be provably safe". Yes, I am always unsure, what we can or can't talk about out loud (nothing effective seems to be safe to talk about, "effective" seems to always imply "powerful", this is, of course, one of the key conundrums, how do we organize real discussions about these things)...

Corrigibility features usually imply something like "AI acts only inside the box and limits its causal impact outside the box in some nice way that allows us to take from box the bottle with nanofactory to do the pivotal act but prevents AI from programming nanofactory to do something bad", i.e. we dodge the problem of AGI caring about humans by building such AGI that wants to do the task (simple task without any mention of humans) in a very specific way that rules out killing everyone.

1mishka2mo
Right. But this does not help us with dealing with the consequences of that act (if it's a simple act, like the proverbial "gpu destruction"), and if we discover that overall risks have increased as a result, then what could we do? And if that AI stays as a boxed resource (capable to continuing to do further destructive acts like "gpu destruction" at the direction of a particular group of humans), I envision a full-scale military confrontation around access to and control of this resource being almost inevitable. And, in reality, AI is doable on CPUs (just will take a bit more time), so how much of our lifestyle destruction would we be willing to risk? No computers at all, with some limited exceptions, the death toll of that change will probably be in billions already...

One of the problem is S-risk. To change "care about maximizing fun" to "care about maximizing suffering" you need just put a minus in a wrong place of math expression that describes your goal.

1mishka2mo
I certainly agree with that. In some sense, almost any successful alignment solution minimizing X-risk seems to carry a good deal of S-risk with it (if one wants AI to actually care about what sentient beings feel, it almost follows one needs to make sure that AI can "truly look inside a subjective realm" of another sentient entity (to "feel what it is like to be that entity"), and that capability (if it's at all achievable) is very abusable in terms of S-risks). But this is something no "pivotal act" is likely to change (when people talk about "pivotal acts", it's typically about minimizing (a subset of) X-risks). And moreover, the S-risk is a very difficult problem on which we do need really powerful thinkers to work on (and not just today's humans).

Reward is an evidence for optimization target.

What's the difference? Multiple AIs can agree to split the universe and gains from disassembling biosphere/building Dyson sphere/whatever and forget to include humanity in negotiations. Unless preferences of AIs are diametrically opposed, they can trade.

2Andy_McKenzie2mo
AIs can potentially trade with humans too though, that's the whole point of the post.  Especially if the AI's have architectures/values that are human brain-like and/or if humans have access to AI tools, intelligence augmentation, and/or whole brain emulation.  Also, it's not clear why AIs will find it easier to coordinate with one another than humans and humans or humans and AIs. Coordination is hard for game theoretic reasons.  These are all standard points, I'm not saying anything new here. 

"FOOM is unlikely under current training paradigm" is a news about current training paradigm, not a news about FOOM.

why should we even elevate the very specific claim that 'AIs will experience a sudden burst of generality at the same time as all our alignment techniques fail.' to consideration at all, much less put significant weight on it?

In my model, it is pretty expected.

Let's suppose, that the agent learns "rules" of arbitrary complexity during training, initially the rules are simple and local, such as "increase the probability of action  by several log-odds in a specific context". As training progresses, the system learns more complex meta-rules, such a... (read more)

Reflection of agent about it's own values can be described as one of two subtypes: regular and chaotic. Regular reflection is a process of resolving normative uncertainty with nice properties like path-independence and convergence, similar to empirical Bayesian inference. Chaotic reflection is a hot mess, when agent learns multiple rules, including rules about rules, finds in some moment that local version of rules is unsatisfactory, and tries to generalize rules into something coherent. Chaotic component happens because local rules about rules can cause d... (read more)

2Vladimir_Nesov2mo
Why should the current place arrived-at after a chaotic path matter, or even the original place before the chaotic path? Not knowing how any of this works well enough to avoid the chaos puts any commitments made in the meantime, as well as significance of the original situation, into question. A new understanding might reinterpret them in a way that breaks the analogy between steps made before that point and after.

I am trying to study moral uncertainty foremost to clarify question about reflexion of superintelligence on its values and sharp left turn.

2Vladimir_Nesov2mo
Right. I'm trying to find a decision theoretic frame for boundary norms for basically the same reason. Both situations are where agents might put themselves before they know what global preference they should endorse. But uncertainty never fully resolves, superintelligence or not, so anchoring to global expected utility maximization is not obviously relevant to anything. I'm currently guessing that the usual moral uncertainty frame is less sensible than building from a foundation of decision making in a simpler familiar environment (platonic environment, not directly part of the world), towards capability in wider environments.

Thoughts about moral uncertainty (I am giving up on writing long coherent posts, somebody help me with my ADHD):

What are the sources of moral uncertainty? 

  1. Moral realism is actually true and your moral uncertainty reflects your ignorance about moral truth. It seems to me that there is no much empirical evidence for resolving uncertainty-about-moral-truth and this kind of uncertainty is purely logical? I don't believe in moral realism and what do you mean by talking about moral truth anyway, but I should mention it. 
  2. Identity uncertainty: you are no
... (read more)
2Vladimir_Nesov2mo
I think trying to be an EU maximizer without knowing a utility function is a bad idea. And without that, things like boundary-respecting norms [https://www.lesswrong.com/posts/8oMF8Lv5jiGaQSFvo/boundaries-part-1-a-key-missing-concept-from-utility-theory] and their acausal negotiation [https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcy] make more sense as primary concerns. Making decisions only within some scope of robustness where things make sense rather than in full generality, and defending a habitat (to remain) within that scope.

Here is a comment for links and sources I've found about moral uncertainty (outside LessWrong), if someone also wants to study this topic. 

Normative Uncertainty, Normalization,and the Normal Distribution

Carr, J. R. (2020). Normative Uncertainty without Theories. Australasian Journal of Philosophy, 1–16. doi:10.1080/00048402.2019.1697710 

Trammell, P. Fixed-point solutions to the regress problem in normative uncertainty. Synthese 198, 1177–1199 (2021). https://doi.org/10.1007/s11229-019-02098-9

Riley Harris: Normative Uncertainty and Information Val... (read more)

Of course, Eliezer knows about CAIS. He just thinks that it is a clever idea that has no chance to work.

It's very funny that you think AI can solve very complex problem of aging, but don't believe that AI can solve much simpler problem "kill everyone".

if the original model learned complex, power-seeking behaviors that doesn't help it do well on the training data

The problem with power-seeking behavior is that it helps to do well in quite broad range of tasks.

2research_prime_space2mo
As of right now, I don't think that LLMs are trained to be power seeking and deceptive. Power-seeking is likely if the model is directly maximizing rewards, but LLMs are not quite doing this.

Argument "from intuition" doesn't work this way. We appeal to intuitions if we don't why, but almost everyone feels that X is true and everybody who doesn't is in psychiatric ward. If you have major intuitive disagreement in baseline population, you don't use argument from intuition.

1omnizoid2mo
Why think that?  If I have a strong intuition, in the sense that I feel like I've grasped a truth, and others don't, then it seems the best explanation is that they're missing something. 

I skimmed the link about moral realism, and hoo boy, it's so wrong. It is recursively, fractally wrong.

Let's consider the argument about "intuitions". The problem with this argument is following: my intuition tells me that moral realism is wrong. I mean it. It's not like "I have intuition that moral realism is true but my careful reasoning disproves it", no, I feel that moral realism is wrong since I first time hear it when I was child and my careful reflection supports this conclusion. Argument from intuitions ignores my existence. The failure to consider that intuitions about morality can be wildly different between people doesn't make me sympathetic to the argument "most philosophers are moral realists" either.

-1omnizoid2mo
Most people don't have those intuitions.  Most people have the intuition that future tuesday indifference is irrational and that it's wrong to torture infants for fun and would be so even if everyone approved. 

Does this reasoning mean that interpretability is basically impossible?

Worth noting that "speed priors" are likely to occur in real-time working systems. While models with speed priors will shift to complexity priors, because our universe seems to be built on complexity priors, so efficient systems will emulate complexity priors, it is not necessary for normative uncertainty of the system, because answers for questions related to normative uncertainty are not well-defined.

Scattered thoughts:

I think that observed behavior is fairly consistent with non-linear functions that have sort-of-linear parts. Let's take ReLU. If you subtract large enough number, it doesn't matter if you subtract more, because you will always get zero, but before that you will observe sort-of-linear change of behavior.

Speculative part: neural networks learn linear representations and condations of switching between them which are expressed in non-linear part of internal mechanisms. If you add too much number to some component, model hits the region of ... (read more)

I think that Eliezer meant biological problems like "given data about various omics in 10000 samples build causal network, including genes, transcription factors, transcripts, etc, so we could use this model to cure cancer and enhance human intelligence"

5ShardPhoenix3mo
This question appears to be structured in such a way as to make it very easy to move the goalposts.

It is direct conclusion from Löb's theorem.

Löb's theorem:

Substitute P with False statement:

But  is equivalent , i.e.

I.e., if provable that it's impossible to prove false statement, then false statements are provable. We have reached contradiction, Q.E.D.

1Thoth Hermes3mo
I am not sure that the statement □False or "It is provable that False" means anything. Basically, you have that the word False and a false statement are not the same thing. Therefore, it is not generally the case that one can make a statement of the form "X is false" without X already being either the word False or another statement of the form "X is false." What this implies is that one cannot in general prove that "False" (just the bare, basic statement that False). □(¬□False)→□False. But □False →False. Thus □(¬□False)→¬□False. This seems consistent to me. We also haven't used anything except the theorem directly and substitution of P with False to get here. If you're saying that results in a contradiction, then that would imply the theorem is false, unless you introduce further assumptions.

there's a reversed prior here

I think it is wrong true name for this kind of problem, because it is not about probabilistic reasoning per se, it is about combination of logical (which deals with 1 and 0 credences) and probabilistic (which deals with everything else) reasoning. And this problem, as far as I know, was solved by logical induction

Sketch proof: by criterion of logical induction, logical inductor is unexploitable, i.e. it's losses are bounded. So, even if adversary trader could pull of 5/10 trick for one time, it can't do it forever, because this... (read more)

I want to point out that for really high expected value you don't need to be extremely reliable. If your AI can with reliability 90% (i.e., without hallucination, for example) generate 100 scientific ideas that worth testing (i.e., testing of this ideas can lead to major technological breakthrough), your AI... is better at generating of ideas than 99.9% of scientists?

5Bezzi3mo
This reminds me of the magic black box [https://slatestarcodex.com/2019/02/26/rule-genius-in-not-out/] described by Scott Alexander: But this scenario seems kind of unfair to me. We are definitely not at the point where LLMs can provide truly novel groundbreaking scientific insights on their own. Meanwhile, nobody would use a LLM calculator over a classical calculator if the former that gets math wrong 10% of the time.
1Erich_Grunewald3mo
Agree, I think the safety-critical vs not-safety-critical distinction is better for sorting out what semi-reliable AI systems will/won't be useful for.

Corrigibility is a feature of advanced agency, it may not be applied to not advanced enough agents. There is nothing unusual if you turn off your computer, because your computer is not an advanced agent that can resist to be turned off, so there is no reason to tell that your computer is "corrigible"

It is... a very weird reasoning about 5/10 problem? It has nothing to do with human ad hoc rationalitzation. 5/10 is a result of naive reasoning about embedded agency and reflective consistency.

Let's say that I proved that I will do A. Therefore, if my reasoning about myself is correct, I wiil do A. So we proved that (proved(A) -> A), then, by Lob's theorem, A is proved and I will do A. But if I proved that I will do A, then probability B is equal to zero, then expected value of B is zero, which is usually less then whatever value of A, so I proved that... (read more)

-2Slimepriestess3mo
Like I said in another comment, there's a reversed prior here, taking behavior as evidence for what kind of agent you are in a way that negatively and recursively shapes you as an agent, instead of using the intrinsic knowledge about what kind of agent you are to positively and recursively shape your behavior.  what do you mean? They obviously do.

To do something really useful (like nanotech or biological immortality), your model should be something like AlphaZero - model-based score-maximizer. Because this model is really intelligent, it can model future world states and find that if model is turned off, future would have lower score than if model wasn't turned off.

2RedFishBlueFish3mo
Yeah so this seems like what I was missing. But it seems to me that in these types of models, where the utility function is based on the state of the world rather than on input to the AI, aligning the AI not to kill humanity is easier. Like if an AI gets a reward every time it sees a paperclip, then it seems hard to punish the AI for killing humans because "human dies" is a hard thing for an AI with just sensory input to explicitly recognize. If however the AI is trained on a bunch of runs where the utility function is the number of paperclips actually created, then we can also penalize the model for the number of people who actually die. I'm not very familiar with these forms of training so I could be off here.
3baturinsky3mo
And yet, AlphaZero is corrigible. It's goal is not even to win, it's goal is to play in a way to maximise the chance of winning if the game is played until completion. It does not actually care about if game is completed or not. For example, it does not trick player into playing the game to the end by pretending they have a change of winning. Though, if it would be trained on parties with real people, and would get better reward for winning than for parties being abandoned by players, it's value function would proably change to aiming for the actual "official" win.

Basically, because the world where kidney selling is legal is not the world where mothers won't see their kids dying, it's the world where people are forced to sell their kidneys to pay their student loans.

Useful heuristic for deontology-violation: this shit usually doesn't have good consequences in the end.

6Rudi C3mo
This is absolutely false. Here in Iran selling kidneys is legal. Only desperate people do sell. No one sells their kidneys for something trivial like education.
3niplav3mo
Could you explain what process forces people to sell their kidneys to pay their student loans? In my model, something like this will happen in the minds of people who study but have to pay hefty student loans (before they start studying): "I want to study field X to signal that I'm a conscientious employee. But I know I will have to pay hefty student loans. There is a chance I will have to sell my kidney. Is the risk of me selling my kidney small enough to outweigh the benefit?" After they have studied: "I now have to pay student loans. I will either have to work harder to pay them back, or I can sell my kidney. [If them selling their kidney is lower cost than working harder] I guess I'll sell my kidney. [If working harder is lower cost than selling their kidney] I guess I'll work harder then." I think a worldview in which taking options away from people is bad is actually quite informed by a deontological libertarianism—it says something like "you have too high uncertainty over the strategies that other people would take, and removing possible strategies shrinks their option set. You can't increase the payoff for an agent in a normal-form game by taking actions away from them." I wonder whether this one is true (and can be easily proved): For a normal form game [https://en.wikipedia.org/wiki/Normal-form_game] G and actions ai for a player i, removing a set of actions a−i from the game yields a game G− in which the Nash equilibria [https://en.wikipedia.org/wiki/Minimax] are worse on average for i (or alternatively the pareto-best/pareto-worst Nash equilibrium is worse for G− than for G). This is quite easily proved for a minimax [https://en.wikipedia.org/wiki/Minimax] strategy.
2DirectedEvolution3mo
Currently, we live in a world where kids are seeing their mothers and fathers dying, either from selling kidneys on the blackmarket/being kidnapped and having them stolen, or from end-stage renal disease, wasting away on dialysis. It is odd to me to see an appeal to consequences used to buttress a deontological moral view.

Like it helps everywhere when uncertainty is here? Imagine a problem "You are in Prisoner's dilemma with such and such payoffs, find optimal strategy if distribution of your possible opponents is 25% CooperateBots, 33% DefectBots and 42% those who actually knows decision theory".

2PaulK3mo
I still don't know exactly what parts of my comment you're responding to. Maybe talking about a concrete sub-agent coordination problem would help ground this more. But as a general response: in your example it sounds like you already have the problem very well narrowed down, to 3 possibilities with precise probabilities. What if there were 10^100 possibilities instead? Or uncertainty where the full real thing is not contained in the hypothesis space?

Am I correct that "knowing what system thinks is fair" is equivalent to "knowing under which bargaining solution system acts"?

It seems to me that this is basically solved by "you put probability distributions over all things that you don't actually know and may have disagreement about"

1PaulK3mo
This is for logical coordination? How does it help you with that?

The instrumental convergence thesis only applies to fitness maximizers, not adaptation executors, however intelligent.

Clearly no? If you execute adaptation "be rational" you get the same results, it's how general intelligence emerged in humans.

2Yair Halberstadt3mo
Being rational is fitness maximisation. It's impossible to be rational without respect to some goal. Fitness maximisation in humans is built on top of an adaptation executor. Adaptation executor/fitness maximisation is a spectrum, not a binary switch.

Okay, this is a weak example of alignment generalization failure! To check:

  1. We gave a task relatively far out of distribution (because model almost definetely haven't encountered this particular complex combination of tasks)
  2. Model successfully does task.
  3. But alignment is totally botched.

If the model is able to conceptualize the base goal before it is significantly goal-directed, then deceptive alignment is unlikely.

 

I am totally buffled by the fact that nobody pointed out that this is totally wrong.

Your model can have perfect representation of goal in "world-model" module, but not in "approve plan based on world-model prediction" module. In Humean style, from "what is" doesn't follow "what should be".

I.e. you conflate two different possible representations of a goal: representation that answers on questions about outside world, like ... (read more)

Load More