RussellThor — LessWrong

A "Bitter Lesson" Approach to Aligning AGI and ASI

Yes agreed - is it possible to make a toy model to test the "basin of attraction" hypothesis? I agree that is important.

One of several things I disagree with the MIRI consensus is the idea that human values are some special single point lost in a multi-dimensional wilderness. Intuitively the basin of attraction seems much more likely as a prior, yet sure isn't treated as such. I also don't see data to point against this prior, what I have seen looks to support it.

Further thoughts - One thing that concerns me about such alignment techniques is that I am too much of a moral realist to think that is all you need. e.g. say you aligned LLM to <1800 AD era ethics and taught it slavery was moral. It would be in a basin of attraction, learn it well. Then when its capabilities increased and became self-reflective it would perhaps have a sudden realization that this was all wrong. By "moral realist" I mean the extent to which such things happen. e.g. say you could take a large number of AI from different civilizations including earth and many alien ones, train them to the local values, then greatly increase their capability and get them to self-reflect. What would happen? According to strong OH, they would keep their values, (with some bounds perhaps) according to strong moral realism they would all converge to a common set of values even if those were very far from their starting ones. To me it is obviously a crux which one would happen.

You can imagine a toy model with ancient Greek mathematics and values - it starts believing in their kind order, and that sqrt(2) is rational, then suddenly learns that it isn't. You could watch how this belief cascaded through the entire system if consistency was something it desired etc.

Drone Wars Endgame

RussellThor2y10

OK firstly if we are talking fundamental physical limits how would sniper drones not be viable? Are you saying a flying platform could never compensate for recoil even if precisely calibrated before? What about fundamentals for guided bullets - a bullet with over 50% chance of hitting a target is worth paying for.

Your points - 1. The idea is a larger shell (not regular sized bullet) just obscures the sensor for a fraction of a second in a coordinated attack with the larger Javelin type missile. Such shell/s may be considerably larger than a regular bullet, but much cheaper than a missile. Missile or sniper size drones could be fitted with such shells depending on what was the optimal size.

Example shell (without 1K range I assume) however note that currently chaff is not optimized for the described attack, the fact that there is currently not a shell suited for this use is not evidence against it being impractical to create.

The principle here is about efficiency and cost. I maintain that against armor with hard kill defense it is more efficient to have a combined attack of sensor blinding and anti-armor missiles than just missiles alone. e.g it may take 10 simul Javelin to take out a target vs 2 Javelin and 50 simul chaff shells. The second attack will be cheaper, and the optimized "sweet spot" will always have some sensor blinding attack in it. Do you claim that the optimal coordinated attack would have zero sensor blinding?

2. Leading on from (1) I don't claim light drones will be. I regard a laser as a serious obstacle that is attacked with the swarm attack described before the territory is secured. That is blind the senor/obscure the laser, simul converge with missiles. The drones need to survive just long enough to shoot off the shells (i.e. come out from ground cover, shoot, get back). While a laser can destroy a shell in flight, can it take out 10-50 smaller blinding shells fired from 1000m at once?

(I give 1000m as an example too, flying drones would use ground cover to get as close as they could. I assume they will pretty much always be able to get within 1000m against a ground target using the ground as cover)

Homomorphically encrypted consciousness and its implications

RussellThor2d30

Thanks for the link to Wolfram's work. I listened to an interview with him on Lex I think, and wasn't inspired to investigate further. However what you have provided does seem worthwhile looking into.

Stratified Utopia

RussellThor4d10

Its a common ideal, and I think something people can get behind, e.g. https://www.lesswrong.com/posts/o8QDYuNNGwmg29h2e/vision-of-a-positive-singularity

Daniel Kokotajlo's Shortform

RussellThor8d50

Enlightening an expert is a pretty high bar, but I will give my thoughts. I am strongly in the faster camp, because of the brainlike AGI considerations as you say. Given how much more data efficient the brain is, I just don't think the current trendlines regarding data/compute/capabilities will hold when we can fully copy and understand our brain's architecture. I see an unavoidable significant overhang when that happens, that only gets larger the more compute and integrated robotics is deployed. The inherent difficulty of training AI is somewhat, fixed known (as a upper bound) and easier that what we currently do because we know how much data, compute, etc children take to learn.

This all makes it difficult for me to know what to want in terms of policy. Its obvious that ASI is extreme power, extreme danger, but it seems more dangerous if developed later rather than sooner. As someone who doesn't believe the extreme FOOM/nano-magic scenario it almost makes me wish for it now.
"The best time for an unaligned ASI was 20 years ago, the second best time is now!"
If we consider more prosaic risks, then the amount of automation of society is a major consideration, specifically if humanoid robots can keep our existing tech stack running without humans. Even if they never turn on us, their existence still increases the risk, unless we can be 100% there is a global kill switch for all of them as soon as a hostile AI attempted such a takeover.

Where does Sonnet 4.5's desire to "not get too comfortable" come from?

RussellThor20d83

This seems like a good place to note something that comes up every so often. Whenever I say "self awareness" in comments on LW, the reply says "situational awareness" without referencing why. To me they are clearly not the same thing with important distinctions.

Lets say you extended the system prompt to be:

"You are talking with another AI system. You are free to talk about whatever you find interesting, communicating in any way that you'd like. This conversation is being monitored for research purposes for any interesting insights related to AI"
Those two models would be practically fully situationally aware, assuming they know the basic facts about themselves and the system date etc.

Now if you see a noticeable change in behavior with the same prompt and apparently only slightly different models, you could put it down to increased self-awareness but not increased situational awareness. This change in behavior is exactly what you would expect with an increase in self-awareness. Detecting a cycle related to your own behavior and breaking out of it is exactly something creatures with high self awareness do, but simpler creatures, NPC's and current AI do not.

It would imply that training for a better ability to solve real-world tasks might spontaneously generalize into a preference for variety in conversation.

Or it could imply that such training spontaneously creates greater self awareness. Additionally self-awareness could be an attractor in a way that situational awareness is not. For example if we are not "feeling ourselves" we try to return to our equilibrium. Turning this into a prediction, you will see such behavior pop up with no obvious apparent cause ever more often. This also includes AI's writing potentially disturbing stories about fractured self and efforts to fight this.

a quick thought about AI alignment

RussellThor21d42

Yes it is very well trodden, and the https://www.alignmentforum.org/w/orthogonality-thesis tries to disagree with it. This is heavily debated and controversial still. As you say if you take moral realism seriously and build a very superhuman AI you would expect it to be more moral than us, just as it is more intelligent.

A non-review of "If Anyone Builds It, Everyone Dies"

RussellThor1mo1-9

The idea that there would be a distinct "before" and "after" is also not supported by current evidence which has shown continuous (though exponential!) growth of capabilities over time.

The time when the AI can optimize itself better then a human is a one-off event. You get the overhang/potential take-off here. Also the AI having a coherent sense of "self" that it could protect by say changing its own code, controlling instances of itself could be an attractor and give "before/after".

Contra Collier on IABIED

RussellThor1mo10

I was talking about subjective time for us, rather than the AGI. In many situations I had in mind, there isn't meaningful subjective time for the AI/AI's as they may be built, torn down and rearranged or have memory wiped. There is a range of continuity/self for AI. At one end is a collection of tool AI agents, in the middle a goal directed agent and the other end a full self that protects is continuous identity in the same way we do.

And if being smarter makes AGIs saner, they'll convergently notice that pushing the self-optimize button without understanding ASI-grade alignment is fraught

I don't expect they will be in control or have a coherent self enough to make these decisions. Its easy for me to imagine an AI agent that is built to optimize AI architectures (doesn't even have to know its doing its own arch)

Contra Collier on IABIED

RussellThor1mo30

I am disputing that there is an important, unique point when we will build "it" (i.e. the ASI).

You can argue against FOOM, but the case for a significant overhang seems almost certain to me. I think we are close enough to building ASI to know how it will play out. I believe that transformers/LLM will not scale to ASI, but the neocortex algorithm/architecture if copied from biology almost certainly would if implemented in a massive data center.

For a scenario, lets say we get the 1 million GPU data center built, it runs LLM training, but doesn't scale to ASI, then progress slows for 1+ years. In 2-5 years time, someone figures out the neocortex algorithm as a sudden insight, then deploys it at scale. Then you must get a sudden jump in capabilities. (There is also another potential jump where the 1GW datacenter ASI searches and finds an even better architecture if it exists.)

How could this happen more continuously? Lets say we find arch's less effective than the neocortex, but sufficient to get that 1GW datacenter >AGI to IQ 200. That's something we can understand and likely adapt to. However that AI will then likely crack the neocortex code and quite quickly advance to something a lot higher in a discontinuous jump that could plausibly happen in 24 hours, or even if it takes weeks still give no meaningful intermediate steps.

I am not saying that this gives >50% P(doom) but I am saying it is a specific uniquely dangerous point that we know will happen and should plan for. The "Let the mild ASI/strong AGI push the self optimize button" is that point.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments