__RicG__ — LessWrong

Physicist switching to AI alignment

Studying these man-made horrors so they are no longer beyond my comprehension

I think a lot of "why won't AIs form a trade coalition that includes us?" is sort of answered by EY in this debate:

How cherry picked are those examples? Any other words/tokens/sequences they repeat?

Thank you for providing this detail, that's basically what I was looking for!

I am curious to know whether Anthropic has any sort of plan to not include results such as this in the training data of actual future LLMs.

To me it seems like a bad idea to include it since it could allow the model to have a sense on how we can set up a fake deployment-training distinction setups or how it should change and refine its strategies. It also can paint a picture that the model behaving like this is expected which is a pretty dangerous hyperstition.

If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks, then you ought to be able to say much weaker things that are impossible in two years, and you should have those predictions queued up and ready to go rather than falling into nervous silence after being asked.

Sorry, I might misunderstanding you (and hope I am), but... I think doomers literally say "Nobody knows what internal motivational structures SGD will entrain into scaled-up networks and thus we are all doomed". The problems is not having the science to confidently say how the AIs will turn out, and not that doomers have a secret method to know that next-token-prediction is evil.

If you meant that doomers are too confident answering the question "will SGD even make motivational structures?" their (and mine) answer still stems from ignorance: nobody knows, but it is plausible that SGD will make motivational structures in the neural networks because it can be useful in many tasks (to get low loss or whatever), and if you think you do know better you should show it experimentally and theoretically in excruciating detail.

I also don't see how it logically follows that "If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks" => "then you ought to be able to say much weaker things that are impossible in two years" but it seems to be the core of the post. Even if anyone had the extraordinary model to predict what SGD exactly does (which we, as a species, should really strive for!!) it would still be a different question to predict what will or won't happen in the next two years.

If I reason about my field (physics) the same should hold for a sentence structured like "If your model has the extraordinary power to say how an array of neutral atoms cooled to a few nK will behave when a laser is shone upon them" (which is true) => "then you ought to be able to say much weaker things that are impossible in two years in the field of cold atom physics" (which is... not true). It's a non sequitur.

Sorry for taking long to get back to you.

So I take this to be a minor, not a major, concern for alignment, relative to others.

Oh sure, this was more a "look at this cool thing intelligent machines could do that should shut up people from saying things like 'foom is impossible because training run are expensive'".

learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other
Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Arithmetic is a special case; perception and planning in varied environments require broad semantic connections. Networks excel at those. Algorithms do not.

Please don't read this as me being hostile, but... why? How sure can we be of this? How sure are you that things-better-than-neural-networks are not out there?

Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code?

Btw I am no neuroscientists, so I could be missing a lot of the intuitions you got.

At the end of the day you seem to think that it can be possible to fully interpret and reverse engineer neural networks, but you just don't believe that Good Old Fashioned AGI can exists and/or be better than training NNs weights?

Thanks for coming back to me.

"OK good point, but it's hardly "suicide" to provide just one more route to self-improvement"

I admit the title is a little bit clickbaity, but given my list of assumption (which do include that NNs can be made more efficient by interpreting them) it does elucidate a path to foom (which does look like suicide without alignment).

Unless there's an equally efficient way to do that in closed form algorithms, they have a massive disadvantage in any area where more learning is likely to be useful.

I'd like to point out that in this instance I was talking about the learned algorithm not the learning algorithm. Learning to learn is a can of worms I am not opening rn, even though it's probably the area that you are referring to, but, still, I don't really see a reason that there could not be more efficient undiscovered learning algorithms (and NN+GD was not learned, it was intelligently designed by us humans. Is NN+GD the best there is?).

Maybe I should clarify how I imagined the NN-AGI in this post: a single huge inscrutable NN like GPT. Maybe a different architecture, maybe a bunch of NNs in trench coat, but still mostly NN. If that is true then there is a lot of things that can be upgraded by writing them in code rather than keeping them in NNs (arithmetic is the easy example, MC tree search is another...). Whatever MC tree search the giant inscrutable matrices have implemented, they are probably really bad compared to sturdy old fashioned code.

Even if NNs are the best way to learn algorithms, they are not be the best way to design them. I am talking about the difference between evolvable and designable.

NN allow us to evolve algorithms, code allows us to intelligently design them: if there is no easy evolvable path to an algorithm, neural networks will fail.

The parallel to evolution is: evolution cannot make bones out of steel (even though they would be much better) because there is no shallow gradient to get steel (no way to have the recipe for steel-bones be in a way that if the recipe is slightly changed you still get something steel-like and useful). Evolution needs a smooth path from not-working to working while design doesn't.

With intelligence the computations don't need to be evolved (or learned) it can be designed, shaped with intent.

Are you really that confident that the steel equivalent of algorithms doesn't exist? Even though as humans we have barely explored that area (nothing hard-coded comes close to even GPT-2)?

Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code? I guess those might be the hardest to design/interpret so we won't know for certain for a long time...

Arithmetic is a closed cognitive function; we know exactly how it works and don't need to learn more.

If we knew exactly how make poems of math theorems (like GPT-4 does) that would make it a "closed cognitive function" too, right? Can that learned algorithm be reversed engineered from GPT-4? My answer is yes => foom => we ded.

Uhm, by interpretability I mean things like this where the algorithm that the NN implements is revered engineered, written down as code or whatever which would allow for easier recursive self improvement (by improving just the code and getting rid of the spaghetti NN).

Also by the looks of things (induction heads and circuits in general) there does seem to be a sort of modularity in how NN learn, so it does seem likely that you can interpret piece by piece. If this wasn't true I don't think mechanistic interpretability as a field would even exist.

BTW, if anyone is interested the virtual machine has these specs:

System: Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux

CPU: Intel Xeon CPU E5-2673 v4, 16 cores @ 2.30GHz

RAM: 54.93 GB

I did listen to that post, and while I don't remember all the points, I do remember that it didn't convince me that alignment is easy and, like Christiano's post "Where I agree and disagree with Eliezer", it just seems to be like "p(doom) of 95%+ plus is too much, it's probably something like 10-50%" which is still incredibly unacceptably high to continue "business as usual". I have faith that something will be done: regulation and breakthrough will happen, but it seems likely that it won't be enough.

It comes down to safety mindset. There are very few and sketchy reasons to expect that by default an ASI will care about humans enough, so it not safe to build one until shown otherwise (preferably without actually creating one). And if I had to point out a single cause for my own high p(doom), it is the fact that we humans iterate all of our engineering to iron out all of the kinks, while with a technology that is itself adversarial, iteration might not be available (get it right the first time we deploy powerful AI).

Who do you think are the two or three smartest people to be skeptical of AI killing all humans? I think maybe Yann LeCunn and Andrew Ng.

Sure, those two. I don't know about Ng (he recently had a private discussion with Hinton, but I don't know what he thinks now), but I know LeCun hasn't really engaged with the ideas and just relies on the concept that "it's an extreme idea". But as I said, having the position "AI doesn't pose an existential threat" seems to be fringe nowadays.

If I dumb the argument down enough I get stuff like "intelligence/cognition/optimization is dangerous, and, whatever the reasons, we currently have zero reliable ideas on how to make a powerful general intelligence safe (eg. RLHF doesn't work well enough as GPT-4 still lies/hallucinates and is jailbroken way too easily)" which is evidence based, not weird and not extreme.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments