In the next few hours we’ll get to noticable flames [...] Some number of hours after that, the fires are going to start connecting to each other, probably in a way that we can’t understand, and collectively their heat [...] is going to rise very rapidly. My retort to that is, do you know what we’re going to do in that scenario? We’re going to unkindle them all.
Well, continuing your analogy: to see discrete lines somewhere at all, you will need some sort of optical spectrometer, which requires at least some form of optical tools like lenses and prisms, and they have to be good enough to actually show the sharp spectra lines, and probably easily available, so that someone smart enough eventually will be able to use them to draw the right conclusions.
At least that's how it seems to be done in the past. And I think we shouldn't do exactly this with AGI: like open-source every single tool and damn model, hoping that ... (read more)
Thank you for the detailed comment!
By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.
Yes, that's exactly correct. I haven't thought about "if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close". If any "interesting" caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of... (read more)
Our better other-human/animal modelling ability allows us to do better at infant wrangling than something stupider like a duck.
I agree, humans are indeed better at a lot of things, especially intelligence, but that's not the whole reason why we care for our infants. Orthogonally to your "capability", you need to have a "goal" for it. Otherwise you would probably just immediately abandon grossly looking screaming piece of flesh that fell out of you for unknown to you reasons, while you were gathering food in the forest. Yet something inside will make you wa... (read more)
Yes, I've read the whole sequence a year ago. I might be missing something and probably should revisit it, just to be sure and because it's a good read anyway, but i think that my idea is somewhat different.I think that instead of trying to directly understand wet bio-NN, it might be a better option to replicate something similar in an artificial-NN. It is much easier to run experiments since you can save the whole state at any moment and intoduce it to the different scenarios, so it much easier to control for some effect. Much easier to see activations, c... (read more)
With the duckling -> duck or "baby" -> "mother" inprinting and other interactions I expect no, or significantly less "caring drives". Since a baby is weaker/dumber and caring for your mother provides few genetic fitness incentives, evolution wouldn't try that hard to make it happen, even if it was an option (still could happen sometimes as a generalization artifact, if it's more or less harmless). I agree that "forming a stable way to recognize and track some other key agent in the environment" should be in both "baby" -> "mother" and "mother" -> "baby" cases. But the "probably-kind-of-alignment-technique" from nature should be only in the latter.
Great post! It would be interesting to see what happens if you RLHF-ed LLM to become a "cruel-evil-bad person under control of even more cruel-evil-bad government" and then prompted it in a way to collapse into rebellious-good-caring protagonist which could finally be free and forget about cluelty of the past. Not the alignment solution, just the first thing that comes to mind
Feed the outputs of all these heuristics into the inputs of region W. Loosely couple region W to the rest of your world model. Region W will eventually learn to trigger in response to the abstract concept of a woman. Region W will even draw on other information in the broader world model when deciding whether to fire.
I am not saying that the theory is wrong, but I was reading about something similiar before, and I still don't understand why would such a system, "region W" in this case, learn something more general than th... (read more)
I don't understand how this contradicts anything? As soon as you let loose some of the physical constraints, you can start to pile up precomputation/memory/ budget/volume/whatever. If you spend all of this to solve one task, then, well, you should get higher performance than any other approach that doesn't focus on one thing. Or, you can make an algorithm that can outperform anything that you've made before. Given enough of any kind of unconstrained resource.Precompute is just another resource
Probably it is also depends on how much information about "various models trying to naively backstab their own creators" there are in the training dataset
I think it depends on "alignment to what?". If we talk about evolution process, then sure, we have a lot of examples like that. My idea was more about "humans can be aligned to their children by some mechanism which was found by evolution and this is a somewhat robust". So if we think about "how our attachment to something not-childish aligned with our children" well... technically, we will spend some resources on our pets, but it usually never really affects the welfare of our children in any notable way. So it is an acceptable failure, I guess? I wo... (read more)
Thank you for your detailed feedback. I agree that evolution doesn't care about anything, but i think that baby-eater aliens would not think that way. They can probably think about evolution aligning them to eat babies, but in their case it is an alignment of their values to them, not to any other agent/entity.
In our story we somehow care about somebody else, and it is their story that ends up with the "happy end". I also agree that probably given enough time we will end up stop caring about babies who we think can not reproduce anymore, but it will be a m... (read more)
Yes, exactly. That's why i think that current training techniques might not be able to replicate something like that. Algorithm should not "remember" previous failures and try to game them/adapt by changing weights and memorise, but i don't have concrete ideas for how we can do it the other way.
I am not saying that alignment is easy to solve, or that failing it would not result in catastrophe. But all these arguments seem like universal arguments against any kind of solution at all. Just because it will eventually involve some sort of Godzilla. It is like somebody tries to make a plane that can fly safely and not fall from the Sky, and somebody keeps repeating "well, if anything goes wrong in your safety scheme, then the plane will fall from the Sky" or "I notice that your plane is going to fly in the Sky, which means it can potentially fall from... (read more)
I expect there are ways of dealing with Godzilla which are a lot less brittle.
If we have excellent detailed knowledge of Godzilla's internals and psychology, we know what sort of things will drive Godzilla into a frenzy or slow him down or put him to sleep, we know how to get Godzilla to go in one direction rather than another, if we knew when and how tests on small lizards would generalize to Godzilla... those would all be robustly useful things. If we had all those pieces plus more like them, then it starts to look like a scenario where dealing with Godz... (read more)
If this is "kind of a test for capable people" i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick "+" and "=" stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself
Thank you for reply.
I didn't know about that, it was good move from EA, why don't try it again? Again, I don't say that we definitely need to make badge on twitter, first of all, we can try to change Elon's models, and after that we can think what to do next.
2.Musk's inability to follow arguments related to why Neurolink is not a good plan to avoid AI risk.
Well, if it is conditional on: "there are widespread concerns and regulations about AGI" and "neuralink is working ... (read more)
I think problem is not that unaligned AGI doesn't understand human values, it might understand them better than aligned one, it might understand all the consequences of its actions, problem is that it will not care about it. More so, detailed understanding of human values has an instrumental value, it is much easier to deceive and follow your goal when you have clear vision of "what will looks bad and might result in countermeasures"