I don't understand how this contradicts anything? As soon as you let loose some of the physical constraints, you can start to pile up precomputation/memory/ budget/volume/whatever. If you spend all of this to solve one task, then, well, you should get higher performance than any other approach that doesn't focus on one thing. Or, you can make an algorithm that can outperform anything that you've made before. Given enough of any kind of unconstrained resource.Precompute is just another resource
Probably it is also depends on how much information about "various models trying to naively backstab their own creators" there are in the training dataset
I think it depends on "alignment to what?". If we talk about evolution process, then sure, we have a lot of examples like that. My idea was more about "humans can be aligned to their children by some mechanism which was found by evolution and this is a somewhat robust". So if we think about "how our attachment to something not-childish aligned with our children" well... technically, we will spend some resources on our pets, but it usually never really affects the welfare of our children in any notable way. So it is an acceptable failure, I guess? I wouldn't mind if some powerful AGI will love all the humans and will try to ensure their happy future while at the same time will have some weird non-human hobbies/attachments which is still less prioritized than our wellbeing, kind of like parents that spend some free time on pets.
Thank you for your detailed feedback. I agree that evolution doesn't care about anything, but i think that baby-eater aliens would not think that way. They can probably think about evolution aligning them to eat babies, but in their case it is an alignment of their values to them, not to any other agent/entity.
In our story we somehow care about somebody else, and it is their story that ends up with the "happy end". I also agree that probably given enough time we will end up stop caring about babies who we think can not reproduce anymore, but it will be a much more complex solution.
At the first step it is probably much easier to just "make an animal who cares about it babies no matter what", otherwise you will have to count on ability of that animal to recognize something it might not even understand (like reproductive abilities of a baby)
Yes, exactly. That's why i think that current training techniques might not be able to replicate something like that. Algorithm should not "remember" previous failures and try to game them/adapt by changing weights and memorise, but i don't have concrete ideas for how we can do it the other way.
I am not saying that alignment is easy to solve, or that failing it would not result in catastrophe. But all these arguments seem like universal arguments against any kind of solution at all. Just because it will eventually involve some sort of Godzilla. It is like somebody tries to make a plane that can fly safely and not fall from the Sky, and somebody keeps repeating "well, if anything goes wrong in your safety scheme, then the plane will fall from the Sky" or "I notice that your plane is going to fly in the Sky, which means it can potentially fall from it".
I am not saying that I have better ideas about checking whether any plan will work or not. They all inevitably involve Godzilla or Sky. And the slightest mistake might cost us our lives. But I don't think that pointing repeatedly at the same scary thing, which will be one way or the other in every single plan, will get us anywhere.
If this is "kind of a test for capable people" i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick "+" and "=" stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself
Thank you for reply.
I didn't know about that, it was good move from EA, why don't try it again? Again, I don't say that we definitely need to make badge on twitter, first of all, we can try to change Elon's models, and after that we can think what to do next.
2.Musk's inability to follow arguments related to why Neurolink is not a good plan to avoid AI risk.
Well, if it is conditional on: "there are widespread concerns and regulations about AGI" and "neuralink is working and can significantly enhance human intelligence" then i can clearly see how it will decrease AI-risks. Imagine Yudkowsky with significantly enhanced capabilities working with several others AI safety researchers, communicating with speed of thought. Of course it will mean that no one else get their hands on that for a while, and we need to build it before AGI become a thing. But it still possible, and i can clearly see how anybody in 2016 is incapable of predicting current ML progress and therefore places their bets on something long-playing, like neuralink
If you can't use signalling before you can pass "a really good exam that shows your understanding of topic" why it will be a bad signal? There are exams that didn't fall that badly for goodhart's law, like, you can't solve a test for calculating integrals, without actually good practical skill. My idea around badge was more like "trick people that it is easy and they can get another social signal, watch how they realize the problem after investigating it"
And the whole idea of post isn't about "badge", it's about "talk with powerful people to explain to them our models"
I think problem is not that unaligned AGI doesn't understand human values, it might understand them better than aligned one, it might understand all the consequences of its actions, problem is that it will not care about it. More so, detailed understanding of human values has an instrumental value, it is much easier to deceive and follow your goal when you have clear vision of "what will looks bad and might result in countermeasures"