A math and computer science graduate interested in machine and animal cognition, philosophy of language, interdisciplinary ideas, etc.
I didn't understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).
Don't you assume much more threat from humans than there actually is? Surely, an AGI will understand that it can destroy humanity easily. Then it would think a little more, and see the many other ways to remove the threat that are strictly cheaper and just as effective - from restricting/monitoring our access to computers, to simply convince/hack us all to work for it. By the time it would have technology that make us strictly useless (like horses), it would probably have so much resources that destroying us would just not be a priority, and not worth the destruction of the information that we contain - the way humans would try to avoid reducing biodiversity for scientific reasons if not others.
In that sense I prefer Eliezer's "you are made of atoms that it needs for something else" - but it may take long time before it have better things to do with those specific atoms and no easier atoms to use.
I meant to criticize moving too far toward "do no harm" policy in general due to inability to achieve a solution that would satisfy us if we had the choice. I agree specifically that if anyone knows of a bottleneck unnoticed by people like Bengio and LeCun, LW is not the right forum to discuss it.
Is there a place like that though? I may be vastly misinformed, but last time I checked MIRI gave the impression of aiming at very different directions ("bringing to safety" mindset) - though I admit that I didn't watch it closely, and it may not be obvious from the outside what kind of work is done and not published.
[Edit: "moving toward 'do no harm'" - "moving to" was a grammar mistake that make it contrary to position you stated above - sorry]
I think that is an example of the huge potential damage of "security mindset" gone wrong. If you can't save your family, as in "bring them to safety", at least make them marginally safer.
(Sorry for the tone of the following - it is not intended at you personally, who did much more than your fair share)
Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you say that most of your probability of survival comes from the possibility that you are wrong - trying to protect your family is trying to at least optimize for such miracle.
There is no safe way out of a war zone. Hiding behind a rock is not therfore the answer.
I can think of several obstacles for AGIs that are likely to actually be created (i.e. seem economically useful, and do not display misalignment that even Microsoft can't ignore before being capable enough to be xrisk). Most of those obstacles are widely recognized in the rl community, so you probably see them as solvable or avoidable. I did possibly think of an economically-valuable and not-obviously-catastrophic exception to the probably-biggest obstacle though, so my confidence is low. I would share it in a private discussion, because I think that we are past the point when strict do-no-harm policy is wise.
More on the meta level: "This sort of works, but not enough to solve it." - do you mean "not enough" as in "good try but we probably need something else" or as in "this is a promising direction, just solve some tractable downstream problem"?
"which utility-wise is similar to the distribution not containing human values." - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don't see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than "don't care about controling goals". For capabilities my intuition is that starting with superficially-aligned goals is enough.
This is an important distinction, that show in its cleanest form in mathematics - where you have constructive definitions from the one hand, and axiomatic definitions from the other. It is important to note though that is is not quite a dichotomy - you may have a constructive definition that assume aximatically-defined entities, or other constructions. For example: vector spaces are usually defined axiomatically, but vector spaces over the real numbers assume the real numbers - that have multiple axiomatic definitions and corresponding constructions.
In science, there is the classic "are wails fish?" - which is mostly about whether to look at their construction/mechanism (genetics, development, metabolism...) or their patterns of interaction with their environment (the behavior of swimming and the structure that support it). That example also emphasize that we natural language simplly don't respect this distinction, and consider both internal structure and outside relations as legitimate "coordinates in thingspace" that may be used together to identify geometrically-natural categories.
I like the general direction of LLMs being more behaviorally "anthropomorphic", so hopefully will look into the LLM alignment links soon :-)
Agree - didn't find a handle that I understand well enough in order to point at what I didn't.
I think my problem was with sentences like that - there is a reference to a decision, but I'm not sure whether to a decision mentioned in the article or in one of the comments.
Didn't disambiguate it for me though I feel like it should.
I am familiar with the technical LW terms separately, so Ill probably understand their relevance once the reference issue is resolved.