As I understand it, the idea with the problems listed in the article is that their solutions are supposed to be fundamental design principles of the AI, rather than addons to fix loopholes.
Augmenting ourselves is probably a good idea to do *in addition* to AI safety research, but I think it's dangerous to do it *instead* of AI safety research. It's far from impossible that artificial intelligence could gain intelligence much faster at some point than augmenting the rather messy human brain, at which point it *needs* to be designed in a safe way.
AI alignment is not about trying to outsmart the AI, it's about making sure that what the AI wants is what we want.
If it were actually about figuring out all possible loopholes and preventing them, I would agree that it's a futile endeavor.
A correctly designed AI wouldn't have to be banned from exploring any philosophical or introspective considerations, since regardless of what it discovers there, it's goals would still be aligned with what we want. Discovering *why* it has these goals is similar to humans discovering why we have our motivations (i.e., evolution), and similarly to how discovering evolution didn't change much what humans desire, there's no reason to assume that an AI discovering where its goals come from should change them.
Of course, care will have to be taken to ensure that any self-modifications don't change the goals. But we don't have to work *against* the AI to accomplish that - the AI *also* aims to accomplish its current goals, and any future self-modification that changes its goals would be detrimental in accomplishing its current goals, so (almost) any rational AI will, to the best of its ability, aim *not* to change its goals. Although this doesn't make it easy, since it's quite difficult to formally specify the goals we would want an AI to have.
Whether or not it would question its reality mostly depends on what you mean by that - it would almost certainly be useful to figure out how the world works, and especially how the AI itself works, for any AI. It might also be useful to figure out the reason for which it was created.
But, unless it was explicitly programmed in, this would likely not be a motivation in and of itself, rather, it would simply be useful for accomplishing its actual goal.
I'd say the reason why humans place such high value in figuring out philosophical issues is to a large extent because evolution produces messy systems with inconsistent goals. This *could* be the case for AIs too, but to me it seems more likely that some more rational thought will go into their design.
(That's not to say that I believe it will be safe by default, but simply that it will have more organized goals than humans have.)
It would need a reason of some kind of reason to change its goals - one might call it a motivation. The only motivation it has available though, are its final goals, and those (by default) don't include changing the final goals.
Humans never had the final goal replicating their genes. They just evolved to want to have sex. (One could perhaps say that the genes themselves had the goal of replicating, and implemented this by giving the humans the goal of having sex.) Reward hacking doesn't involve changing the terminal goal, just fulfilling it in unexpected ways (which is one reason why reinforcement learning might be a bad idea for safe AI.)
What you're saying goes against the here widely believed orthogonality thesis, which essentially states that what goal an agent has is independent of how smart it is. If the agent has programmed in a certain set of goals, there is no reason for it to change this set of goals if it becomes smarter (this is because changing its goals would not be beneficial to achieving its current goals).
In this example, if an agent has the sole goal of fulfilling the wishes of a particular human, there is no reason for it to change this goal once it becomes an ASI. As far as the agent is concerned, using resources for this purpose wouldn't be a waste, it would be the only worthwhile use for them. What else would it do with them?
You seem to be assigning some human properties to the hypothetical AI (e.g. "scorn", viewing something as "petty"), which might be partially responsible for the disagreement here.
Why wait until someone wants the money? Shouldn't the AI try to send 5 Dollars to everyone with a note attached reading "Here is a tribute; please don't kill a huge number of people" regardless of whether they ask for it or not?
Sounds pretty cool, definitely going to try it out some.
Oh, and by the way, you wrote "Inpsect" instead of "Inspect" at the end of page 27.
Working links on yudkowsky.net and acceleratingfuture.com:
Transhumanism as Simplified Humanism The Meaning That Immortality Gives to Life
That's true, though I think "optimal" would be a better word for that than "correct".
There are no "correct" or "incorrect" definitions, though, are there? Definitions are subjective, it's only important that participants of a discussion can agree on one.