By "just thinking about IRL", do you mean "just thinking about the robot using IRL to learn what humans want"? 'Coz that isn't alignment.
'But potentially a problem with more abstract cashings-out of the idea "learn human values and then want that"' is what I'm talking about, yes. But it also seems to be what you're talking about in your last paragraph.
"Human wants cookie" is not a full-enough understanding of what the human really wants, and under what conditions, to take intelligent actions to help the human. A robot learning that would act like a paper-clipper, but with cookies. It isn't clear whether a robot which hasn't resolved the de dicto / de re / de se distinction in what the human wants will be able to do more good than harm in trying to satisfy human desires, nor what will happen if a robot learns that humans are using de se justifications.
Here's another way of looking at that "nor what will happen if" clause: We've been casually tossing about the phrase "learn human values" for a long time, but that isn't what the people who say that want. If AI learned human values, it would treat humans the way humans treat cattle. But if the AI is to learn to desire to help humans satisfy their wants, it isn't clear that the AI can (A) internalize human values enough to understand and effectively optimize for them, while at the same time (B) keeping those values compartmentalized from its own values, which make it enjoy helping humans with their problems. To do that the AI would need to want to propagate and support human values that it disagrees with. It isn't clear that that's something a coherent, let's say "rational", agent can do.
How is that de re and de dicto?
You're looking at the logical form and imagining that that's a sufficient understanding to start pursuing the goal. But it's only sufficient in toy worlds, where you have one goal at a time, and the mapping between the goal and the environment is so simple that the agent doesn't need to understand the value, or the target of "cookie", beyond "cookie" vs. "non-cookie". In the real world, the agent has many goals, and the goals will involve nebulous concepts, and have many considerations and conditions attached, eg how healthy is this cookie, how tasty is it, how hungry am I. It will need to know /why/ it, or human24, wants a cookie in order to intelligently know when to get the cookie, and to resolve conflicts between goals, and to do probability calculations which involve the degree to which different goals are correlated in the higher goals they satisfy.
There's a confounding confusion in this particular case, in which you seem to be hoping the robot will infer that the agent of the desired act is the human, both in the case of the human, and of the AI. But for values in general, we often want the AI to act in the way that the human would act, not to want the human to do something. Your posited AI would learn the goal that it wants human24 to get a cookie.
What it all boils down to is: You have to resolve the de re / de dicto / de se interpretation in order to understand what the agent wants. That means an AI also has to resolve that question in order to know what a human wants. Your intuitions about toy examples like "human 24 always wants a cookie, unconditionally, forever" will mislead you, in the ways toy-world examples misled symbolic AI researchers for 60 years.
So, "mesa" here means "tabletop", and is pronounced "MAY-suh"?
I think your insight is that progress counts--that counting counts. It's overcoming the Boolean mindset, in which anything that's true some of the time, must be true all of the time. That you either "have" or "don't have" a problem.
I prefer to think of this as "100% and 0% are both unattainable", but stating it as the 99% rule might be more-motivating to most people.
What do you mean by a goodhearting problem, & why is it a lossy compression problem? Are you using "goodhearting" to refer to Goodhart's Law?
I'll preface this by saying that I don't see why it's a problem, for purposes of alignment, for human values to refer to non-existent entities. This should manifest as humans and their AIs wasting some time and energy trying to optimize for things that don't exist, but this seems irrelevant to alignment. If the AI optimizes for the same things that don't exist as humans do, it's still aligned; it isn't going to screw things up any worse than humans do.
But I think it's more important to point out that you're joining the same metaphysical goose chase that has made Western philosophy non-sense since before Plato.
You need to distinguish between the beliefs and values a human has in its brain, and the beliefs & values it expresses to the external world in symbolic language. I think your analysis concerns only the latter. If that's so, you're digging up the old philosophical noumena / phenomena distinction, which itself refers to things that don't exist (noumena).
Noumena are literally ghosts; "soul", "spirit", "ghost", "nature", "essence", and "noumena" are, for practical purposes, synonyms in philosophical parlance. The ghost of a concept is the metaphysical entity which defines what assemblages in the world are and are not instances of that concept.
But at a fine enough level of detail, not only are there no ghosts, there are no automobiles or humans. The Buddhist and post-modernist objections to the idea that language can refer to the real world are that the referents of "automobiles" are not exactly, precisely, unambiguously, unchangingly, completely, reliably specified, in the way Plato and Aristotle thought words should be. I.e., the fact that your body gains and loses atoms all the time means, for these people, that you don't "exist".
Plato, Aristotle, Buddhists, and post-modernists all assumed that the only possible way to refer to the world is for noumena to exist, which they don't. When you talk about "valuing the actual state of the world," you're indulging in the quest for complete and certain knowledge, which requires noumena to exist. You're saying, in your own way, that knowing whether your values are satisfied or optimized requires access to what Kant called the noumenal world. You think that you need to be absolutely, provably correct when you tell an AI that one of two words is better. So those objections apply to your reasoning, which is why all of this seems to you to be a problem.
The general dissolution of this problem is to admit that language always has slack and error. Even direct sensory perception always has slack and error. The rationalist, symbolic approach to AI safety, in which you must specify values in a way that provably does not lead to catastrophic outcomes, is doomed to failure for these reasons, which are the same reasons that the rationalist, symbolic approach to AI was doomed to failure (as almost everyone now admits). These reasons include the fact that claims about the real world are inherently unprovable, which has been well-accepted by philosophers since Kant's Critique of Pure Reason.
That's why continental philosophy is batshit crazy today. They admitted that facts about the real world are unprovable, but still made the childish demand for absolute certainty about their beliefs. So, starting with Hegel, they invented new fantasy worlds for our physical world to depend on, all pretty much of the same type as Plato's or Christianity's, except instead of "Form" or "Spirit", their fantasy worlds are founded on thought (Berkeley), sense perceptions (phenomenologists), "being" (Heidegger), music, or art.
The only possible approach to AI safety is one that depends not on proofs using symbolic representations, but on connectionist methods for linking mental concepts to the hugely-complicated structures of correlations in sense perceptions which those concepts represent, as in deep learning. You could, perhaps, then construct statistical proofs that rely on the over-determination of mental concepts to show almost-certain convergence between the mental languages of two different intelligent agents operating in the same world. (More likely, the meanings which two agents give to the same words don't necessarily converge, but agreement on the probability estimates given to propositions expressed using those same words will converge.)
Fortunately, all mental concepts are over-determined. That is, we can't learn concepts unless the relevant sense data that we've sensed contains much more information than do the concepts we learned. That comes automatically from what learning algorithms do. Any algorithm which constructed concepts that contained more information than was in the sense data, would be a terrible, dysfunctional algorithm.
You are still not going to get a proof that two agents interpret all sentences exactly the same way. But you might be able to get a proof which shows that catastrophic divergence is likely to happen less than once in a hundred years, which would be good enough for now.
Perhaps what I'm saying will be more understandable if I talk about your case of ghosts. Whether or not ghosts "exist", something exists in the brain of a human who says "ghost". That something is a mental structure, which is either ultimately grounded in correlations between various sensory perceptions, or is ungrounded. So the real problem isn't whether ghosts "exist"; it's whether the concept "ghost" is grounded, meaning that the thinker defines ghosts in some way that relates them to correlations in sense perceptions. A person who thinks ghosts fly, moan, and are translucent white with fuzzy borders, has a grounded concept of ghost. A person who says "ghost" and means "soul" has an ungrounded concept of ghost.
Ungrounded concepts are a kind of noise or error in a representational system. Ungrounded concepts give rise to other ungrounded concepts, as "soul" gave rise to things like "purity", "perfection", and "holiness". I think it highly probable that grounded concepts suppress ungrounded concepts, because all the grounded concepts usually provide evidence for the correctness of the other grounded concepts. So probably sane humans using statistical proofs don't have to worry much about whether every last concept of theirs is grounded, but as the number of ungrounded concepts increases, there is a tipping point beyond which the ungrounded concepts can be forged into a self-consistent but psychotic system such as Platonism, Catholicism, or post-modernism, at which point they suppress the grounded concepts.
Sorry that I'm not taking the time to express these things clearly. I don't have the time today, but I thought it was important to point out that this post is diving back into the 19th-century continental grappling with Kant, with the same basic presupposition that led 19th-century continental philosophers to madness. TL;DR: AI safety can't rely on proving statements made in human or other symbolic languages to be True or False, nor on having complete knowledge about the world.
When you write of A belief in human agency, it's important to distinguish between the different conceptions of human agency on offer, corresponding to the 3 main political groups:
Someone who wants us united under a document written by desert nomads 3000 years ago, or someone who wants the government to force their "solutions" down our throats and keep forcing them no matter how many people die, would also say they believe in human agency; but they don't want private individuals to have agency.
This is a difficult but critical point. Big progressive projects, like flooding desert basins, must be collective. But movements that focus on collective agency inevitably embrace, if only subconsciously, the notion of a collective soul. This already happened to us in 2010, when a large part of the New Atheist movement split off and joined the Social Justice movement, and quickly came to hate free speech, free markets, and free thought.
I think it's obvious that the enormous improvements in material living standards in the last ~200 years you wrote of was caused by the Enlightenment, and can be summarized as the understanding of how liberating individuals leads to economic and social progress. Whereas modernist attempts to deliberately cause economic and social progress are usually top-down and require suppressing individuals, and so cause the reverse of what they intend. This is the great trap that we must not fall into, and it hinges on our conception of human agency.
A great step forward, or backwards (towards Athens), was made by the founders of America when they created a nation based in part on the idea of competition and compromise as being good rather than bad, basically by applying Adam Smith's invisible hand to both economics and politics. One way forward is to understand how to do large projects that have a noble purpose. That is, progressive capitalism. Another way would be to understand how governments have sometimes managed to do great things, like NASA's Apollo project, without them degenerating into economic and social disasters like Stalin's or Mao's 5-Year-Plans. Either way, how you conceptualize human agency will be a decisive factor in whether you produce heaven or hell.
I think it would be more-graceful of you to just admit that it is possible that there may be more than one reason for people to be in terror of the end of the world, and likewise qualify your other claims to certainty and universality.
That's the main point of what gjm wrote. I'm sympathetic to the view you're trying to communicate, Valentine; but you used words that claim that what you say is absolute, immutable truth, and that's the worst mind-killer of all. Everything you wrote just above seems to me to be just equivocation trying to deny that technical yet critical point.
I understand that you think that's just a quibble, but it really, really isn't. Claiming privileged access to absolute truth on LessWrong is like using the N-word in a speech to the NAACP. It would do no harm to what you wanted to say to use phrases like "many people" or even "most people" instead of the implicit "all people", and it would eliminate a lot of pushback.