Definitions and justifications have to be circular at some point, or else terminate in some unexplained things, or else create an infinite chain.
If I'm understanding your point correctly, I think I disagree completely. A chain of instrumental goals terminates in a terminal goal, which is a very different kind of thing from an instrumental goal in that assigning properties like "unjustified" or "useless" to it is a category error. Instrumental goals either promote higher goals or are unjustified, but that's not true of all goals- it's just something particular to that one type of goal.
I'd also argue that a chain of definitions terminates in qualia- things like sense data and instincts determine the structure of our most basic concepts, which define higher concepts, but calling qualia "undefined" would be a category error.
There is no fundamental physical structure which constitutes agency
I also don't think I agree with this. A given slice of objective reality will only have so much structure- only so many ways of compressing it down with symbols and concepts. It's true that we're only interested in a narrow subset of that structure that's useful to us, but the structure nevertheless exists prior to us. When we come up with a useful concept that objectively predicts part of reality, we've, in a very biased way, discovered an objective part of the structure of reality- and I think that's true of the concept of agency.
Granted, maybe there's a strange loop in the way that cognitive reduction can be further reduced to physical reduction, while physical reduction can be further reduced to cognitive reduction- objective structure defines qualia, which defines objective structure. If that's what you're getting at, you may be on to something.
There seems to be a strong coalition around consciousness
One further objection, however: given that we don't really understand consciousness, I think the cultural push to base our morality around it is a really bad idea.
If it were up to me, we'd split morality up into stuff meant to solve coordination problems by getting people to pre-commit to not defecting, stuff meant to promote compassionate ends for their own sake, and stuff that's just traditional. Doing that instead of conflating everything into a single universal imperative would get rid of the deontology/consequentialism confusion, since deontology would explain the first thing and consequentialism the second, and by not founding our morality on poorly understood philosophy concepts, we wouldn't risk damaging useful social technologies or justifying horrifying atrocities if Dennettian illusionism turns out to be true or something.
An important bit of context that often gets missed when discussing this question is that actual trans athletes competing in women's sports are very rare. Of the millions competing in organized sports in the US, the total number who are trans might be under 20 (see this statement from the NCAA president estimating "fewer than ten" in college sports, this article reporting that an anti-trans activist group was able to identify only five in K-12 sports, and this Wikipedia article, which identifies only a handful of trans athletes in professional US sports).
Because this phenomenon is so rare relative to how often it's discussed, I'm a lot more interested in the sociology of the question than the question itself. There was a recent post from Hanson arguing that the Left and Right in the US have become like children on a road trip annoying each other in deniable ways to provoke responses that they hope their parents will punish. I think the discrepancy between the scale of the issue and how often it comes up is mostly due to it being used in this way.
A high school coach who has to choose whether to allow a trans student to compete in female sports is faced with a difficult social dilemma. If they deny the request, then the student- who wants badly to be seen as female- will be disappointed and might face additional bullying; if they allow it, that will be unfair to the other female players. In some cases, other players may be willing to accept a bit of unfairness as an act of probably supererogatory kindness, but in cases where they are aren't, explaining to the student that they shouldn't compete without hurting their feelings will take a lot of tact on the part of the coach.
Elevating this to a national conversation isn't very tactful. People on the right can plausibly claim to only be concerned with fairness in sports, but presented so publicly, this looks to liberals like an attempt to bully trans people. They're annoyed, and may be provoked into responding in hard to defend ways like demanding unconditional trans participation in women's sports- which I think is often the point. It's a child in a car poking the air next to his sister and saying "I'm not touching you", hoping that she'll slap him and be punished.
I'm certain the OP didn't intend anything like that- LessWrong is, of course, a very high-decoupling place. But I'd argue that this is an issue best resolved by letting the very few people directly affected sort out the messy emotions involved among themselves, rather than through public analysis of the question on the object level.
So, in practice, what might that look like?
Of course, AI labs use quite a bit of AI in their capabilities research already- writing code, helping with hardware design, doing evaluations and RLAIF; even distillation and training itself could sort of be thought of as a kind of self-improvement. So, would the red line need to target just fully autonomous self-improvement? But just having a human in the loop to rubber-stamp AI decisions might not actually slow down an intelligence explosion by all that much, especially at very aggressive labs. So, would we need some kind of measure for how autonomous the capabilities research at a lab is, and then draw the line at "only somewhat autonomous"? And if we were able to define a robust threshold, could we really be confident that it would prevent ASI development altogether, rather than just slowing it down?
Suppose instead we had a benchmark that measured something like the capabilities of AI agents in long-term real-world tasks like running small businesses and managing software development projects. Do you think it might make sense to draw a red line on somewhere on that graph- targeting a dangerous level of capabilities directly, rather than trying to prevent that level of capabilities from being developed by targeting research methods?
The most important red line would have to be strong superintelligence, don't you think? I mean, if we have systems that are agentic in the way humans are, but surpass us in capabilities in the way we surpass animals, it seems like specific bans on the use of weapons, self-replication, and so on might not be very effective at keeping them in check.
Was it necessary to avoid mentioning ASI in the "concrete examples" section of the website to get these signatories on board? Are you concerned that avoiding that subject might contribute to the sense that discussion of ASI is non-serious or outside of the Overton window?
I think this is related to what Chalmers calls the "meta problem of consciousness"- the problem of why it seems subjectively undeniable that a hard problem of consciousness exists, even though it only seems possible to objectively describe "easy problems" like the question of whether a system has an internal representation of itself. Illusionism- the idea that the hard problem is illusory- is an answer to that problem, but I don't think it fully explains things.
Consider the question "why am I me, rather than someone else". Objectively, the question is meaningless- it's a tautology like "why is Paris Paris". Subjectively, however, it makes sense, because your identity in objective reality and your consciousness are different things- you can imagine "yourself" seeing the world through different eyes, with different memories and so on, even though that "yourself" doesn't map to anything in objective reality. The statement "I am me" also seems to add predictive power to a subjective model of reality- you can reason inductively that since "you" were you in the past, you will continue to be in the future. But if someone else tells you "I am me", that doesn't improve your model's predictive power at all.
I think there's a real epistemological paradox there, possibly related somehow to the whole liar's/Godel's/Russell's paradox thing. I don't think it's as simple as consciousness being equivalent to a system with a representation of itself.
I used to do graphic design professionally, and I definitely agree the cover needs some work.
I put together a few quick concepts, just to explore some possible alternate directions they could take it:
https://i.imgur.com/zhnVELh.png
https://i.imgur.com/OqouN9V.png
https://i.imgur.com/Shyezh1.png
These aren't really finished quality either, but the authors should feel free to borrow and expand on any ideas they like if they decide to do a redesign.
This suggests that in order to ensure a sincere author-concept remains in control, the training data should carefully exclude any text written directly by a malicious agent (e.g. propaganda).
I don't think that would help much, unfortunately. Any accurate model of the world will also model malicious agents, even if the modeller only ever learns about them second-hand. So the concepts would still be there for the agent to use if it was motivated to do so.
Censoring anything written by malicious people would probably make it harder to learn about some specific techniques of manipulation that aren't discussed much by non-malicious people or which appear much in fiction- but I doubt that would be much more than a brief speed bump for a real misaligned ASI, and probably at the expense of reducing useful capabilities in earlier models like the ability to identify maliciousness, which would give an advantage to competitors.
A counterpoint: when I skip showers, my cat appears strongly in favor of smell of my armpits- occasionally going so far as to burrow into my shirt sleeves and bite my armpit hair (which, to both my and my cat's distress, is extremely ticklish). Since studies suggest that cats have a much more sensitive olfactory sense than humans (see https://www.mdpi.com/2076-2615/14/24/3590), it stands to reason that their judgement regarding whether smelling nice is good or bad should hold more weight than our own. And while my own cat's preference for me smelling bad is only anecdotal evidence, it does seem to suggest at least that more studies are required to fully resolve the question.
I think it's a very bad idea to dismiss the entirety of news as a "propaganda machine". Certainly some sources are almost entirely propaganda. More reputable sources like the AP and Reuters will combine some predictable bias with largely trustworthy independent journalism. Identifying those more reliable sources and compensating for their bias takes effort and media literacy, but I think that effort is quite valuable- both individually and collectively for society.
Of course, we have to be very careful with our news consumption- even the most sober, reliable sources will drive engagement by cherry-picking stories, which can skew our understanding of the frequency of all kinds of problems. But availability bias is a problem we have to learn to compensate for in all sorts of different domains- it would be amazing if we were able to build a rich model of important global events by consuming only purely unbiased information, but that isn't the world we live in. The news is the best we've got, and we ought to use it.
I'm certain your model of what purpose is is a lot more detailed than mine. My take, however, is that animal brains don't exactly have a utility function, but probably do have something functionally similar to a reward function in machine learning. A well-defined set of instrumental goals terminating in terminal goals would be a very effective way of maximizing that reward, so the behaviors reinforced will often converge on an approximation of that structure. However, the biological learning algorithm is very bad at consistently finding the structure, and so the approximations will tend to shift around and conflict a lot- behaviors that approximate a terminal goal one year might approximate an instrumental goal later on, or cease to approximate any goal at all. Imagine a primitive image diffusion model with a training set of face photos- you run it on a set of random pixels, and it starts producing eyes and mouths and so on in random places, then gradually shifts those around into a slightly more coherent image as the remaining noise decreases.
So, instrumental and terminal goals in my model aren't so much things agents actually have as a sort of logical structure that influences how our behaviors develop. It's sort of like the structure of "if A implies B which implies C, then A implies C"- that's something that exists prior to us, but we tend to adopt behaviors approximating it because doing so produces a lot of reward. Note, though, that comparing the structure of goals to logic can be confusing, since logic can help promote terminal goals- so when we're approximating having goals, we want to be logical, but we have no reason to want to have terminal goals. That just something our biological reward function tends to reinforce.
Regarding my use of the term "category error", I used that term rather than saying "terminal goals don't require justification" because, while technically accurate, the use of the word "require" there sounds very strange to me. To "require" something means that it's necessary to promote some terminal goal. So, the phrase reads to me a bit like "a king is the rank which doesn't follow the king's orders"- accurate, technically, but odd. More sensible to say instead that following the king's orders is something having to do with subjects, and a category error when applied to a king.