From my current stance, it is plausible, because we haven't settled how we think of aliens (especially those who are significantly outside of our behaviors) philosophically. I most likely don't respect arbitrary intelligent agents, as I'd be for getting rid of a vulnerable paperclipper if we found one on the far edges of the galaxy.
Then, I think you're not extrapolating mentally how much that computronium would give. From our current perspective the logic makes sense: where we upload the aliens regardless even if you respect their preferences beyond that, because it lets you simulate vastly more aliens or other humans at the same time.
I expect we care about their preferences. However those preferences will end up to some degree subordinate to our own preferences, the clear obvious being that we probably wouldn't allow them an ASI depending on how attack/defense works, but the other being that we may upload them regardless due to the sheer benefits.
Beyond that I disagree how common that motivation is. I think the kind of learning we know naturally results in that, limited social agents modeling each other in an iterated environment, is currently not on track to apply to AI.... and that another route is "just care strategically" especially if you're intelligent enough. I feel this is extrapolating a relatively modern human line of thought to arbitrary kinds of minds.
(Note: I've only read a few pages so far, so perhaps this is already in the background)
I agree that if the parent comment scenario holds then it is a case of the upload being improper.
However, I also disagree that most humans naturally generalize our values out of distribution. I think it is very easy for many humans to get sucked into attractors (ideologies that are simplifications of what they truly want; easy lies; the amount of effort ahead stalling out focus even if the gargantuan task would be worth it) that damage their ability to properly generalize and also importantly apply their values. That is, humans have predictable flaws. Then when you add in self-modification you open up whole new regimes.
My view is that a very important element of our values is that we do not necessarily endorse all of our behaviors!
I think a smart and self-aware human could sidestep and weaken these issues, but I do think they're still hard problems. Which is why I'm a fan of (if we get uploads) going "Upload, figure out AI alignment, then have the AI think long and hard about it" as that further sidesteps problems of a human staring too long at the sun. That is, I think it is very hard for a human to directly implement something like CEV themselves, but that a designed mind doesn't necessarily have the same issues.
As an example: power-seeking instinct. I don't endorse seeking power in that way, especially if uploaded to try to solve alignment for Humanity in general, but given my status as an upload and lots of time realizing that I have a lot of influence over the world, I think it is plausible that instinct affects me more and more. I would try to plan around this but likely do so imperfectly.
A core element is that you expect acausal trade among far more intelligent agents, such as AGI or even ASI. As well that they'll be using approximations.
Problem 1: There isn't going to be much Darwinian selection pressure against a civilization that can rearrange stars and terraform planets. I'm of the opinion that it has mostly stopped mattering now, and will only matter even less over time. As long as we don't end up in a "everyone has an AI and competes in a race to the bottom". I don't think it is that odd that an ASI could resist selection pressures. It operates on a faster time-scale and can apply more intelligent optimization than evolution can, towards the goal of keeping itself and whatever civilization it manages stable.
Problem 2:
I find it somewhat plausible there's some nicely sufficiently pinned down variables that can get us to a more objective measure. However, I don't think it is needed and most presentations of this don't go for an objective distribution.
So, to me, using a UTM that is informed by our own physics and reality is fine. This presumably results in more of a 'trading nearby' sense, the typical example being across branches, but in more generality. You have more information about how those nearby universes look anyway.
The downside here is that whatever true distribution there is, you're not trading directly against it. But if it is too hard for an ASI in our universe to manage, then presumably many agents aren't managing to acausally trade against the true distribution regardless.
I think you're referring to their previous work? Or you might find it relevant if you didn't run into it. https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly
If you were pessimistic about LLMs learning a general concept of good/bad, then yes, that should update you. However, I think it still has the main core problems. If you are doing a simple continual learning loop (LLM -> output -> retrain to accumulate knowledge; analogous to ICL) then we can ask the question of how robust this process is. Do the values of how to behave drastically diverge. Such as, are there attractors over a hundred days of output that it is dragged towards that aren't aligned at all? Can it be jail-broken wittingly or not by getting the model to produce garbage responses that it is then trained on? And then arguments like 'does this hold up under reflection' or 'does it attach itself to the concept of good or chatgpt-influenced good (or evil). So while LLMs being capable of learning good is, well, good, there are still big targeting, resolution, and reflection issues.
For this post specifically, I believe it to be bad news. It provides evidence that subtle reward hacking scenarios encourage the model to act misaligned in a more general manner. It is likely quite nontrivial to get rid of reward-hacking like behavior in our larger and larger training runs. So if the model gets into a period of time where reward-hacking is rewarded, a continual learning scenario is easiest to imagine but even in training, then it may drastically change its behavior.
I have some of the same feeling, but internally I've mostly pinned it to two prongs of repetition and ~status.
ChatGPT's writing is increasingly disliked by those who recognize it. The prose is poor in various ways, but I've certainly read worse and not been so off-put. Nor am I as off-put when I first use a new model, but then I increasingly notice its flaws over the next few weeks. The main aspect is that the generated prose is repetitive across the writings which ensures we can pick up on the pattern. Such as making it easy to predict flaws. Just as I avoid many generic power fantasy fiction as much of it is very predictable in how it will fall short even though many are still positive value if I didn't have other things to do with my time.
So, I think a substantial part is that of recognizing the style, there being flaws you've seen in many images in the past, and then regardless of whether this specific actual image is that problematic, the mind associates it with negative instances and also being overly predictable.
Status-wise this is not entirely in a negative status game sense. A generated image is a sign that it was probably not that much effort for the person making it, and the mind has learned to associate art with effort + status to a degree, even if indirect effort + status by the original artist the article is referencing. And so it is easy to learn a negative feeling towards these, which attaches itself to the noticeable shared repetition/tone. Just like some people dislike pop in part due to status considerations like being made by celebrities or countersignaling of not wanting to go for the most popular thing, and then that feeds into an actual dislike for that style of musical art.
But this activates too easily, a misfiring set of instincts, so I've deliberately tamped it down on myself; because I realized that there are plenty of images which five years ago I would have been simply impressed and find them visually appealing. I think this is an instinct that is to a degree real (generated images can be poorly made), while also feeding on itself that makes it disconnected from past preferences. I don't think that the poorly made images should notably influence my enjoyment of better quality images, even if there is a shared noticeable core. So that's my suggestion.
Anecdotally, I would perceive "Bowing out of this thread" as a more negative response because it encapsulates both topic as well as the quality of my response or behavior of myself. While "not worth getting into" is mostly about the worth of the object level matter. (Though remarking on behavior of the person you're arguing with is a reasonable thing to do, I'm not sure that interpretation is what you intend)
I disagree. Posts seem to have an outsized effect and will often be read a bunch before any solid criticisms appear. Then are spread even given high quality rebuttals... if those ever materialize.
I also think you're referring to a group of people who write high quality posts typically and handle criticism well, while others don't handle criticism well. Despite liking many of his posts, Duncan is an example of this.
As for Said specifically, I've been annoyed at reading his argumentation a few times, but then also find him saying something obvious and insightful that no one else pointed out anywhere in the comments. Losing that is unfortunate. I don't think there's enough "this seems wrong or questionable, why do you believe this?"
Said is definitely more rough than I'd like, but I also do think there's a hole there that people are hesitant to fill.
So I do agree with Wei that you'll just get less criticism, especially since I do feel like LessWrong has been growing implicitly less favorable towards quality critiques and more favorable towards vibey critiques. That is, another dangerous attractor is the Twitter/X attractor, wherein arguments do exist but they matter to the overall discourse less than whether or not someone puts out something that directionally 'sounds good'. I think this is much more likely than the sneer attractor or the linkedin attractor.
I also think that while the frontpage comments section has been good for surfacing critique, it encourages the "this sounds like the right vibe" substantially. As well as a mentality of reading the comments before the post, encouraging faction mentality.
Because Said is an important user who provides criticism/commentary across many years. This is not about some random new user, which is why there is a long post in the first place rather than him being silently banned.
Alicorn is raising a legitimate point. That it is easy to get complaints about a user who is critical of others, that we don't have much information about the magnitude, and that it is far harder to get information about users who think his posts are useful.
LessWrong isn't a democracy, but these are legitimate questions to ask because they are about what kind of culture (as Habryka talks about) LW is trying to create.
I find this surprising. The typical beliefs I'd expect are 1) Disbelief that models are conscious in the first place; 2) believing this is mostly signaling (and so whether or not model welfare is good, it is actually a negative update about the trustworthiness of the company); 3) That it is costly to do this or indicates high cost efforts in the future. 4) Effectiveness
I suspect you're running into selection issues of who you talked to. I'd expect #1 to come up as the default reason, but possibly the people you talk to were taking precautionary principle seriously enough to avoid that.
The objections you see might come from #3. That they don't view this as a one-off cheap piece of code, they view it as something Anthropic will hire people for (which they have), which "takes" money away from more worthwhile and sure bets.
This is to some degree true, though I find those X odd as Anthropic isn't going to spend on those groups anyway. However, for topics like furthering AI capabilities or AI safety then, well, I do think there is a cost there.
Contrary, I liked this post and the latter half the most. It serves as a relatively direct parable about different levels of ability and also the major problems with common arguments against AGI/ASI, which I think people still miss making a point of very often. Spelling them out explicitly without going into super-long detail as a full post is good as it provides more concise argumentative handles. That is, people do not actually make the basic counterarguments enough.
(I also think those suggesting that this is already argued out enough should link to alternative posts. Posts for higher quality and more concise argumentation, and also posts made for reading by interlocutors.)