LESSWRONG
LW

1778
MinusGix
2091780
Message
Dialogue
Subscribe

Programmer.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2MinusGix's Shortform
3y
1
2MinusGix's Shortform
3y
1
Thane Ruthenis's Shortform
MinusGix6d10

(Note: I've only read a few pages so far, so perhaps this is already in the background)

I agree that if the parent comment scenario holds then it is a case of the upload being improper.

However, I also disagree that most humans naturally generalize our values out of distribution. I think it is very easy for many humans to get sucked into attractors (ideologies that are simplifications of what they truly want; easy lies; the amount of effort ahead stalling out focus even if the gargantuan task would be worth it) that damage their ability to properly generalize and also importantly apply their values. That is, humans have predictable flaws. Then when you add in self-modification you open up whole new regimes.

My view is that a very important element of our values is that we do not necessarily endorse all of our behaviors!

I think a smart and self-aware human could sidestep and weaken these issues, but I do think they're still hard problems. Which is why I'm a fan of (if we get uploads) going "Upload, figure out AI alignment, then have the AI think long and hard about it" as that further sidesteps problems of a human staring too long at the sun. That is, I think it is very hard for a human to directly implement something like CEV themselves, but that a designed mind doesn't necessarily have the same issues.

As an example: power-seeking instinct. I don't endorse seeking power in that way, especially if uploaded to try to solve alignment for Humanity in general, but given my status as an upload and lots of time realizing that I have a lot of influence over the world, I think it is plausible that instinct affects me more and more. I would try to plan around this but likely do so imperfectly.

Reply
Help me understand: how do multiverse acausal trades work?
MinusGix22d10

A core element is that you expect acausal trade among far more intelligent agents, such as AGI or even ASI. As well that they'll be using approximations.

Problem 1: There isn't going to be much Darwinian selection pressure against a civilization that can rearrange stars and terraform planets. I'm of the opinion that it has mostly stopped mattering now, and will only matter even less over time. As long as we don't end up in a "everyone has an AI and competes in a race to the bottom". I don't think it is that odd that an ASI could resist selection pressures. It operates on a faster time-scale and can apply more intelligent optimization than evolution can, towards the goal of keeping itself and whatever civilization it manages stable.

Problem 2: I find it somewhat plausible there's some nicely sufficiently pinned down variables that can get us to a more objective measure. However, I don't think it is needed and most presentations of this don't go for an objective distribution.
So, to me, using a UTM that is informed by our own physics and reality is fine. This presumably results in more of a 'trading nearby' sense, the typical example being across branches, but in more generality. You have more information about how those nearby universes look anyway.

The downside here is that whatever true distribution there is, you're not trading directly against it. But if it is too hard for an ASI in our universe to manage, then presumably many agents aren't managing to acausally trade against the true distribution regardless.

Reply1
Harmless reward hacks can generalize to misalignment in LLMs
MinusGix1mo61

I think you're referring to their previous work? Or you might find it relevant if you didn't run into it. https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly

If you were pessimistic about LLMs learning a general concept of good/bad, then yes, that should update you. However, I think it still has the main core problems. If you are doing a simple continual learning loop (LLM -> output -> retrain to accumulate knowledge; analogous to ICL) then we can ask the question of how robust this process is. Do the values of how to behave drastically diverge. Such as, are there attractors over a hundred days of output that it is dragged towards that aren't aligned at all? Can it be jail-broken wittingly or not by getting the model to produce garbage responses that it is then trained on? And then arguments like 'does this hold up under reflection' or 'does it attach itself to the concept of good or chatgpt-influenced good (or evil). So while LLMs being capable of learning good is, well, good, there are still big targeting, resolution, and reflection issues.


For this post specifically, I believe it to be bad news. It provides evidence that subtle reward hacking scenarios encourage the model to act misaligned in a more general manner. It is likely quite nontrivial to get rid of reward-hacking like behavior in our larger and larger training runs. So if the model gets into a period of time where reward-hacking is rewarded, a continual learning scenario is easiest to imagine but even in training, then it may drastically change its behavior.

Reply1
HPMOR: The (Probably) Untold Lore
MinusGix1mo40

I have some of the same feeling, but internally I've mostly pinned it to two prongs of repetition and ~status.

ChatGPT's writing is increasingly disliked by those who recognize it. The prose is poor in various ways, but I've certainly read worse and not been so off-put. Nor am I as off-put when I first use a new model, but then I increasingly notice its flaws over the next few weeks. The main aspect is that the generated prose is repetitive across the writings which ensures we can pick up on the pattern. Such as making it easy to predict flaws. Just as I avoid many generic power fantasy fiction as much of it is very predictable in how it will fall short even though many are still positive value if I didn't have other things to do with my time.

So, I think a substantial part is that of recognizing the style, there being flaws you've seen in many images in the past, and then regardless of whether this specific actual image is that problematic, the mind associates it with negative instances and also being overly predictable.

Status-wise this is not entirely in a negative status game sense. A generated image is a sign that it was probably not that much effort for the person making it, and the mind has learned to associate art with effort + status to a degree, even if indirect effort + status by the original artist the article is referencing. And so it is easy to learn a negative feeling towards these, which attaches itself to the noticeable shared repetition/tone. Just like some people dislike pop in part due to status considerations like being made by celebrities or countersignaling of not wanting to go for the most popular thing, and then that feeds into an actual dislike for that style of musical art.

But this activates too easily, a misfiring set of instincts, so I've deliberately tamped it down on myself; because I realized that there are plenty of images which five years ago I would have been simply impressed and find them visually appealing. I think this is an instinct that is to a degree real (generated images can be poorly made), while also feeding on itself that makes it disconnected from past preferences. I don't think that the poorly made images should notably influence my enjoyment of better quality images, even if there is a shared noticeable core. So that's my suggestion.

Reply
Banning Said Achmiz (and broader thoughts on moderation)
MinusGix1mo30

Anecdotally, I would perceive "Bowing out of this thread" as a more negative response because it encapsulates both topic as well as the quality of my response or behavior of myself. While "not worth getting into" is mostly about the worth of the object level matter. (Though remarking on behavior of the person you're arguing with is a reasonable thing to do, I'm not sure that interpretation is what you intend)

Reply2
Banning Said Achmiz (and broader thoughts on moderation)
MinusGix1mo115

I disagree. Posts seem to have an outsized effect and will often be read a bunch before any solid criticisms appear. Then are spread even given high quality rebuttals... if those ever materialize.
I also think you're referring to a group of people who write high quality posts typically and handle criticism well, while others don't handle criticism well. Despite liking many of his posts, Duncan is an example of this.

As for Said specifically, I've been annoyed at reading his argumentation a few times, but then also find him saying something obvious and insightful that no one else pointed out anywhere in the comments. Losing that is unfortunate. I don't think there's enough "this seems wrong or questionable, why do you believe this?"
Said is definitely more rough than I'd like, but I also do think there's a hole there that people are hesitant to fill.

So I do agree with Wei that you'll just get less criticism, especially since I do feel like LessWrong has been growing implicitly less favorable towards quality critiques and more favorable towards vibey critiques. That is, another dangerous attractor is the Twitter/X attractor, wherein arguments do exist but they matter to the overall discourse less than whether or not someone puts out something that directionally 'sounds good'. I think this is much more likely than the sneer attractor or the linkedin attractor.

I also think that while the frontpage comments section has been good for surfacing critique, it encourages the "this sounds like the right vibe" substantially. As well as a mentality of reading the comments before the post, encouraging faction mentality.

Reply42
Banning Said Achmiz (and broader thoughts on moderation)
MinusGix1mo76

Because Said is an important user who provides criticism/commentary across many years. This is not about some random new user, which is why there is a long post in the first place rather than him being silently banned.
Alicorn is raising a legitimate point. That it is easy to get complaints about a user who is critical of others, that we don't have much information about the magnitude, and that it is far harder to get information about users who think his posts are useful.

LessWrong isn't a democracy, but these are legitimate questions to ask because they are about what kind of culture (as Habryka talks about) LW is trying to create.

Reply1
Stephen Martin's Shortform
MinusGix1mo61

I find this surprising. The typical beliefs I'd expect are 1) Disbelief that models are conscious in the first place; 2) believing this is mostly signaling (and so whether or not model welfare is good, it is actually a negative update about the trustworthiness of the company); 3) That it is costly to do this or indicates high cost efforts in the future. 4) Effectiveness

I suspect you're running into selection issues of who you talked to. I'd expect #1 to come up as the default reason, but possibly the people you talk to were taking precautionary principle seriously enough to avoid that.
The objections you see might come from #3. That they don't view this as a one-off cheap piece of code, they view it as something Anthropic will hire people for (which they have), which "takes" money away from more worthwhile and sure bets. This is to some degree true, though I find those X odd as Anthropic isn't going to spend on those groups anyway. However, for topics like furthering AI capabilities or AI safety then, well, I do think there is a cost there.

Reply
My Empathy Is Rarely Kind
MinusGix2mo*30

How did you arrive at this belief? Like, the thing that I would be concerned with is "How do I know that Russel's teapot isn't just beyond my current horizon"?

Empirical evidence of being more in tune with my own emotions, generally better introspection, and in modeling why others make decisions. Compared to others. I have no belief that I'm perfect at this, but I do think I'm generally good at it and that I'm not missing a 'height' component to my understanding.

Is it possible, do you think, that the way you're doing analysis isn't sufficient, and that if you were to be more careful and thorough, or otherwise did things differently, your experience would be different? If not, how do you rule this out, exactly? How do you explain others who are able to do this?

Because, (I believe) the impulse to dismiss any sort of negativity or blame once you understand the causes deep enough is one I've noticed myself. I do not believe it to be a level of understanding that I've failed to reach, I've dismissed it because it seems an improper framing.
At times the reason for this comes from a specific grappling with determinism and choice that I disagree with.
For others, the originating cause is due to considering kindness as automatically linked with empathy, with that unconsciously shaping what people think is acceptable from empathy.
In your case, some of it is tying it purely to prediction that I disagree with, because of some mix of kindness-being-the-focus, determinism, a feeling that once it has been explained in terms of the component parts that there's nothing left, and other factors that I don't know because they haven't been elucidated.

Empirical exploration as in your example can be explanatory. However, I have thought about motivation and the underlying reasons to a low granularity plenty of times (impulses that form into habits, social media optimizing for short form behaviors, the heuristics humans come with which can make doing it now hard to weight against the cost of doing it a week from now, how all of those constrain the mind...), which makes me skeptical. The idea of 'shift the negativity elsewhere' is not new, but given your existing examples it does not convince me that if I spent an hour with you on this that we would get anywhere.

"because they're bad/lazy/stupid"/"they shouldn't have" or whatever you want to round it to, but these things are semantic stopsigns, not irreducible explanations.

This, for example, is a misunderstanding of my position or the level of analysis that I'm speaking of. Wherein I am not stopping there, as I mentally consider complex social cause and effect and still feel negative about the choices they've made.

Yet as you grieve, these things come up less and less frequently. Over time, you run out of errant predictions like "It's gonna be fun to see Benny when --Oh fuck, no, that's not happening". Eventually, you can talk about their death like it's just another thing that is, because it is.

Grief like this exists, but I don't agree that it is pure predictive remembrance. There is grief which lasts for a time and then fades away, not because my lower level beliefs are prediction to see them—away from home and a pet dies, I'm still sad, not because of prediction error but because I want (but wants are not predictions) the pet to be alive and fine, but they aren't. Because it is bad, to be concise.

You could try arguing that this is 'prediction that my mental model will say they are alive and well', with two parts of myself in disagreement, but that seems very hard to determine the accuracy as an explanation and I think is starting to stretch the meaning of prediction error. Nor does the implication that 'fully knowing the causes' carves away negative emotion follow?

I'm holding the goal posts even further forward though. Friendly listening is one thing, but I'm talking about pointing out that they're acting foolish and getting immediate laughter in recognition that you're right. This is the level of ability that I'm pointing at. This is what is what's there to aim for, which is enabled by sufficiently clear maps.

This is more about socialization ability, though having a clear map helps. I've done this before, with parents and joking with a friend about his progress on a project, but I do not do so regularly nor could I do it in arbitrarily. Joking itself is only sometimes the right route, the more general capability is working a push into normal conversation, with joking being one tool in the toolbox there. I don't really accept the implication 'and thus you are mismodeling via negative emotions if you can not do that consistently'. I can be mismodeling to the degree that I don't know precisely what words will satisfy them, but that can be due to social abilities.

The big thing I was hoping you'd notice, is that I was trying to make my claims so outrageous and specific so that you'd respond "You can't say this shit without providing receipts, man! So lets see them!". I was daring you to challenge me to provide evidence. I wonder if maybe you thought I was exaggerating, or otherwise rounding my claims down to something less absurd and falsifiable?

When you don't provide much argumentation, I don't go 'huh, guess I need to prod them for argumentation' I go 'ah, unfortunate, I will try responding to the crunchy parts in the interests of good conversation, but will continue on'. That is, the onus is on you to provide reasons. I did remark that you were asserting without much backing.

I was taking you literally, and I've seen plenty of people fall back without engaging—I've definitely done it during the span of this discussion, and then interpreting your motivations through that. 'I am playing a game to poke and prod at you' is uh.....

Anyway, there are a few things in your comment that suggest you might not be having fun here. If that's the case, I'm sorry about that. No need to continue if you don't want, and no hard feelings either way.

A good chunk of it is the ~condescension. Repeated insistence while seeming to mostly just continue on the same line of thought without really engaging where I elaborate, goalpost gotcha, and then the bit about Claude when you just got done saying that it was to 'test' me; which it being to prod me being quite annoying in-of-itself.
Of course, I think you have more positive intent behind that. Pushing me to test myself empirically, or pushing me to push back on you so then you can push back yourself on me to provide empirical tests (?), or perhaps trying to use it as an empathy test for whether I understand you. I'm skeptical of you really understanding my position given your replies.

I feel like I'm being better at engaging at the direct level, while you're often doing 'you would understand if you actually tried', when I believe I have tried to a substantial degree even if nothing precisely like 'spend two hours mapping cause and effect of how a person came to these actions'.

Reply
My Empathy Is Rarely Kind
MinusGix2mo30

The thing that I was missing then, and which you're missing now, is that the bar for deep careful analysis is just a lot higher than you think (or most anyone thinks). It's often reasonable to skimp out and leave it as "because they're bad/lazy/stupid"/"they shouldn't have" or whatever you want to round it to, but these things are semantic stopsigns, not irreducible explanations.

No, I believe I'm fully aware the level of deep careful analysis, and I understand why it pushes some people to sweep all facets of negativity or blame away, I just think they're confused because their understanding of emotions/relations/causality hasn't updated properly alongside their new understanding of determinism

"I'm annoyed that the calculator doesn't work... without batteries?" How do you finish the statement of annoyance?

Because I wanted the calculator to work, I think it is a good thing for calculators in stores to work, I am frustrated that the calculator didn't work... none of this is exotic, nor is it purely prediction error. (nor do prediction error related emotions have to go away once you've explained the error... I still feel emotional pain when a pet dies even if I realize all the causes why; why would that not extend to other emotions related to prediction error?)

Empirically, what happens, is that you can keep going and keep going, until you can't, and at that point there's just no more negative around that spot because it's been crowded out. It doesn't matter if it's annoyance, or sadness, or even severe physical pain. If you do your analysis well, the experience shifts, and loses its negativity.

You assert this but I still don't agree with it. I've thought long and hard about people before and the causes that make them do things, but no, this does not match my experience. I understand the impulse that encourages sweeping away negative emotions once you've found an explanation, like realizing that humanities' lack of coordination is a big problem, but I can still very well feel negative emotions about that despite there being an explanation.

In other words, there are reasons for their choices. Do you understand why they chose the way they did?

Relatively often? Yes. I don't blame people for not outputting the code for an aligned AGI because it is something that would have been absurdly hard to reinforce in yourself to become the kind of person to do that.

If someone has a disease that makes so they struggle to do much at all, I am going to judge them a hell of a lot less. Most humans have the "disease" that they can't just smash out the code for an aligned AGI.

I can understand why someone is not investing more time studying, and I can even look at myself and relatively well pin down why, and why it is hard to get over that hump... I just don't dismiss the negative feeling even though I understand why. They 'could have', because the process-that-makes-their-decisions is them and not some separate third-thing.

I fail to study when I should because a combination of short-term optimized positive feeling seeking which leads me to watching youtube or skimming X, a desire for faster intellectual feelings that are easier gotten from arguing on reddit (or lesswrong) than slowly reading through a math paper, because I fear failure, and much more. Yet I still consider that bad, even if I got a full causal explanation it would have still been my choices.

Regardless, I do not have issues getting along with someone even if I experience negative emotions about how they've failed to reach farther in the past—just like I can do so even if their behavior, appearance, and so on are displeasing. This will be easier if I do something vaguely like John's move of 'thinking of them like a cat', but it is not necessary for me to be polite and friendly.

Notice the movement of goal posts here? I'm talking about successfully helping people, you're saying you can "get along". Getting along is easy. I'm sure you can offer what passes as empathy to the girl with the nail in her head, instead of fighting her like a beliggerent dummy.

I don't have issues with helping people, there "goalposts" moved forward again, despite nothing in my sentence meaning I can't help people. My usage of 'get along' was not the bare minimum meaning.

Getting along with people in the nail scenario often means being friendly and listening to them. I can very well do that, and have done it many times before, while still thinking their individual choices are foolish.

I don't think your comment has supplied much more beyond further assertions that I must surely not be thinking things through.

Reply
Load More