burrito — LessWrong

The case for more ambitious language model evals

Maybe a bit of a nitpick, but RLHF'd GPT-4o can still detect Eric Drexler's writing (chat link). I gave it the first paragraph of his latest blog post, which was written in February 2024, past 4o's knowledge cutoff date of October 2023. In general I'm not sure if RLHF actually makes the models worse at truesight. It would be interesting to see a benchmark comparing e.g. Llama base vs instruct on this capability.

All AGI Safety questions welcome (especially basic ones) [April 2023]

burrito3y10

Thanks, this is exactly the kind of thing I was looking for.

All AGI Safety questions welcome (especially basic ones) [April 2023]

burrito3y20

Thanks for the reply.

GPT-4 is far below village idiot level at most things a village idiot uses their brain for, despite surpassing humans at next-token prediction.

Could you give some examples? I take it that what Eliezer meant by village-idiot intelligence is less "specifically does everything a village idiot can do" and more "is as generally intelligent as a village idiot". I feel like the list of things GPT-4 can do that a village idiot can't would look much more indicative of general intelligence than the list of things a village idiot can do that GPT-4 can't. (As opposed to AlphaZero, where the extent of the list is "can play some board games really well")

I just can't imagine anyone interacting with a village idiot and GPT-4 and concluding that the village idiot is smarter. If the average village idiot had the same capabilities as today's GPT-4, and GPT-4 had the same capabilities as today's village idiots, I feel like it would be immediately obvious that we hadn't gotten village-idiot level AI yet. My thinking on this is still pretty messy though so I'm very open to having my mind changed on this.

Something like this plausibly came up in the Eliezer/Paul dialogues from 2021, but I couldn't find it with a cursory search. Eliezer has also in various places acknowledged being wrong about what kind of results the current ML paradigm would get, which probably is a superset of this specific thing.

Just skimmed the dialogues, couldn't find it either. I have seen Eliezer acknowledge what you said but I don't really see how it's related; for example, if GPT-4 had been Einstein-level then that would look good for his intelligence-gap theory but bad for his suspicion of the current ML paradigm.

All AGI Safety questions welcome (especially basic ones) [April 2023]

burrito3y226

In My Childhood Role Model, Eliezer Yudkowsky says that the difference in intelligence between a village idiot and Einstein is tiny relative to the difference between a chimp and a village idiot.This seems to imply (I could be misreading) that {the time between the first AI with chimp intelligence and the first AI with village idiot intelligence} will be much larger than {the time between the first AI with village idiot intelligence and the first AI with Einstein intelligence}. If we consider GPT-2 to be roughly chimp-level, and GPT-4 to be above village idiot level, then it seems like this would predict that we'll get an Einstein-level AI within at least the next year. This seems really unlikely and I don't even think Eliezer currently believes this. If my interpretation of this is correct, this seems like an important prediction that he got wrong and I haven't seen acknowledged.

So my question is: Is this a fair representation of Eliezer's beliefs at the time? If so, has this prediction been acknowledged wrong, or was it actually not wrong and there's something I'm missing? If the prediction was wrong, what might the implications be for fast vs slow takeoff? (Initial thoughts: If this prediction had been right, then we'd expect fast takeoff to be much more likely, because it seems like if you improve Einstein by the difference between him and a village idiot 20 times over 4 years (the time between GPT-2 and GPT-4, i.e. ~chimp level vs >village idiot level), you will definitely get a self-improving AI somewhere along the way)

(Meta: I know this seems like more of an argument than a question; the reason I put it here is that I expect someone to have an easy/obvious answer, since I haven't really spent any time thinking about or working with AI beyond playing with GPT and watching/reading some AGI debates. I also don't want to pollute the discourse in threads where the participants are expected to at least kind of know what they're talking about.)

The Best Software For Every Need

burrito4y20

Strongly agree. As a relative beginner I've found the automatic code completion and method listing/descriptions incredibly useful.

burrito's Shortform

burrito5y10

Responding to 3):

What is your standing to judge your imagined version of someone's experience? Maybe your preferences are different enough from the subject's that you're simply wrong in your comparison.

You're right and I should have said "imagine the point at which they are indifferent", "would they prefer the 10x experience or the 1x experience", etc. Imagining whether I would prefer it could be a decent approximation of their preferences, though.

burrito's Shortform

burrito5y10

Responding to 2):

It's likely that some experiences are non-linear in utility per intensity.

In the post, I defined intensity as linearly proportional to utility. If you think wording it as "intensity" is misleading because what we generally think of as "experience intensity" isn't linearly proportional to utility, then I agree, but can't think of a better term to use.

Or that you'd have to crank up some parts of the experience and not others. For instance, enjoying the contrast of bitter and fruity in a shot of espresso - there's no way to scale the whole thing up 10x, you have to pick and choose what to intensify, and then your result is subject to those modeling choices.

If I'm understanding you correctly, you're saying that the experience of enjoying the contrast of bitter and fruity can be modeled as the individual experiences "bitter 1, bitter 2, bitter 3, ... ,fruity 1, fruity 2, fruity 3, ..." and the total utility of the 10x scaled experience depends on which ones you group together when summing the utilities. For example, if your "utility groups" are bitter 1 + fruity 1, bitter 2 + fruity 2, etc., it comes out with higher utility than if you group them as bitter 1 +bitter 2 + etc and fruity 1 + fruity 2 + etc, because the specific combination of bitter and fruity is what makes it a good experience.

I disagree that the individual experiences are bitter 1, bitter 2, ..., fruity 1, fruity 2, ... ; I feel like it should be more like bitter-fruity 1, bitter-fruity 2, ..., bitter 1, bitter 2, ..., fruity 1, fruity 2, ... The combination of bitter and fruity ("bitter-fruity") is a distinct individual experience in the set of experiences occurring at that moment, and in that set might also be included individual "bitter" and "fruity" experiences. Here, we can just intensify each individual experience by 10 (i.e. multiply the utility of the experience by 10 while keeping it as the same "type" of experience) and sum their utilities.

Also, why 10x rather than 0.1x, or 0x (direct comparison of experienced to not-experienced).

I don't really know what it would mean to prefer to "not experience" something. You're always experiencing something; your baseline mental state is an experience. If your baseline mental state has exactly 0 utility, this would work, but your baseline mental state isn't necessarily exactly 0 utility. If "not experiencing" something means shaving that much time off the rest of your life, this still feels like a conceptually weaker version of a "preference". When comparing two experiences, I can imagine being on experience 1, then deciding I want to switch to experience 2, then actually deciding experience 1 is better, etc., until some sort of equilibrium is reached and I make up my mind. (Keep in mind that how tired you are of the experience is itself part of the experience, and held constant, so you wouldn't keep jumping back and forth endlessly.) Theoretically, the analogous way to compare an experience with the absence of experience would be jumping between [your entire mind being turned off except for the part that allows you to think and make decisions] and the experience, but that's a harder thought experiment to have than imagining jumping between two different experiences.

I hadn't thought of comparing 1x to 0.1x; that's a good idea, and I don't object to it. I imagine 1x to 0.1x is more useful if each individual experience has a lot of utility (or disutility), and 1x to 10x is more useful if each individual experience only has a little utility (or disutility).

(I will respond to 3) and 4) tomorrow, it's getting late and I should sleep)

burrito's Shortform

burrito5y10

First, I want to make sure we're separating the validity of the model itself from concerns with applying it. I'll try to be clear about which one I'm talking about for each part.

I'll respond to each number in a separate reply because the format of the conversation will be a mess otherwise. Starting with 1):

Where do you get this list?

Could you be more specific? Is this question centered around how I know what other people are thinking, or how to separate a whole experience into individual experiences ?

And how do you account for future unexpected experiences?

I don't know how to respond to this. What part of the model depends on accounting for unexpected future experiences? If you're asking generally how I would predict future experiences, I don't have a good answer, but this seems both separate from the philosophical model itself and not an objection to the application of this specific philosophical model (it applies to all experience-based consequentialism).

burrito's Shortform

burrito5y70

Vague idea for how to theoretically decide whether a mind's existence at any given moment is net positive, from a utilitarian standpoint:

Get a list of every individual conscious experience the mind is having at the exact moment.
For each individual experience, crank up its "intensity" (i.e. magnitude of utility) by a factor of, say, 10; as intense as you can easily imagine and empathize with, but no more. This can be approximated by imagining the point at which you are indifferent to experiencing the "1x" experience for 10 seconds or the "10x" experience for 1 second. Try to imagine this independently of mental side effects caused by experiencing it for a long time.
Now imagine the entire collection of "10x" experiences together, for a "10x" existence. Would you prefer to experience this, or the "1x" existence? If you prefer 10x, the mind's existence is net positive; if you prefer 1x, it's net negative.

I feel like some version of this has to logically follow if you assume that the utility of multiple experiences at once is equal to the sum of the utilities of the individual experiences (might not be true), plus some uncontroversial utilitarian assumptions, but I can't formalize the proof in my head.

If anyone has links to other attempts to get a utilitarian threshold for net positive/negative existence, they would be appreciated.

(I should note that I know approximately zero neuroscience, and don't know if the concept of an "individual experience" as distinct from other "individual experiences" happening in the same mind at the same time is coherent)

burrito's Shortform

burrito5y50

Intentionally rationalizing against your beliefs could be a good strategy for doing a cost-benefit analysis. For example, if you currently support increasing the minimum wage, imagine yourself as someone who is against it, and from that perspective come up with as many disadvantages to it as possible. I'm sure I'm not the first to come up with this idea but I haven't seen it anywhere else; is there a name for this concept?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments