All of cubefox's Comments + Replies

Fixed! (Video reviews, so unfortunately there is no Ctrl-F to find the relevant part.)

Two reviewers who worried about the weight: Norman Chan, Marques Brownlee.

2ChristianKl2h
If you would also have added links, I would have added the "nice, scholarship"-react.

There are at least two related theories in which "all sentient beings matter" may be true.

  • Sentient beings can experience things like suffering, and suffering is bad. So sentient beings matter insofar it is better that they experience more rather than less well-being. That's hedonic utilitarianism.

  • Sentient beings have conscious desires/preferences, and those matter. That would be preference utilitarianism.

The concepts of mattering or being good or bad (simpliciter) are intersubjective generalizations of the subjective concepts of mattering or being... (read more)

I aware of just three methods to modify GPTs: In-context learning (prompting), supervised fine-tuning, reinforcement fine-tuning. The achievable effects seem rather similar.

2Matt Goldenberg12d
There's many other ways to search the network in the literature, such as Activation Vectors [https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector].  And I suspect we're just getting started on these sorts of search methods.

I did read your post. The fact that something like predicting text requires superhuman capabilities of some sort does not mean that the task itself will result in superhuman capabilities. That's the crucial point.

It is much harder to imitate human text than to write while being a human, but that doesn't mean the imitated human itself is any more capable than the original.

An analogy. The fact that building fusion power plants is much harder than building fission power plants doesn't at all mean that the former are better. They could even be worse. There is a fundamental disconnect between the difficulty of a task and the usefulness of that task.

This approach doesn't seem to work with in-context learning. Then it is unclear whether fine-tuning could be more successful.

2Matt Goldenberg12d
I think there are probably many approaches that don't work.

Being able to perfectly imitate a Chimpanzee would probably also require superhuman intelligence. But such a system would still only be able to imitate chimpanzees. Effectively, it would be much less intelligent than a human. Same for imitating human text. It's very hard, but the result wouldn't yield large capabilities.

3DragonGod12d
Do please read the post. Being able to predict human text requires vastly superhuman capabilities, because predicting human text requires predicting the processes that generated said text. And large tracts of text are just reporting on empirical features of the world. Alternatively, just read the post I linked.
2Matt Goldenberg13d
It depends on your ability to extract the information from the model. RLHF and instruction tuning are one such algorithm that allow certain capabaliities besides next-token prediction to be extracted from the model. I suspect many other search and extraction techniques will be found, which can leverage latent capabalities and understandings in the model that aren't modelled in its' text outputs.

Thank you, this has many interesting points. The takeoff question is the heart of predicting x-risk. With soft takeoff catastrophy seems unlikely, and likely with hard takeoff.

One point though. "Foom" was intended to be a synonym for "intelligence explosion" and "hard takeoff". But not for "recursive self-improvement", although EY perceived the latter to be the main argument for the former, though not the only one. He wrote:

[Recursive self-improvement] is the biggest, most interesting, hardest-to-analyze, sharpest break-with-the-past contributing to the

... (read more)
5DragonGod13d
"The upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum [https://www.lesswrong.com/posts/MmmPyJicaaJRk4Eg2/the-limit-of-language-models]".

Yeah. In logic it is usually assumed that sentences are atomic when they do not contain logical connectives like "and". And formal (Montaigne style) semantics makes this more precise, since logic may be hidden in linguistic form. But of course humans don't start out with language. We have some sort of mental activity, which we somehow synthesize into language, and similar thoughts/propositions can be expressed alternatively with an atomic or a complex sentence. So atomic sentences seem definable, but not abstract atomic propositions as object of belief and desire.

A bit late, a related point. Let me start with probability theory. Probability theory is considerably more magic than logic, since only the latter is "extensional" or "compositional", the former is not. Which just means the truth values of and determine the truth value of complex statements like ("A and B"). The same is not the case for probability theory: The probabilities of and do not determine the probability of , they only constrain it to a certain range of values.

For example, if and have probabilities 0.6 and 0.5 respectively, the ... (read more)

2moridinamael23d
Great points. I would only add that I’m not sure the “atomic” propositions even exist. The act of breaking a real-world scenario into its “atomic” bits requires magic, meaning in this case a precise truncation of intuited-to-be-irrelevant elements.

This is an interesting result!

  • It seems to support LeCun's argument against autoregressive LLMs more than "simulator theory".

  • One potential weakness about your method is that you didn't use a base (foundation) model, but apparently the heavily finetuned gpt-3.5-turbo. The different system prompts probably can't negate the effect of this common fine-tuning completely. It would be interesting how the results hold up when you use code-davinci-002, the GPT-3.5 base model, which has no instruction tuning or RLHF applied. Though this model is no longer avail

... (read more)

Okay, that clarifies a lot. But the last paragraph I find surprising.

re: (2), I just don't see LLMs as providing much evidence yet about whether the concepts they're picking up are compact or correct (cf. monkeys don't have an IGF concept).

If LLMs are good at understanding the meaning of human text, they must to be good at understanding human concepts, since concepts are just meanings of words the LLM understands. Do you doubt they are really understanding text as well as it seems? Or do you mean they are picking up other, non-human, concepts as well, ... (read more)

Inner alignment is a problem, but it seems less of a problem than in the monkey example. The monkey values were trained using a relatively blunt form of genetic algorithm, and monkeys aren't anyway capable of learning the value "inclusive genetic fitness", since they can't understand such a complex concept (and humans didn't understand it historically). By contrast, advanced base LLMs are presumably able to understand the theory of CEV about as well as a human, and they could be finetuned by using that understanding, e.g. with something like Constitutional... (read more)

The fragility-of-value posts are mostly old. They were written before GPT-3 came out (which seemed very good at understanding human language and, consequently, human values), before instruction fine-tuning was successfully employed, and before forms of preference learning like RLHF or Constitutional AI were implemented.

With this background, many arguments in articles like Eliezer's Complexity of Value (2015) sound now implausible, questionable or in any case outdated.

I agree that foundation LLMs are just able to predict how a caring human sounds like, but ... (read more)

9So8res2mo
It seems to me that the usual arguments still go through. We don't know how to specify the preferences of an LLM (relevant search term: "inner alignment"). Even if we did have some slot we could write the preferences into, we don't have an easy handle/pointer to write into that slot. (Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don't in fact have a clean "inclusive genetic fitness" concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn't have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.) Separately, note that the "complexity of value" claim is distinct from the "fragility of value" claim. Value being complex doesn't mean that the AI won't learn it (given a reason to). Rather, it suggests that the AI will likely also learn a variety of other things (like "what the humans think they want" and "what the humans' revealed preferences are given their current unendorsed moral failings" and etc.). This makes pointing to the right concept difficult. "Fragility of value" then separately argues that if you point to even slightly the wrong concept when choosing what a superintelligence optimizes, the total value of the future is likely radically diminished.

Regarding the last point. Can you explain why existing language models, which seem to care more than a little about humans, aren't significant evidence against your view?

7So8res2mo
Current LLM behavior doesn't seem to me like much evidence that they care about humans per se. I'd agree that they evidence some understanding of human values (but the argument is and has always been "the AI knows but doesn't care"; someone can probably dig up a reference to Yudkowsky arguing this as early as 2001). I contest that the LLM's ability to predict how a caring-human sounds is much evidence that the underlying coginiton cares similarly (insofar as it cares at all). And even if the underlying cognition did care about the sorts of things you can sometimes get an LLM to write as if it cares about, I'd still expect that to shake out into caring about a bunch of correlates of the stuff we care about, in a manner that comes apart under the extremes of optimization. (Search terms to read more about these topics on LW, where they've been discussed in depth: "a thousand shards of desire", "value is fragile".)

Yeah, championing seems to border on deception, bullshitting, or even lying. But the group rationality argument says that it can be optimal when a few members of a group "over focus" (from an individual perspective) on an issue. These pull in different directions.

Looking back, I would say this post has not aged well. Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will very much try to give you just what you want, not something unintended, at least if it can be produced using text.

May... (read more)

... and in between, instruction tuning uses SL. So they use all three paradigms.

In your ABC example we rely on the background information that

  • P(A&B)=0
  • P(A&C)=0
  • P(B&C)=0
  • P(A or B or C)=1.

So the background information is that the events are mutually exclusive and exhaustive. But only then do probabilities need to add to one. It's not a general fact that "probabilities add to 1". So taking the geometric average does itself not violate any axioms of probability. We "just" need to update the three geometric averages on this background knowledge. Plausibly how this should be done in this case is to normalize them such that t... (read more)

2AlexMennen2mo
My problem with a forecast aggregation method that relies on renormalizing to meet some coherence constraints is that then the probabilities you get depend on what other questions get asked. It doesn't make sense for a forecast aggregation method to give probability 32.5% to A if the experts are only asked about A, but have that probability predictably increase if the experts are also asked about B and C. (Before you try thinking of a reason that the experts' disagreement about B and C is somehow evidence for A, note that no matter what each of the experts believe, if your forecasting method is mean log odds, but renormalized to make probabilities sum to 1 when you ask about all 3 outcomes, then the aggregated probability assigned to A can only go up when you also ask about B and C, never down. So any such defense would violate conservation of expected evidence.) Any linear constraints (which are the things you get from knowing that certain Boolean combinations of questions are contradictions or tautologies) that are satisfied by each predictor will also be satisfied by their arithmetic mean. That's part of my point. Arithmetic mean of probabilities gives you a way of averaging probability distributions, as well as individual probabilities. Geometric mean of log odds does not. In this example, the sources of evidence they're using are not independent; they can expect ahead of time that each of them will observe the same relative frequency of black balls from the urn, even while not knowing in advance what that relative frequency will be. The circumstances under which the multiplicative evidence aggregation method is appropriate are exactly the circumstances in which the evidence actually is independent. They make their bet direction and size functions of the odds you offer them in such a way that they bet more when you offer better odds. If you give the correct odds, then the bet ends up resolving neutrally on average, but if you give incorrect odds, then which

In Peano arithmetic, the induction axiom (not axiom schema) basically says "... and nothing else is a natural number". It can only be properly formulated in second-order logic, and the result is that Peano arithmetic becomes "categorical", which means it has only one (the intended) model up to isomorphism. The real or complex number systems and geometry also have categorical axiomatizations. Standard (first-order) ZFC is not categorical, since it allows both for models that are larger than intended (like first-order Peano arithmetic) and smaller than inten... (read more)

But do look at introductions to Bayesian statistics versus Bayesian epistemology. There does exist hardly any overlap. One thing they have in common is that they both agree that it makes sense to assign probabilities to hypotheses. But otherwise? I personally know quite a lot about Bayesian epistemology, but basically none of that appears to be of interest for Bayesian statisticians.

It is worth thinking about why ChatGPT, an Oracle AI which can execute certain instructions, does not fail text equivalents of the cauldron task.

It seems the reason why it doesn't fail is that it is pretty good at understanding the meaning of expressions. (If an AI floods the room with water because this maximizes the probability that the cauldron will be filled, then the AI hasn't fully understood the instruction "fill the cauldron", which only asks for a satisficing solution.)

And why is ChatGPT so good as interpreting the meaning of instructions? Because... (read more)

Yeah. And many people do indeed recommend one should add pasta only after the water is boiling. For example:

Don't add the noodles until the water has come to a rolling boil, or they'll end up getting soggy and mushy.

Except ... they don't get soggy.

I would know, I made a lot of pasta in spring of 2020!

While we're at it, they also say

Bring a large pot of water to a boil.

which other sources also tend to recommend. This is usually justified by saying that by using a lot of water the pasta will thereby stick together less. But as I said, I consider myse... (read more)

(I don't know much about physics, but...) Raising the boiling point just means raising the maximal temperature of the water. Since during normal (saltless) cooking that maximum is usually reached at some time x before the pasta is done, raising the boiling point with salt means the water becomes overall hotter after x, which means you have to cook (a tiny bit) shorter. What makes the pasta done is not the boiling, just the temperature of the water and the time it has some temperature.

5Going Durden3mo
Im pretty sure, though I cannot find data on it, that cooking in salt water simply causes salt to interact chemically with the pasta, "tenderizing" it, the same way salt tenderizes meat, vegetables etc.  I assume we could perform an experiment in which we submerge identical amounts of pasta in cold tap water, and in an equal volume of salt water, and wait until it becomes soft enough to eat. My assumption is that waterlogging pasta in salt water would soften it much faster.
2RamblinDash3mo
Right, but if you wait for the water to boil before you put the pasta in, then you are waiting a little longer before adding the pasta. Then cooking the pasta slightly shorter after you put it in.

More examples:

  • Prostitution
  • Suicide

And now OpenAI is removing access to code-davinci-002, the GPT-3.5 foundation model: https://twitter.com/deepfates/status/1638212305887567873

The GPT-4 base model will apparently also not be available via the API. So it seems the most powerful publicly available foundation model is now Facebook's leaked LLaMA.

For any proposition which you assert it is possible that someone else has another "perspective" and asserts instead, each acting as if it was the truth. So the existence of possible perspective is not specific to politics or truth seeking. Sure, it is possible to be overconfident relative to the evidence you have, but I don't recommend universal extensive hedging for any political examples merely because they are political. If you disagree with his examples, you are surely able to insert similar examples where (what you believe to be) epistemic mistak... (read more)

I find him using political examples not suspicious at all. After all, politics is an area where epistemic mistakes can have large to extremely large negative effects. He could have referred to non-political examples, but those tend to be comparatively inconsequential.

2shminux3mo
Yes, indeed, and that was my point: they are using a political example with a connotation-loaded language as if it was truth, not one possible perspective. Which made me question the OP's ability to evaluate their own commitment to truth-seeking.

My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI's GPT-4 (though not Microsoft's Bing version) seems to be also a lot more robust than GPT-3.5, but I don't know why.

1Mohammad Bavarian3mo
I think there is a big danger in just relying on papers and not doing empirical tests. 

How about making a follow up with GPT-4, and testing how it improved? From OpenAI GPT-4 is only available via ChatGPT+, but Bing has also a free variant of it. Though the latter is still a bit limited (15 model replies per conversation) and currently based on a waitlist.

I also have not used them since my voting power increased, simply because unduly exaggerating my voice is unethical. But once sufficiently many other people do it, or are suspected of doing it, this inhibition would go away.

unduly exaggerating my voice is unethical

The users of the forum have collectively granted you a more powerful voice through our votes over the years. While there are ways you could use it unethically, using it as intended is a good thing.

It is not clear to me whether it helps with the cases you mention. It gives more voting power to senior or heavy users. But it also incentivizes users to abuse their strong votes. This is similar to how score or range voting systems encourage voters to exaggerate the strength of their preferences and to give extreme value votes as often as possible.

I think this already happens in the EA Forum, where controversial topics like the Bostrom email seemed to encourage mind-killed tribe voting. Sometimes similarly reasonable arguments would get either heavily vot... (read more)

6Kaj_Sotala3mo
For what it's worth, I have 10 strong upvote strength and at least when talking about comments, for me the effect is the opposite. With the karma of most comments being in the 0-10 range, an upvote of 10 feels so huge that I use it much more rarely than if it was something smaller like 4. (For posts, 10 points isn't necessarily that much so there I'm more inclined to use it.)
5Elizabeth3mo
Data point: in practice I've given fewer strong votes as my vote power has increased and will very rarely use strong votes on comments where it would dramatically change the comment karma (or posts, but most posts get enough karma I feel fine strong voting)

Interesting. Claude being more robust against jailbreaking has probably to do with the fact that Anthropic doesn't use RLHF, but a sort of RL on synthetic examples of automatic and iterated self-critique, based on a small number of human-written ethical principles. The method is described in detail in their paper on "Constitutional AI". In a recent blog post, OpenAI explicitly mentions Constitutional AI as an example how they plan to improve their fine-tuning process in the future. I assume the Anthropic paper simply came out too late to influence OpenAI's... (read more)

6Mohammad Bavarian3mo
Did you test Claude for it being less susceptible to this issue? Otherwise not sure where your comment actually comes from. Testing this, I saw similar or worse behavior by that model - albeit GPT4 also definitely has this issue https://twitter.com/mobav0/status/1637349100772372480?s=20
9James Payor3mo
I think I saw OpenAI do some "rubric" thing which resembles the Anthropic method. It seems easy enough for me to imagine that they'd do a worse job of it though, or did something somewhat worse, since folks at Anthropic seem to be the originators of the idea (and are more likely to have a bunch of inside view stuff that helped them apply it well)

Thank you, I didn't know that.

The fact that strong votes have such a disproportionate effect (which relies on the restraint of the users not to abuse it) reduces my trust in the Karma/agreement voting system.

I think it should increase your trust in the voting system! Most of the rest of the internet has voting dominated by whatever new users show up whenever a thing gets popular, and this makes it extremely hard to interpret votes in different contexts. E.g. on Reddit the most upvoted things in most subreddits actually often don't have that much to do with the subreddit, they are just the thins that blew up to the frontpage and so got a ton of people voting on it. Weighted voting helps a lot in creating some stability in voting and making things less internet-popularity weighted (it also does some other good things, and has some additional costs, but this is I think one of the biggest ones).

This is a tangent, but any explanation why strong votes now give/deduct 4 points? This seems excessive to me.

9habryka3mo
Strong votes scale depending on your karma, all the way up to 10 points, I think (though there are I think maybe 2-3 users with that vote-strength). It's basically a logarithmic scaling of vote-strength.

All three prompts were correct when I gave them to Bing Chat "precise".

2faul_sname3mo
Is that a special version of Bing Chat? If so, can you feed it the following: Edit: figured it out. But yeah that's pretty impressive and unless they explicitly trained on reversed wikipedia that's quite impressive. Calling my prediction a miss.

Note that this is not identical to the original three prompts, which worked in the opposite direction.

Nice post. Non-transitivity of concept extrapolation is overall plausible for me, but not so much in your dog example. Though I couldn't come up with a more intuitive case.

Not a new phenomenon. Fine-tuning leads to mode collapse, this has been pointed out before: Mysteries of mode collapse

2kyleherndon3mo
Thanks for the great link. Fine-tuning leading to mode collapse wasn't the core issue underlying my main concern/confusion (intuitively that makes sense). paulfchristiano's reply leaves me now mostly completely unconfused, especially with the additional clarification from you. That said I am still concerned; this makes RLHF seem very 'flimsy' to me.

Okay, these points seem reasonable.

One other worry I forgot to mention however: I could be totally wrong here, but presumably most applications of this kind of "standpoint epistemology", in the last ten years, comes from researchers I would suspect of being far-left activists. If so, those people would of course be very eager to interview people they believe in their political worldview to be victims of oppression, i.e. especially black people and women. They would very rarely interview white men or Asians or police officers about their "experiences" or "p... (read more)

3tailcalled3mo
I'm not 100% sure about this, but from what I've heard a lot of left-wing academics don't even try all that hard to reveal black people's experiences, but instead mainly use black people as a tool to say that right-wingers are bad. I agree that this sort of thing is a problem, but I'd think it is best addressed by doing more to map out different people's experiences in a publicly accessible way. That is what I am getting at when I say: My ideal outcome for this post would be if more people went out and mapped more groups' perspectives of more situations.

In the context of qualitative interview questions like this, straightforwardly taking the answers to be about "the problems black people face" or "the problems the police faces", presupposes, about individual opinions on what these problems are, that these beliefs are neither incorrect, confused, or otherwise inaccurate. Again, imagine interviewing pre war Christian Germans to find out "the problems Germans face with Jews".

Qualitative interviews are even less reliable than opinion polls, since in those polls we get at least statistically significant result... (read more)

Again, imagine interviewing pre war Christian Germans to find out "the problems Germans face with Jews".

I don't have any clear imaginations of what would happen in this case?

Like I know that antisemitism was rampant there at the time, so probably you would get a lot of angry negative opinions. But what would they be? "My pastor's friend's niece was killed by a Jew"? "Jews control the banking system which is evil and also they are breeding like rabbits"? "There's a group of child prostitutes downtown, and their pimp is Jewish"?

I would like to know what the ... (read more)

You talk a lot about experiences here, but all these answers express beliefs, not experiences. Beliefs can be arbitrarily biased -- just think about modifications of your method: Instead of asking black people about their "experiences" with the police, you could ask police officers about their "experiences" with black people. Or you could ask people without migration background about their "experiences" with immigrants. You could have asked Christian Germans in 1938 about their "experiences" with Jews, etc. What you will get is a bunch of opinions which co... (read more)

2Stephen Bennett (Previously GWS)3mo
That also stood out to me as a bit of a leap. It seems to me that for Aumann's theorem to apply to standpoint epistemology, everyone would have to share all their experiences and believe everyone else about their own experiences.
7tailcalled3mo
Unless you have comprehensive measurement such as cameras etc., information about experiences are necessarily mediated by human beliefs. However, human beliefs can be about many other things than experiences; e.g. about statistics/general societal tendencies, about options for change, about epistemology, about future trajectories, etc.. Beliefs about experiences constitute a specific form of beliefs, and experiences are in a sense the "native" way for humans to gain information about the world (i.e. hunter-gatherers have experiences too, but they probably don't work with statistical data), so they are probably the place where humans form their richest/most-detailed/most-informative opinions about. In the post, I specifically suggested that asking the police would make for a good followup: When you say that my suggestion here is "arbitrarily biased", what do you mean by that? Could you help me understand how looking at crime statistics would help enlighten us about the problems black people face with the police? I think they would be more illuminating about the problems the police or society face with black people. But also, I don't think statistics are all that epistemically different from surveys? Help me out if I'm misunderstanding something here, but my understanding is that both my survey about experiences and crime statistics ultimately originate in experiences where people have interacted with the police. Cops and the people they interact with then process those experiences to remember them and make sense of what to do. Where my approach differs from crime statistics is that my approach then just ends there, asking people about what they've figured from the interaction. Meanwhile in the case of crime statistics, police are given authorization to decide that some interaction is sufficiently harmful that they should arrest the person involved in the interaction. And then they kick off a governmental system of having people come out to investigate the sce

Great points. Perhaps an acceptable substitute for advice is offering help. For example: "Would you like me to go to the doctor's with you?" Of course, offers for help shouldn't be given in a way that sounds like advice. And listening/empathy should probably come first.

I find the common downvoting-instead-of-arguing mentality frustrating and immature. If I don't have the energy for a counterargument, I simply don't react at all. Just doing downvotes is intellectually worthless booing. As feedback it's worse than useless.

1Guillaume Charrier3mo
Strong upvote!

But it is clearly "morally" bad? It is just not a morally wrong action. Actions are wrong insofar their expected outcomes are bad, but an outcome can be bad without being the result of anyone's action.

(You might say that morality is only a theory of actions. Then saying that a world, or any outcome, is "morally" bad, would be a category mistake. Fine then, call "ethics" the theory both of good and bad outcomes, and of right and wrong actions. Then a world where everyone suffers is bad, ethically bad.)

1TAG3mo
No, that's the point. Yep, but you still need to show its morally bad even if it is unintentional.

The terms "right" and "wrong" apply just to actions. This world is bad, without someone doing something wrong.

1TAG3mo
An imperfect world might be in various ways, such as being undesirable, but if it is not morally bad, it implies nothing about objective morality.

This insight can be reversed: If you can't understand the mathematical details of a theory (which will be true for many of us, math is often hard), don't waste undue time on understanding the high-level features. Luckily, many interesting theories outside physics have much simpler math than quantum mechanics.

If there is none, it would mean a world where everyone suffers horribly forever is not objectively worse than one where everyone is eternally happy. But I think that's just not compatible with what words like "good" or "worse" mean! If we imagine a world where everything is the same as in ours, except that people call things "bad" we call "good", and "good" what we call "bad" -- would that mean they believe suffering is good? Of course not. They just use different words for the same concepts we have! Believing that, other things being equal, suffering is b... (read more)

1TAG3mo
Consider a world where everyone suffers horribly, and it's no ones fault , and it's impossible or to change. Is it morally wrong , even though the the elements of intentionality and obligation are absent?
Load More