There are at least two related theories in which "all sentient beings matter" may be true.
Sentient beings can experience things like suffering, and suffering is bad. So sentient beings matter insofar it is better that they experience more rather than less well-being. That's hedonic utilitarianism.
Sentient beings have conscious desires/preferences, and those matter. That would be preference utilitarianism.
The concepts of mattering or being good or bad (simpliciter) are intersubjective generalizations of the subjective concepts of mattering or being...
I aware of just three methods to modify GPTs: In-context learning (prompting), supervised fine-tuning, reinforcement fine-tuning. The achievable effects seem rather similar.
I did read your post. The fact that something like predicting text requires superhuman capabilities of some sort does not mean that the task itself will result in superhuman capabilities. That's the crucial point.
It is much harder to imitate human text than to write while being a human, but that doesn't mean the imitated human itself is any more capable than the original.
An analogy. The fact that building fusion power plants is much harder than building fission power plants doesn't at all mean that the former are better. They could even be worse. There is a fundamental disconnect between the difficulty of a task and the usefulness of that task.
This approach doesn't seem to work with in-context learning. Then it is unclear whether fine-tuning could be more successful.
Being able to perfectly imitate a Chimpanzee would probably also require superhuman intelligence. But such a system would still only be able to imitate chimpanzees. Effectively, it would be much less intelligent than a human. Same for imitating human text. It's very hard, but the result wouldn't yield large capabilities.
Thank you, this has many interesting points. The takeoff question is the heart of predicting x-risk. With soft takeoff catastrophy seems unlikely, and likely with hard takeoff.
One point though. "Foom" was intended to be a synonym for "intelligence explosion" and "hard takeoff". But not for "recursive self-improvement", although EY perceived the latter to be the main argument for the former, though not the only one. He wrote:
...[Recursive self-improvement] is the biggest, most interesting, hardest-to-analyze, sharpest break-with-the-past contributing to the
Yeah. In logic it is usually assumed that sentences are atomic when they do not contain logical connectives like "and". And formal (Montaigne style) semantics makes this more precise, since logic may be hidden in linguistic form. But of course humans don't start out with language. We have some sort of mental activity, which we somehow synthesize into language, and similar thoughts/propositions can be expressed alternatively with an atomic or a complex sentence. So atomic sentences seem definable, but not abstract atomic propositions as object of belief and desire.
A bit late, a related point. Let me start with probability theory. Probability theory is considerably more magic than logic, since only the latter is "extensional" or "compositional", the former is not. Which just means the truth values of and determine the truth value of complex statements like ("A and B"). The same is not the case for probability theory: The probabilities of and do not determine the probability of , they only constrain it to a certain range of values.
For example, if and have probabilities 0.6 and 0.5 respectively, the ...
This is an interesting result!
It seems to support LeCun's argument against autoregressive LLMs more than "simulator theory".
One potential weakness about your method is that you didn't use a base (foundation) model, but apparently the heavily finetuned gpt-3.5-turbo. The different system prompts probably can't negate the effect of this common fine-tuning completely. It would be interesting how the results hold up when you use code-davinci-002, the GPT-3.5 base model, which has no instruction tuning or RLHF applied. Though this model is no longer avail
Okay, that clarifies a lot. But the last paragraph I find surprising.
re: (2), I just don't see LLMs as providing much evidence yet about whether the concepts they're picking up are compact or correct (cf. monkeys don't have an IGF concept).
If LLMs are good at understanding the meaning of human text, they must to be good at understanding human concepts, since concepts are just meanings of words the LLM understands. Do you doubt they are really understanding text as well as it seems? Or do you mean they are picking up other, non-human, concepts as well, ...
Inner alignment is a problem, but it seems less of a problem than in the monkey example. The monkey values were trained using a relatively blunt form of genetic algorithm, and monkeys aren't anyway capable of learning the value "inclusive genetic fitness", since they can't understand such a complex concept (and humans didn't understand it historically). By contrast, advanced base LLMs are presumably able to understand the theory of CEV about as well as a human, and they could be finetuned by using that understanding, e.g. with something like Constitutional...
The fragility-of-value posts are mostly old. They were written before GPT-3 came out (which seemed very good at understanding human language and, consequently, human values), before instruction fine-tuning was successfully employed, and before forms of preference learning like RLHF or Constitutional AI were implemented.
With this background, many arguments in articles like Eliezer's Complexity of Value (2015) sound now implausible, questionable or in any case outdated.
I agree that foundation LLMs are just able to predict how a caring human sounds like, but ...
Regarding the last point. Can you explain why existing language models, which seem to care more than a little about humans, aren't significant evidence against your view?
Yeah, championing seems to border on deception, bullshitting, or even lying. But the group rationality argument says that it can be optimal when a few members of a group "over focus" (from an individual perspective) on an issue. These pull in different directions.
Looking back, I would say this post has not aged well. Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will very much try to give you just what you want, not something unintended, at least if it can be produced using text.
May...
In your ABC example we rely on the background information that
So the background information is that the events are mutually exclusive and exhaustive. But only then do probabilities need to add to one. It's not a general fact that "probabilities add to 1". So taking the geometric average does itself not violate any axioms of probability. We "just" need to update the three geometric averages on this background knowledge. Plausibly how this should be done in this case is to normalize them such that t...
Apparently LLMs automatically correct mistakes in CoT, which seems to run counter to LeCun's argument.
In Peano arithmetic, the induction axiom (not axiom schema) basically says "... and nothing else is a natural number". It can only be properly formulated in second-order logic, and the result is that Peano arithmetic becomes "categorical", which means it has only one (the intended) model up to isomorphism. The real or complex number systems and geometry also have categorical axiomatizations. Standard (first-order) ZFC is not categorical, since it allows both for models that are larger than intended (like first-order Peano arithmetic) and smaller than inten...
But do look at introductions to Bayesian statistics versus Bayesian epistemology. There does exist hardly any overlap. One thing they have in common is that they both agree that it makes sense to assign probabilities to hypotheses. But otherwise? I personally know quite a lot about Bayesian epistemology, but basically none of that appears to be of interest for Bayesian statisticians.
It is worth thinking about why ChatGPT, an Oracle AI which can execute certain instructions, does not fail text equivalents of the cauldron task.
It seems the reason why it doesn't fail is that it is pretty good at understanding the meaning of expressions. (If an AI floods the room with water because this maximizes the probability that the cauldron will be filled, then the AI hasn't fully understood the instruction "fill the cauldron", which only asks for a satisficing solution.)
And why is ChatGPT so good as interpreting the meaning of instructions? Because...
Yeah. And many people do indeed recommend one should add pasta only after the water is boiling. For example:
Don't add the noodles until the water has come to a rolling boil, or they'll end up getting soggy and mushy.
Except ... they don't get soggy.
I would know, I made a lot of pasta in spring of 2020!
While we're at it, they also say
Bring a large pot of water to a boil.
which other sources also tend to recommend. This is usually justified by saying that by using a lot of water the pasta will thereby stick together less. But as I said, I consider myse...
(I don't know much about physics, but...) Raising the boiling point just means raising the maximal temperature of the water. Since during normal (saltless) cooking that maximum is usually reached at some time x before the pasta is done, raising the boiling point with salt means the water becomes overall hotter after x, which means you have to cook (a tiny bit) shorter. What makes the pasta done is not the boiling, just the temperature of the water and the time it has some temperature.
And now OpenAI is removing access to code-davinci-002, the GPT-3.5 foundation model: https://twitter.com/deepfates/status/1638212305887567873
The GPT-4 base model will apparently also not be available via the API. So it seems the most powerful publicly available foundation model is now Facebook's leaked LLaMA.
For any proposition which you assert it is possible that someone else has another "perspective" and asserts instead, each acting as if it was the truth. So the existence of possible perspective is not specific to politics or truth seeking. Sure, it is possible to be overconfident relative to the evidence you have, but I don't recommend universal extensive hedging for any political examples merely because they are political. If you disagree with his examples, you are surely able to insert similar examples where (what you believe to be) epistemic mistak...
I find him using political examples not suspicious at all. After all, politics is an area where epistemic mistakes can have large to extremely large negative effects. He could have referred to non-political examples, but those tend to be comparatively inconsequential.
My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI's GPT-4 (though not Microsoft's Bing version) seems to be also a lot more robust than GPT-3.5, but I don't know why.
How about making a follow up with GPT-4, and testing how it improved? From OpenAI GPT-4 is only available via ChatGPT+, but Bing has also a free variant of it. Though the latter is still a bit limited (15 model replies per conversation) and currently based on a waitlist.
I also have not used them since my voting power increased, simply because unduly exaggerating my voice is unethical. But once sufficiently many other people do it, or are suspected of doing it, this inhibition would go away.
unduly exaggerating my voice is unethical
The users of the forum have collectively granted you a more powerful voice through our votes over the years. While there are ways you could use it unethically, using it as intended is a good thing.
It is not clear to me whether it helps with the cases you mention. It gives more voting power to senior or heavy users. But it also incentivizes users to abuse their strong votes. This is similar to how score or range voting systems encourage voters to exaggerate the strength of their preferences and to give extreme value votes as often as possible.
I think this already happens in the EA Forum, where controversial topics like the Bostrom email seemed to encourage mind-killed tribe voting. Sometimes similarly reasonable arguments would get either heavily vot...
Interesting. Claude being more robust against jailbreaking has probably to do with the fact that Anthropic doesn't use RLHF, but a sort of RL on synthetic examples of automatic and iterated self-critique, based on a small number of human-written ethical principles. The method is described in detail in their paper on "Constitutional AI". In a recent blog post, OpenAI explicitly mentions Constitutional AI as an example how they plan to improve their fine-tuning process in the future. I assume the Anthropic paper simply came out too late to influence OpenAI's...
Thank you, I didn't know that.
The fact that strong votes have such a disproportionate effect (which relies on the restraint of the users not to abuse it) reduces my trust in the Karma/agreement voting system.
I think it should increase your trust in the voting system! Most of the rest of the internet has voting dominated by whatever new users show up whenever a thing gets popular, and this makes it extremely hard to interpret votes in different contexts. E.g. on Reddit the most upvoted things in most subreddits actually often don't have that much to do with the subreddit, they are just the thins that blew up to the frontpage and so got a ton of people voting on it. Weighted voting helps a lot in creating some stability in voting and making things less internet-popularity weighted (it also does some other good things, and has some additional costs, but this is I think one of the biggest ones).
This is a tangent, but any explanation why strong votes now give/deduct 4 points? This seems excessive to me.
Note that this is not identical to the original three prompts, which worked in the opposite direction.
Nice post. Non-transitivity of concept extrapolation is overall plausible for me, but not so much in your dog example. Though I couldn't come up with a more intuitive case.
Not a new phenomenon. Fine-tuning leads to mode collapse, this has been pointed out before: Mysteries of mode collapse
Okay, these points seem reasonable.
One other worry I forgot to mention however: I could be totally wrong here, but presumably most applications of this kind of "standpoint epistemology", in the last ten years, comes from researchers I would suspect of being far-left activists. If so, those people would of course be very eager to interview people they believe in their political worldview to be victims of oppression, i.e. especially black people and women. They would very rarely interview white men or Asians or police officers about their "experiences" or "p...
In the context of qualitative interview questions like this, straightforwardly taking the answers to be about "the problems black people face" or "the problems the police faces", presupposes, about individual opinions on what these problems are, that these beliefs are neither incorrect, confused, or otherwise inaccurate. Again, imagine interviewing pre war Christian Germans to find out "the problems Germans face with Jews".
Qualitative interviews are even less reliable than opinion polls, since in those polls we get at least statistically significant result...
Again, imagine interviewing pre war Christian Germans to find out "the problems Germans face with Jews".
I don't have any clear imaginations of what would happen in this case?
Like I know that antisemitism was rampant there at the time, so probably you would get a lot of angry negative opinions. But what would they be? "My pastor's friend's niece was killed by a Jew"? "Jews control the banking system which is evil and also they are breeding like rabbits"? "There's a group of child prostitutes downtown, and their pimp is Jewish"?
I would like to know what the ...
You talk a lot about experiences here, but all these answers express beliefs, not experiences. Beliefs can be arbitrarily biased -- just think about modifications of your method: Instead of asking black people about their "experiences" with the police, you could ask police officers about their "experiences" with black people. Or you could ask people without migration background about their "experiences" with immigrants. You could have asked Christian Germans in 1938 about their "experiences" with Jews, etc. What you will get is a bunch of opinions which co...
Great points. Perhaps an acceptable substitute for advice is offering help. For example: "Would you like me to go to the doctor's with you?" Of course, offers for help shouldn't be given in a way that sounds like advice. And listening/empathy should probably come first.
I find the common downvoting-instead-of-arguing mentality frustrating and immature. If I don't have the energy for a counterargument, I simply don't react at all. Just doing downvotes is intellectually worthless booing. As feedback it's worse than useless.
But it is clearly "morally" bad? It is just not a morally wrong action. Actions are wrong insofar their expected outcomes are bad, but an outcome can be bad without being the result of anyone's action.
(You might say that morality is only a theory of actions. Then saying that a world, or any outcome, is "morally" bad, would be a category mistake. Fine then, call "ethics" the theory both of good and bad outcomes, and of right and wrong actions. Then a world where everyone suffers is bad, ethically bad.)
The terms "right" and "wrong" apply just to actions. This world is bad, without someone doing something wrong.
This insight can be reversed: If you can't understand the mathematical details of a theory (which will be true for many of us, math is often hard), don't waste undue time on understanding the high-level features. Luckily, many interesting theories outside physics have much simpler math than quantum mechanics.
If there is none, it would mean a world where everyone suffers horribly forever is not objectively worse than one where everyone is eternally happy. But I think that's just not compatible with what words like "good" or "worse" mean! If we imagine a world where everything is the same as in ours, except that people call things "bad" we call "good", and "good" what we call "bad" -- would that mean they believe suffering is good? Of course not. They just use different words for the same concepts we have! Believing that, other things being equal, suffering is b...
Fixed! (Video reviews, so unfortunately there is no Ctrl-F to find the relevant part.)