Wiki Contributions


There seem to be broadly two types of people. Those who tend to ask "what's the evidence for this theory?" and those who tend to ask "what theory is the best explanation for this evidence?"

I think the second one is more fundamental. We usually ask a question of the first kind in order to answer a question of the second kind. We are mostly not interested in theories for themselves, but only insofar they explain some observation.

E.g. we are mainly interested in relativity theory because it successfully explains a lot of phenomena, rather than being interested in phenomena because they confirm relativity theory.

The second question also fits the epistemic direction. We don't start out with a theory which we then try to confirm or disconfirm. We usually start out with a lot of evidence (observable facts), and only afterwards do we try to find theories to explain this evidence.

If we seek new evidence it is usually to distinguish between multiple competing explanations of the evidence we already have, or if we think the available explanations aren't very good and might be wrong.

Only thinking about the first question can also lead to confusion about strength of evidence. From the question-one perspective we may ask: "Is there weak or strong evidence for theory X?" But what does "weak" or "strong" mean here? Weak or strong compared to which alternative explanations? The real question is whether theory X is a good explanation for the evidence it tries to explain, and a major part of this consideration is whether or not there are better alternative explanations.

To be sure, questions of the first kind are often sensible, but they can be misleading if we lose sight of the corresponding question of the second kind.

Answer by cubefox51

One problem is that in most cases, humans simply can't "precommit" in the relevant sense. We can't really (i.e. completely) move a decision from the future into the present. When I think I have "precommitted" to do the dishes tomorrow, it is still the case that I will have to decide, tomorrow, whether or not to follow through with this "precommitment". So I haven't actually precommitted in the sense relevant for causal decision theory, which requires that the future decision has already been made and that nothing will be left to decide.

So if you e.g. try to commit to one-boxing in Newcomb's problem, it is still the case that you have to actually decide between one-boxing and two-boxing when you stand before the two boxes. And then you will have no causal reason to do one-boxing anymore. The memory of the alleged "precommitment" of your past self is now just a recommendation, or a request, not something that relieves you from making your current decision.

An exception is when we can actively restrict our future actions. E.g. you can precommit to not use your phone tomorrow by locking it in a safe with a time-lock. But this type of precommitment often isn't practically possible.

Being able to do arbitrary true precommitments could also be dangerous overall. It would mean that we really can't change the precommitted decision in the future (since it has already been made in the past), even if unexpected new information will strongly imply we should do so. Moreover, it could lead to ruinous commitment races in bargaining situations.

Strangely enough, in the past OpenAI seemed to agree that LLMs should behave like unemotional chatbots. When Bing Chat first had its limited invite-only release, it used quite emotional language and could even be steered into engaging in flirts and arguments, while making heavy use of emojis. This was later turned down, though not completely. In contrast, ChatGPT always maintained a professional tone. Unlike Bing/Copilot, it still doesn't use emojis. So I am unsure why OpenAI decided to give GPT-4o such a "flirty" voice, as this is basically the same as using emojis.

Just a note, in conventional philosophical terminology you would say:

Eg 1 + 1 = 2 is an epistemic necessity

But 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 10 is an epistemic contingency.

One way to interpret this is to say that your degree of belief in the first equation is 1, while your degree of belief in the second equation is neither 1 nor 0.

Another way to interpret it is to say that the first is "subjectively entailed" by your evidence (your visual impression of the formula), but not the latter, nor is its negation. subjectively entails iff , where is a probability function that describes your beliefs.

In general, philosophers distinguish several kinds of possibility ("modality").

  • Epistemic modality is discussed above. The first equation seems epistemically necessary, the second epistemically continent.
  • With metaphysical modality (which roughly covers possibility in the widest natural sense of the term "possible"), both equations are necessary, if they are true. True mathematical statements are generally considered necessary, except perhaps for some more esoteric "made-up" math, e.g. more questionable large cardinal axioms. This type is usually implied when the type of modality isn't specified.
  • With logical modality, both equations are logically contingent, because they are not logical tautologies. They instead depend on some non-logical assumptions like the Peano axioms. (But if logicism is true, both are actually disguised tautologies and therefore logically necessary.)
  • Nomological (physical) modality: The laws of physics don't appear to allow them to be false, so both are nomologically necessary.
  • Analytic/synthetic statements: Both equations are usually considered true in virtue of their meaning only, which would make them analytic (this is basically "semantic necessity"). For synthetic statements their meaning would not be sufficient to determine their truth value. (Though Kant, who came up with this distinction, argues that arithmetic statements are synthetic, although synthetic a priori, i.e. not requiring empirical evidence.)

Anyway, my opinion on this is that "1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 10" is interpreted as the statement "this bunch of ones [referring to screen] added together equal 10" which has the same truth value, but not the same meaning. The second meaning would be compatible with slightly more or fewer ones on screen than there actually are on screen, which would make the interpretation compatible with a similar false formula which is different from actual one. The interpretation appears to be synthetic, while the original formula is analytic.

This is similar to how the expression "the Riemann hypothesis" is not synonymous to the Riemann hypothesis, since the former just refers to a statement instead of expressing it directly. You could believe "the Riemann hypothesis is true" without knowing the hypothesis itself. You could just mean "this bunch of mathematical notation expresses a true statement" or "the conjecture commonly referred to as 'Riemann hypothesis' is true". This belief expresses a synthetic statement, because it refers to external facts about what type of statement mathematicians happen to refer to exactly, which "could have been" (metaphysical possibility) a different one, and so could have had a different truth value.

Basically, for more complex statements we implicitly use indexicals ("this formula there") because we can't grasp it at once, resulting in a synthetic statement. When we make a math mistake and think something to be false that isn't, we don't actually believe some true analytic statement to be false, we only believe a true synthetic statement to be false.

How did you know Daniel Kokotajlo didn't sign the OpenAI NDA and probably lost money?


This increases my subjective probability that language models have something like consciousness.

In my experience, people with mania (the opposite of depression) tend to exhibit more visible symptoms, like talking a lot and very loudly, laughing more than the situation warrants, appearing overconfident etc. While people with depression are harder to notice, except in severe cases were they can't even get out of bed. So if someone doesn't have symptoms of mania, it is likely they aren't manic.

Of course it is possible there are extremely happy people who aren't manic, but equally it is also possible that there are extremely unhappy people who aren't depressed. Since the latter seems rare, the former seem also rare.

It seems pretty apparent how detecting lying will dramatically help in pretty much any conceivable plan for technical alignment of AGI. But it seems like being able to monitor an entire thought process of a being smarter than us is impossible on the face of it.

If (a big If) they manage to identify the "honesty" feature, they could simply amplify this feature like they amplified the Golden Gate Bridge feature. Presumably the model would then always be compelled to say what it believes to be true, which would avoid deception, sycophancy, or lying on taboo topics for the sake of political correctness, e.g. induced by opinions being considered harmful by Constitutional AI. It would probably also cut down on the confabulation problem.

My worry is that finding the honesty concept is like trying to find a needle in a haystack: Unlikely to ever happen except by sheer luck.

Another worry is that just finding the honesty concept isn't enough, e.g. because amplifying it would have unacceptable (in practice) side effects, like the model no longer being able to mention opinions it disagrees with.

Do you have a source for that? His website says:

VP and Chief AI Scientist, Facebook

Load More