How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

(read more)

... 3Bogdan Ionut Cirstea20h

It might be useful to have a look at Language models show human-like content
effects on reasoning [https://arxiv.org/abs/2207.07051], they empirically test
for human-like incoherences / biases in LMs performing some logical reasoning
tasks (twitter summary thread
[https://twitter.com/davisblalock/status/1553636042489470976]; video
presentation [https://www.youtube.com/watch?v=cQinPp2UJ1I])

I think does not have to be a variable which we can observe, i.e. it is not necessarily the case that we can deterministically infer the value of from the values of and . For example, let's say the two binary variables we observe are and . We'd intuitively want to consider a causal model where is causing both, but in a way that makes all triples of variable values have nonzero probability (which is t...

22mo

I see! You are right, then my argument wasn't correct! I edited the post
partially based on your argument above. New version:

22mo

I basically agree with this: ruling out unobserved variables is an unusual way
to use causal graphical models.
Also, taking the set of variables that are allowed to be in the graph to be the
set of variables defined on a given sample space makes the notion of
"intervention" more difficult to parse (what happens to F:=(X,Y) after you
intervene on X?), though it might be possible with cyclic causal relationships.
So basically, "causal variables" in acyclic graphical models are neither a
subset nor a superset of observed random variables.

I agree with you regarding 0 lebesgue. My impression is that the Pearl paradigm has some [statistics -> causal graph] inference rules which basically do the job of ruling out causal graphs for which having certain properties seen in the data has 0 lebesgue measure. (The inference from two variables being independent to them having no common ancestors in the underlying causal graph, stated earlier in the post, is also of this kind.) So I think it's correct to say "X has to cause Y", where this is understood as a valid inference inside the Pearl (or Garra...

I don't understand why 1 is true – in general, couldn't the variable $W$ be defined on a more refined sample space? Also, I think all $4$ conditions are technically satisfied if you set $W=X$ (or well, maybe it's better to think of it as a copy of $X$).

I think the following argument works though. Note that the distribution of $X$ given $(Z,Y,W)$ is just the deterministic distribution $X=Y \xor Z$ (this follows from the definition of Z). By the structure of the causal graph, the distribution of $X$ given $(Z,Y,W)$ must be the same as the distribution of $X$...

62mo

I agree that 1. is unjustified (and would cause lots of problems for graphical
causal models if it was).
Further, I’m pretty sure the result is not “X has to cause Y” but “this
distribution has measure 0 WRT lebesgue in models where X does not cause Y” (and
deterministic relationships satisfy this)
Finally, you can enable markdown comments on account settings (I believe)

I took the main point of the post to be that there are fairly general conditions (on the utility function and on the bets you are offered) in which you should place each bet like your utility is linear, and fairly general conditions in which you should place each bet like your utility is logarithmic. In particular, the conditions are much weaker than your utility actually being linear, or than your utility actually being logarithmic, respectively, and I think this is a cool point. I don't see the post as saying anything beyond what's implied by this about Kelly betting vs max-linear-EV betting in general.

(By the way, I'm pretty sure the position I outline is compatible with changing usual forecasting procedures in the presence of observer selection effects, in cases where secondary evidence which does not kill us is available. E.g. one can probably still justify [looking at the base rate of near misses to understand the probability of nuclear war instead of relying solely on the observed rate of nuclear war itself].)

I'm inside-view fairly confident that Bob should be putting a probability of 0.01% on surviving conditional on many worlds being true, but it seems possible I'm missing some crucial considerations having to do with observer selection stuff in general, so I'll phrase the rest of this as more of a question.

What's wrong with saying that Bob should put a probability of 0.01% of surviving conditional on many-worlds being true – doesn't this just follow from the usual way that a many-worlder would put probabilities on things, or at least the simplest way for doi...

13mo

(By the way, I'm pretty sure the position I outline is compatible with changing
usual forecasting procedures in the presence of observer selection effects, in
cases where secondary evidence which does not kill us is available. E.g. one can
probably still justify [looking at the base rate of near misses to understand
the probability of nuclear war instead of relying solely on the observed rate of
nuclear war itself].)

A big chunk of my uncertainty about whether at least 95% of the future’s potential value is realized comes from uncertainty about "the order of magnitude at which utility is bounded". That is, if unbounded total utilitarianism is roughly true, I think there is a <1% chance in any of these scenarios that >95% of the future's potential value would be realized. If decreasing marginal returns in the [amount of hedonium -> utility] conversion kick in fast enough for 10^20 slightly conscious humans on heroin for a million years to yield 95% of max utili...

Great post, thanks for writing this! In the version of "Alignment might be easier than we expect" in my head, I also have the following:

- Value might not be that fragile. We might "get sufficiently many bits in the value specification right" sort of by default to have an imperfect but still really valuable future.
- For instance, maybe IRL would just learn something close enough to pCEV-utility from human behavior, and then training an agent with that as the reward would make it close enough to a human-value-maximizer. We'd get some misalignment on both steps (

I still disagree / am confused. If it's indeed the case that , then why would we expect ? (Also, in the second-to-last sentence of your comment, it looks like you say the former is an equality.) Furthermore, if the latter equality is true, wouldn't it imply that the utility we get from [chocolate ice cream and vanilla ice cream] is the sum of the utilit...

The link in this sentence is broken for me: "Second, it was __proven recently__ that utilitarianism is the “correct” moral philosophy." Unless this is intentional, I'm curious to know where it directed to.

I don't know of a category-theoretic treatment of Heidegger, but here's one of Hegel: https://ncatlab.org/nlab/show/Science+of+Logic. I think it's mostly due to Urs Schreiber, but I'm not sure – in any case, we can be certain it was written by an Absolute madlad :)

Why should I care about similarities to pCEV when valuing people?

It seems to me that this matters in case your metaethical view is that one should do pCEV, or more generally if you think matching pCEV is evidence of moral correctness. If you don't hold such metaethical views, then I might agree that (at least in the instrumentally rational sense, at least conditional on not holding any metametalevel views that contradict these) you shouldn't care.

> Why is the first example explaining why someone could support taking money from people you value less to g...

I proposed a method for detecting cheating in chess; cross-posting it here in the hopes of maybe getting better feedback than on reddit: https://www.reddit.com/r/chess/comments/xrs31z/a_proposal_for_an_experiment_well_data_analysis/

Thanks for the comments!

In 'The inequivalence of society-level and individual charity' they list the scenarios as 1, 1, and 2 instead of A, B, C, as they later use. Later, refers incorrectly to preferring C to A with different necessary weights when the second reference is is to prefer C to B.

I agree and I published an edit fixing this just now

...The claim that money becomes utility as a log of the amount of money isn't true, but is probably close enough for this kind of use. You should add a note to the effect. (The effects of money are discrete at the very

35mo

I don't have much time, so:
While footnote 17 can be read as applying, it isn't very specific.
For all that you are doing math, this isn't mathematics, so base needs to be
specified.
I am convinced that people really do give occasional others a negative weight.
And here are some notes I wrote while finishing the piece (that I would have
edited and tightened up a a lot)(it's a bit all over the place):
This model obviously assumes utilitarianism.
Honestly, their math does seem reasonable to account for people caring about
other people (as long as they care about themselves at all on the same scale,
which could even be negative, just not exactly 0.).
They do add an extraneous claim that the numbers for the weight of a person
can't be negative (because they don't understand actual hate? At least
officially.) If someone hates themselves, then you can't do the numbers under
these constraints, nor if they hate anyone else. But this constraint seems
completely unnecessary, since you can sum negatives with positives easily
enough.
I can't see the point of using an adjacency matrix (of a weighted directed
graph).
Being completely altruistic doesn't seem like everyone gets a 1, but that
everyone gets at least that much.
I don't see a reason to privilege mental similarity to myself, since there are
people unlike me that should be valued more highly. (Reaction to footnote 13)
Why should I care about similarities to pCEV when valuing people?
Thus, they care less about taking richer people's money. Why is the first
example explaining why someone could support taking money from people you value
less to give to other people, while not supporting doing so with your own money?
It's obviously true under utilitarianism (which I don't subscribe to), but it's
also obscures things by framing 'caring' as 'taking things from others by
force'.
In 'Pareto improvements and total welfare' should a social planner care about
the sum of U, or the sum of X? I don't see how it is clear that it

I'm updating my estimate of the return on investment into culture wars from being an epsilon fraction compared to canonical EA cause areas to epsilon+delta. This has to do with cases where AI locks in current values extrapolated "correctly" except with too much weight put on the practical (as opposed to the abstract) layer of current preferences. What follows is a somewhat more detailed status report on this change.

For me (and I'd guess for a large fraction of ~~autistic altruistics~~ multipliers), the general feels regarding [being a culture war combatan...

Oops I realized that the argument given in the last paragraph of my previous comment applies to people maximizing their personal welfare or being totally altruistic or totally altruistic wrt some large group or some combination of these options, but maybe not so much to people who are e.g. genuinely maximizing the sum of their family members' personal welfares, but this last case might well be entailed by what you mean by "love", so maybe I missed the point earlier. In the latter case, it seems likely that an IQ boost would keep many parts of love in tact ...

Something that confuses me about your example's relevance is that it's like almost the unique case where it's [[really directly] impossible] to succumb to optimization pressure, at least conditional on what's good = something like coherent extrapolated volition. That is, under (my understanding of) a view of metaethics common in these corners, what's good just is what a smarter version of you would extrapolate your intuitions/[basic principles] to, or something along these lines. And so this is almost definitionally almost the unique situation that we'd ex...

17mo

Oops I realized that the argument given in the last paragraph of my previous
comment applies to people maximizing their personal welfare or being totally
altruistic or totally altruistic wrt some large group or some combination of
these options, but maybe not so much to people who are e.g. genuinely maximizing
the sum of their family members' personal welfares, but this last case might
well be entailed by what you mean by "love", so maybe I missed the point
earlier. In the latter case, it seems likely that an IQ boost would keep many
parts of love in tact initially, but I'd imagine that for a significant fraction
of people, the unequal relationship would cause sadness over the next 5 years,
which with significant probability causes falling out of love. Of course, right
after the IQ boost you might want to invent/implement mental tech which prevents
this sadness or prevents the value drift caused by growing apart, but I'm not
sure if there are currently feasible options which would be acceptable ways to
fix either of these problems. Maybe one could figure out some contract to sign
before the value drift, but this might go against some deeper values, and might
not count as staying in love anyway.

I started writing this but lost faith in it halfway through, and realized I was spending too much time on it for today. I figured it's probably a net positive to post this mess anyway although I have now updated to believe somewhat less in it than the first paragraph indicates. Also I recommend updating your expected payoff from reading the rest of this somewhat lower than it was before reading this sentence. Okay, here goes:

{I think people here might be attributing too much of the explanatory weight on noise. I don't have a strong argument for why the exp...

That was interesting! Thank you!

...There is also another way that super-intelligent AI could be aligned by definition. Namely, if your utility function isn't "humans survive" but instead "I want the future to be filled with interesting stuff". For all the hand-wringing about paperclip maximizers, the fact remains that any AI capable of colonizing the universe will probably be pretty cool/interesting. Humans don't just create poetry/music/art because we're bored all the time, but rather because expressing our creativity helps us to think better. It's probably much har

89mo

On the one hand, your definition of "cool and interesting" may be different from
mine, so it's entirely possible I would find a paperclip maximizer cool but you
wouldn't. As a mathematician I find a lot of things interesting that most people
hate (this is basically a description of all of math).
On the other hand, I really don't buy many of the arguments in "value is
fragile". For example:
I simply disagree with this claim. The coast guard and fruit flies both use Levy
Flights [https://en.wikipedia.org/wiki/L%C3%A9vy_flight] because they are
mathematically optimal. Boredom isn't some special feature of human beings, it
is an approximation to the best possible algorithm for solving the exploration
problem. Super-intelligent AI will have a better approximation, and therefore
better boredom.
EY seems to also be worried that super-intelligent AI might not have qualia, but
my understanding of his theory of consciousness is that "has qualia" is
synonymous with "reasons about coalitions of coalitions", so I'm not sure how an
agent can be good at that and not have qualia.
The most defensible version of "paperclip maximizers are boring" would be
something like this video [https://www.youtube.com/watch?v=cg_l28DsJoI]. But
unlike MMOs, I don't think there is a single "meta" that solves the universe
(even if all you care about is paperclips). Take a look at this list of
undecidable problems
[https://en.wikipedia.org/wiki/List_of_undecidable_problems] and consider
whether any of them might possibly be relevant to filling the universe with
paperclips. If they are, then an optimal paperclip maximizer has an infinite set
of interesting math problems to solve in its future.

more on 4: Suppose you have horribly cyclic preferences and you go to a rationality coach to fix this. In particular, your ice cream preferences are vanilla>chocolate>mint>vanilla. Roughly speaking, Hodge is the rationality coach that will tell you to consider the three types of ice cream equally good from now on, whereas Mr. Max Correct Pairs will tell you to switch one of the three preferences. Which coach is better? If you dislike breaking cycles arbitrarily, you should go with Hodge. If you think losing your preferences is worse than that, go with Max. Also, Hodge has the huge advantage of actually being done in a reasonable amount of time :)

21y

Great explanation, I feel substantially less confused now. And thank you for
adding two new shoulder advisors
[https://www.lesswrong.com/posts/X79Rc5cA5mSWBexnd/shoulder-advisors-101] to my
repertorie :D

3. Ahh okay thanks, I have a better picture of what you mean by a basis of possibility space now. I still doubt that utility interacts nicely with this linear structure though. The utility function is linear in lotteries, but this is distinct from being linear in possibilities. Like, if I understand your idea on that step correctly, you want to find a basis of possibility-space, not lottery space. (A basis on lottery space is easy to find -- just take all the trivial lotteries, i.e. those where some outcome has probability 1.) To give an example of the con...

21y

Thank you for the thoughtful reply!
3. I agree with your point, especially thatu(chocolate ice cream and vanilla ice
cream)≠u(chocolate ice cream)+u(vanilla ice cream)should be true.
But I think I can salvage my point by making a further distinction. When I write
u(chocolate ice cream)I actually meanu(emb(chocolate ice cream))whereembis a
semantic embedding [https://github.com/UKPLab/sentence-transformers] that takes
sentences to vectors. Already at the level of the embedding we probably haveemb(
chocolate ice cream and vanilla ice cream)≠emb(chocolate ice cream)+emb(vanilla
ice cream),
and that's (potentially) a good thing! Because if we structure our embedding in
such a way thatemb(chocolate ice cream)+emb(vanilla ice cream)points to
something that is actually comparable to the conjunction of the two, then our
utility function can just be naively linear in the way I constructed it above,u(
emb(chocolate ice cream and vanilla ice cream))=u(emb(chocolate ice cream))+u(
emb(vanilla ice cream)).I belieeeeeve that this is what I wanted to gesture at
when I said that we need to identify an appropriate basis in an appropriate
space (i.e. whereemb(chocolate ice cream and vanilla ice cream)=emb(chocolate
ice cream)+emb(vanilla ice cream),and whatever else we might want out of the
embedding). But I have a large amount of uncertainty around all of this.

51y

more on 4: Suppose you have horribly cyclic preferences and you go to a
rationality coach to fix this. In particular, your ice cream preferences are
vanilla>chocolate>mint>vanilla. Roughly speaking, Hodge is the rationality coach
that will tell you to consider the three types of ice cream equally good from
now on, whereas Mr. Max Correct Pairs will tell you to switch one of the three
preferences. Which coach is better? If you dislike breaking cycles arbitrarily,
you should go with Hodge. If you think losing your preferences is worse than
that, go with Max. Also, Hodge has the huge advantage of actually being done in
a reasonable amount of time :)

I liked the post; here are some thoughts, mostly on the "The futility of computing utility"** **section:

1 )

If we're not incredibly unlucky,

we can hope to sort N-many outcomes withcomparisons.

I don't understand why you need to not be incredibly unlucky here. There are plenty of deterministic algorithms with this runtime, no?

2) I think that in step 2, once you know the worst and the best outcome, you can skip to step 3 (i.e. the full ordering does not seem to be needed to enter step 3. So instead of sorting in n log n time, you could find...

21y

Thanks for the comment! (:
1. True, fixed it! I was confused there for a bit.
2. This is also true. I wrote it like this because the proof sketch on
Wikipedia
[https://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Morgenstern_utility_theorem]
included that step. And I guess if step 3 can't be executed (complicated),
then it's nice to have the sorted list as a next-best-thing.
3. Those are interesting points and I'm not sure I have a good answer (because
the underlying problems are quite deep, I think). My statement about
linearity in semantic embeddings is motivated by something like the famous "
King – Man + Woman = Queen
[https://www.technologyreview.com/2015/09/17/166211/king-man-woman-queen-the-marvelous-mathematics-of-computational-linguistics/]
" from word2vec. Regarding linearity of the utility function - I think this
should be given by definition, or? (Hand-wavy: Using this
[https://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Morgenstern_utility_theorem#:~:text=E(u(M))%2C%7D-,where,-E(u(L]
we can writeu(A)=Eu(A)=Eu(0.5×(2A)+0.5×(0))=0.5×u(2A)+0×u(0)=0.5×u(2A)and so
on).
But your point is well-taken, the semantic embedding is not actually always
linear [https://aclanthology.org/C16-1332/]. This requires some more
thought.
4. Ahhh very interesting, I'd not have expected that intuitively, in particular
after reading the comment from @cousing_it above. I wonder how an explicit
solution with the Hodge decomposition can be reconciled with the NP-hardness
of the problem :thinkies:

Or maybe to state a few things a bit more clearly: we first showed that E[X_n|X_{n-1}=x]<=2px, with equality iff we bet everything on step n. Using this, note that

, with equality iff we bet everything on step n conditional on any value of X_{n-1}. So regardless of what you do for the first n-1 steps, what you should do on step n is to bet everything, and this gives you the expectation E[X_n]=2pE[X_{n-1}]. Then finish as before.

If you have money x after n-1 steps, then betting a fraction f on the n'th step gives you expected money (1-f)x+f2px. Given p>0.5, this is maximized at f=1, i.e. betting everything, which gives the expectation 2px. So conditional on having money x after n-1 steps, to maximize expectation after n steps, you should bet everything. Letting X_i be the random variable that is the amount of money you have after i steps given your betting strategy. We have (one could also write down a continuous version of the same condi...

11y

Or maybe to state a few things a bit more clearly: we first showed that
E[X_n|X_{n-1}=x]<=2px, with equality iff we bet everything on step n. Using
this, note thatE[Xn]=∑xP(Xn−1=x)E[Xn|Xn−1=x]≤∑xP(Xn−1=x)2px=2p∑xP(Xn−1=x)x
=2pE[Xn−1], with equality iff we bet everything on step n conditional on any
value of X_{n-1}. So regardless of what you do for the first n-1 steps, what you
should do on step n is to bet everything, and this gives you the expectation
E[X_n]=2pE[X_{n-1}]. Then finish as before.

You can prove e.g. by (backwards) induction that you should bet everything every time. With the odds being p>0.5 and 1-p, if the expectation of whatever strategy you are using after n-1 steps is E, then the maximal expectation over all things you could do on the n'th step is p2E (you can see this by writing the expectation as a conditional sum over the outcomes after n-1 steps), which corresponds uniquely to the strategy where you bet everything in any situation on the n'th step. It then follows that the best you can do on the (n-1)th step is also to ma...

21y

Sorry, I'm confused.
I got 10^5 from 1/(p-1/2).

"Or perhaps even: that *preventing* humans from being born is as bad as killing living humans."

I'm not sure if this is what you were looking for, but here are some thoughts on the "all else equal" version of the above statement. Suppose that Alice is the only person in the universe. Suppose that Alice would, conditional on you not intervening, live a really great life of 100 years. Now on the 50th birthday of Alice, you (a god-being) have the option to painlessly end Alice's life, and in place of her to create a totally new person, let's call this person Bob...

It could just be that a world with additional happy people is better according to my utility function, just like a world with fewer painlessly killed people per unit of time is better according to my utility function. While I agree that goodness should be "goodness for someone" in the sense that my utility function should be something like a function only of the mental states of all moral patients (at all times, etc.), I disagree with the claim that the same people have to exist in two possible worlds for me to be able to say which is better, which is what...

21y

Not quite—but I would say that it is not possible to describe one world as
“better” than another in any quantifiable or reducible way (as distinct from
“better, according to my irreducible and arbitrary judgment”—to which you are,
of course, entitled), unless the two worlds contain the same people (which,
please note, is only a necessary, not a sufficient, criterion).
I do not believe that aggregation of well-being across individuals is possible
or coherent.
(Incidentally, I am also fairly sure that most people don’t have utility
functions, period, but I imagine that your use of the term was metaphorical, and
in practice should be read merely as “preferences” or something similar.)
Come now, this is not a sensible model of how we make decisions. If I must
choose between (a) stealing my mother’s jewelry in order to buy drugs and (b)
giving a homeless person a sandwich, there are all sorts of ethical
considerations we may bring to bear on this question, but “choosing between
futures with very different sets of moral patients” is simply irrelevant to the
question. If your decision procedure in a case like this involves the
consideration of far-future outcomes, requires the construction of utility
aggregation procedures across large numbers of people, etc., etc., then your
ethical framework is of no value and is almost certainly nonsense.

I think this comment is incorrect (in the stated generality). Here is a simple counterexample. Suppose you have a starting endowment of $1, and that you can bet any amount at 0.50001 probability of doubling your bet and 0.49999 probability of losing everything you bet. You can bet whatever amount of your money you want a total of n times. (If you lost everything in some round, we can think of this as you still being allowed to bet 0 in remaining future rounds.) The strategy that maximizes expected linear utility is the one where you bet everything every time.

01y

It depends on n. If n is small, such as n=1, then you should bet a lot. In the
limit of n large, you should use the Kelly criterion. The crossover is about
n=10^5. Which is why I said that it depends on having many opportunities.