All of Scott Garrabrant's Comments + Replies

Finite Factored Sets

Sure, if you want to send me an email and propose some times, we could set up a half hour chat. (You could also wait until I post all the math over the next couple weeks.)

Finite Factored Sets

Looks like you copied it wrong. Your B only has one 4.

Finite Factored Sets

I have not thought much about applying to things other than finite sets. (I looked at infinite sets enough to know there is nontrivial work to be done.) I do think it is good that you are thinking about it, but I don't have any promises that it will work out.

What I meant when I think that this can be done in a categorical way is that I think I can define a nice symmetric monodical category of finite factored sets such that things like orthogonality can be given nice categorical definitions. (I see why this was a confusing thing to say.)

Finite Factored Sets

If I understand correctly, that definition is not the same. In particular, it would say that you can get nontrivial factorizations of a 5 element set: {{{0,1},{2,3,4}},{{0,2,4},{1,3}}}.

1lcmgcd24dThat misses element 4 right? >>> from itertools import product >>> B = [[{0, 1}, {2, 3, 4}], [{0, 2, 3}, {1, 3}]] >>> list(product(*B)) [({0, 1}, {0, 2, 3}), ({0, 1}, {1, 3}), ({2, 3, 4}, {0, 2, 3}), ({2, 3, 4}, {1, 3})] >>> [set.intersection(*tup) for tup in product(*B)] [{0}, {1}, {2, 3}, {3}] >>> set.union(*[set.intersection(*tup) for tup in product(*B)]) {0, 1, 2, 3}
Finite Factored Sets

When I prove it, I prove and use (a slight notational variation on) these two lemmas.

  1. If , then  for all .
  2. .

(These are also the two lemmas that I have said elsewhere in the comments look suspiciously like entropy.)

These are not trivial to prove, but they might help.

Finite Factored Sets

I think that you are pointing out that you might get a bunch of false positives in your step 1 after you let a thermodynamical system run for a long time, but they are are only approximate false positives.  

2[comment deleted]1mo
Finite Factored Sets

I think my model has macro states. In game of life, if you take the entire grid at time t, that will have full history regardless of t. It is only when you look at the macro states (individual cells) that my time increases with game of life time.

Finite Factored Sets

As for entropy, here is a cute observation (with unclear connection to my framework): whenever you take two independent coin flips (with probabilities not 0,1, or 1/2), their xor will always be high entropy than either of the individual coin flips. 

Finite Factored Sets

Wait, I misunderstood, I was just thinking about the game of life combinatorially, and I think you were thinking about temporal inference from statistics. The reversible cellular automaton story is a lot nicer than you'd think.

if you take a general reversible cellular automaton (critters for concreteness), and have a distribution over computations in general position in which initial conditions cells are independent, the cells may not be independent at future time steps.

If all of the initial probabilities are 1/2, you will stay in the uniform distribution,... (read more)

2cousin_it1moWait, can you describe the temporal inference in more detail? Maybe that's where I'm confused. I'm imagining something like this: 1. Check which variables look uncorrelated 2. Assume they are orthogonal 3. From that orthogonality database, prove "before" relationships Which runs into the problem that if you let a thermodynamical system run for a long time, it becomes a "soup" where nothing is obviously correlated to anything else. Basically the final state would say "hey, I contain a whole lot of orthogonal variables!" and that would stop you from proving any reasonable "before" relationships. What am I missing?
2Scott Garrabrant1moI think my model has macro states. In game of life, if you take the entire grid at time t, that will have full history regardless of t. It is only when you look at the macro states (individual cells) that my time increases with game of life time.
2Scott Garrabrant1moAs for entropy, here is a cute observation (with unclear connection to my framework): whenever you take two independent coin flips (with probabilities not 0,1, or 1/2), their xor will always be high entropy than either of the individual coin flips.
Finite Factored Sets

Yep, there is an obnoxious number of factorizations of a large game of life computation, and they all give different definitions of "before."

2cousin_it1moI think your argument about entropy might have the same problem. Since classical physics is reversible, if we build something like a heat engine in your model, all randomness will be already contained in the initial state. Total "entropy" will stay constant, instead of growing as it's supposed to, and the final state will be just as good a factorization as the initial. Usually in physics you get time (and I suspect also causality) by pointing to a low probability macrostate and saying "this is the start", but your model doesn't talk about macrostates yet, so I'm not sure how much it can capture time or causality. That said, I like really like how your model talks only about information, without postulating any magical arrows. Maybe it has a natural way to recover macrostates, and from them, time?
Finite Factored Sets

I don't have a great answer, which isn't a great sign.

I think the scientist can infer things like. "algorithms reasoning about the situation are more likely to know X but not Y than they are to know Y but not X, because reasonable processes for learning Y tend to learn learn enough information to determine X, but then forget some of that information." But why should I think of that as time?

I think the scientist can infer things like "If I were able to factor the world into variables, and draw a DAG (without determinism) that is consistent with the distribu... (read more)

4cousin_it1moThanks for the response! Part of my confusion went away, but some still remains. In the game of life example, couldn't there be another factorization where a later step is "before" an earlier one? (Because the game is non-reversible and later steps contain less and less information.) And if we replace it with a reversible game, don't we run into the problem that the final state is just as good a factorization as the initial?
Finite Factored Sets

I partially agree, which is partially why I am saying time rather than causality.

I still feel like there is an ontological disagreement in that it feels like you are objecting to saying the physical thing that is Alice's knowledge is (not) before the physical thing that is Bob's knowledge.

In my ontology:
1) the information content of Alice's knowledge is before the information content of Bob's knowledge. (I am curios if this part is controversial.)

and then,

2) there is in some sense no more to say about the physical thing that is e.g. Alice's knowledge beyon... (read more)

8cousin_it1moNot sure we disagree, maybe I'm just confused. In the post you show that if X is orthogonal to X XOR Y, then X is before Y, so you can "infer a temporal relationship" that Pearl can't. I'm trying to understand the meaning of the thing you're inferring - "X is before Y". In my example above, Bob tells Alice a lossy function of his knowledge, and Alice ends up with knowledge that is "before" Bob's. So in this case the "before" relationship doesn't agree with time, causality, or what can be computed from what. But then what conclusions can a scientist make from an inferred "before" relationship?
Finite Factored Sets

 is the event you are conditioning on, so the thing you should expect is that , which does indeed hold.

3FjolleJagt1moThanks, that makes sense! Could you say a little about why the weak union axiom holds? I've been struggling to prove that from your definitions. I was hoping thathF(X|z,w)⊆hF(X|z)would hold, but I don't think thathF(X|z)satisfies the second condition in the definition of conditional history forhF(X|z,w).
Finite Factored Sets

I think I (at least locally) endorse this view, and I think it is also a good pointer to what  seems to me to be the largest crux between the my theory of time and Pearl's theory of time.

Finite Factored Sets

Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?

Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead indep... (read more)

1passinglunatic1moI think the motivation for the representability of some sets of conditional independences with a DAG is pretty clear, because people already use probability distributions all the time, they sometimes have conditional independences and visuals are nice. On the other hand the fundamental theorem relates orthogonality to independences in a family of distributions generated in a particular way. Neither of these things are natural properties of probability distributions in the way that conditional independence is. If I am using probability distributions, it seems to me I'd rather avoid introducing them if I can. Even if the reasons are mysterious, it might be useful to work with models of this type - I was just wondering if there were reasons for doing that are apparent before you derive any useful results. Alternatively, is it plausible that you could derive the same results just using probability + whatever else you need anyway? For example, you could perhaps defineXto be prior toYif, relative to some ordering of functions by "naturalness", there is a more naturalf(X,Y)such thatX⊥f(X,Y)andX⊥/f(X,Y)|Ythan anyg(X,Y)such thatY⊥g(X,Y)etc. I have no idea if that actually works! However, I'm pretty sure you'll need something like a naturalness ordering in order to separate "true orthogonality" from "merely apparent orthogonality", which is why I think it's fair to posit it as an element of "whatever else you need anyway". Maybe not.
Finite Factored Sets

Makes sense. I think a bit of my naming and presentation was biased by being so surprised by the not on OEIS fact.

I think I disagree about the bipartite graph thing. I think it only feels more natural when comparing to Pearl. The talk frames everything in comparison to Pearl, but I think if you are not looking at Pearl, I think graphs don’t feel like the right representation here. Comparing to Pearl is obviously super important, and maybe the first introduction should just be about the path from Pearl to FFS, but once we are working within the FFS ontology... (read more)

4paulfchristiano1moI agree that bipartite graphs are only a natural way of thinking about it if you are starting from Pearl. I'm not sure anything in the framework is really properly analogous to the DAG in a causal model.
Finite Factored Sets

I was originally using the name Time Cube, but my internal PR center wouldn't let me follow through with that :)

6gwillen1moThat sounds like the right choice, but a part of me is incredibly disappointed that you didn't go for it.
Finite Factored Sets

Thanks Paul, this seems really helpful.

As for the name I feel like "FFS" is a good name for the analog of "DAG", which also doesn't communicate that much of the intuition, but maybe doesn't make as much sense for name of the framework.

9paulfchristiano1moI think FFS makes sense as an analog of DAG, and it seems reasonable to think of the normal model as DAG time and this model as FFS time. I think the name made me a bit confused by calling attention to one particular diff between this model and Pearl (factored sets vs variables), whereas I actually feel like that diff was basically a red herring and it would have been fastest to understand if the presentation had gone in the opposite direction by demphasizing that diff (e.g. by presenting the framework with variables instead of factors). That said, even the DAG/FFS analogy still feels a bit off to me (with the caveat that I may still not have a clear picture / don't have great aesthetic intuitions about the domain). Factorization seems analogous to describing a world as a set of variables (and to the extent it's not analogous it seems like an aesthetic difference about whether to take the world or variables as fundamental, rather than a substantive difference in the formalism) rather than to the DAG that relates the variables. The structural changes seem more like (i) replacing a DAG with a bipartite graph, (ii) allowing arrows to be deterministic (I don't know how typically this is done in causal models). And then those structural changes lead to generalizing the usual concepts about causality so that they remain meaningful in this setting. All that said, I'm terrible at both naming things and presenting ideas, and so don't want to make much of a bid for changes in either department.

I was originally using the name Time Cube, but my internal PR center wouldn't let me follow through with that :)

Finite Factored Sets

Here is a more concrete example of me using FFS the way I intend them to be used outside of the inference problem. (This is one specific application, but maybe it shows how I intend the concepts to be manipulated).

I can give an example of embedded observation maybe, but it will have to come after a more formal definition of observation (This is observation of a variable, rather than the observation of an event above):

Definition: Given a FFS , and , which are partitions of , where , we say  ... (read more)

Finite Factored Sets

I'll try. My way of thinking doesn't use the examples, so I have to generate them for communication.

I can give an example of embedded observation maybe, but it will have to come after a more formal definition of observation (This is observation of a variable, rather than the observation of an event above):

Definition: Given a FFS , and , which are partitions of , where , we say  observes  relative to W if:
1) ,

2)  can be expressed in the form , an... (read more)

Finite Factored Sets

Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.

If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk. 

I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a varia... (read more)

2Scott Garrabrant1moHere is a more concrete example of me using FFS the way I intend them to be used outside of the inference problem. (This is one specific application, but maybe it shows how I intend the concepts to be manipulated). I can give an example of embedded observation maybe, but it will have to come after a more formal definition of observation (This is observation of a variable, rather than the observation of an event above): Definition: Given a FFSF=(S,B), andA,W,X, which are partitions ofS, whereX={x1,… ,xn}, we sayAobservesXrelative to W if: 1)A⊥X, 2)Acan be expressed in the formA=A0∨S⋯∨SAn, and 3)Ai⊥W|(S∖xi). (This should all be interpreted combinatorially, not probabilistically.) The intuition of what is going on here is that to observe an event, you are being promised that you 1) do not change whether the event holds, and 3) do not change anything that matters in the case where that event does not hold. Then, to observe a variable, you can basically 2) split yourself up into different fragments of your policy, where each policy fragment observes a different value of that variable. (This whole thing is very updateless.) Example 1: (non-observation) An agentA={L,R}does not observe a coinflipX={H,T}, and chooses to raise either his left or right hand. Our FFSF=(S,B)is given byS=A×X, andB={A,X}. (I am abusing notation here slightly by conflatingAwith the partition you get onA×Xby projecting onto theAcoordinate.) Then W is the discrete partition onA×X. In this example, we do not have observation. Proof: A only has two parts, so if we express A as a common refinement of 2 partitions, at least one of these two partitions must be equal to A. However, A is not orthogonal to W given H and A is not orthogonal to W given T. (hF(A|H)=hF(W|H)=hF(A|T)=hF(W|T)={A}). Thus we must violate condition 3. Example 2: (observation) An agentA={LL,LR,RL,RR}does observe a coinflipX={H,T}, and chooses to raise either his left or right hand. We can think ofAas actually choosing a polic
4[comment deleted]1mo
Finite Factored Sets

I are note sure what you are asking (indeed I am not sure if you are responding to me or cousin_it.)

One thing that I think is going on is that I use "factorization" in two places. Once when I say Pearl is using factorization data, and once where I say we are inferring a FFS. I think this is a coincidence. "Factorization" is just a really general and useful concept.

So the carving into A and B and C is a factorization of the world into variables, but it is not the kind of factorization that shows up in the FFS, because disjoint factors should be independent ... (read more)

Finite Factored Sets

I think that the answers to both the concern about 7 elements, and the desire to have questions depend of previous questions come out of thinking about FFS models, rather than FFS.

If you want to have 7 elements in , that just means you will probably have more than 7 elements in .

If I want to model a situation where some questions I ask depend on other questions, I can just make a big FFS that asks all the questions, and have the model hide some of the answers. 

For example, Let's say I flip a biased coin, and then if heads I roll a biased 6... (read more)

Finite Factored Sets

Yep, this all seems like a good way of thinking about it.

Finite Factored Sets

Hmm, I doubt the last paragraph about sets of partitions is going to be valuable, bet the eigenspace thinking might be useful. 

Note that I gave my thoughts about how to deal with the uniform distribution over 4 elements in the thread responding to cousin_it.

Finite Factored Sets

So we are allowing S to have more than 4 elements (although we dont need that in this case), so it is not just looking at a small number of factorizations of a 4 element set. This is because we want an FFS model, not just a factorization of the sample space.

If you factor in a different way, X will not be before Y, but if you do this it will not be the case that X is orthogonal to X XOR Y. The theorem in this example is saying that X being orthogonal to X XOR Y implies that X is before Y.

Finite Factored Sets

Ok, makes sense. I think you are just pointing out that when I am saying "general position," that is relative to a given structure, like FFS or DAG or symmetric FFS.

If you have a probability distribution, it might be well modeled by a DAG, or a weaker condition is that it is well modeled by a FFS, or an even weaker condition is that it is well modeled by a SFFS (symmetric finite factored set). 

We have a version of the fundamental theorem for DAGs and d-seperation, we have a version of the fundamental theorem for FFS and conditional orthogonality, and ... (read more)

4cousin_it1moCan you give some more examples to motivate your method? Like the smoking/tar/cancer example for Pearl's causality, or Newcomb's problem and counterfactual mugging for UDT.
Finite Factored Sets

It looks like X and V are independent binary variables with different probabilities in general position, and Y is defined to be X XOR V. (and thus V=X XOR Y).

Finite Factored Sets

Yeah, you are right. I will change it. Thanks.

Finite Factored Sets

I don't understand what conspiracy is required here.

X being orthogonal to X XOR Y implies X is before Y, we don't get the converse.

Well, imagine we have three boolean random variables. In "general position" there are no independence relations between them, so we can't say much. Constrain them so two of the variables are independent, that's a bit less "general", and we still can't say much. Constrain some more so the xor of all three variables is always 1, that's even less "general", now we can use your method to figure out that the third variable is downstream of the first two. Constrain some more so that some of the probabilities are 1/2, and the method stops working. What I'd like to understand is the intuition, which real world cases have the particular "general position" where the method works.

3acgt1moWhat would such a distribution look like? The version where X XOR Y is independent of both X and Y makes sense but I’m struggling to envisage a case where it’s independent of only 1 variable.
Finite Factored Sets

The swapping within a factor allows for considering rational probabilities to be in general position, and the swapping factors allows IID samples to be considered in general position. I think this is an awesome research direction to go in, but it will make the story more complicated, since will not be able to depend on the fundamental theorem, since we are allowing for a new source of independence that is not orthogonality. (I want to keep calling the independence that comes from disjoint factors orthogonality, and not use "orthogonality" to describe the new independences that come from the principle of indifference.)

6cousin_it1moYeah, that's what I thought, the method works as long as certain "conspiracies" among probabilities don't happen. (1/2 is not the only problem case, it's easy to find others, but you're right that they have measure zero.) But there's still something I don't understand. In the general position, if X is before Y, it's not always true that X is independent of X XOR Y. For example, if X = "person has a car on Monday" and Y = "person has a car on Tuesday", and it's more likely that a car-less person gets a car than the other way round, the independence doesn't hold. It requires a conspiracy too. What's the intuitive difference between "ok" and "not ok" conspiracies?
Finite Factored Sets

So you should probably not work with probabilities equal to 1/2 in this framework, unless you are doing so for a specific reason. Just like in Pearlian causality, we are mostly talking about probabilities in general position. I have some ideas about how to deal with probability 1/2 (Have a FFS, together with a group of symmetry constraints, which could swap factors, or swap parts within a factor), but that is outside of the scope of what I am doing here.

To give more detail, the uniform distribution on four elements does not satisfy the compositional semigr... (read more)

1Eigil Rischel22dThanks (to both of you), this was confusing for me as well.
4Scott Garrabrant1moThe swapping within a factor allows for considering rational probabilities to be in general position, and the swapping factors allows IID samples to be considered in general position. I think this is an awesome research direction to go in, but it will make the story more complicated, since will not be able to depend on the fundamental theorem, since we are allowing for a new source of independence that is not orthogonality. (I want to keep calling the independence that comes from disjoint factors orthogonality, and not use "orthogonality" to describe the new independences that come from the principle of indifference.)
Finite Factored Sets

If you look at the draft edits for this sequence that is still awaiting approval on OEIS, you'll find some formulas. 

Finite Factored Sets

Nope, we have , but not . That breaks the symmetry.

5tailcalled1moAh of course! So many symbols to keep track of 😅
Finite Factored Sets

Indeed, If X is independent of both Y and X xor Y, that violates the compositional semigraphiod axioms (assuming X is nondeterministic.) Although it could still happen e.g. in the uniform distribution on X x Y. In the example in the post, I mean for X to be independent of X xor Y and for X to not be independent of Y.  

3tailcalled1moI think one thing that confuses me is, wouldn't Y also be before X then?
Overconfidence is Deceit

And I mean the word "maybe" in the above sentence. I am saying the sentence not to express any disagreement, but to play with a conjecture that I am curious about.

Overconfidence is Deceit

Anyway, my reaction to the actual post is:

"Yep, Overconfidence is Deceit. Deceit is bad."

However, reading your post made me think about how maybe your right to not be deceived is trumped by my right to be incorrect.

7Scott Garrabrant4moAnd I mean the word "maybe" in the above sentence. I am saying the sentence not to express any disagreement, but to play with a conjecture that I am curious about.
Overconfidence is Deceit

I believe that I could not pass your ITT. I believe I am projecting some views onto you, in order engage with them in my head (and publicly so you can engage if you want). I guess I have a Duncan-model that I am responding to here, but I am not treating that Duncan-model as particularly truth tracking. It is close enough that it makes sense (to me) to call it a Duncan-model, but its primary purpose in me is not for predicting Duncan, but rather for being there to engage with on various topics. 

I suspect that being a better model would help it serve th... (read more)

Overconfidence is Deceit

Yep, I totally agree that it is a riff. I think that I would have put it in response to the poll about how important it is for karma to track truth, if not for the fact that I don't like to post on Facebook.

Overconfidence is Deceit

This comment is not only about this post, but is also a response to Scott's model of Duncan's beliefs about how epistemic communities work, and a couple of Duncan's recent Facebook posts. It is also is a mostly unedited rant. Sorry.

I grant that overconfidence is in a similar reference class as saying false things. (I think there is still a distinction worth making, similar to the difference between lying directly and trying to mislead by saying true things, but I am not really talking about that distinction here.)

I think society needs to be robust to peopl... (read more)

2Duncan_Sabien4moI'm feeling demoralized by Ben and Scott's comments (and Christian's), which I interpret as being primarily framed as "in opposition to the OP and the worldview that generated it," and which seem to me to be not at all in opposition to the OP, but rather to something like preexisting schemas that had the misfortune to be triggered by it. Both Scott's and Ben's thoughts ring to me as almost entirely true, and also separately valuable, and I have far, far more agreement with them than disagreement, and they are the sort of thoughts I would usually love to sit down and wrestle with and try to collaborate on. I am strong upvoting them both. But I feel caught in this unpleasant bind where I am telling myself that I first have to go back and separate out the three conversations—where I have to prove that they're three separate conversations, rather than it being clear that I said "X" and Ben said "By the way, I have a lot of thoughts about W and Y, which are (obviously) quite close to X" and Scott said "And I have a lot of thoughts about X' and X''." Like, from my perspective it seems that there are a bunch of valid concerns being raised that are not downstream of my assertions and my proposals, and I don't want to have to defend against them, but feel like if I don't, they will in fact go down as points against those assertions and proposals. People will take them as unanswered rebuttals, without noticing that approximately everything they're specifically arguing against, I also agree is bad. Those bad things might very well be downstream of e.g. what would happen, pragmatically speaking, if you tried to adopt the policies suggested, but there's a difference between "what I assert Policy X will degenerate to, given [a, b, c] about the human condition" and "Policy X." (Jim made this distinction, and I appreciated it, and strong upvoted that, too.) And for some reason, I have a very hard time mustering any enthusiasm at all for both Ben and Scott's proposed conversat
2019 Review: Voting Results!

Unedited stream of thought:

Before trying to answer the question, I'm just gonna say a bunch of things that might not make sense (either because I am being unclear or being stupid). 

So, I think the debate example is much more *about* manipulation, than the iterated amplification example, so I was largely replying to the class that includes IA and debate, I can imagine saying that Iterated amplification done right does not provide an incentive to manipulate the human.

I think that a process that was optimizing directly for finding a fixed point of ... (read more)

4rohinmshah5moYeah, I agree debate seems less obvious. I guess I'm more interested in the iterated amplification claim since it seems like you do see iterated amplification as opposed to "avoiding manipulation" or "making a clean distinction between good and bad reasoning", and that feels confusing to me. (Whereas with debate I can see the arguments for debate incentivizing manipulation, and I don't think they're obviously wrong, or obviously correct.) Yeah, this argument makes sense to me, though I question how much such incentives matter in practice. If we include incentives like this, then I'm saying "I think the incentives a) arise for any situation and b) don't matter in practice, since they never get invoked during training". (Not just for the automated decomposition example; I think similar arguments apply less strongly to situations involving actual humans.) Agreed. I'm not claiming (1) in full generality. I'm claiming that there's a spectrum of how much incentive there is to predict humans in generality. On one end we have the automated examples I mentioned above, and on the other end we have sales and marketing. It seems like where we are on this spectrum is primarily dependent on the task and the way you structure your reasoning. If you're just training your AI system on making better transistors, then it seems like even if there's a human in the loop your AI system is primarily going to be learning about transistors (or possibly about how to think about transistors in the way that humans think about transistors). Fwiw, I think you can make a similar claim about debate. If we use iterated amplification to aim for corrigibility, that will probably require the system to learn about agency, though I don't think it obviously has to learn about humans. I might also be claiming (2), except I don't know what you mean by "sufficiently far". I can understand how prediction behavior is "close" to manipulation behavior (in that many of the skills needed for the first are re
2019 Review: Voting Results!

BTW, I would be down for something like a facilitated double crux on this topic, possibly in the form of a weekly LW meetup. (But think it would be a mistake to stop talking about it in this thread just to save it for that.)

Yeah, that sounds interesting, I'd participate.

2019 Review: Voting Results!

I am having a hard time generating any ontology that says:

I don't see [let's try to avoid giving models strong incentives to learn how to manipulate humans] as particularly opposed to methods like iterated amplification or debate.

Here are some guesses:

You are distinguishing between an incentive to manipulate real life humans and an incentive to manipulate human models?

You are claiming that the point of e.g. debate is that when you do it right there is no incentive to manipulate?

You are focusing on the task/output of the system, and internal incentives to learn how to manipulate don't count?

These are just guesses.

4rohinmshah5moThis seems closest, though I'm not saying that internal incentives don't count -- I don't see what these incentives even are (or, I maybe see them in the superintelligent utility maximizer model, but not in other models). Do you agree that the agents in Supervising strong learners by amplifying weak experts [https://arxiv.org/abs/1810.08575] don't have an incentive to manipulate the automated decomposition strategies? If yes, then if we now change to a human giving literally identical feedback, do you agree that then nothing would change (i.e. the resulting agent would not have an incentive to manipulate the human)? If yes, then what's the difference between that scenario and one where there are internal incentives to manipulate the human? Possibly you say no to the first question because of wireheading-style concerns; if so my followup question would probably be something like "why doesn't this apply to any system trained from a feedback signal, whether human-generated or automated?" (Though that's from a curious-about-your-beliefs perspective. On my beliefs I mostly reject wireheading as a specific thing to be worried about, and think of it as a non-special instance of a broader class of failures.)
2019 Review: Voting Results!

(Where by "my true reason", I mean what feels live to me right now, There is also all the other stuff from the post, and the neglectedness argument)

4rohinmshah5moI'm on board with "let's try to avoid giving models strong incentives to learn how to manipulate humans" and stuff like that, e.g. via coordination to use AI primarily for (say) science and manufacturing, and not for (say) marketing and recommendation engines. I don't see this as particularly opposed to methods like iterated amplification or debate, which seem like they can be applied to all sorts of different tasks, whether or not they incentivize manipulation of humans. It feels like the crux is going to be in our picture of how AGI works, though I don't know how.
2019 Review: Voting Results!

Yeah, looking at this again, I notice that the post probably failed to communicate my true reason. I think my true reason is something like:

I think that drawing a boundary around good and bad behavior is very hard. Luckily, we don't have to draw a boundary between good and bad behavior, we need to draw a boundary that has bad behavior on the outside, and *enough* good behavior on the inside to bootstrap something that can get us through the X-risk. Any distinction between good and bad behavior with any nuance seems very hard to me. However the boundary of ... (read more)

2Eli Tyre4mo[Eli's personal notes. Feel free to ignore or engage] Related to the following, from here [https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like] . . . . Thinking further, this is because of something like...the "good" strategies for engaging with humans are continuous with the "bad"strategies for engaging with humans (ie dark arts persuasion is continuous with good communication), but if your AI is only reasoning about a domain that doesn't have humans than deceptive strategies are isolated in strategy space from the other strategies that work (namely, mastering the domain, instead of tricking the judge). Because of this isolation of deceptive strategies, we can notice them more easily?
6Scott Garrabrant5mo(Where by "my true reason", I mean what feels live to me right now, There is also all the other stuff from the post, and the neglectedness argument)
2019 Review: Voting Results!

Yeah, I am sad, but not surprised, because I have been trying to push this idea (e.g. at conferences) for a few years. 

Guesses as to why I'm failing?

I think that we actually undersold the neglectedness point in this post, but I don't think that is the main problem, I think the main problem is that the post (and I) do not give viable alternatives, its like:

"Have you noticed that the CHAI ontology, the Paul ontology, and basically all the concrete plans for safe AGI are trying to get safety out of superhuman models of humans, and there are plausible wor... (read more)

I certainly would be more excited if there was an alternative in mind -- it seems pretty unlikely to me that this is at all tractable.

However, I am also pretty unconvinced by the object-level arguments that there are risks from using human models that are comparable to the risks from AI alignment overall (under a total view, longtermist perspective). Taking the arguments from the post:

  • Less Independent Audits: I agree that all else equal, having fewer independent audits increases x-risk from AI alignment failure. This seems like a far smaller effect than fr
... (read more)

For the 2019 Review, I think it would've helped if you/Rob/others had posted something like this as reviews of the post. Then voters would at least see that you had this take, and maybe people who disagree would've replied there which could've led to some of this getting hashed out in the comments.

Quick note that I personally didn't look at the post mostly due to opportunity cost of lots of other posts in the review.

I also think we should maybe run a filter on the votes that only takes AlignmentForum users, and check what the AF consensus on alignment-related posts was. Which may not end up mattering here, but I know I personally avoided voting on a lot of alignment stuff because I didn't feel like I could evaluate them well, and wouldn't be surprised if other people did that as well.

Eight Definitions of Observability

I am confused, why is it not identical to your other comment?

3Ramana Kumar5moBecauseS1andS2are not a partition of the world here. EDIT: but what we actually need in the proof isAssumeS1(C)&AssumeS2(C)&⋯≃AssumeS 1∪S2(C)&…where the…do result in a partition, so I think this will work out the same as the other comment. I'm still not convinced about biextensional equivalence between the frames without the rest of the product.
Load More