All of scasper's Comments + Replies

Did you read the post?

I think it is clear that we are not advocating for centralized authority.  All of the three points in "takeaways" lead into this. The questions you asked in the second paragraph are ones that we discuss in the "what do we want section" with some additional stuff in the EA forum version of the post.

Without falling into the trap of debating over definitions. Anarchy can be used colloquially to refer to chaos in general and it was intended here to mean a lack of barriers to misuse in the regulatory and dev ecosystems -- not a lack o... (read more)

3the gears to ascenscion24d
fair enough. I read much of the post but primarily skimmed it like a paper, it seems I missed some of the point. Sorry to write an unhelpful comment!

The Vice article came out on August 24th. That was 5 days after the SD leak and 2 days after its official open-source release. The claim that it made that SD couldn't "begin to trick anyone into thinking they're real snapshots of nudes" did not stand the test of time.  We linked the vice article in context of the discussion of deepfake porn in general, not on the specific photorealistic capabilities of SD. 

Speaking of which, dreambooth does allow for this. See this SFW example. This is the type of thing would not be possible with older methods. h... (read more)

I think I agree with johnswentworth's comment. I think there is a chance that equating genuinely useful ASI safety-related work with deceptive alignment could be harmful. To give another perspective, I would also add that I think your definition of deceptive alignment is very broad -- broad enough to encompass areas of research that I find quite distinct (e.g. better training methods vs. better remediation methods) -- yet still seems to exclude some other things that I think matter a lot for AGI safety. Some quick examples I thought of in a few minutes are... (read more)

3Marius Hobbhahn2mo
That's fair. I think my statement as presented is much stronger than I intended it to be. I'll add some updates/clarifications at the top of the post later. Thanks for the feedback!

The taxonomy we introduced in the survey gave a helpful way of splitting up the problem. Other than that, it took a lot of effort, several google docs that got very messy, and https://www.connectedpapers.com/. Personally, I've also been working on interpretability for a while and have passively formed a mental model of the space. 

scasper3moΩ13260

My answer to this is actually tucked into one paragraph on the 10th page of the paper: "This type of approach is valuable...reverse engineering a system". We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.

Making adversaries:

https://distill.pub/2019/activation-atlas/

https://arxiv.org/abs/2110.03605

https://arxiv.org/abs/1811.12231

https://arxiv.org/abs/2201.11114

https://arxiv.org/abs/2206.14754

https://... (read more)

Don't know what part of the post you're referring to.

In both cases, it can block the Lobian proofs. But something about this is unsatisfying about making ad-hoc adjustments to one's policy like this. I'll quote Demski on this instead of trying to write my own explanation. Demski writes

  • Secondly, an agent could reason logically but with some looseness. This can fortuitously block the Troll Bridge proof. However, the approach seems worryingly unprincipled, because we can “improve” the epistemics by tightening the relationship to logic, and get a decision-theoretically much worse result.
    • The problem here is that
... (read more)
1Tao Lin4mo
feels to me like the problems with Lob's theorem come from treating worldly cause and effect as logical implication above the proof system, which is just silly. (i'm no expert
1Tao Lin4mo
I guess I disagree, in that even if the agent intentionally implements a non-loose policy, it won't subjectively believe that it will act exactly non-loosely because some random thing could go wrong.

Any chance you could clarify?

In the troll bridge problem, the counterfactual (the agent crossing the bridge) would indicate the inconsistency of the agent's logical system of reasoning. See this post and what demski calls a subjective theory of counterfactuals

in your terms an "object" view and an "agent" view.

Yes, I think that there is a time and place for these two stances toward agents. The object stance when we are thinking about how behavior is deterministic conditioned on a state of the world and agent. The agent stance for when we are trying to be purposive and think about what types of agents to be/design. If we never wanted to take the object stance, we couldn't successfully understand many dilemmas, and if we never wanted to take the agent stance, then there seems little point in trying to talk about w... (read more)

3Chris_Leong4mo
Agreed. The core lesson for me is that you can't mix and match - you need to clearly separate out when you are using one stance or another. I can understand this perspective, but perhaps if there's a relatively accessible way of explaining why this (or something similar to this) isn't self-defeating, then maybe we should go with that? I don't quite your point. Any chance you could clarify? Like sure we can construct counterfactuals within an inconsistent system and sometimes this may even be a nifty trick for getting the right answer if we can avoid the inconsistency messing us up, but outside of this, why is this something that we should we care about this? Good point, now that you've said it I have to agree that I was too quick to assume that the outside-of-the-universe decision theory should be the same as the inside-of-the-universe decision theory. Thinking this through, if we use CDT as our outside decision theory to pick an inside decision theory, then we need to be able to justify why we were using CDT. Similarly, if we were to use another decision theory. One thing I've just realised is that we don't actually have to use CDT, EDT or FDT to make our decision. Since there's no past for the meta-decider, we can just use our naive decision theory which ignores the past altogether. And we can justify this choice based on the fact that we are reasoning from where we are. This seems like it would avoid the recursion. Except I don't actually buy this, as we need to be able to provide a justification of why we would care about the result of a meta-decider outside of the universe when we know that isn't the real scenario. I guess what we're doing is making an analogy with inside the universe situations where we can set the source code of a robot before it goes and does some stuff. And we're noting that a robot probably has a good algorithm if its code matches what a decider would choose if they had to be prepared for a wide variety of circumstances and then tryi

Thanks, the second bit you quoted, I rewrote. I agree that sketching the proof that way was not good. 

Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross. 

This should be more clear and not imply that rob needs to be able to prove his own consistency. I hope that helps. 
 

1Martín Soto4mo
Okay, now it is clear that you were not presupposing the consistency of the logical system, but its soundness (if Rob proves something, then it is true of the world). I still get the feeling that embracing hypothetical absurdity is how a logical system of this kind will work by default, but I might be missing something, I will look into Adam's papers.

Here's the new version of the paragraph with my mistaken explanation fixed. 

"Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross. "


 

Thanks for the comment. tl;dr, I think you mixed up some things I said, and interpreted others in a different way than I intended. But either way, I don't think there are "enormous problems".

So the statement to be proven (which I shall call P) is not just "agent takes action X", but "when presented with this specific proof of P, the agent takes action X".

Remember that I intentionally give a simplified sketch of the proof instead of providing it. If I did, I would specify the provability predicate. I think you're conflating what I say about the proof and wh... (read more)

1scasper4mo
Here's the new version of the paragraph with my mistaken explanation fixed. "Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross. "
4shminux4mo
OK, thanks! I guess some topics are beyond my comprehension.

The hypothetical pudding matters too!

2shminux4mo
Wasn't the whole post about them not mattering, or else if you take them too seriously you are led astray?

My best answer here is in the form of this paper that I wrote which talks about these dilemmas and a number of others. Decision theoretic flaws like the ones here are examples of subtle flaws in seemingly-reasonable frameworks for making decisions that may lead to unexpected failures in niche situations. For agents who are either vulnerable to spurious proofs or trolls, there are adversarial situations that could effectively exploit these weaknesses. These issues aren't tied to incompleteness so much as they are just examples of ways that agents could be manipulable. 

The importance of this question doesn't involve whether or not there is an "option" in the first case or what you can or can't do in the second.  What matters is whether, hypothetically, you would always obey such a proof or would potentially disobey one. The hypothetical here matters independently of what actually happens because the hypothetical commitments of an agent can potentially be used in cases like this to prove things about it via Lob's theorem. 
Another type of in which how an agent would hypothetically behave can have real influence o... (read more)

2Hoagy4mo
Thanks!

This is self promotion, but this paper provides one type of answer for how certain questions involving agent foundations are directly important for alignment. 

3johnswentworth8mo
I don't think that link goes where you intended.

Thanks for this post, but I don't feel like I have the background for understanding the point you're making. In the pingback post, Demski describes your point as saying:

an agent could reason logically but with some looseness. This can fortuitously block the Troll Bridge proof.

Could you offer a high level explanation of what the main principle here is and what Demski means by looseness? (If such an explanation exists. )Thanks. 

3Diffractor10mo
The key piece that makes any Lobian proof tick is the "proof of X implies X" part. For Troll Bridge, X is "crossing implies bridge explodes". For standard logical inductors, that Lobian implication holds because, if a proof of X showed up, every trader betting in favor of X would get free money, so there could be a trader that just names a really really big bet in favor of the X (it's proved, after all), the agent ends up believing X, and so doesn't cross, and so crossing implies bridge explodes. For this particular variant of a logical inductor, there's an upper limit on the number of bets a trader is able to make, and this can possibly render the statement "if a proof of X showed up, the agent would believe X" false. And so, the key piece of the Lobian proof fails, and the agent happily crosses the bridge with no issue, because it would disbelieve a proof of bridge explosion if it saw it (and so the proof does not show up in the first place).

Thanks for this post, I think it has high clarificational value and that your interpretation is valid and good.  In my post, I failed to cite Y&S's actual definition and should have been more careful. I ended up critiquing a definition that probably resembled MacAskill's definition more than Y&S's, and it seems to have been somewhat of an accidental strawperson. In fairness to me though, Y&S never offered any example with the minimal conditions for SD to apply in their original paper while I did.  This is part of what led to MacAskill... (read more)

2Heighn10mo
Thanks for your reply! I think your definition does say so, and rightfully so. Your source code is run in the historical cases as well, so indirectly it does confound Omega. Or are you saying your source code changes? Then there's no (or no full) subjunctive dependence with either definition. If your source code remains the same but your specific-for-each-game strategy changes, we can imagine something like using historical data as an observation in your strategy, in which case it can be different each time. But again, then there is no (or no full) subjunctive dependence with either definition. I might misunderstand what you mean though. Thanks again for your reply, I think discussing this is important!

Nice post. I'm leaving a reply there instead of here. :)

It's really nice to hear that the paper seems clear! Thanks for the comment. 

  • I've been working on this since March, but at a very slow pace, and I took a few hiatuses. most days when I'd work on it, it was for less than an hour. After coming up with the initial framework to tie things together, the hardest part was trying and failing to think of interesting ways in which most of the achilles heels presented could be used as novel containment measures. I discuss this a bit in the discussion section.

For 2-3, I can give some thoughts, but these aren't ne... (read more)

Thanks for the comment. +1 to it. I also agree that this is an interesting concept: using Achilles Heels as containment measures. There is a discussion related to this on page 15 of the paper. In short, I think that this is possible and useful for some achilles heels and would be a cumbersome containment measure for others which could be accomplished more simply via bribes of reward. 

Thanks.

I disagree a bit. My point has been that it's easy for solipsism to explain consciousness and hard to materialism to. But it's easy for materialism to account for structure and hard solipsism to. Don't interpret the post as my saying solipsism wins--just that it's underrated. I also don't say qualia must be irreducible, just that there's spookiness if they are.

Thanks! This is insightful.

What exactly would it mean to perform a baysian update on you not experiencing qualia?

Good point. In an anthropic sense, the sentence this is a reply to could be redacted. Experiencing qualia themselves would not be evidence to prefer one theory over another. Only experiencing certain types of observations would cause a meaningful update.

The primitives of materialism are described in equations. Does a solipsist seek an equation to tell them how angry they will be next Tuesday? If not, what is the substance of a solipsistic mode
... (read more)
1Donald Hobson3y
Someone that knows quantum physics but almost no computing looks at a phone. They don't know how it works inside. They are uncertain about how apps result from material phenomenon. This is just normal uncertainty over a set of hypothesis. One of those hypotheses is the actual answer, many others will look like alternate choices of circuit board layout or programming language. They still need to find out how the phone works, but that is because they have many hypothesis that involve atoms. They have no reason to doubt that the phone is made of atoms. I don't know how your brain works either, but I am equally sure it is made of (atoms, quantum waves, strings or whatever). I apply the same to my own brain. In the materialist paradigm I can understand Newtonian gravity as at least an approximation of whatever the real rules are. How does a solipsist consider it?

Great comment. Thanks.

In the case of idealism, we call the ontological primitive "mental", and we say that external phenomena don't actually exist but instead we just model them as if they existed to predict experiences. I suppose this is a consistent view and isn't that different in complexity from regular materialism.

I can't disagree. This definitely shifts my thinking a bit. I think that solipsism + structured observations might be comparable in complexity to materialism + an ability for qualia to arise from material phenomena. ... (read more)

1Brian_Tomasik3y
Makes sense. :) To me it seems relatively plausible that the intuition of spookiness regarding materialist consciousness is just a cognitive mistake [https://reducing-suffering.org/hard-problem-consciousness/#Expgap_syndrome], similar to Capgras syndrome. I'm more inclined to believe this than to adopt weirder-seeming ontologies.

Thanks for the comment. I'm not 100% on the computers analogy. I think answering the hard problem of consciousness is significantly different compared to understanding how complex information processing systems like computers work. Any definition or framing of consciousness in terms of informational or computational theory may allow it to be studied in those terms in the same way that computers are can be understood by system based theoretical reasoning based on abstraction. However, I don't think this is what it means to solve the hard problem o... (read more)

I agree--thanks for the comment. When writing this post, my goal was to share a reflection on solipsism in a vacuum rather than in context of decision theory. I acknowledge that solipsism doesn't really tend to drive someone toward caring much about others and such. In that sense, it's not very productive if someone is altruistically/externally motivated.

I don't want to give any impression that this is a particularly important decision theoretic question. :)

4Dagon3y
Mostly my comment was a response to the word "underrated" in the title. We wouldn't know how it's rated, because, by it's nature, it's going to be less proselytized. A quibble, to be sure, but "underrepresented" is probably more accurate.

Thanks for the comment. I think it's exciting for this to make it into the newsletter. I am glad that you liked these principles.

I think that even lacking a concept of free will, FDT can be conveniently thought of applying to humans through the installation of new habits or ways of thinking without conflicting with the framework that I aim to give here. I agree that there are significant technical difficulties in thinking about when FDT applies to humans, but I wouldn't consider them philosophical difficulties.

I'm skeptical of this. Non-mere correlations are consequences of an agent's source-code producing particular behaviors that the predictor can use to gain insight into the source-code itself. If an agent adaptively and non-permanently modifies its souce-code, this (from the perspective of a predictor who suspects this to be true), de-correlates it's current source code from the non-mere correlations of its past behavior -- essentially destroying the meaning of non-mere correlations to the extent that the predictor is suspicious.

Oh yes. I agre... (read more)

I wrote a LW post as a reply to this. I explain several points of disagreement with MacAskill and Y&S alike. See here.

I really like this analysis. Luckily, with the right framework, I think that these questions, though highly difficult, are technical but no longer philosophical. This seems like a hard question of priors but not a hard question of framework. I speculate that in practice, an agent could be designed to adaptively and non-permanently modify its actions and source code to slick past many situations, fooling predictors exploiting non-mere correlations when helpful.

But on the other hand, maybe a way to induce a great amount of uncertainty to mess up certain age... (read more)

1Isnasene3y
Yes -- I agree with this. It turns the question "What is the best source-code for making decisions when the situations you are placed on depend on that source-code?" into a question more like"Okay, since there are a bunch of decisions that are contingent on source-code, which ones do we expect to actually happen and with what frequency?" And this is something we can, in principle, reason about (ie, we can speculate on what incentives we would expect predictors to have and try to estimate uncertainties of different situations happening). I'm skeptical of this. Non-mere correlations are consequences of an agent's source-code producing particular behaviors that the predictor can use to gain insight into the source-code itself. If an agent adaptively and non-permanently modifies its souce-code, this (from the perspective of a predictor who suspects this to be true), de-correlates it's current source code from the non-mere correlations of its past behavior -- essentially destroying the meaning of non-mere correlations to the extent that the predictor is suspicious. Maybe there's a clever way to get around this. But, to illustrate the problem with a claim from your blog [https://medium.com/@thestephencasper/decision-theory-ii-going-meta-5bc9970bd2b9] : This is true for a mind reader that is directly looking at source code but is untrue for predictions relying on non-mere correlations. To such a predictor, a dynamic user of CDT who has just updated to EDT would have a history of CDT behavior and non-mere correlations associated mostly with CDT. Now two things might happen: 1. The predictor classifies the agent as CDT and kills it 2. The predictor classifies the agent as a dynamic user of CDT, predicts that it has updated to EDT, and does not kill it. Option 1 isn't great because the agent gets killed. Option 2 also isn't great because it implies predictors have access to non-mere correlations strong enough to indicate that a given agent can dynamically update. This