Wiki Contributions

Comments

Yeah, I agree with what you just said; I should have been more careful with my phrasing. 

Maybe something like: "The naive version of the orthogonality thesis where we assume that AIs can't converge towards human values is assumed to be true too often"

Compared to other people on this site this is a part of my alignment optimism. I think that there are Natural abstractions in the moral landscape that makes agents converge towards cooperation and similar things. I read this post recently and Leo Gao made an argument that concave agents generally don't exist because since they stop existing. I think that there are pressures that conform agents to part of the value landscape. 

Like I agree that the orthogonality thesis is presumed to be true way too often. It is more like an argument that it may not happen by default but I'm also uncertain about the evidence that it actually gives you.

I have the same experience, I love having it connect two disparate topics together, it is very fun. I had the thought today that I use GPT for basically 80%+ of work tasks i do as a brainstorming partner.

Hey! I saw that you had a bunch of downvotes and I wanted to get in here before you came too disilusioned with the LW crowd. I think a big point for me is that you don't really have any sub-headings or examples that are more straight to the point. It is all a long text that seems similar to how you directly thought, this makes it really hard to engage with what you say. Of course you're saying controversial things but if there was more clarity I think you would have more engagement.

(GPT is really op for this nowadays) Anyway, I wish you the best of luck! I'm also sorry for not engaging with any of your arguments but I couldnt quite follow.

Alright, quite a ba(y)sed point there, very nice. My lazy ass is looking for a heuristic here. It seems like the more the EMH is true in a situation/amount of optimisation pressure applied the more you should expect to be disappointed with a trade.

But what is a good heuristic for how much worse it will be? Maybe one just has to think about the counterfactual option each time?

I thought the orangutan argument was pretty good when I first saw it, but then I looked it up, and I realised that it is not that they aren't power seeking. It is more that they only are when it comes to interactions that matter for the future survival of offspring. It actually is a very flimsy argument. Some of the things he says are smart like some of the stuff on the architecture front, but you know, he always talks about his aeroplane analogy in AI Safety. It is like really dumb as I wouldn't get into an aeroplane without knowing that it has been safety checked and I have a hard time taking him seriously when it comes to safety as a consequence.

Very cool! I want to mention that it might be interesting to mention the connection between what the buddha called dependent origination and the formation of a self view of being an agent.

The idea is that your self is built through a loop of expecting your self to be there in the future and thus creating a self fulfilling prophecy. This is similar to how agents in the intentional stance are defined as it is informationally more efficient to express yourself as an agent.

A way to view the alignment problem is through a self loop taking over fully or dependent origination in artificisl agents. Anyway, I think it seems very cool and I wish you the best of luck!

I notice being confused about the relationship between power-seeking arguments and counting arguments. Since I'm confused I'm assuming others are so I would appreciate some clarity on this.

In footnote 7, Turner mentions that the paper, optimal policies tend to seek power is an irrelevant counting error post.

In my head, I think of the counting argument as that it is hard to hit an alignment target because of there being a lot more non-alignment targets. This argument is (clearly?) wrong due to reasons specified in the post. Yet this doesn’t address the power seeking as that seems more like a optimisation pressure applied to the system not something dependent on counting arguments?

In my head, power-seeking is more like saying that an agent's attraction basin is larger in one point of the optimisation landscape compared to another point. The same can also be said about deception here.

I might be dumb but I never thought of the counting argument as true nor crucial to both deception and power-seeking. I'm very happy to be enlightened about this issue.

I buy the argument that scheming won't happen conditionally on the fact that we don't allow much slack between different optimisation steps. As Quentin mentions in his AXRP podcast episode, SGD doesn't have close to the same level of slack that, for example, cultural evolution allowed. (See the entire free energy of optimisation debate here from before, can't remember the post names ;/) Iff that holds, then I don't see why the inner behaviour should diverge from what the outer alignment loop specifies.

I do, however, believe that ensuring that this is true by specifying the right outer alignment loop as well as the right deployment environment is important to ensure that slack is minimised at all points along the chain so that misalignment is avoided everywhere.

If we catch deception in training, we will be ok. If we catch actors that might create deceptive agents in training then we will be ok. If we catch states developing agents to do this or defense>offense then we will be ok. I do not believe that this happens by default.

Load More