Sodium

Trying to get into alignment

247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b

Wiki Contributions

Comments

Sorted by
Sodium50

I wonder if it's useful to try to disentangle the disagreement using the outer/inner alignment framing? 

One belief is that "the deceptive alignment folks" believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much. 


What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up "will the model scheme" into the inner/outer misalignment. In other words, he doesn't care about P(scheming|Good base objective/outer alignment) and only P(scheming). 


I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general. 

Maybe the another disagreement is how likely "Good base objective/outer alignment" occurs in the strongest models, and how important this problem is. 

Sodium10

Hmmm ok maybe I’ll take a look at this :)

Sodium22

Have people done evals for a model with/without an SAE inserted? Seems like even just looking at drops in MMLU performance by category could be non-trivially informative. 

Sodium3981

I wouldn't trust an Altman quote in a book tbh. In fact, I think it's reasonable to not trust what Altman says in general. 

Sodium30

You said that 

CVI is explicitly partisan and can spend money in ways that more effectively benefit Democrats. VPC is a non-partisan organization and donations to it are fully tax deductible


But on their about us page, it states

Center for Voter Information is a non-profit, non-partisan partner organization to Voter Participation Center, both founded to provide resources and tools to help voting-eligible citizens register and vote in upcoming elections.

The Voter Participation center also states

The Voter Participation Center (VPC) is a non-profit, non-partisan organization founded in 2003

Sodium30

FYI, since I think you missed this: According to the responsible scaling policy update, the Long-Term Benefit Trust would "have sufficient oversight over the [responsible scaling] policy implementation to identify any areas of non-compliance." 

Sodium10

It's also EAG London weekend lol it's a busy weekend for all

Sodium21

I thought that the part about models needing to keep track of a more complicated mix-state presentation as opposed to just the world model is one of those technical insights that's blindingly obvious once someone points it out to you (i.e., the best type of insight :)). I love how the post starts out by describing the simple ZIR example to help us get a sense of what these mixed state presentations are like. Bravo!

Sodium356

So out of the twelve people on the weak to strong generalization paper, four have since left OpenAI? (Leopald, Pavel, Jan, and Ilya)

Other recent safety related departures that come to mind are Daniel Kokotajlo and William Saunders.

Am I missing anyone else?

Sodium10

Others have mentioned Coase (whose paper is a great read!). I would also recommend The Visible Hand: The Managerial Revolution in American Business. This is an economic history work detailing how large corporations emerged in the US in the 19th century. 

Load More