I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.
Why would models start out aligned by default?
This is the best I've got so far. I estimated the rating using the midpoint of a logistic regression fit to the games. The first few especially seem to have been inflated due to not having enough high rated players in the data, so it had to extrapolate. And they all seem inflated by (I'd guess) a couple of hundred points due to the effects I mentioned in the post. (Edit: Please don't share the graph alone without this context).
The NN rating in the Blitz data highlights the flaw in this method of estimating the rating.
I haven't found a way to get similar data on human vs human games.
Took a while to download all this. I'm curious what your blitz rating is?
Does that sound right?
Can't give a confident yes because I'm pretty confused about this topic, and I'm pretty unhappy currently with the way the leverage prior mixes up action and epistemics. The issue about discounting theories of physics if they imply high leverage seems really bad? I don't understand whether the UDASSA thing fixes this. But yes.
That avoids the "how do we encode numbers" question that naturally raises itself.
I'm not sure how natural the encoding question is, there's probably an AIT answer to this kind of question that I don't know.
By "control plausibly works" I didn't mean "Stuff like existing monitoring will work to control AIs forever". I meant it works if it is a stepping stone allows us to accelerate/finish alignment research, and thereby build aligned AGI.
I think several of the subquestions that matter for whether it'll plausibly work to have AI solve alignment for us are in the second category. Like the two points I mentioned in the post. I think there are other subquestions that are more in the first category, which are also relevant to the odds of success. I'm relatively low confidence about this kind of stuff because of all the normal reasons why it's difficult to say how other people should be thinking. It's easy to miss relevant priors, evidence, etc. But still... given what I know about what everyone believes, it looks like these questions should be resolvable among reasonable people.
Makes sense, but in that case, why penalize by time? Why not just directly penalize by utility? Like the leverage prior.
Also, why not allow floating point representations of utility to be output? Rather than just binary integers?
Aren't there programs that run fast and also return a number that grows much faster than |p|? Like up arrow notation. Why don't these grow faster than your speed prior penalizes them?
I think there are reasonable people who look at the evidence and think it plausible that control works, and also reasonable people who look at the evidence and think it implausible that control works. And others who think that openai-superalignment-style plans plausibly work.
Something is going wrong here.
Relevant comment on reddit from someone working on Leela Odds: