If you observe 2 pieces of evidence, you have to condition the 2nd on seeing the 1st to avoid double-counting evidence
A human given finite time to think also only performs O(1) computation, and thus cannot "solve computationally hard problems".
I don't really want to argue about language. I'll defend "almost no individual has a pretty substantial affect on capabilities." I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that's bad-on-net for x-risk.
Chris Olah's interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering
I think this is fals...
Similarly, if you thought that you should publish capabilities research to accelerate to AGI, and you found out how to build AGI, then whether you should publish is not really relevant anymore.
I think it's probably reasonable to hold off on publishing interpretability if you strongly suspect that it also advances capabilities. But then that's just an instance of a general principle of "maybe don't advance capabilities", and the interpretability part was irrelevant. I don't really buy that interpretability is particularly likely to increase capabilities that you should have a sense of general caution around this. If you have a specific sense that e.g. working on nuclear fission could produce a bomb, then maybe you shouldn't publish (as has histor...
I think this case is unclear, but also not central because I'm imagining the primary benefit of publishing interp research as being making interp research go faster, and this seems like you've basically "solved interp", so the benefits no longer really apply?
Naively there are so few people working on interp, and so many people working on capabilities, that publishing is so good for relative progress. So you need a pretty strong argument that interp in particular is good for capabilities, which isn't borne out empirically and also doesn't seem that strong.
In general, this post feels like it's listing a bunch of considerations that are pretty small, and the 1st order consideration is just like "do you want people to know about this interpretability work", which seems like a relatively straightfoward "yes".
I also...
I also seperately think that LW tends to reward people for being "capabilities cautious" more than is reasonable, and once you've made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don't matter ex ante.
But isn't most of the interpretability research happening from people who have not made this commitment? Anthropic, which is currently the biggest publisher of interp-research, clearly does not have a commitment to not work towards advancing capabilities, and it seems impor...
So you need a pretty strong argument that interp in particular is good for capabilities, which isn't borne out empirically and also doesn't seem that strong.
I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large ca...
"if you've built a powerful enough optimizer to automate scientific progress, your AI has to understand your conception of goodness to avoid having catastrophic consequences, and this requires making deep advances such that you're already 90% of the way to 'build an actual benevolent sovereign."
I think this is just not true? Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they're not a soverign is because they have high uncertainty about e.g. wha...
But from an outer alignment perspective, it's nontrivial to specify this such that, say, it doesn't convert all the earth to computronium running instances of google ad servers, and bots that navigate google clicking on ads all day.
But Google didn't want their AIs to do that, so if the AIs do that then the AIs weren't aligned. Same with the mind-hacking.
In general, your AI has some best guess at what you want it to do, and if it's aligned it'll do that thing. If it doesn't know what you meant, then maybe it'll make some mistakes. But the point is that a...
My shitty guess is that you're basically right that giving a finite set of programs infinite money can sort of be substituted for the theorem prover. One issue is that logical inductor traders have to be continuous, so you have to give an infinite family of programs "infinite money" (or just an increasing unbounded amount as eps -> 0)
I think if these axioms were inconsistent, then there wouldn't be a price at which no trades happen so the market would fail. Alternatively, if you wanted the infinities to cancel, then the market prices could just be whatever they wanted (b/c you would get infinite buys and sells for any price in (0, 1)).
I think competitiveness matters a lot even if there's only moderate amounts of competitive pressure. The gaps in efficiency I'm imagining are less "10x worse" and more like "I only had support vector machines and you had SGD"
humans, despite being fully general, have vastly varying ability to do various tasks, e.g. they're much better at climbing mountains than playing GO it seems. Humans also routinely construct entirely technology bases to enable them to do tasks that they cannot do themselves. This is, in some sense, a core human economic activity: the construction of artifacts that can do tasks better/faster/more efficiently than humans can do themselves. It seems like by default, you should expect a similar dynamic with "fully general" AIs. That is, AIs trained to do semic...
Not literally the best, but retargetable algorithms are on the far end of the spectrum of "fully specialized" to "fully general", and I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than "fully general" algorithms, so there's decently strong pressure to be towards the "specialized" side.
I also think that heuristics are going to be closer to multiplicative speed ups than additive, so it's going to be closer to "general algorithms just can't compete" than "it's just a little worse". E.g. random search is te...
One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can't be generally-retargetable. E.g. if you consider something like stockfish, it's a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to "maximize the max number of pawns you ever have" you had, you would not be able to use [specialized for telling whether a mo...
Flagging that I don't think your description of what ELK is trying to do is that accurate, e.g. we explicitly don't think that you can rely on using ELK to ask your AI if it's being deceptive, because it might just not know. In general, we're currently quite comfortable with not understanding a lot of what our AI is "thinking", as long as we can get answers to a particular set of "narrow" questions we think is sufficient to determine how good the consequences of an action are. More in “Narrow” elicitation and why it might be sufficient.
Separately, I think ...
If powerful AIs are deployed in worlds mostly shaped by slightly less powerful AIs, you basically need competitiveness to be able to take any "pivotal action" because all the free energy will have been eaten by less powerful AIs.
Here's a conversation that I think is vaguely analogous:
Alice: Suppose we had a one-way function, then we could make passwords better by...
Bob: What do you want your system to do?
Alice: Well, I want passwords to be more robust to...
Bob: Don't tell me about the mechanics of the system. Tell me what you want the system to do.
Alice: I want people to be able to authenticate their identity more securely?
Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?
Alice: IDK I just think the world is likely to be gener...
Some common issues with alignment plans, on Eliezer's account, include:
Bob: Oh OK, we're just going to create this user authetication technology and hope people use it for good?
Seems to me that the answer "I hope people will use it for good" is quite okay for authentication, but not okay for alignment. Doing good is outside the scope of authentication, but is kinda the point of alignment.
Isn't there an equilibrium where people assume other people's militaries are as strong as they can demonstrate, and people just fully disclose their military strength?
See https://www.nickbostrom.com/aievolution.pdf for a discussion about why such arguments probably don't end up pushing timelines forward that much.
From my perspective, ELK is currently very much "A problem we don't know how to solve, where we think rapid progress is being made (as we're still building out the example-counterexample graph, and are optimistic that we'll find an example without counterexamples)" There's some question of what "rapid" means, but I think we're on track for what we wrote in the ELK doc: "we're optimistic that within a year we will have made significant progress either towards a solution or towards a clear sense of why the problem is hard."
We've spent ~9 months on the proble...
The official deadline for submissions is "before I check my email on the 16th", which I tend to do around 10 am PST.
The high-level reason is that the 1e12N model is not that much better at prediction than the 2N model. You can correct for most of the correlation even with only a vague guess at how different the AI and human probabilities are, and most AI and human probabilities aren't going to be that different in a way that produces a correlation the human finds suspicious. I think that the largest correlations are going to be produced by the places the AI and the human have the biggest differences in probabilities, which are likely also going to be the places where th...
I agree that i does slightly worse than t on consistency checks, but i also does better on other regularizers you're (maybe implicitly) using like speed/simplicity, so as long as i doesn't do too much worse it'll still beat out the direct translator.
One possible thing you might try is some sort of lexicographic ordering of regularization losses. I think this rapidly runs into other issues with consistency checks, like the fact that the human is going to be systematically wrong about some correlations, so i potentially is more consistent than t.
I think latex renders if you're using the markdown editor, but if you're using the other editor then it only works if you use the equation editor.
I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for.
Like the thing that seems weird to me here is that you can't simultaneously require that the elicited knowledge be 'relevant' and 'comprehensible' and also cover these sorts of obfuscated debate like scenarios.
I don't know what you mean by "relevant" or ...
I don’t think I understand your distinction between obfuscated and non-obfuscated knowledge. I generally think of non-obfuscated knowledge as NP or PSPACE. The human judgement of a situation might only theoretically require a poly sized fragment of a exp sized computation, but there’s no poly sized proof that this poly sized fragment is the correct fragment, and there are different poly sized fragments for which the human will evaluate differently, so I think of ELK as trying to elicit obfuscated knowledge.
We generally assume that we can construct questions sufficiently well that there's only one unambiguous interpretation. We also generally assume that the predictor "knows" which world it's in because it can predict how humans would respond to hypothetical questions about various situations involving diamonds and sensors and that humans would say in theory Q1 and Q2 could be different.
More concretely, our standard for judging proposals is exhibiting an unambiguous failure. If it was plausible you asked the wrong question, or the AI didn't know what you mean...
I think we would be trying to elicit obfuscated knowledge in ELK. In our examples, you can imagine that the predictor's Bayes net works "just because", so an argument that is convincing to a human for why the diamond in the room has to be arguing that the Bayes net is a good explanation of reality + arguing that it implies the diamond is in the room, which is the sort of "obfuscated" knowledge that debate can't really handle.
The dataset is generated with the human bayes net, so it's sufficient to map to the human bayes net. There is, of course, an infinite set of "human" simulators that use slightly different bayes nets that give the same answers on the training set.
Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially the same architecture as the original?
Yes, approximately. If you can do it for only e.g. transformers, but not other things, that would be interesting.
Does this mean that it must be able to deal with a broad variety of questions, so that we cannot simply sit down and think about how to optimize the model for getting a single question (e.g. "Where is the diamond?") right?
Yes, approximately. Thinking about how to get one question rig...
We generally imagine that it’s impossible to map the predictors net directly to an answer because the predictor is thinking in terms of different concepts, so it has to map to the humans nodes first in order to answer human questions about diamonds and such.
The SmartFabricator seems basically the same. In the robber example, you might imagine the SmartVault is the one that puts up the screen to conceal the fact that it let the diamond get stolen.
Yes. Section Strategy: have a human operate the SmartVault and ask them what happened describes what I think you're asking about.
A different way of phrasing Ajeya's response, which I think is roughly accurate, is that if you have a reporter that gives consistent answers to questions, you've learned a fact about the predictor, namely "the predictor was such that when it was paired with this reporter it gave consistent answers to questions." if there were 8 predictor for which this fact was true then "it's the [7th] predictor such that when it was paired with this reporter it gave consistent answers to questions" is enough information to uniquely determine the reporter, e.g. the previ...
There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the "unaligned benchmark" we're trying to compare to is trained, and the reporter is the thing that we add onto that to "align" it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something)
In this frame, doing anything to train the way the pred...
I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I'm not sure I'm understanding correctly, but I think I would make the following claims:
Thanks for your proposal! We have considered similar strategies in the past. The main points of the breaker response would be:
My point is either that:
Thanks for your proposal! I'm not sure I understand how the "human is happy with experiment" part is supposed to work. Here are some thoughts:
We don't think that real humans are likely to be using Bayes nets to model the world. We make this assumption for much the same reasons that we assume models use Bayes nets, namely that it's a test case where we have a good sense of what we want a solution to ELK to look like. We think the arguments given in the report will basically extend to more realistic models of how humans reason (or rather, we aren't aware of a concrete model of how humans reason for which the arguments don't apply).
If you think there's a specific part of the report where the human Bayes net assumption seems crucial, I'd be happy to try to give a more general form of the argument in question.
I think Luke told your mushroom story to me. Defs not a coincidence.