All of Mark Xu's Comments + Replies

I think competitiveness matters a lot even if there's only moderate amounts of competitive pressure. The gaps in efficiency I'm imagining are less "10x worse" and more like "I only had support vector machines and you had SGD"

humans, despite being fully general, have vastly varying ability to do various tasks, e.g. they're much better at climbing mountains than playing GO it seems. Humans also routinely construct entirely technology bases to enable them to do tasks that they cannot do themselves. This is, in some sense, a core human economic activity: the construction of artifacts that can do tasks better/faster/more efficiently than humans can do themselves. It seems like by default, you should expect a similar dynamic with "fully general" AIs. That is, AIs trained to do semic... (read more)

4Daniel Kokotajlo5mo
Nobody is asking that the AI can also generalize to "optimize human values as well as the best available combination of skills it has otherwise..." at least, I wasn't asking that. (At no point did I assume that fully general means 'equally good' at all tasks. I am not even sure such comparisons can be made.) But now rereading your comments it seems you were all along, since you brought up competitiveness worries. So now maybe I understand you better: you are assuming a hypercompetitive takeoff in which if there are AIs running around optimized to play the training game or something, and then we use interpretability tools to intervene on some of them and make them optimize for long-run human values instead, they won't be as good at it as they were at playing the training game, even though they will be able to do it (compare: humans can optimize for constructing large cubes of iron, but they aren't as good at it as they are at optimizing for status) and so they'll lose competitions to the remaining AIs that haven't been modified? (My response to this would be ah, this makes sense, but I don't expect there to be this much competition so I'm not bothered by this problem. I think if we have the interpretability tools we'll probably be able to retarget the search of all relevant AIs, and then they'll optimize for human values inefficiently but well enough to save the day.)

Not literally the best, but retargetable algorithms are on the far end of the spectrum of "fully specialized" to "fully general", and I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than "fully general" algorithms, so there's decently strong pressure to be towards the "specialized" side.

I also think that heuristics are going to be closer to multiplicative speed ups than additive, so it's going to be closer to "general algorithms just can't compete" than "it's just a little worse". E.g. random search is te... (read more)

4Daniel Kokotajlo5mo
OK, cool. How do you think generalization works? I thought the idea was that instead of finding a specific technique that only works on the data you were trained on, sufficiently big NN's trained on sufficiently diverse data end up finding more general techniques that work on that data + other data that is somewhat different. Generalization ability is a key metric for AGI, which I expect to go up before the end; like John said the kinds of AI we care about are the kinds that are pretty good at generalizing, meaning that they ARE close to the "fully general" end of the spectrum, or at least close enough that whatever they are doing can be retargeted to lots of other environments and tasks besides the exact ones they were trained on. Otherwise, they wouldn't be AGI. Would you agree with that? I assume not...
3johnswentworth5mo
I basically buy that claim. The catch is that those specialized AIs won't be AGIs, for obvious reasons, and at the end of the day it's the AGIs which will have most of X-risk impact.
Mark Xu5moΩ11167

One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can't be generally-retargetable. E.g. if you consider something like stockfish, it's a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to "maximize the max number of pawns you ever have" you had, you would not be able to use [specialized for telling whether a mo... (read more)

4Daniel Kokotajlo5mo
Are you saying that the AIs we train will be optimization algorithms that are literally the best at optimizing some objective given a fixed compute budget? Can you elaborate on why that is?
3Oliver Sourbut5mo
I agree. My comment here on Rohin and John's thread [https://www.lesswrong.com/posts/w4aeAFzSAguvqA5qu/how-to-go-from-interpretability-to-alignment-just-retarget?commentId=kFpADuaGpCDNsHq6N] is a poor attempt at saying something similar, but also observing that having the machinery to do the 'find the good heuristics' thing is itself a (somewhat necessary?) property of 'recursive-ish search' (at least of the flavour applicable to high-dimensional 'difficult' problem-spaces). In humans and animals I think this thing is something like 'motivated exploration' aka 'science' aka 'experimentation', plus magic abstraction-formation and -recomposition. I think it's worth trying to understand better how these pieces fit together, and to what extent these burdens can (or will) be overcome by compute and training scale.
4Evan R. Murphy5mo
This seems like a good argument against retargeting the search in a trained model turning out to be a successful strategy. But if we get to the point where we can detect such a search process in a model and what its target is, even if its efficiency is enhanced by specialized heuristics, doesn't that buy us a lot even without the retargeting mechanism? We could use that info about the search process to start over and re-train the model, modifying parameters to try and guide it toward learning the optimization target that we want it to learn. Re-training is far from cheap on today's large models, but you might not have to go through the entire training process before the optimizer emerges and gains a stable optimization target. This could allow us to iterate on the search target and verify once we have the one we want before having to deploy the model in an unsafe environment.

Flagging that I don't think your description of what ELK is trying to do is that accurate, e.g. we explicitly don't think that you can rely on using ELK to ask your AI if it's being deceptive, because it might just not know. In general, we're currently quite comfortable with not understanding a lot of what our AI is "thinking", as long as we can get answers to a particular set of "narrow" questions we think is sufficient to determine how good the consequences of an action are. More in “Narrow” elicitation and why it might be sufficient.

Separately, I think ... (read more)

9paulfchristiano7mo
I think that the sharp left turn is also relevant to ELK, if it leads to your system not generalizing from "questions humans can answer" to "questions humans can't answer." My suspicion is that our key disagreements with Nate are present in the case of solving ELK and are not isolated to handling high-stakes failures. (However it's frustrating to me that I can never pin down Nate or Eliezer on this kind of thing, e.g. are they still pessimistic if there were a low-stakes AI deployment in the sense of this post [https://www.alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment]?)

If powerful AIs are deployed in worlds mostly shaped by slightly less powerful AIs, you basically need competitiveness to be able to take any "pivotal action" because all the free energy will have been eaten by less powerful AIs.

The humans presumably have access to the documents being summarized.

1Alexander Gietelink Oldenziel7mo
I see, thank you

Here's a conversation that I think is vaguely analogous:

Alice: Suppose we had a one-way function, then we could make passwords better by...

Bob: What do you want your system to do?

Alice: Well, I want passwords to be more robust to...

Bob: Don't tell me about the mechanics of the system. Tell me what you want the system to do.

Alice: I want people to be able to authenticate their identity more securely?

Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?

Alice: IDK I just think the world is likely to be gener... (read more)

Some common issues with alignment plans, on Eliezer's account, include:

  • Someone stays vague about what task they want to align the AGI on. This lets them mentally plug in strong capabilities when someone objects 'but won't this make the AGI useless?', and weak capabilities when someone objects 'but won't this be too dangerous?', without ever needing a single coherent proposal that can check all the boxes simultaneously.
  • More generally, someone proposes a plan t
... (read more)
1Gerald Monroe7mo
Presumably those early time sharing systems, "we need some way for a user to access their files but not other users". So password. Then later in scale "system administrators or people with root keep reading the passwords file and using it for bad acts later". So one way hash. Password complexity requirements came from people rainbow tabling the one way hash file. None of the above was secure so 2 factor. People keep SMS redirecting so apps on your phone... Each additional level of security was taken from a pool of preexisting ideas that academics and others had contributed. But it wasn't applied until it was clear it was needed.

Bob: Oh OK, we're just going to create this user authetication technology and hope people use it for good?

Seems to me that the answer "I hope people will use it for good" is quite okay for authentication, but not okay for alignment. Doing good is outside the scope of authentication, but is kinda the point of alignment.

Isn't there an equilibrium where people assume other people's militaries are as strong as they can demonstrate, and people just fully disclose their military strength?

2[comment deleted]9mo
4antimonyanthony9mo
Sort of! This paper [https://arxiv.org/abs/2204.03484] (of which I’m a coauthor) discusses this “unraveling” argument, and the technical conditions under which it does and doesn’t go through. Briefly: * It’s not clear how easy it is to demonstrate military strength in the context of an advanced AI civilization, in a way that can be verified / can’t be bluffed. If I see that you’ve demonstrated high strength in some small war game, but my prior on you being that strong is sufficiently low, I’ll probably think you’re bluffing and wouldn’t be that strong in the real large-scale conflict. * Supposing strength can be verified, it might be intractable to do so without also disclosing vulnerable info (irrelevant to the potential conflict). As TLW's comment notes, the disclosure process itself might be really computationally expensive. * But if we can verifiably disclose, and I can either selectively disclose only the war-relevant info or I don’t have such a vulnerability, then yes you’re right, war can be avoided. (At least in this toy model where there’s a scalar “strength” variable; things can get more complicated in multiple dimensions, or where there isn’t an “ordering” to the war-relevant info.) * Another option (which the paper presents) is conditional disclosure—even if you could exploit me by knowing the vulnerable info, I commit to share my code if and only if you commit to share yours, play the cooperative equilibrium, and not exploit me.
1TLW9mo
Demonstrating military strength is itself often a significant cost. Say your opponent has a military of strength 1.1x, and is demonstrating it. If you have the choice of keeping and demonstrating a military of strength x, or keeping a military of strength 1.2 and not demonstrating at all...

See https://www.nickbostrom.com/aievolution.pdf for a discussion about why such arguments probably don't end up pushing timelines forward that much.

From my perspective, ELK is currently very much "A problem we don't know how to solve, where we think rapid progress is being made (as we're still building out the example-counterexample graph, and are optimistic that we'll find an example without counterexamples)" There's some question of what "rapid" means, but I think we're on track for what we wrote in the ELK doc: "we're optimistic that within a year we will have made significant progress either towards a solution or towards a clear sense of why the problem is hard."

We've spent ~9 months on the proble... (read more)

4tailcalled1y
I continue to think that this is a mistake that locks out the most promising directions for solving it. [https://www.lesswrong.com/posts/QEYWkRoCn4fZxXQAY/prizes-for-elk-proposals?commentId=Dxmn8RaSkNJEZPvQz] It's a well-known constraint that models are generally underdetermined, so you need some sort of structural solution to this underdetermination, which you can't have if it must work for all models.

The official deadline for submissions is "before I check my email on the 16th", which I tend to do around 10 am PST.

Before I check my email on Feb 16th, which I will do around 10am PST.

The high-level reason is that the 1e12N model is not that much better at prediction than the 2N model. You can correct for most of the correlation even with only a vague guess at how different the AI and human probabilities are, and most AI and human probabilities aren't going to be that different in a way that produces a correlation the human finds suspicious. I think that the largest correlations are going to be produced by the places the AI and the human have the biggest differences in probabilities, which are likely also going to be the places where th... (read more)

I agree that i does slightly worse than t on consistency checks, but i also does better on other regularizers you're (maybe implicitly) using like speed/simplicity, so as long as i doesn't do too much worse it'll still beat out the direct translator.

One possible thing you might try is some sort of lexicographic ordering of regularization losses. I think this rapidly runs into other issues with consistency checks, like the fact that the human is going to be systematically wrong about some correlations, so i potentially is more consistent than t.

2Lanrian1y
Any articulable reason for why i just does slightly worse than t? Why would a 2N-node model fix a large majority of disrepancys between an N-node model and a 1e12*N-node model? I'd expect it to just fix a small fraction of them. Yeah, if you can get better-looking consistency than the direct translator in some cases, I agree that a sufficiently high consistency penalty will just push towards exploiting that (even if the intermediate model needs to be almost as large as the full predictor to exploit it properly). I'm curious whether you think this is the main obstacle. If we had a version of the correlation-consistency approach that always gave the direct translator minimal expected consistency loss, do we as-of-yet lack a counterexample for it?

I think latex renders if you're using the markdown editor, but if you're using the other editor then it only works if you use the equation editor.

I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for.

Like the thing that seems weird to me here is that you can't simultaneously require that the elicited knowledge be 'relevant' and 'comprehensible' and also cover these sorts of obfuscated debate like scenarios.

I don't know what you mean by "relevant" or ... (read more)

1A Ray1y
Thanks for taking the time to explain this! I think this is what I was missing. I was incorrectly thinking of the system as generating poly-sized fragments.

I don’t think I understand your distinction between obfuscated and non-obfuscated knowledge. I generally think of non-obfuscated knowledge as NP or PSPACE. The human judgement of a situation might only theoretically require a poly sized fragment of a exp sized computation, but there’s no poly sized proof that this poly sized fragment is the correct fragment, and there are different poly sized fragments for which the human will evaluate differently, so I think of ELK as trying to elicit obfuscated knowledge.

1A Ray1y
So if there are different poly fragments that the human would evaluate differently, is ELK just "giving them a fragment such that they come to the correct conclusion" even if the fragment might not be the right piece. E.g. in the SmartVault case, if the screen was put in the way of the camera and the diamond was secretly stolen, we would still be successful even if we didn't elicit that fact, but instead elicited some poly fragment that got the human to answer disapprove? Like the thing that seems weird to me here is that you can't simultaneously require that the elicited knowledge be 'relevant' and 'comprehensible' and also cover these sorts of obfuscated debate like scenarios. Does it seem right to you that ELK is about eliciting latent knowledge that causes an update in the correct direction, regardless of whether that knowledge is actually relevant?

We would prefer submissions be private until February 15th.

We generally assume that we can construct questions sufficiently well that there's only one unambiguous interpretation. We also generally assume that the predictor "knows" which world it's in because it can predict how humans would respond to hypothetical questions about various situations involving diamonds and sensors and that humans would say in theory Q1 and Q2 could be different.

More concretely, our standard for judging proposals is exhibiting an unambiguous failure. If it was plausible you asked the wrong question, or the AI didn't know what you mean... (read more)

2TekhneMakre1y
>We also generally assume that the predictor "knows" which world it's in because it can predict how humans would respond to hypothetical questions about various situations This seems like it doesn't disambiguate between the conditions assumed in a question being true, vs. just the human believing them. E.g. the predictor could predict that when asked "The camera is hacked so it looks like this [camera feeds making it seem like the diamond is still there], and the diamond is in the robber's pocket; is the diamond really in the room?", the human will answer "No!", not by understanding that by "diamond really in the room" the human means that the diamond is really in the room, but rather just by modeling the human as believing the premise of the question (that the diamond is in the pocket). Edit: To elaborate, this condition on counterexamples is given in the ELK document: "The model understands the question. One sufficient condition is that the model can predict human answers to essentially arbitrary hypothetical questions in order to clarify the meaning of terms." I basically don't see how this condition constrains anything about the predictor. It seems like all it really says is that the predictor knows how humans talk. I don't see how it can be specifying that the AI's beliefs about how humans answer questions are related to reality, other than in the training set, where we assume that the human talk matches reality. I don't see how it makes sense to think of this as the model "understanding the question". Normally I'd think of "understanding the question" as meaning "can have the same question". To have a question, you have a role that an answer could fulfill. But if the predictor is organized e.g. as a giant low-level Bayes net, then there's no role that could be filled by an answer to "where's the diamond". There might be a role for an answer to "where's the diamond", induced by how the rest of the AI makes use of the predictor, but that seems contingent a

I think we would be trying to elicit obfuscated knowledge in ELK. In our examples, you can imagine that the predictor's Bayes net works "just because", so an argument that is convincing to a human for why the diamond in the room has to be arguing that the Bayes net is a good explanation of reality + arguing that it implies the diamond is in the room, which is the sort of "obfuscated" knowledge that debate can't really handle.

1A Ray1y
Okay now I have to admit I am confused. Re-reading the ELK proposal -- it seems like the latent knowledge you want to elicit is not-obfuscated. Like, the situation to solve is that there is a piece of non-obfuscated information, which, if the human knew it, would change their mind about approval. How do you expect solutions to elicit latent obfuscated knowledge (like 'the only true explanation is incomprehendible by the human' situations)?
1A Ray1y
Cool, this makes sense to me. My research agenda is basically about making a not-obfuscated model, so maybe I should just write that up as an ELK proposal then.

Note that this has changed to February 15th.

The dataset is generated with the human bayes net, so it's sufficient to map to the human bayes net. There is, of course, an infinite set of "human" simulators that use slightly different bayes nets that give the same answers on the training set.

Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially the same architecture as the original?

Yes, approximately. If you can do it for only e.g. transformers, but not other things, that would be interesting.

Does this mean that it must be able to deal with a broad variety of questions, so that we cannot simply sit down and think about how to optimize the model for getting a single question (e.g. "Where is the diamond?") right?

Yes, approximately. Thinking about how to get one question rig... (read more)

1tailcalled1y
I guess a closer analogy would be "What if the family of strategies only works for transformer-based GANs?" than "What if the family of strategies only works for transformers?". As in there'd be heavy restrictions on both the "layer types", the I/O, and the training procedure? What if each question/family of questions you want to answer requires careful work on the structure of the model? So the strategy does generalize, but it doesn't generalize "for free"?

We generally imagine that it’s impossible to map the predictors net directly to an answer because the predictor is thinking in terms of different concepts, so it has to map to the humans nodes first in order to answer human questions about diamonds and such.

1berglund1y
I see, thanks for answering. To further clarify, given the reporter's only access to the human's nodes is through the human's answers, would it be equally likely for the reporter to create a mapping to some other Bayes net that is similarly consistent with the answers provided? Is there a reason why the reporter would map to the human's Bayes net in particular?

The SmartFabricator seems basically the same. In the robber example, you might imagine the SmartVault is the one that puts up the screen to conceal the fact that it let the diamond get stolen.

A different way of phrasing Ajeya's response, which I think is roughly accurate, is that if you have a reporter that gives consistent answers to questions, you've learned a fact about the predictor, namely "the predictor was such that when it was paired with this reporter it gave consistent answers to questions." if there were 8 predictor for which this fact was true then "it's the [7th] predictor such that when it was paired with this reporter it gave consistent answers to questions" is enough information to uniquely determine the reporter, e.g. the previ... (read more)

1Quintin Pope1y
If you want, you can slightly refactor my proposal to include a reporter module that takes the primary model's hidden representations as input and outputs more interpretable representations for the student models to use. That would leave the primary model's training objective unchanged. However, I don't think this is a good idea for much the same reason that training just the classification head of a pretrained language model isn't a good idea. However, I think training the primary model to be interpretable to other systems may actually improve economic competitiveness. The worth of a given approach depends on the ratio of capabilities to compute required. If you have a primary model whose capabilities are more easily distilled into smaller models, that's an advantage from a competitiveness standpoint. You can achieve better performance on cheaper models compared to competitors. I think people are FAR too eager to assume a significant capabilities/interpretability tradeoff. In a previous post [https://www.lesswrong.com/posts/LHCSZbhbtoLpr7B7u/the-case-for-radical-optimism-about-interpretability] , I used analogies to the brain to argue that there's enormous room to improve the interpretability of existing ML systems with little capabilities penalty. To go even further, more interpretable internal representations may actually improve learning. ML systems face their own internal interpretability problems. To optimize a system, gradient descent needs to be able to disentangle which changes will benefit vs harm the system's performance. This is a form of interpretability, though not one we often consider. Being "interpretable to gradient descent" is very different from being "interpretable to humans". However, most of my proposal focuses on making the primary model generally interpretable to many different systems, with humans as a special case. I think being more interpretable may directly lead to being easier to optimize. Intuitively, it seems easier to improve a

There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the "unaligned benchmark" we're trying to compare to is trained, and the reporter is the thing that we add onto that to "align" it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something)

In this frame, doing anything to train the way the pred... (read more)

I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I'm not sure I'm understanding correctly, but I think I would make the following claims:

  • Our notion of narrowness is that we are interested in solving the problem where the question we're asking is such that a state always resolves a question. E.g. there isn't any ambiguity around whether a state "really contains a diamond". (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there co
... (read more)
1Ramana Kumar1y
This "there isn't any ambiguity"+"there is ambiguity" does not seem possible to me: these types of ambiguity are one and the same. But it might depend on what “any set of observations” is allowed to include. “Any set” suggests being very inclusive, but remember that passive observation is impossible. Perhaps the observations I’d want the human to use to figure out if the diamond is really there (presuming there isn’t ambiguity) would include observations you mean to exclude, such as disabling the filter-nanobots first? I guess a wrinkle here is that observations need to be “implementable” in the world. If we’re thinking of making observations as intervening on the world (e.g., to decide which sensors to query), then some observations may be inaccessible because we can’t make that intervention. Rewriting this all without relying on “possible”/”can” concepts would be instructive.
3Charlie Steiner1y
I think this statement encapsulates some worries I have. If it's important how the human defines a property like "the same diamond," then assuming that the sameness of the diamond is "out there in the diamond" will get you into trouble - e.g. if there's any optimization pressure to find cases where the specifics of the human's model rear their head. Human judgment is laden with the details of how humans model the world, you can't avoid dependence on the human (and the messiness that entails) entirely. Or to phrase it another way: I don't have any beef with a narrow approach that says "there's some set of judgments for which the human is basically competent, and we want to elicit knowledge relevant to those judgments." But I'm worried about a narrow approach that says "let's assume that humans are basically competent for all judgments of interest, and keep assuming this until something goes wrong." It just feels to me like this second approach is sort of... treating the real world as if it's a perturbative approximation to the platonic realm.

Thanks for your proposal! We have considered similar strategies in the past. The main points of the breaker response would be:

  • How are you training this "terrifier"? If the terrifier moves that can be made during training are from held-out sensors that you have constructed, then the "bad terrifier" you might learn doesn't generalize to thinking of any new sensors, only ever considering the sensors that you used as ground truth during training. Thus a sufficiently clever robber might trick all of the sensors-used-during-training (which may not exist during
... (read more)

My point is either that:

  • it will always be possible to find such an experiment for any action, even desirable ones, because the AI will have defended the diamond in a way the human didn't understand or the AI will have deduced some property of diamonds that humans thought they didn't have
  • or there will be some tampering for which it's impossible to find an experiment, because in order to avoid the above problem, you will have to restrict the space of experiments

Thanks for your proposal! I'm not sure I understand how the "human is happy with experiment" part is supposed to work. Here are some thoughts:

  • Eventually, it will always be possible to find experiments where the human confidently predicts wrongly. Situations I have in mind are ones where your AI understands the world far better than you, so can predict that e.g. combining these 1000 chemicals will produce self-replicating protein assemblages, whereas the human's best guess is going to be "combining 1000 random chemicals doesn't do anything"
  • If the human
... (read more)
1Ramana Kumar1y
Thanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly. Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly. This is a crux for me. I don’t yet believe such tampering exists. The intuition I’m drawing on here is that our beliefs about what world we’re in need to cash out in anticipated experiences. Exposing confusion about something that shouldn’t be confusing can be a successful proposer strategy. I appreciate your examples of “a fake diamond that can only be exposed by complex imaging techniques” and “a human making subtly different moral judgements” and will ponder them further. Your comment also helped me realise another danger of this strategy: to get the data for training the experiment proposer, we have to execute the SmartVault actions first. (Whereas I think in the baseline scheme they don’t have to be executed.)

We don't think that real humans are likely to be using Bayes nets to model the world. We make this assumption for much the same reasons that we assume models use Bayes nets, namely that it's a test case where we have a good sense of what we want a solution to ELK to look like. We think the arguments given in the report will basically extend to more realistic models of how humans reason (or rather, we aren't aware of a concrete model of how humans reason for which the arguments don't apply).

If you think there's a specific part of the report where the human Bayes net assumption seems crucial, I'd be happy to try to give a more general form of the argument in question.

Agreed, but the thing you want to use this for isn’t simulating a long reflection, which will fail (in the worst case) because HCH can’t do certain types of learning efficiently.

2johnswentworth1y
Once we get past Simulated Long Reflection, there's a whole pile of Things To Do With AI which strike me as Probably Doomed on general principles. You mentioned using HCH to "let humans be epistemically competitive with the systems we're trying to train", which definitely falls in that pile. We have general principles saying that we should definitely not rely on humans being epistemically competitive with AGI; using HCH does not seem to get around those general principles at all. (Unless we buy some very strong hypotheses about humans' skill at factorizing problems, in which case we'd also expect HCH to be able to simulate something long-reflection-like.) Trying to be epistemically competitive with AGI is, in general, one of the most difficult use-cases one can aim for. For that to be easier than simulating a long reflection, even for architectures other than HCH-emulators, we'd need some really weird assumptions.

I want to flag that HCH was never intended to simulate a long reflection. It’s main purpose (which it fails in the worse case) is to let humans be epistemically competitive with the systems you’re trying to train.

9johnswentworth1y
I mean, we have this thread [https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models?commentId=4AoBCfmL2MHhfXwEz] with Paul directly saying "If all goes well you can think of it like 'a human thinking a long time'", plus Ajeya and Rohin both basically agreeing with that.

The way that you would think about NN anchors in my model (caveat that this isn't my whole model):

  • You have some distribution over 2020-FLOPS-equivalent that TAI needs.
  • Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio.
  • The function from 20XX to the 1:N ratio is relatively predictable, e.g. a "smooth" exponential with respect to time.
  • Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at convert
... (read more)
2Vanessa Kosoy1y
I don't understand this. * What is the meaning of "2020-FLOPS-equivalent that TAI needs"? Plausibly you can't build TAI with 2020 algorithms without some truly astronomical amount of FLOPs. * What is the meaning of "20XX-FLOPS convert to 2020-FLOPS-equivalent"? If 2020 algorithms hit DMR, you can't match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs. Maybe you're talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in this case, how do you quantify the performance required for TAI? Do we have "real life elo" for modern algorithms that we can compare to human "real life elo"? Even if we did, this is not what Cotra is doing with her "neural anchor".
Mark Xu1yΩ1416

My model is something like:

  • For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an "effective compute regime" where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
  • In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn't get you that much better performance than like an OOM of compute (I have no idea if this is true, ex
... (read more)
5Vanessa Kosoy1y
Hmm... Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is "best algorithm" is interpreted to mean "best algorithm out of some wide class of algorithms s.t. we never or almost never managed to discover any algorithm outside of this class". This can justify biological anchors as upper bounds[1] [#fn-cCeH9Wga7mav4koHv-1] : if biology is operating using the best algorithm then we will match its performance when we reach the same level of compute, whereas if biology is operating using a suboptimal algorithm then we will match its performance earlier. However, how do we define the compute used by biology? Moravec's estimate is already in the past and there's still no human-level AI. Then there is the "lifetime" anchor from Cotra's report which predicts a very short timeline. Finally, there is the "evolution" anchor which predicts a relatively long timeline. However, in Cotra's report most of the weight is assigned to the "neural net" anchors which talk about the compute for training an ANN of brain size using modern algorithms (plus there is the "genome" anchor in which the ANN is genome-sized). This is something that I don't see how to justify using Mark's model. On Mark's model, modern algorithms might very well hit diminishing returns soon, in which case we will switch to different algorithms which might have a completely different compute(parameter count) function. -------------------------------------------------------------------------------- 1. Assuming evolution also cannot discover algorithms outside our class o

In general, Baumol type effects (spending decreasing in sectors where productivity goes up), mean that we can have scenarios in which the economy is growing extremely fast on "objective" metrics like energy consumption, but GDP has stagnated because that energy is being spent on extremely marginal increases in goods being bought and sold.

A similar point is made by Korinek in his review of Could Advanced AI Drive Explosive Economic Growth:

My first reaction to the framing of the paper is to ask: growth in what? It’s important to keep in mind that concepts like “gross domestic product” and “world gross domestic product” were defined from an explicit anthropocentric perspective - they measure the total production of final goods within a certain time period. Final goods are what is either consumed by humans (e.g. food or human services) or what is invested into “capital goods” that last for m

... (read more)
2Mark Xu1y
In general, Baumol type effects (spending decreasing in sectors where productivity goes up), mean that we can have scenarios in which the economy is growing extremely fast on "objective" metrics like energy consumption, but GDP has stagnated because that energy is being spent on extremely marginal increases in goods being bought and sold.

Yeah that seems like a reasonable example of a good that can't be automated.

I think I'm mostly interested in whether these sorts of goods that seem difficult to automate will be a pragmatic constraint on economic growth. It seems clear that they'll eventually be ultimate binding constraints as long as we don't get massive population growth, but it's a separate question about whether or not they'll start being constraints early enough to prevent rapid AI-driven economic growth.

1Aaron Bergman2y
Good idea, think I will.

My house implemented such a tax.

Re 1, we ran into some of the issues Matthew brought up, but all other COVID policies are implicitly valuing risk at some dollar amount (possibly inconsistently), so the Pigouvian tax seemed like the best option available.

2Rohin Shah2y
Nice! And yeah, that matches my experience as well.

I'd be interested to see the rest of this list, if you're willing to share.

2gianlucatruda2y
I'll DM you :)
Load More