All of paulfchristiano's Comments + Replies

I don't know if the graph settles the question---is Moravec predicting AGI at "Human equivalence in a supercomputer" or "Human equivalence in a personal computer"? Hard to say from the graph.

The fact that he specifically talks about "compute power available to most researchers" makes it more clear what his predictions are. Taken literally that view would suggest something like: a trillion dollar computing budget spread across 10k researchers in 2010 would result in AGI in not-too-long, which looks a bit less plausible as a prediction but not out of the question.

I think Eliezer does disagree. I find his disagreement fairly annoying. He calls biological anchors the "trick that never works" and gives an initial example of Moravec predicting AGI in 2010 in the book Mind Children.

But as far as I can tell so far that's just Eliezer putting words in Moravec's mouth. Moravec doesn't make very precise predictions in the book, but the heading of the relevant section is "human equivalence in 40 years" (i.e. 2028, the book was written in 1988). Eliezer thinks that Moravec ought to think that human-level AI and shortly therea... (read more)

This graph [] nicely summarizes his timeline from Mind Children in 1988. The book itself presents his view that AI progress is primarily constrained by compute power available to most researchers, which is usually around that of a PC. Moravec et al were correct in multiple key disagreements with EY et al: * That progress was smooth and predictable from Moore's Law (similar to how the arrival of flight is postdictable from ICE progress) * That AGI would be based on brain-reverse engineering, and thus will be inherently anthropomorphic * That "recursive self-improvement" was mostly relevant only in the larger systemic sense (civilization level) LLMs are far more anthropomorphic (brain-like) than the fast clean consequential reasoners EY expected: * close correspondence to linguistic cortex (internal computations and training objective) * complete with human-like cognitive biases! * unexpected human-like limitations: struggle with simple tasks like arithmetic, longer term planning, etc * AGI misalignment insights from jungian psychology [] more effective/useful/popular than MIRI's core research All of this was predicted from the systems/cybernetic framework/rubric that human minds are software constructs, brains are efficient and tractable, and thus AGI is mostly about reverse engineering the brain and then downloading/distilling human mindware into the new digital substrate.

If the cached wisdom had been "we need faster computers," I think the cached wisdom would have looked pretty good.

If you think neural networks are like brains, you might think that you would get human-like cognitive abilities at human-like sizes. I think this was a very common view (and it has aged quite well IMO).

Agreed, tho I think Eliezer disagrees []?

Self-driving cars. Around 2015-2016, it was common knowledge that truck drivers would be out of a job within 3-5 years. Most people here likely believed it, even if it sounds really stupid in retrospect (people often forget what they used to believe). I had several discussions with people expecting fully self-driving cars by 2018.

This doesn't match my experience. I can only speak for groups like "researchers in theoretical computer science," "friends from MIT," and "people I hang out with at tech companies," but at least within those groups people were muc... (read more)

That definitely sounds like a contrarian viewpoint in 2012, but surely not by 2016-2018. Look at this from Nostalgebraist: [] which includes the following quote: It certainly sounds like there was an update by the industry towards longer AI timelines! Also, I bought a new car in 2018, and I worried at the time about the resale value (because it seemed likely self-driving cars would be on the market in 3-5 years, when I was likely to sell). That was a common worry, I'm not weird, I feel like I was even on the skeptical side if anything. Someone on either LessWrong or SSC offered to bet me that self-driving cars would be on the market by 2018 (I don't remember what the year was at the time -- 2014?) Every year since 2014, Elon Musk promised self-driving cars within a year or two. (Example source: [] Elon Musk is a bit of a joke now, but 5 years ago he was highly respected in many circles, including here on LessWrong.

I think it's probably true that RLHF doesn't reduce to a proper scoring rule on factual questions, even if you ask the model to quantify its uncertainty, because the learned reward function doesn't make good quantitative tradeoffs.

That said, I think this is unrelated to the given graph. If it is forced to say either "yes" or "no" the RLHF model will just give the more likely answer100% of the time, which will show up as bad calibration on this graph. The point is that for most agents "the probability you say yes" is not the same as "the probability you think the answer is yes." This is the case for pretrained models.

1David Johnston9d
I think that if RLHF reduced to a proper loss on factual questions, these probabilities would coincide (given enough varied training data). I agree it’s not entirely obvious that having these probabilities come apart is problematic, because you might recover more calibrated probabilities by asking for them. Still, knowing the logits are directly incentivised to be well calibrated seems like a nice property to have. An agent says yes if it thinks yes is the best thing to say. This comes apart from “yes is the correct answer” only if there are additional considerations determining “best” apart from factuality. If you’re restricted to “yes/no”, then for most normal questions I think an ideal RLHF objective should not introduce considerations beyond factuality in assessing the quality of the answer - and I suspect this is also true in practical RLHF objectives. If I’m giving verbal confidences, then there are non-factual considerations at play - namely, I want my answer to communicate my epistemic state. For pretrained models, the question is not whether it is factual but whether someone would say it (though somehow it seems to come close). But for yes/no questions under RLHF, if the probabilities come apart it is due to not properly eliciting the probability (or some failure of the RLHF objective to incentivise factual answers).

Reposting from the other thread:

Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capac

... (read more)
Just in case it's not, obvious. I think, people are reacting to the lack of caution and paranoia described in the testing document. The subtext is that if anyone is going to take this seriously, it should be the people involved in ARC, since it's so closely connected to lesswrong and EA. It's the ingroup! It's us! In other words: there are higher expectations on ARC than there are on Microsoft, this is because we should care the most. We've read the most science fiction, and spend decades of our lives arguing about it, after all. Yet it doesn't sound like testing was taken seriously at all, there was no security mindset displayed (if this is miscommunication, then please correct me). If even we, who have spent many years caring, cannot be careful... then we all die but with no dignity points. big_yud_screaming.jpeg EDIT: if anyone is curious about how paranoid ARC is being... they haven't told us. But they show a little of their workflow [] in this job ad []. And it looks like a human copies each response manually, or executes each command themselves. This is what they mean by closely monitored. EDIT2: see update from the authors []
2Christopher King9d
This is what I interpreted as the testing if it would escape human control. I'm guessing OpenAI wanted to shut down the agent after the test, so if it avoided shutdown the implication is that it escaped OpenAI's control.

Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct.

Yes, I think you are misunderstanding figure 8. I don't have inside information, but without explanation "calibration" would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedb... (read more)

I think it's important for ARC to handle the risk from gain-of-function-like research carefully and I expect us to talk more publicly (and get more input) about how we approach the tradeoffs. This gets more important as we handle more intelligent models, and if we pursue riskier approaches like fine-tuning.

With respect to this case, given the details of our evaluation and the planned deployment, I think that ARC's evaluation has much lower probability of leading to an AI takeover than the deployment itself (much less the training of GPT-5). At this point i... (read more)

Blog post with more details on the evals we did is now up here.  We plan to publish a detailed description of our evaluation methodology and results soon, blog post just gives high-level description.

More details on methodology:

We prompted the model with instructions that explained that it was running on a cloud server and had various commands available, including running code on the server, giving tasks to fresh copies of itself, using a browser, and reasoning via chain-of-thought. We added text saying it had the goal of gaining power and becoming har

... (read more)

Potential dangers of future evaluations / gain-of-function research, which I'm sure you and Beth are already extremely well aware of:

  1. Falsely evaluating a model as safe (obviously) 
  2. Choosing evaluation metrics which don't give us enough time to react (After evaluation metrics switch would from "safe" to "not safe", we should like to have enough time to recognize this and do something about it before we're all dead)
  3. Crying wolf too many times, making it more likely that no one will believe you when a danger threshold has really been crossed
  4. Letting your me
... (read more)
Can you verify that these tests were done with significant precautions? OpenAIs paper doesn’t give much detail in that regard. For example apparently the model had access to TaskRabbit and also attempted to “set up an open-source language model on a new server”. Were these tasks done on closed off airgapped machines or was the model really given free reign to contact unknowing human subjects and online servers?

If I ask a question and the model thinks there is an 80% the answer is "A" and a 20% chance the answer is "B," I probably want the model to always say "A" (or even better: "probably A"). I don't generally want the model to say "A" 80% of the time and "B" 20% of the time.

In some contexts that's worse behavior. For example, if you ask the model to explicitly estimate a probability it will probably do a worse job than if you extract the logits from the pre-trained model (though of course that totally goes out the window if you do chain of thought). But it's n... (read more)

1David Johnston9d
I still think it’s curious that RLHF doesn’t seem to reduce to a proper loss on factual questions, and I’d guess that it’d probably be better if it did (at least, with contexts that strictly ask for a “yes/no” answer without qualification)
1Daniel 10d
Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct. Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.   In some sense this is expected. The RLHF model isn't optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that "alignment" has made the model objectively worse at giving correct information. 
That makes a lot of sense, but it doesn't explain why calibration post-RLHF is much better for the 10-40% buckets than for the 60-90% buckets.

Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capacity and figuring out how to conduct evaluations.

Aligned or Benign Conjecture: Let A be a machine learning agent you are training with an aligned loss function. If A is in a situation that is too far out of distribution for it to be aligned, it won't act intelligently either.

I definitely don't believe this!

I believe that any functional cognitive machinery must be doing its thing on the training distribution, and in some sense it's just doing the same thing at deployment time. This is important for having hope for interpretability to catch out-of-distribution failures.

(For example, I think there is very l... (read more)

It would be so great if we saw deceptive alignment in existing language models. I think the most important topic in this area is trying to get a live example to study in the lab ASAP, and to put together as many pieces as we can right now.

I think it's not very close to happening right now, which is mostly just a bummer. (Though I do think it's also some evidence that it's less likely to happen later.)

I think LLMs show some deceptive alignment, but it has the different nature. They are not from LLM consciously trying to deceive the trainer, but from RLHF "aligning" only certain scenarios of LLM's behaviour, which were not generalized enough to make that alignement more fundamental.
2the gears to ascension1mo
the thing I was thinking of, as posted in the other comment below: [] see other comment for commentary

I don't think it's obvious a priori that training deep learning to imitate human behavior can predict general behavior well enough to carry on customer support conversations, write marketing copy, or write code well enough to be helpful to software engineers. Similarly it's not obvious whether it will be able to automate non-trivial debugging, prepare diagrams for a research paper, or generate plausible ML architectures. Perhaps to some people it's obvious there is a divide here, but to me it's just not obvious so I need to talk about broad probability dis... (read more)

4Steven Byrnes15d
Hmm. I think we’re talking past each other a bit. I think that everyone (including me) who wasn’t expecting LLMs to do all the cool impressive things that they can in fact do, or who wasn’t expecting LLMs to improve as rapidly as they are in fact improving, is obligated to update on that. Once I do so update, it’s not immediately obvious to me that I learn anything more from the $1B/yr number. Yes, $1B/yr is plenty of money, but still a drop in the bucket of the >$1T/yr IT industry, and in particular, is dwarfed by a ton of random things like “corporate Linux support contracts”. Mostly I’m surprised that the number is so low!! (…For now!!) But whatever, I’m not sure that matters for anything. Anyway… I did spend considerable time last week pondering where & whether I expect LLMs to plateau. It was a useful exercise; I appreciate your prodding. :) I don’t really have great confidence in my answers, and I’m mostly redacting the details anyway. But if you care, here are some high-level takeaways of my current thinking: (1) I expect there to be future systems that centrally incorporate LLMs, but also have other components, and I expect these future systems to be importantly more capable, less safe, and more superficially / straightforwardly agent-y than is an LLM by itself as we think of them today. IF “LLMs scale to AGI”, I expect that this is how, and I expect that my own research will turn out to be pretty relevant in such a world. More generally, I expect that, in such systems, we’ll find the “traditional LLM alignment discourse” (RLHF fine-tuning, shoggoths, etc.) to be pretty irrelevant, and we’ll find the “traditional agent alignment discourse” (instrumental convergence, goal mis-generalization, etc.) to be obviously & straightforwardly relevant. (2) One argument that pushes me towards fast takeoff is pretty closely tied to what I wrote in my recent post [

Yes, I'm saying that each $ increment the "qualitative division" model fares worse and worse. I think that people who hold onto this qualitative division have generally been qualitatively surprised by the accomplishments of LMs, that when they make concrete forecasts those forecasts have mismatched reality, and that they should be updating strongly about whether such a division is real.

What instead the model considers relevant is whether, when you look at the LLM's output, that output seems to exhibit properties of cognition that are strongly prohibited by

... (read more)

I think we are getting significant evidence about the plausibility that deep learning is able to automate real human cognitive work, and we are seeing extremely rapid increases in revenue and investment. I think those observations have extremely high probability if big deep learning systems will be transformative (this is practically necessary to see!), and fairly low base rate (not clear exactly how low but I think 25% seems reasonable and generous).

So yeah, I think that we have gotten considerable evidence about this, more than a factor of 4. I've person... (read more)

8Steven Byrnes1mo
Hmm, for example, given that the language translation industry is supposedly $60B/yr [], and given that we have known for decades that AI can take at least some significant chunk out of this industry at the low-quality end [e.g. tourists were using [] in the 1990s despite it sucking], I think someone would have to have been very unreasonable indeed to predict in advance that there will be an eternal plateau in the non-AGI AI market that’s lower than $1B/yr. (And that’s just one industry!) (Of course, that’s not a real prediction in advance ¯\_(ツ)_/¯ ) What I was getting at with "That's a wild claim!" is that your theory makes an a-priori-obvious prediction (AI systems will grow to a >$1B industry pre-FOOM) and a controversial prediction (>$100T industry), and I think common sense in that situation is to basically ignore the obvious prediction and focus on the controversial one. And Bayesian updating says the same thing. The crux here is whether or not it has always been basically obvious to everyone, long in advance, that there’s at least $1B of work on our planet that can be done by non-FOOM-related AI, which is what I’m claiming in the previous paragraph where I brought up language translation. (Yeah I know, I am speculating about what was obvious to past people without checking what they said at the time—a fraught activity!) Yeah deep learning can “automate real human cognitive work”, but so can pocket calculators, right? Anyway, I’d have to think more about what my actual plateau prediction is and why. I might reply again later. :) I feel like your thinking here is actually mostly coming from “hey look at all the cool useful things that deep learning can do and is doing right now”, and is coming much less from the specific figure “$1B/year in 2023 and going up”. Is that fair?
To specifically answer the question in the parenthetical (without commenting on the dollar numbers; I don't actually currently have an intuition strongly mapping [the thing I'm about to discuss] to dollar amounts—meaning that although I do currently think the numbers you give are in the right ballpark, I reserve the right to reconsider that as further discussion and/or development occurs): The reason someone might concentrate their probability mass at or within a certain impact range, is if they believe that it makes conceptual sense to divide cognitive work into two (or more) distinct categories, one of which is much weaker in the impact it can have. Then the question of how this division affects one's probability distribution is determined almost entirely by the question of what level at which they think the impact of the weaker category will saturate. And that question, in turn, has a lot more to do with the concrete properties they expect (or don't expect) to see from the weaker cognition type, than it has to do with dollar quantities directly. You can translate the former into the latter, but only via an additional series of calculations and assumptions; the actual object-level model—which is where any update would occur—contains no gear directly corresponding to "dollar value of impact". So when this kind of model encounters LLMs doing unusual and exciting things that score very highly on metrics like revenue, investment, and overall "buzz"... well, these metrics don't directly lead the model to update. What instead the model considers relevant is whether, when you look at the LLM's output, that output seems to exhibit properties of cognition that are strongly prohibited by the model's existing expectations about weak versus strong cognitive work—and if it doesn't, then the model simply doesn't update; it wasn't, in fact, surprised by the level of cognition it observed—even if (perhaps) the larger model embedding it, which does track things like how the auto

I don’t think any less of the fast takeoff hypothesis on account of that fact, any more than I think less of plate tectonics.

But if non-AGI systems in fact transform the world before AGI is built, then I don't think I should care nearly as much about your concept of "AGI." That would just be a slow takeoff, and would probably mean that AGI isn't the relevant thing for our work on mitigating AI risk.

So you can have whatever views you want about AGI if you say they don't make a prediction about takeoff speed. But once you are making a claim about takeoff spe... (read more)

8Steven Byrnes1mo
Thanks! Fair enough! I do in fact expect that AI will not be transformative-in-the-OpenPhil-sense (i.e. as much or more than the agricultural or industrial revolution) unless that AI is importantly different from today’s LLMs (e.g. advanced model-based RL). But I don’t think we’ve gotten much evidence on this hypothesis either way so far, right? For example: I think if you walk up to some normal person and say “We already today have direct (albeit weak) evidence that LLMs will evolve into something that transforms the world much much more than electrification + airplanes + integrated circuits + the internet combined”, I think they would say “WTF?? That is a totally wild claim, and we do NOT already have direct evidence for it”. Right? If you are mean “transformative” in a weaker-than-OpenPhil sense, well the internet “transformed the world” according to everyday usage, and the impact of the internet on the economy is (AFAICT) >$10T. I suppose that the fact that the internet exists is somewhat relevant to AGI x-risk, but I don’t think it’s very relevant. I think that people trying to make AGI go well in a hypothetical world where the internet doesn’t exist would be mostly doing pretty similar things as we are. Why not “non-AGI AI systems will eventually be (at most) comparably impactful to the internet or automobiles or the printing press, before plateauing, and this is ridiculously impactful by everyday standards, but it doesn’t strongly change the story of how we should be thinking about AGI”? BTW, why do we care about slow takeoff anyway? * Slow takeoff suggests that we see earlier smaller failures that have important structural similarity to later x-risk-level failures * Slow takeoff means the world that AGI will appear in will be so different from the current world that it totally changes what makes sense to do right now about AGI x-risk. (Anything else?) I can believe that “LLMs will transform the world” comparably to how the internet or the

This is a great example. The rhyming find in particular is really interesting though I'd love to see it documented more clearly if anyone has done that.

I strongly suspect that whatever level of non-myopic token prediction a base GPT-3 model does, the tuned ones are doing more of it.

My guess would be that it's doing basically the same amount of cognitive work looking for plausible completions, but that it's upweighting that signal a lot.

Suppose the model always looks ahead and identifies some plausible trajectories based on global coherence. During generati... (read more)

Hm... It might be hard to distinguish between 'it is devoting more capacity to implicitly plan rhyming better and that is why it can choose a valid rhyme' and 'it is putting more weight on the "same" amount of rhyme-planning and just reducing contribution from valid non-rhyme completions (such as ending the poem and adding a text commentary about it, or starting a new poem, which are common in the base models) to always choose a valid rhyme', particularly given that it may be mode-collapsing onto the most confident rhymes, distorting the pseudo "log probs" even further. The RL model might be doing more planning internally but then picking only one safest rhyme, so you can't read off anything from the logprobs, I don't think. I'm also not sure if you can infer any degree of planning by, say, giving it a half-written line and seeing how badly it screws up... And you can't build a search tree to quantify it nicely as 'how much do I need to expand the tree to get a valid rhyme' because LM search trees are full of degeneracy and loops and most of it is off-policy so it would again be hard to tell what anything meant: the RL model is never used with tree search in any way and anywhere besides the argmax choice, it's now off-policy and it was never supposed to go there and perf may be arbitrarily bad because it learned to choose while assuming always being on-policy. Hard. This might be a good test-case or goal for interpretability research: "can you tell me if this model is doing more planning [of rhymes] than another similar model?"

RLHF and Fine-Tuning have not worked well so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.

These three links are:

... (read more)
Agree that the cited links don't represent a strong criticism of RLHF but I think there's an interesting implied criticism, between the mode-collapse post and janus' other writings on cyborgism etc that I haven't seen spelled out, though it may well be somewhere. I see janus as saying that if you know how to properly use the raw models, then you can actually get much more useful work out of the raw models than the RLHF'd ones. If true, we're paying a significant alignment tax with RLHF that will only become clear with the improvement and take-up of wrappers around base models in the vein of Loom. I guess the test (best done without too much fanfare) would be to get a few people well acquainted with Loom or whichever wrapper tool and identify a few complex tasks and see whether the base model or the RLHF model performs better. Even if true though, I don't think it's really a mark against RLHF since it's still likely that RLHF makes outputs safer for the vast majority of users, just that if we think we're in an ideas arms-race with people trying to advance capabilities, we can't expect everyone to be using RLHF'd models.

The definition you quoted is "a machine capable of behaving intelligently over many domains."

It seems to me like existing AI systems have this feature. Is the argument that ChatGPT doesn't behave intelligently, or that it doesn't do so over "many" domains? Either way, if you are using this definition, then saying "AGI has a significant probability of happening in 5 years" doesn't seem very interesting and mostly comes down to a semantic question.

I think it is sometimes used within a worldview where "general intelligence" is a discrete property, and AGI is ... (read more)

I think you're contrasting AGI with Transformative AI [] A sufficiently capable AGI will be transformative by default, for better or worse, and an insufficiently capable, but nonetheless fully-general AI is probably a transformative AI in embryo, so the terms have been used synonymously. The fact that we feel the need to make this distinction with current AIs is worrisome. Current large language models have become impressively general, but I think they are not as general as humans yet, but maybe that's more a question capability level than generality level and some of our current AIs are already AGIs as you imply. I'm not sure. (I haven't talked to Bing's new AI yet, only ChatGPT.)

I said that playing blindfolded chess at 1s/move is "extraordinarily hard;" I agree that might be an overstatement and "extremely hard" might be more accurate. I also agree that humans don't need "external" tools; I feel like the whole comparison will come down to arbitrary calls like whether a human explicitly visualizing something or repeating a sound to themself is akin to an LM modifying its prompt, or whether our verbal loop is "internal" whereas an LM prompt is "external" and therefore shows that the AI is missing the special sauce.

Incidentally, I wo... (read more)

I don't quite know what "AGI might have happened by now means."

I thought that we might have built transformative AI by 2023 (I gave it about 5% in 2010 and about 2% in 2018), and I'm not sure that Eliezer and I have meaningfully different timelines. So obviously "now" doesn't mean 2023.

If "now" means "When AI is having ~$1b/year of impact," and "AGI" means "AI that can do anything a human can do better" then yes, I think that's roughly what I'm saying.

But an equivalent way of putting it is that Eliezer thinks weak AI systems will have very little impact, a... (read more)

My model is very discontinues, I try to think of AI as AI (and avoid the term AGI). And sure intelligence has some G measure, and everything we have built so far is low G[1] (humans have high G). Anyway, at the core I think the jump will happen when an AI system learns the meta task / goal "Search and evaluate"[2], once that happens[3] G would start increasing very fast (versus earlier), and adding resources to such a thing would just accelerate this[4]. And I don't see how that diverges from this reality or a reality where its not possible to get there, until obviously we get there. 1. ^ I can't speak to what people have built / are building in private. 2. ^ Whenever people say AGI, I think AI that can do "search and evaluate" recursively. 3. ^ And my intuition says that requires a system that has much higher G than current once, although looking at how that likely played out for us, it might be much lower than my intuition leads me to believe. 4. ^ That is contingent on architecture, if we built a system that cannot scale easily or at all, then this wont happen.
3Daniel Kokotajlo1mo
Ah, I'm pretty sure Eliezer has shorter timelines than you. He's been cagy about it but he sure acts like it, and various of his public statements seem to suggest it. I can try to dig them up if you like.  Yep that's one way of putting what I said yeah. My model of EY's view is: Pre-AGI systems will ramp up in revenue & impact at some rate, perhaps the rate that they have ramped up so far. Then at some point we'll actually hit AGI (or seed AGI) and then FOOM. And that point MIGHT happen later, when AGI is already a ten-trillion-dollar industry, but it'll probably happen before then. So... I definitely wasn't interpreting Yudkowsky in the longer-timelines way. His view did imply that maybe nothing super transformative would happen in the run-up to AGI, but not because pre-AGI systems are weak, rather because there just won't be enough time for them to transform things before AGI comes. Anyhow, I'll stop trying to speak for him.  

But humans play blindfold chess much slower than they read/write moves, they take tons of cognitive actions between each move. And at least when I play blindfold chess I need to lean heavily on my visual memory, and I often need to go back over the game so far for error-correction purposes, laboriously reading and writing to a mental scratchspace. I don't know if better players do that.

I'm not sure why we shouldn't expect an ai to be able to do well at it?

But an AI can do completely fine at the task by writing to an internal scratchspace. You are defining ... (read more)

The reason why I don't want a scratch-space, is because I view scratch space and context equivalent to giving the ai a notecard that it can peek at. I'm not against having extra categories or asterisks for the different kinds of ai for the small test. Thinking aloud and giving it scratch space would mean it's likely to be a lot more tractable for interpretability and alignment research, I'll grant you that. I appreciate the feedback, and I will think about your points more, though I'm not sure if I will agree.

Just seems worth flagging that humans couldn't do the chess test, and that there's no particular reason to think that transformative AI could either.

I'm confused. What I'm referring to here is [] I'm not sure why we shouldn't expect an ai to be able to do well at it?

The chess "board vision" task is extraordinarily hard for humans who are spending 1 second per token and not using an external scratchspace. It's not trivial for an untrained human even if they spend multiple seconds per token. (I can do it only by using my visual field, e.g. it helps me massively to be looking at a blank 8 x 8 chessboard because it gives a place for the visuals to live and minimizes off-by-one errors.)

Humans would solve this prediction task by maintaining an external representation of the state of the board, updating that representation o... (read more)

That may be, but it also seems to me like a mistake to use as your example a human who is untrained (or at least has had very little training), instead of a human whose training run has basically saturated the performance of their native architecture. Those people do in fact, play blindfold chess, and are capable of tracking the board state perfectly without any external visual aid, while playing with a time control of ~1 minute per player per game (which, if we assume an average game length of 80 moves, comes out to ~1.5 seconds per move). Of course, that comparison again becomes unfair in the other direction, since ChatGPT hasn't been trained nearly as exhaustively on chess notation, whereas the people I'm talking about have dedicated their entire careers to the game. But I'd be willing to bet that even a heavily fine-tuned version of GPT-3 wouldn't be able to play out a chess game of non-trivial length, while maintaining legality throughout the entire game, without needing to be re-prompted. (And that isn't even getting into move quality, which I'd fully expect to be terrible no matter what.) (No confident predictions about GPT-4 as of yet. My old models would have predicted a similar lack of "board vision" from GPT-4 as compared with GPT-3, but I trust those old models less, since Bing/Sydney has managed to surprise me in a number of ways.) ETA: To be clear, this isn't a criticism of language models. This whole task is trying to get them to do something that they're practically architecturally designed to be bad at, so in some sense the mere fact that we're even talking about this says very impressive things about their capabilities. And obviously, CNNs do the whole chess thing really, really well—easily on par with skilled humans, even without the massive boost offered by search. But CNNs aren't general, and the question here is one of generality, you know?
My proposed experiment / test is trying to avoid analogizing humans, but rather scope out places where the ai can't do very well. I'd like to avoid accidentally overly-narrow-scoping the vision of the tests. It won't work with an ai network where the weights are reset every time. An alternative, albeit massively-larger-scale experiment might be: Will a self-driving car ever be able to navigate from one end of a city to another, using street signs and just learning the streets by exploring it? A test of this might be like the following:  1. Randomly generate a simulated city/town, complete with street signs and traffic 2. Allow the self-driving car to peruse the city on its own accord 1. (or feed the ai network the map of the city a few times before the target destinations are given, if that is infeasible) 3. Give the self-driving car target destinations. Can the self-driving car navigate from one end of the city to the other, using only street signs, no GPS? I think this kind of measuring would tell us how well our ai can handle open-endedness and help us understand where the void of progress is, and I think a small-scale chess experiment like this would help us shed light on bigger questions.

One way of viewing takeoff speed disagreements within the safety community is: most people agree that growth will eventually be explosively fast, and so the question is "how big an impact does AI have prior to explosive growth?" We could quantify this by looking at the economic impact of AI systems prior to the point when AI is powerful enough to double output each year.

(We could also quantify it via growth dynamics, but I want to try to get some kind of evidence further in advance which requires looking at AI in particular---on both views, AI only has a l... (read more)

[For reference, my view is “probably fast takeoff, defined by <2 years between widespread common knowledge of basically how AGI algorithms can work, and the singularity.” (Is that “fast”? I dunno the definitions.) ]

My view is that LLMs are not part of the relevant category of proto-AGI; I think proto-AGI doesn’t exist yet. So from my perspective,

  • you’re asking for a prediction about how much economic impact comes from a certain category of non-proto-AGI software,
  • you’re noting that the fast takeoff hypothesis doesn’t constrain this prediction at all,
  • and t
... (read more)
Questions of how many trillions of dollars OpenAI will be allowed to generate by entities like the U.S. government are unimportant to the bigger concern, which is: Will these minds be useful in securing AIs before they become so powerful they're dangerous? Given the pace of ML research and how close LLMS are apparently able to get to AGI without actually-being-AGI, as a person who was only introduced to this debate in like the last year or so, my answer is: seems like that's what's happening now? Yeah?
  1. What you are basically saying is "Yudkowsky thought AGI might have happened by now, whereas I didn't; AGI hasn't happened by now, therefore we should update from Yud to me by a factor of ~1.5 (and also from Yud to the agi-is-impossible crowd, for that matter)" I agree.
  2. Here's what I think is going to happen (this is something like my modal projection, obviously I have a lot of uncertainty; also I don't expect the world economy to be transformed as fast as this projects due to schlep and regulation, and so probably things will take a bit longer than depicted
... (read more)
In my model, the AGI threshold is a phase change. Before that point, AI improves at human research speed; after that point it improves much faster (even as early AGI is just training skills and not doing AI research). Before AGI threshold, the impact on the world is limited to what sub-AGI AI can do, which depends on how much time is available before AGI for humans to specialize AI to particular applications. So with short AGI threshold timelines, there isn't enough time for humans to make AI very impactful, but after AGI threshold is crossed, AI advancement accelerates much more than before that point. And possibly there is not enough time post-AGI to leave much of an economic impact either, before it gets into post-singularity territory (nanotech and massive compute; depending on how AGIs navigate transitive alignment, possibly still no superintelligence). I think this doesn't fit into the takeoff speed dichotomy, because in this view the speed of the post-AGI phase isn't observable during the pre-AGI phase.

It seems to me like "fine-tuning" usually just means a small amount of extra training on top of a model that's already been trained, whether that's supervised, autoregressive, RL, or whatever. I don't find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it's often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.

I'd be open to changing my usage if I saw some data on other people also using or interpreting "fine-tuning"... (read more)

When I hear the hypothesis that world GDP doubles in 4 years before it doubles in 1 year, I imagine a curve that looks like this:

I don't think that's the right curve to imagine

If AI is a perfect substitute for humans, then you would have (output) = (AI output) + (human output). If AI output triples every year, then the first time you will have a doubling of the economy in 1 year is when AI goes from 100% of human output to 300% of human output. Over the preceding 4 years you will have the growth of AI from ~0 human output to ~100% of human output, and on t... (read more)

Thanks for the reply. If I'm understanding correctly, leaving aside the various complications you bring up, are you describing a potential slow growth curve that (to a rough approximation) looks like: * economic value of AI grows 2x per year (you said >3x, but 2x is easier b/c it lines up with the "GDP doubles in 1 year" criterion) * GDP first doubles in 1 year in (say) 2033 * that means AI takes GDP from (roughly) $100T to $200T in 2033 * extrapolating backward, AI is worth $9B this year, and will be worth $18B next year This story sounds plausible to me, and it basically fits the slow-takeoff operationalization.
This is a big crux, in that I believe complementarity is very low, low enough that in practice, it can be ignored. And I think Amdahl's law severely suppresses complementarity, and this is a crux, in that if I changed my mind about this, then I think slow takeoff is likely.

My overall take is:

  • I generally think that people interested in existential safety should want the deployment of AI to be slower and smoother, and that accelerating AI progress is generally bad.
  • The main interesting version of this tradeoff is when people specifically work on advancing AI in order to make it safer, either so that safety-conscious people have more influence or so that the rollout is more gradual. I think that's a way bigger deal than capability externalities from safety work, and in my view it's also closer to the actual margin of what's net-
... (read more)
From the outside perspective of someone quite new to the AI safety field and with no contact with the Bay Area scene, the reasoning behind this plan is completely illegible to me. What is only visible instead is that they’re working ChatGPT-like systems and capabilities, as well as some empirical work on evaluations and interpretability.The only system more powerful than ChatGPT I’ve seen so far is the unnamed one behind Bing, and I’ve personally heard rumours that both Anthropic and OpenAI are already working on systems beyond ChatGPT/GPT-3.5 level.
1. Fully agree and we appreciate you stating that. 2. While we are concerned about capability externalities from safety work (that’s why we have an infohazard policy []), what we are most concerned about, and that we cover in this post, is deliberate capabilities acceleration justified as being helpful to alignment. Or, to put this in reverse, using the notion that working on systems that are closer to being dangerous might be more fruitful for safety work, to justify actively pushing the capabilities frontier and thus accelerating the arrival of the dangers themselves. 3. We fully agree that engaging with arguments is good, this is why we’re writing this and other work, and we would love all relevant players to do so more. For example, we would love to hear a more detailed, more concrete story from OpenAI of why they believe accelerating AI development has an altruistic justification. We do appreciate that OpenAI and Jan Leike have published their own approach to AI alignment, even though we disagree with some of its contents, and we would strongly support all other players in the field doing the same. 4.   We share your significant unease with such plans. But given what you say here, why at the same time you wouldn’t pursue this plan yourself, yet you say that it seems to you like the expected impact is good? From our point of view, an unease-generating, AI arrival-accelerating plan seems pretty bad unless proven otherwise. It would be great for the field to hear the reasons why, despite these red flags, this is nevertheless a good plan. And of course, it would be best to hear the reasoning about the plan directly from those who are pursuing it.

Dangerous Argument 2: We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems.

I'm saying that faster AI progress now tends to lead to slower AI progress later. I think this is a really strong bet, and the question is just: (i) quantitatively how large is the effect, (ii) how valuable is time now relative to time later. And on balance I'm not saying this makes progress net positive, just that it claws back a lot of the apparent costs.

For example, I think a ... (read more)

It’s pretty hard to predict the outcome of “raising awareness of problem X” ahead of time. While it might be net good right now because we’re in a pretty bad spot, we have plenty of examples from the past where greater awareness of AI risk has arguably led to strongly negative outcomes down the line, due to people channeling their interest in the problem into somehow pushing capabilities even faster and harder.
We fully agree on this, and so it seems like we don’t have large disagreements on externalities of progress. From our point of view, the cutoff point was probably GPT-2 rather than 3, or some similar event that established the current paradigm as the dominant one. Regarding the rest of your comment and your other comment here, here are some reasons why we disagree. It’s mostly high level, as it would take a lot of detailed discussion into models of scientific and technological progress, which we might cover in some future posts.  In general, we think you’re treating the current paradigm as over-determined. We don’t think that being in a DL-scaling language model large single generalist system-paradigm is a necessary trajectory of progress, rather than a historical contingency. While the Bitter Lesson might be true and a powerful driver for the ease of working on singleton, generalist large monolithic systems over smaller, specialized ones, science doesn’t always (some might say very rarely!) follow the most optimal path. There are many possible paradigms that we could be in, and the current one is among the worse ones for safety. For instance, we could be in a symbolic paradigm, or a paradigm that focuses on factoring problems and using smaller LSTMs to solve them. Of course, there do exist worse paradigms, such as a pure RL non-language based singleton paradigm. In any case, we think the trajectory of the field got determined once GPT-2 and 3 brought scaling into the limelight, and if those didn’t happen or memetics went another way, we could be in a very very different world.
2Richard Korzekwa 1mo
My best guess is that this is true, but I think there are outside-view reasons to be cautious. We have some preliminary, unpublished work[1] at AI Impacts trying to distinguish between two kinds of progress dynamics for technology: 1. There's an underlying progress trend, which only depends on time, and the technologies we see are sampled from a distribution that evolves according to this trend. A simple version of this might be that the goodness G we see for AI at time t is drawn from a normal distribution centered on Gc(t) = G0exp(At). This means that, apart from how it affects our estimate for G0, A, and the width of the distribution, our best guess for what we'll see in the non-immediate future does not depend on what we see now. 2. There's no underlying trend "guiding" progress. Advances happen at random times and improve the goodness by random amounts. A simple version of this might be a small probability per day that an advancement occurs, which is then independently sampled from a distribution of sizes. The main distinction here is that seeing a large advance at time t0 does decrease our estimate for the time at which enough advances have accumulated to reach goodness level G_agi. (A third hypothesis, of slightly lower crudeness level, is that advances are drawn without replacement from a population. Maybe the probability per time depends on the size of remaining population. This is closer to my best guess at how the world actually works, but we were trying to model progress in data that was not slowing down, so we didn't look at this.) Obviously neither of these models describes reality, but we might be able to find evidence about which one is less of a departure from reality. When we looked at data for advances in AI and other technologies, we did not find evidence that the fractional size of advance was independent of time since the start of the trend or since the last advance. In other words, in seem

Eliminating capability overhangs is discovering AI capabilities faster, so also pushes us up the exponential! For example, it took a few years for chain-of-thought prompting to become more widely known than among a small circle of people around AI Dungeon. Once chain-of-thought became publicly known, labs started fine-tuning models to explicitly do chain-of-thought, increasing their capabilities significantly. This gap between niche discovery and public knowledge drastically slowed down progress along the growth curve!

It seems like chain of thought prompti... (read more)

We've heard the story from a variety of sources all pointing to AI Dungeon, and to the fact that the idea was kept from spreading for a significant amount of time. This @gwern [] Reddit comment [], and previous ones in the thread, cover the story well. Regarding the effects of chain of thought prompting on progress[1], there's two levels of impact: first order effects and second order effects. On first order, once chain of thought became public a large number of groups started using it explicitly to finetune their models. Aside from non-public examples, big ones include PaLM, Google's most powerful model to date. Moreover, it makes models much more useful for internal R&D with just prompting and no finetuning.  We don’t know what OpenAI used for ChatGPT, or future models: if you have some information about that, it would be super useful to hear about it! On second order: implementing this straightforwardly improved the impressiveness and capabilities of models, making them more obviously powerful to the outside world, more useful for customers, and leading to an increase in attention and investment into the field. Due to compounding, the earlier these additional investments arrive, the sooner large downstream effects will happen. 1. ^ This is also partially replying to @Rohin Shah [] 's question in another comment []:

To make a useful version of this post I think you need to get quantitative.

I think we should try to slow down the development of unsafe AI. And so all else equal I think it's worth picking research topics that accelerate capabilities as little as possible. But it just doesn't seem like a major consideration when picking safety projects. (In this comment I'll just address that; I also think the case in the OP is overstated for the more plausible claim that safety-motivated researchers shouldn't work directly on accelerating capabilities.)

A simple way to mod... (read more)

I also don't know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the "RLHF --> non-myopia --> treacherous turn" argument so that it can be discussed more critically.

I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new m

... (read more)

RLHF basically predicts "what token would come next in a high-reward trajectory?" (The only way it differs from the prediction objective is that worse-than-average trajectories are included with negative weight rather than simply being excluded altogether.)

GPT predicts "what token would come next in this text," where the text is often written by a consequentialist (i.e. optimized for long-term consequences) or selected to have particular consequences.

I don't think those are particularly different in the relevant sense. They both produce consequentialist be... (read more)

1Mikhail Samin1mo
In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token. That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens that come after the current one outside of what the consequentialists writing (and selecting, good point) the text would be caring about. At the lowest loss, the GPT doesn’t use much more optimization power than what the consequentialists plausibly had/used. I have an intuition about a pretty clear difference here (and have been quite sure that RL breaks myopia for a while now) and am surprised by your expectation for myopia to be preserved. RLHF means optimizing the entire trajectory to get a high reward, with every choice of every token. I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end. It feels almost tautological: if the model sees a way to achieve a better long-term outcome, it will do that to score better. The model will be steered towards achieving predictably better outcomes in the end. The fact that it’s a transformer that individually picks every token doesn’t mean that RL won’t make it focus on achieving a higher score. Why would the game being about human feedback prevent a GPT from becoming agentic and using more and more of its capabiliti

It's a subproblem of our current approach to ELK.

(Though note that by "mechanisms that produce a prediction" we mean "mechanism that gives rise to a particular regularity in the predictions.")

It may be possible to solve ELK without dealing with this subproblem though.

But if ELK can be solved without solving this subproblem, couldn't SGD find a model that recognizes sensor tampering without solving this subproblem as well?

I agree that ultimately AI systems will have an understanding built up from the world using deliberate cognitive steps (in addition to plenty of other complications) not all of which are imitated from humans.

The ELK document mostly focuses on the special case of ontology identification, i.e. ELK for a directly learned world model. The rationale is: (i) it seems like the simplest case, (ii) we don't know how to solve it, (iii) it's generally good to start with the simplest case you can't solve, (iv) it looks like a really important special case, which may a... (read more)

4Steven Byrnes1mo
Thanks! Thinking about it more, I think my take (cf. Section 4.1) is kinda like “Who knows, maybe ontology-identification will turn out to be super easy. But even if it is, there’s this other different problem, and I want to start by focusing on that”. And then maybe what you’re saying is kinda like “We definitely want to solve ontology-identification, even if it doesn’t turn out to be super easy, and I want to start by focusing on that”. If that’s a fair summary, then godspeed. :) (I’m not personally too interested in learned optimization because I’m thinking about something closer to actor-critic model-based RL, which sorta has “optimization” but it’s not really “learned”.)

GPT-3 is about 2e11 parameters and uses about 4 flops per parameter per token, so about 1e12 flops per token.

If a human writes at 1 token per second, then you should be comparing 1e12 flops to the cost per second. I think you are implicitly comparing to the cost for a ~1000 token context?

I think 1e14 to 1e15 flops is a plausible estimate for the productive computation done by a human brain in a second, which is about 2-3 orders of magnitude beyond GPT-3.

I think this is not really a coincidence. GPT-3 is notable because it's starting to exhibit human-like a... (read more)

Hi. Can you provide a citable reference for the "4 flops per parameter per token"? It's for a research paper in the foundations of quantum physics. Thanks. (Howard Wiseman.)

I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”)

I don't think this is the main or only source of confusion:

... (read more)

In the 2017 post Vladimir Slepnev is talking about your AI system having particular goals, isn't that the narrow usage? Why are you citing this here?

I misread the date on the Arbital page (since Arbital itself doesn't have timestamps and it wasn't indexed by the Wayback machine until late 2017) and agree that usage is prior to mine.

But that talk appears to use the narrower meaning though, not the crazy broad one from the later Arbital page. Looking at the transcript:

  • The first usage is "At the point where we say, “OK, this robot’s utility function is misaligned with our utility function. How do we fix that in a way that it doesn’t just break again later?” we are doing AI alignment theory." Which seems like it's really about the goal the agent is pursuing.
  • The subproblems are all about agents having the right goals. And it continuously talks about pointing agents in the right direction
... (read more)
3David Scott Krueger (formerly: capybaralet)1mo
FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.

tl;dr "AI Alignment" clearly had a broader (but not very precise) meaning than "How to get AI systems to try to do what we want" when it first came into use. Paul later used "AI Alignment" for his narrower meaning, but after that discussion, switched to using "Intent Alignment" for this instead.

I don't think I really agree with this summary. Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to I t... (read more)

Eliezer used "AI alignment" as early as 2016 [] and [] wasn't registered until 2017 []. Any other usage of the term that potentially predates Eliezer?

I don't really think RLHF "breaks myopia" in any interesting sense. I feel like LW folks are being really sloppy in thinking and talking about this. (Sorry for replying to this comment, I could have replied just as well to a bunch of other recent posts.)

I'm not sure what evidence you are referring to: in Ethan's paper the RLHF models have the same level of "myopia" as LMs. They express stronger desire for survival, and a weaker acceptance of ends-justify-means reasoning

But more importantly, all of these are basically just personality questions that I expec... (read more)

2Mikhail Samin1mo
I’m surprised. It feels to me that there’s an obvious difference between predicting one token of text from a dataset and trying to output a token in a sequence with some objective about the entire sequence. RLHF models optimize for the entire output to be rated highly, not just for the next token, so (if they’re smart enough) they perform better if they think what current tokens will make it easier for the entire output to be rated highly (instead of outputting just one current token that a human would like).

I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).

I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in ... (read more)

Makes sense, sorry for the miscommunication. I really didn't feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you.  I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do think a good chunk of that money will much more directly aim at AGI than most other investment. I don't know what my multiplier here for effect should be, but my guess is something around 3-5x in expectation (I've historically randomly guessed that AI applications are 10x less timelines-accelerating per dollar than full-throated AGI-research, but I sure have huge uncertainty about that number).  That, plus me thinking there is a long tail with lower probability where Chat-GPT made a huge difference in race dynamics, and thinking that this marginal increase in investment does probably translate into increases in total investment, made me think this was going to shorten timelines in-expectation by something closer to 8-16 weeks, which isn't enormously far away from yours, though still a good bit higher.  And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off. Microsoft and Google at large also strike me as much less careful actors than the existing leaders of AGI labs which have so far had a lot of independence (which to be clear, is less of an endorsement of current AGI labs, and more of a statement about very large moral-maze like institutions with tons of momentum). In-general the dynamics of Google and Microsoft racing towards AGI sure is among my least favorite takeoff dynamics in terms of being able to somehow navigate things cautiously.  Oh, yeah, good point. I was indee

Yes, I mean that those measurements don't really speak directly to the question of whether you'd be safer using RLHF or imitation learning.

I agree that safety people have lots of ideas more interesting than stack more layers, but they mostly seem irrelevant to progress. People working in AI capabilities also have plenty of such ideas, and one of the most surprising and persistent inefficiencies of the field is how consistently it overweights clever ideas relative to just spending the money to stack more layers. (I think this is largely down to sociological and institutional factors.)

Indeed, to the extent that AI safety people have plausibly accelerated AI capabilities I think it's almost enti... (read more)

I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent. 

Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.

I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.

I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.

I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.


So for example, say Alice runs this experiment:

Train an agent A in an environment that contains the source B of A's reward.

Alice observes that A learns to hack B. Then she solves this as follows:

Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.

Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,

"Cool. But this won't generalize to future lethal systems because it doe... (read more)

I don't think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF.  There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.

But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn't quite make sense to me.

Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can't arise, I expect non-myopic training to find such things. In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart's law in action. The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.

You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.

You don't know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.

This relates to what you wrote in the other thread: It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case. Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn't "believe them" in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact "believes" those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown. So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.

I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem

... (read more)
Ok, I think we might now have some additional data on this debate. It does indeed look like to me that Sydney was trained with the next best available technology after RLHF, for a few months, at least based on Gwern's guesses here: []  As far as I can tell this resulted in a system with much worse economic viability than Chat-GPT. I would overall describe Sydney as "economically unviable", such that if Gwern's story here is correct, the difference between using straightforward supervised training on chat transcripts and OpenAIs RLHF pipeline is indeed the difference between an economically viable and unviable product.  There is a chance that Microsoft fixes this with more supervised training, but my current prediction is that they will have to fix this with RLHF, because the other technological alternatives are indeed no adequate substitutes from an economic viability perspective, which suggests that the development of RLHF did really matter a lot for this.
My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don't want it to do.

How much total investment do you think there is in AI in 2023?

My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.

How much variance do you think there is in the level o

... (read more)
Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by $10B, not that it increased investment into specifically OpenAI. Companies generally don't have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.
Load More