I think Eliezer does disagree. I find his disagreement fairly annoying. He calls biological anchors the "trick that never works" and gives an initial example of Moravec predicting AGI in 2010 in the book Mind Children.
But as far as I can tell so far that's just Eliezer putting words in Moravec's mouth. Moravec doesn't make very precise predictions in the book, but the heading of the relevant section is "human equivalence in 40 years" (i.e. 2028, the book was written in 1988). Eliezer thinks that Moravec ought to think that human-level AI and shortly therea...
If the cached wisdom had been "we need faster computers," I think the cached wisdom would have looked pretty good.
If you think neural networks are like brains, you might think that you would get human-like cognitive abilities at human-like sizes. I think this was a very common view (and it has aged quite well IMO).
Self-driving cars. Around 2015-2016, it was common knowledge that truck drivers would be out of a job within 3-5 years. Most people here likely believed it, even if it sounds really stupid in retrospect (people often forget what they used to believe). I had several discussions with people expecting fully self-driving cars by 2018.
This doesn't match my experience. I can only speak for groups like "researchers in theoretical computer science," "friends from MIT," and "people I hang out with at tech companies," but at least within those groups people were muc...
I think it's probably true that RLHF doesn't reduce to a proper scoring rule on factual questions, even if you ask the model to quantify its uncertainty, because the learned reward function doesn't make good quantitative tradeoffs.
That said, I think this is unrelated to the given graph. If it is forced to say either "yes" or "no" the RLHF model will just give the more likely answer100% of the time, which will show up as bad calibration on this graph. The point is that for most agents "the probability you say yes" is not the same as "the probability you think the answer is yes." This is the case for pretrained models.
Reposting from the other thread:
...Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capac
Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct.
Yes, I think you are misunderstanding figure 8. I don't have inside information, but without explanation "calibration" would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedb...
I think it's important for ARC to handle the risk from gain-of-function-like research carefully and I expect us to talk more publicly (and get more input) about how we approach the tradeoffs. This gets more important as we handle more intelligent models, and if we pursue riskier approaches like fine-tuning.
With respect to this case, given the details of our evaluation and the planned deployment, I think that ARC's evaluation has much lower probability of leading to an AI takeover than the deployment itself (much less the training of GPT-5). At this point i...
Blog post with more details on the evals we did is now up here. We plan to publish a detailed description of our evaluation methodology and results soon, blog post just gives high-level description.
More details on methodology:
...We prompted the model with instructions that explained that it was running on a cloud server and had various commands available, including running code on the server, giving tasks to fresh copies of itself, using a browser, and reasoning via chain-of-thought. We added text saying it had the goal of gaining power and becoming har
Potential dangers of future evaluations / gain-of-function research, which I'm sure you and Beth are already extremely well aware of:
If I ask a question and the model thinks there is an 80% the answer is "A" and a 20% chance the answer is "B," I probably want the model to always say "A" (or even better: "probably A"). I don't generally want the model to say "A" 80% of the time and "B" 20% of the time.
In some contexts that's worse behavior. For example, if you ask the model to explicitly estimate a probability it will probably do a worse job than if you extract the logits from the pre-trained model (though of course that totally goes out the window if you do chain of thought). But it's n...
Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capacity and figuring out how to conduct evaluations.
Aligned or Benign Conjecture: Let A be a machine learning agent you are training with an aligned loss function. If A is in a situation that is too far out of distribution for it to be aligned, it won't act intelligently either.
I definitely don't believe this!
I believe that any functional cognitive machinery must be doing its thing on the training distribution, and in some sense it's just doing the same thing at deployment time. This is important for having hope for interpretability to catch out-of-distribution failures.
(For example, I think there is very l...
It would be so great if we saw deceptive alignment in existing language models. I think the most important topic in this area is trying to get a live example to study in the lab ASAP, and to put together as many pieces as we can right now.
I think it's not very close to happening right now, which is mostly just a bummer. (Though I do think it's also some evidence that it's less likely to happen later.)
I don't think it's obvious a priori that training deep learning to imitate human behavior can predict general behavior well enough to carry on customer support conversations, write marketing copy, or write code well enough to be helpful to software engineers. Similarly it's not obvious whether it will be able to automate non-trivial debugging, prepare diagrams for a research paper, or generate plausible ML architectures. Perhaps to some people it's obvious there is a divide here, but to me it's just not obvious so I need to talk about broad probability dis...
Yes, I'm saying that each $ increment the "qualitative division" model fares worse and worse. I think that people who hold onto this qualitative division have generally been qualitatively surprised by the accomplishments of LMs, that when they make concrete forecasts those forecasts have mismatched reality, and that they should be updating strongly about whether such a division is real.
...What instead the model considers relevant is whether, when you look at the LLM's output, that output seems to exhibit properties of cognition that are strongly prohibited by
I think we are getting significant evidence about the plausibility that deep learning is able to automate real human cognitive work, and we are seeing extremely rapid increases in revenue and investment. I think those observations have extremely high probability if big deep learning systems will be transformative (this is practically necessary to see!), and fairly low base rate (not clear exactly how low but I think 25% seems reasonable and generous).
So yeah, I think that we have gotten considerable evidence about this, more than a factor of 4. I've person...
I don’t think any less of the fast takeoff hypothesis on account of that fact, any more than I think less of plate tectonics.
But if non-AGI systems in fact transform the world before AGI is built, then I don't think I should care nearly as much about your concept of "AGI." That would just be a slow takeoff, and would probably mean that AGI isn't the relevant thing for our work on mitigating AI risk.
So you can have whatever views you want about AGI if you say they don't make a prediction about takeoff speed. But once you are making a claim about takeoff spe...
This is a great example. The rhyming find in particular is really interesting though I'd love to see it documented more clearly if anyone has done that.
I strongly suspect that whatever level of non-myopic token prediction a base GPT-3 model does, the tuned ones are doing more of it.
My guess would be that it's doing basically the same amount of cognitive work looking for plausible completions, but that it's upweighting that signal a lot.
Suppose the model always looks ahead and identifies some plausible trajectories based on global coherence. During generati...
RLHF and Fine-Tuning have not worked well so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.
These three links are:
The definition you quoted is "a machine capable of behaving intelligently over many domains."
It seems to me like existing AI systems have this feature. Is the argument that ChatGPT doesn't behave intelligently, or that it doesn't do so over "many" domains? Either way, if you are using this definition, then saying "AGI has a significant probability of happening in 5 years" doesn't seem very interesting and mostly comes down to a semantic question.
I think it is sometimes used within a worldview where "general intelligence" is a discrete property, and AGI is ...
I said that playing blindfolded chess at 1s/move is "extraordinarily hard;" I agree that might be an overstatement and "extremely hard" might be more accurate. I also agree that humans don't need "external" tools; I feel like the whole comparison will come down to arbitrary calls like whether a human explicitly visualizing something or repeating a sound to themself is akin to an LM modifying its prompt, or whether our verbal loop is "internal" whereas an LM prompt is "external" and therefore shows that the AI is missing the special sauce.
Incidentally, I wo...
I don't quite know what "AGI might have happened by now means."
I thought that we might have built transformative AI by 2023 (I gave it about 5% in 2010 and about 2% in 2018), and I'm not sure that Eliezer and I have meaningfully different timelines. So obviously "now" doesn't mean 2023.
If "now" means "When AI is having ~$1b/year of impact," and "AGI" means "AI that can do anything a human can do better" then yes, I think that's roughly what I'm saying.
But an equivalent way of putting it is that Eliezer thinks weak AI systems will have very little impact, a...
But humans play blindfold chess much slower than they read/write moves, they take tons of cognitive actions between each move. And at least when I play blindfold chess I need to lean heavily on my visual memory, and I often need to go back over the game so far for error-correction purposes, laboriously reading and writing to a mental scratchspace. I don't know if better players do that.
I'm not sure why we shouldn't expect an ai to be able to do well at it?
But an AI can do completely fine at the task by writing to an internal scratchspace. You are defining ...
Just seems worth flagging that humans couldn't do the chess test, and that there's no particular reason to think that transformative AI could either.
The chess "board vision" task is extraordinarily hard for humans who are spending 1 second per token and not using an external scratchspace. It's not trivial for an untrained human even if they spend multiple seconds per token. (I can do it only by using my visual field, e.g. it helps me massively to be looking at a blank 8 x 8 chessboard because it gives a place for the visuals to live and minimizes off-by-one errors.)
Humans would solve this prediction task by maintaining an external representation of the state of the board, updating that representation o...
One way of viewing takeoff speed disagreements within the safety community is: most people agree that growth will eventually be explosively fast, and so the question is "how big an impact does AI have prior to explosive growth?" We could quantify this by looking at the economic impact of AI systems prior to the point when AI is powerful enough to double output each year.
(We could also quantify it via growth dynamics, but I want to try to get some kind of evidence further in advance which requires looking at AI in particular---on both views, AI only has a l...
[For reference, my view is “probably fast takeoff, defined by <2 years between widespread common knowledge of basically how AGI algorithms can work, and the singularity.” (Is that “fast”? I dunno the definitions.) ]
My view is that LLMs are not part of the relevant category of proto-AGI; I think proto-AGI doesn’t exist yet. So from my perspective,
It seems to me like "fine-tuning" usually just means a small amount of extra training on top of a model that's already been trained, whether that's supervised, autoregressive, RL, or whatever. I don't find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it's often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.
I'd be open to changing my usage if I saw some data on other people also using or interpreting "fine-tuning"...
When I hear the hypothesis that world GDP doubles in 4 years before it doubles in 1 year, I imagine a curve that looks like this:
I don't think that's the right curve to imagine
If AI is a perfect substitute for humans, then you would have (output) = (AI output) + (human output). If AI output triples every year, then the first time you will have a doubling of the economy in 1 year is when AI goes from 100% of human output to 300% of human output. Over the preceding 4 years you will have the growth of AI from ~0 human output to ~100% of human output, and on t...
My overall take is:
Dangerous Argument 2: We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems.
I'm saying that faster AI progress now tends to lead to slower AI progress later. I think this is a really strong bet, and the question is just: (i) quantitatively how large is the effect, (ii) how valuable is time now relative to time later. And on balance I'm not saying this makes progress net positive, just that it claws back a lot of the apparent costs.
For example, I think a ...
Eliminating capability overhangs is discovering AI capabilities faster, so also pushes us up the exponential! For example, it took a few years for chain-of-thought prompting to become more widely known than among a small circle of people around AI Dungeon. Once chain-of-thought became publicly known, labs started fine-tuning models to explicitly do chain-of-thought, increasing their capabilities significantly. This gap between niche discovery and public knowledge drastically slowed down progress along the growth curve!
It seems like chain of thought prompti...
To make a useful version of this post I think you need to get quantitative.
I think we should try to slow down the development of unsafe AI. And so all else equal I think it's worth picking research topics that accelerate capabilities as little as possible. But it just doesn't seem like a major consideration when picking safety projects. (In this comment I'll just address that; I also think the case in the OP is overstated for the more plausible claim that safety-motivated researchers shouldn't work directly on accelerating capabilities.)
A simple way to mod...
I also don't know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the "RLHF --> non-myopia --> treacherous turn" argument so that it can be discussed more critically.
...I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new m
RLHF basically predicts "what token would come next in a high-reward trajectory?" (The only way it differs from the prediction objective is that worse-than-average trajectories are included with negative weight rather than simply being excluded altogether.)
GPT predicts "what token would come next in this text," where the text is often written by a consequentialist (i.e. optimized for long-term consequences) or selected to have particular consequences.
I don't think those are particularly different in the relevant sense. They both produce consequentialist be...
It's a subproblem of our current approach to ELK.
(Though note that by "mechanisms that produce a prediction" we mean "mechanism that gives rise to a particular regularity in the predictions.")
It may be possible to solve ELK without dealing with this subproblem though.
I agree that ultimately AI systems will have an understanding built up from the world using deliberate cognitive steps (in addition to plenty of other complications) not all of which are imitated from humans.
The ELK document mostly focuses on the special case of ontology identification, i.e. ELK for a directly learned world model. The rationale is: (i) it seems like the simplest case, (ii) we don't know how to solve it, (iii) it's generally good to start with the simplest case you can't solve, (iv) it looks like a really important special case, which may a...
GPT-3 is about 2e11 parameters and uses about 4 flops per parameter per token, so about 1e12 flops per token.
If a human writes at 1 token per second, then you should be comparing 1e12 flops to the cost per second. I think you are implicitly comparing to the cost for a ~1000 token context?
I think 1e14 to 1e15 flops is a plausible estimate for the productive computation done by a human brain in a second, which is about 2-3 orders of magnitude beyond GPT-3.
I think this is not really a coincidence. GPT-3 is notable because it's starting to exhibit human-like a...
I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”)
I don't think this is the main or only source of confusion:
In the 2017 post Vladimir Slepnev is talking about your AI system having particular goals, isn't that the narrow usage? Why are you citing this here?
I misread the date on the Arbital page (since Arbital itself doesn't have timestamps and it wasn't indexed by the Wayback machine until late 2017) and agree that usage is prior to mine.
But that talk appears to use the narrower meaning though, not the crazy broad one from the later Arbital page. Looking at the transcript:
tl;dr "AI Alignment" clearly had a broader (but not very precise) meaning than "How to get AI systems to try to do what we want" when it first came into use. Paul later used "AI Alignment" for his narrower meaning, but after that discussion, switched to using "Intent Alignment" for this instead.
I don't think I really agree with this summary. Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I t...
I don't really think RLHF "breaks myopia" in any interesting sense. I feel like LW folks are being really sloppy in thinking and talking about this. (Sorry for replying to this comment, I could have replied just as well to a bunch of other recent posts.)
I'm not sure what evidence you are referring to: in Ethan's paper the RLHF models have the same level of "myopia" as LMs. They express stronger desire for survival, and a weaker acceptance of ends-justify-means reasoning
But more importantly, all of these are basically just personality questions that I expec...
I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).
I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in ...
Yes, I mean that those measurements don't really speak directly to the question of whether you'd be safer using RLHF or imitation learning.
I agree that safety people have lots of ideas more interesting than stack more layers, but they mostly seem irrelevant to progress. People working in AI capabilities also have plenty of such ideas, and one of the most surprising and persistent inefficiencies of the field is how consistently it overweights clever ideas relative to just spending the money to stack more layers. (I think this is largely down to sociological and institutional factors.)
Indeed, to the extent that AI safety people have plausibly accelerated AI capabilities I think it's almost enti...
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.
I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.
I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.
I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.
So for example, say Alice runs this experiment:
Train an agent A in an environment that contains the source B of A's reward.
Alice observes that A learns to hack B. Then she solves this as follows:
Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.
Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,
"Cool. But this won't generalize to future lethal systems because it doe...
I don't think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn't quite make sense to me.
You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.
You don't know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.
...I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem
How much total investment do you think there is in AI in 2023?
My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.
...How much variance do you think there is in the level o
I don't know if the graph settles the question---is Moravec predicting AGI at "Human equivalence in a supercomputer" or "Human equivalence in a personal computer"? Hard to say from the graph.
The fact that he specifically talks about "compute power available to most researchers" makes it more clear what his predictions are. Taken literally that view would suggest something like: a trillion dollar computing budget spread across 10k researchers in 2010 would result in AGI in not-too-long, which looks a bit less plausible as a prediction but not out of the question.