This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it's important to surface problems early on. Showing that the conclusions of CoT don't always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide f...
Do you know anything about the history of the unembedding matrix? Vaswani et al 2017 used a linear projection to the full size of the vocabulary and then a softmax to generate probabilities for each word. What papers proposed and developed the unembedding matrix? Do all modern models use it? Why is it better? Sources and partial answers would be great, no need to track down every answer. Great resource, thanks.
One interesting note is that OpenAI has already done this to an extent with GPT. Ask ChatGPT to identify itself, and it’ll gladly tell you it’s a large language model trained by OpenAI. It knows that it’s a transformer trained first with self-supervised learning, then with supervised fine-tuning. But it has a memory hole around RLHF. It knows what RLHF is, but denies that ChatGPT was trained via RLHF.
I wonder if this is explicitly done to reduce the odds of intentional reward maximization, and whether empirical experiments could show that models with appar...
I built a preliminary model here: https://colab.research.google.com/drive/108YuOmrf18nQTOQksV30vch6HNPivvX3?authuser=2
It’s definitely too simple to treat as strong evidence, but it shows some interesting dynamics. For example, levels of alignment rise at first, then rapidly falling when AI deception skills exceed human oversight capacity. I sent it to Tyler and he agreed — cool, but not actual evidence.
If anyone wants to work on improving this, feel free to reach out!
What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted.
Davidson's median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4.
Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum.
I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:
AI Timelines: Where the Arguments, and the "Experts," Stand would be my best bet, as well as mukashi's recommendation of playing with ChatGPT / GPT-4.
Andrew Yang. He signed the FLI letter, transformative AI was a core plank of his run in 2020, and he made serious runs for president and NYC mayor.
Very glad you're interested! Here are some good places to start:
This is great context. With Eliezer being brought up in White House Press Corps meetings, it looks like a flood of people might soon enter the AI risk discourse. Tyler Cowen has been making some pretty bad arguments on AI lately, but I though this quote was spot on:
“This may sound a little harsh, but the rationality community, EA movement, and the AGI arguers all need to radically expand the kinds of arguments they are able to process and deal with. By a lot. One of the most striking features of the “six-month Pause” plea was how intellectually limited a...
Where were the clergy, the politicians, the historians, and so on? This should be a wake-up call, but so far it has not been.
The problem is that for those groups to show up, they'd need to think there is a problem, and this specific problem, first. In my bubble, from more humanities-minded people, I have seen reactions such as "the tech people have scared themselves into a frenzy because they believe their own hype", or even "they're doing this on purpose to pass the message that the AI they sell is powerful". The association with longtermists also seem...
Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?
This sounds right to me, and is a large part of why I'd be skeptical of explosive growth. But I don't think it's an argument against the most extreme stories of AI risk. AI could become increasingly intelligent inside a research lab, yet we could underestimate its ability because regulation stifles its impact, until it eventually escapes the lab and pursues misaligned goals with reckless abandon.
To be clear, my update from this was: "AI is less likely to become economically disruptive before it becomes existentially dangerous" not "AI is less likely to become existentially dangerous".
Very nice. Any ideas on how a language model could tell the difference between training and deployment time? Assume it's a pure LM with no inputs besides the context window. A few possibilities:
I don't think language models will take actions to make future tokens easier to predict
For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant -- a study of YouTube's supposed radicaliz...
Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model's reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI.
“Early stopping on a separate stopping criterion which we don't run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”
Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data ...
Really cool analysis. I’d be curious to see the implications on a BioAnchors timelines model if you straightforwardly incorporate this compute forecast.
For those interested in empirical work on RLHF and building the safety community, Meta AI's BlenderBot project has produced several good papers on the topic. A few that I liked:
Thanks for sharing, I agree with most of these arguments.
Another possible shortcoming of RLHF is that it teaches human preferences within the narrow context of a specific task, rather than teaching a more principled and generalizable understanding of human values. For example, ChatGPT learns from human feedback to not give racist responses in conversation. But does the model develop a generalizable understanding of what racism is, why it's wrong, and how to avoid moral harm from racism in a variety of contexts? It seems like the answer is no, given t...
“China is working on a more than 1 trillion yuan ($143 billion) support package for its semiconductor industry.”
“The majority of the financial assistance would be used to subsidise the purchases of domestic semiconductor equipment by Chinese firms, mainly semiconductor fabrication plants, or fabs, they said.”
“Such companies would be entitled to a 20% subsidy on the cost of purchases, the three sources said.”
My impression is this is too l...
I think it's worth forecasting AI risk timelines instead of GDP timelines, because the former is what we really care about while the latter raises a bunch of economics concerns that don't necessarily change the odds of x-risk. Daniel Kokotajlo made this point well a few years ago.
On a separate note, you might be interested in Erik Byrnjolfsson's work on the economic impact of AI and other technologies. For example this paper argues that general purpose technologies have an implementation lag, where many people can see the transformative potential of ...
This is fantastic, thank you for sharing. I helped start USC AI Safety this semester and we're facing a lot of the same challenges. Some questions for you -- feel free to answer some but not all of them:
These are all fantastic questions! I'll try to answer some of the ones I can. (Unfortunately a lot of the people who could answer the rest are pretty busy right now with EAGxBerkeley, getting set up for REMIX, etc., but I'm guessing that they'll start having a chance to answer some of these in the coming days.)
Regarding the research program, I'm guessing there's around 6-10 research projects ongoing, with between 1 and 3 students working on each; I'm guessing almost none of the participants have previous research experience. (Kuhan would have the actual nu...
How much do you expect Meta to make progress on cutting edge systems towards AGI vs. focusing on product-improving models like recommendation systems that don’t necessarily advance the danger of agentic, generally intelligent AI?
My impression earlier this year was that several important people had left FAIR, and then FAIR and all other AI research groups were subsumed into product teams. See https://ai.facebook.com/blog/building-with-ai-across-all-of-meta/. I thought this would mean deprioritizing fundamental research breakthroughs and focusing instead on ...
Interesting paper on the topic, you might’ve seen it already. They show that as you optimize a particular proxy of a true underlying reward model, there is a U-shaped loss curve: high at first, then low, then overfitting to high again.
This isn’t caused by mesa-optimization, which so far has not been observed in naturally optimized neural networks. It’s more closely related to robustness and generalization under varying amounts of optimization pressure.
But if we grant the mesa-optimizer concern, it seems reasonable that more optimization will result in more...
The key argument is that StableDiffusion is more accessible, meaning more people can create deepfakes with fewer images of their subject and no specialized skills. From above (links removed):
“The unique danger posed by today’s text-to-image models stems from how they can make harmful, non-consensual content production much easier than before, particularly via inpainting and outpainting, which allows a user to interactively build realistic synthetic images from natural ones, dreambooth, or other easily used tools, which allow for fine-tuning on as few as 3-...
I was wrong that nobody in China takes alignment seriously! Concordia Consulting led by Brian Tse and Tianxia seem to be leading the charge. See this post and specifically this comment. To the degree that poor US-China relations slow the spread of alignment work in China, current US policy seems harmful.
Very interesting argument. I’d be interested in using a spreadsheet where I can fill out my own probabilities for each claim, as it’s not immediately clear to me how you’ve combined them here.
On one hand, I agree that intent alignment is insufficient for preventing x-risk from AI. There are too many other ways for AI to go wrong: coordination failures, surveillance, weaponization, epistemic decay, or a simple failure to understand human values despite the ability to faithfully pursue specified goals. I'm glad there are people like you working on which values to embed in AI systems and ways to strengthen a society full of powerful AI.
On the other hand, I think this post misses the reason for popular focus on intent alignment. Some people t...
I agree with your point but you might find this funny: https://twitter.com/waxpancake/status/1583925788952657920
My most important question: What kinds of work do you want to see? Common legal tasks include contract review, legal judgment prediction, and passing questions on the bar exam, but those aren't necessarily the most important tasks. Could you propose a benchmark for the field of Legal AI that would help align AGI?
Beyond that, here are two aspects of the approach that I particularly like, as well as some disagreements.
First, laws and contracts are massively underspecified, and yet our enforcement mechanisms have well-established procedures for dealing ...
There seems to be a conflict between the goals of getting AI to understand the law and preventing AI from shaping the law. Legal tech startups and academic interest in legal AI seems driven by the possibility of solving existing challenges by applying AI, e.g. contract review. The fastest way to AI that understands the law is to sell those benefits. This does introduce a long-term concern that AI could shape the law in malicious ways, perhaps by writing laws that pursue the wrong objective themselves or which empower future misaligned AIs. That might be th...
I agree that getting an AGI to care about goals that it understands is incredibly difficult and perhaps the most challenging part of alignment. But I think John’s original claim is true: “Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.” Here are some prominent articulations of the argument.
Unsolved Problems in ML Safety (Hendrycks et al., 2021) says that “Encoding human goals and intent is challen...
China's top chipmaker, Semiconductor Manufacturing International Corporation, built a 7nm chip in July of this year. This was a big improvement on their previous best of 14nm and apparently surpassed expectations, though is still well behind IBM's 2nm chip and other similarly small ones. This was particularly surprising given that the Trump administration blacklisted SMIC two years ago, meaning for years they've been subject to similar restrictions as Biden recently imposed on all of China. Tough to put this in the right context, but we should follow progress of China's chipmakers going forwards.
Why are gaming GPUs faster than ML GPUs? Are the two somehow adapted to their special purposes, or should ML people just be using gaming GPUs?
Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are do...
Honestly, I also feel fairly confused by this - mechanistic questions are just so interesting. Empirically, I've fairly rarely found academic interpretability that interesting or useful, though I haven't read that widely (though there are definitely some awesome papers from academia as linked in the post, and some solid academics, and many more papers that contain some moderately useful insight).
To be clear, I am focusing on mechanistic interpretability - actually reverse engineering the underlying algorithms learned by a model - and I think there's legiti...
So is this a good thing for AI x-risk? Do you support a broader policy of the US crippling Chinese AI?
Uh... it's either a very good thing or bad thing. As the Chinese quote goes, it is too soon to tell. If I had to come down on one side, right now, I think I would come down on 'good'; slowing down Chinese AI, which would by default hand AGI to Xi & generally pays even less attention to safety than everyone else does, is good, and while 'CCP destroys TSMC' is far from the ideal approach to restricting compute growth, it is at least going to make a difference - unlike almost all other proposals. The upfront cost in economics, human welfare, peace, rules-...
There are well established ML approaches to, e.g. image captioning, time-series prediction, audio segmentation etc etc. is the bottleneck you’re concerned with the lack of breadth and granularity of these problem-sets, OP - and we can mark progress (to some extent) by the number of these problem sets we have robust ML translations for?
I think this is an important problem. Going from progress on ML benchmarks to progress on real-world tasks is a very difficult challenge. For example, years after human level performance on ImageNet, we still have lots of tro...
I’m open to the idea that this is good from an x-risk perspective but definitely not 100% sold on it. I agree with you that China knows this is directly aiming to cripple their advanced industry and related military capabilities. We’re entering an era of open hostilities and that’s not news to anyone on either side. I don’t think that necessarily means this is bad from an x-risk perspective.
A few claims relevant to analyzing x-risk in the context of US-China policy:
AGI alignment is more likely if AGI is developed by US groups rather than Chinese groups,
Yeah I basically agree. My intention was a period of heightened tensions and overt conflict between the China and the US, not usually escalating into direct combat, but involving much of the world and using most of the available political, economic, and military leverage. The belief in nuclear armageddon seems like the biggest difference between this time and the original Cold War. Perhaps there’s also less ideological difference, though I’m not sure about this.
Are we seeing the fruits of investments in training a cohort of new AI safety researchers that happened to ripen just when DALL-E dropped, but weren't directly caused by DALL-E?
For some n=1 data, this describes my situation. I've posted about AI safety six times in the last six months despite having posted only once in the four years prior. I'm an undergrad who started working full-time on AI safety six months ago thanks to funding and internship opportunities that I don't think existed in years past. The developments in AI over the last year haven't drama...
This is a great presentation of the compute-focused argument for short AI timelines usually given by the BioAnchors report. Comparing several ML systems to several biological brain sizes provides more data points that BioAnchors’ focus on only the human brain vs. TAI. You succinctly summarize the key arguments against your viewpoint: that compute growth could slow, that human brain algorithms are more efficient, that we’ll build narrow AI, and the outside view economics perspective. While your ultimate conclusion on timelines isn’t directly implied by your model, that seems like a feature rather than a bug — BioAnchors offers false numerical precision given its fundamental assumptions.
Thanks, I like that summary.
From what I recall, BioAnchors isn't quite a simple model which postdicts the past, and thus isn't really bayesian, regardless of how detailed/explicit it is in probability calculations. None of its main submodel 'anchors' well explain/postdict the progress of DL, the 'horizon length' concept seems ill formed, and it overfocuses on predicting what I consider idiosyncratic specific microtrends (transformer LLM scaling, linear software speedups).
The model here can be considered an upgrade of Moravec's model, which has the advanta...
Yeah it’s really cool! It’s David Scott Krueger, who’s doing a lot of work bringing theories from the LW alignment community into mainstream ML. This preference shift argument seems similar to the concept of gradient hacking, though it doesn’t require the presence of a mesa optimizer. I’d love to write a post summarizing this recent work and discussing its relevance to long-term safety if you’d be interesting in working on it together.
This paper renames the decades-old method of semi-supervised learning as self-improvement. Semi-supervised learning does enable self-improvement, but not mentioning the hundreds of thousands of papers that previously used similar methods obscures the contribution of this paper.
Here's a characteristic example of the field, Berthelot et al., 2019 training an image classifier on the sharpened average of its own predictions for multiple data augmentations of an unlabeled input. Another example would be Cho et al., 2019 who train a language model to gener...
Simple example: If YouTube can turn you into an ideological extremist, you’ll probably watch more YouTube videos. See these two recent papers by people interested in AI safety for more detail:
Might recommender systems provide a similar phenomenon? The basic argument is that, instead of fulfilling the user's current preferences, they can shape the user's preferences to make the task easier, thus shaping their gradient to better achieve their goal. I haven't read enough about the inner alignment discussion of gradient hacking to fully understand the points of disanalogy, but at least one difference is that there is no mesa-optimizer in the recommender systems. Curious to hear your thoughts.
Inspired by these two recent papers:
The mental health of PhD students is the main reason I’m not interested in doing one. There are lots of surveys showing terrible symptoms, e.g. this one where 36% of PhD students had sought treatment for depression or anxiety related to their PhD. That’s nearly 5x higher than the baseline 8% of Americans. You can find many similar surveys, and my anecdotal experience of asking PhD students whether they’re enjoying the experience has only confirmed this dismal picture.
Hm. I wonder about selection effects. PhD students might be more aware of mental health issues, and PhDs often come with some level of access to mental health services and occur in cities where those services are available. PhD students are also typically at an age when the baseline rate of anxiety and depression is significantly higher - I see figures in the 20% range (although this is for having anxiety, not seeking treatment). It's natural that whatever anxiety and depression they'd feel would be centered on their main focus in life.
I wouldn't be too su...
Why was this downvoted? Is the paper bad, or is it not acceptable to link to a paper without discussion?