All of aogara's Comments + Replies

Why was this downvoted? Is the paper bad, or is it not acceptable to link to a paper without discussion?

This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it's important to surface problems early on. Showing that the conclusions of CoT don't always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide f... (read more)

2miles12d
Thanks! Glad you like it. A few thoughts: * CoT is already incredibly hot, I don't think we're adding to the hype. If anything, I'd be more concerned if anyone came away thinking that CoT was a dead-end, because I think doing CoT might be a positive development for safety (as opposed to people trying to get models to do reasoning without externalizing it).  * Improving the faithfulness of CoT explanations doesn't mean improving the reasoning abilities of LLMs. It just means making the reasoning process more consistent and predictable. The faithfulness of CoT is fairly distinct from the performance that you get from CoT.  Any work on improving faithfulness would be really helpful for safety and has minimal capability externalities.

Do you know anything about the history of the unembedding matrix? Vaswani et al 2017 used a linear projection to the full size of the vocabulary and then a softmax to generate probabilities for each word. What papers proposed and developed the unembedding matrix? Do all modern models use it? Why is it better? Sources and partial answers would be great, no need to track down every answer. Great resource, thanks. 

One interesting note is that OpenAI has already done this to an extent with GPT. Ask ChatGPT to identify itself, and it’ll gladly tell you it’s a large language model trained by OpenAI. It knows that it’s a transformer trained first with self-supervised learning, then with supervised fine-tuning. But it has a memory hole around RLHF. It knows what RLHF is, but denies that ChatGPT was trained via RLHF.

I wonder if this is explicitly done to reduce the odds of intentional reward maximization, and whether empirical experiments could show that models with appar... (read more)

I built a preliminary model here: https://colab.research.google.com/drive/108YuOmrf18nQTOQksV30vch6HNPivvX3?authuser=2

It’s definitely too simple to treat as strong evidence, but it shows some interesting dynamics. For example, levels of alignment rise at first, then rapidly falling when AI deception skills exceed human oversight capacity. I sent it to Tyler and he agreed — cool, but not actual evidence.

If anyone wants to work on improving this, feel free to reach out!

What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted. 

Davidson's median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4. 

3tgb2mo
Thank you. I was completely missing that they used a second 'preference' model to score outputs for the RL. I'm surprised that works!

Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum. 

I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome: 

  • Predictive model misuse -- People use AI to do terrible things. 
    • Adversarial Robustness: Train ChatGPT to withstand jailbreaking. When people try to trick it into doing something bad using a creative prompt,
... (read more)
2Kshitij Sachan2mo
Great response! I would imitative generalization to the "Scalable oversight failure without deceptive alignment" section
4Alex Lawsen 2mo
Thanks, both for the thoughts and encouragement! Appreciate you doing a quick version. I'm excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon. Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like: - P(scenario B leads to doom | scenario A turns out not to be an issue by default) - P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve) - P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved) etc.

Andrew Yang. He signed the FLI letter, transformative AI was a core plank of his run in 2020, and he made serious runs for president and NYC mayor. 

Answer by aogaraApr 06, 202360

Very glad you're interested! Here are some good places to start:

  1. Learn more about the alignment problem with the readings here: https://www.agisafetyfundamentals.com/ai-alignment-curriculum
  2. Learn more about technical agendas relevant to safety here: https://course.mlsafety.org
  3. High level guide on working in the field: https://forum.effectivealtruism.org/posts/7WXPkpqKGKewAymJf/how-to-pursue-a-career-in-technical-ai-alignment 
  4. Guide to developing research engineering skills: https://forum.effectivealtruism.org/posts/S7dhJR5TDwPb5jypG/levelling-up-in-ai-saf
... (read more)

This is great context. With Eliezer being brought up in White House Press Corps meetings, it looks like a flood of people might soon enter the AI risk discourse. Tyler Cowen has been making some pretty bad arguments on AI lately, but I though this quote was spot on:

“This may sound a little harsh, but the rationality community, EA movement, and the AGI arguers all need to radically expand the kinds of arguments they are able to process and deal with. By a lot. One of the most striking features of the “six-month Pause” plea was how intellectually limited a... (read more)

1Martin Randall2mo
My clergy spouse wishes to remind people that there are some important religious events this week, so many clergy are rather busy. I'm quite hopeful that there will be a strong religious response to AI risk, as there is already to climate risk.
4Evan R. Murphy2mo
Some counter-examples that come to mind: Joshua Bengio, Geoffrey Hinton, Stephen Hawking, Bill Gates, Steve Wozniak. Looking at the Pause Giant Experiments Open letter now, I also see several signatories from fields like history, philosophy, some signers identifying as teachers, priests, librarians, psychologists, etc. (Not that I disagree broadly with your point that the discussion has been strongly weighted in the rationality and EA communities.)

Where were the clergy, the politicians, the historians, and so on? This should be a wake-up call, but so far it has not been.

The problem is that for those groups to show up, they'd need to think there is a problem, and this specific problem, first. In my bubble, from more humanities-minded people, I have seen reactions such as "the tech people have scared themselves into a frenzy because they believe their own hype", or even "they're doing this on purpose to pass the message that the AI they sell is powerful". The association with longtermists also seem... (read more)

Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?

9mbernhardt4mo
One explanation would be changing hardware, see [this tweet](https://twitter.com/OfirPress/status/1542610741668093952). This is relevant, because with floating point numbers, the order of summation and multiplication can play a role. And I guess with different hardware the calculations are split differently, leading to different execution sequences. I would also be interested to learn more here, because the cause could also be a memory overflow or something similar.

This sounds right to me, and is a large part of why I'd be skeptical of explosive growth. But I don't think it's an argument against the most extreme stories of AI risk. AI could become increasingly intelligent inside a research lab, yet we could underestimate its ability because regulation stifles its impact, until it eventually escapes the lab and pursues misaligned goals with reckless abandon. 

To be clear, my update from this was: "AI is less likely to become economically disruptive before it becomes existentially dangerous" not "AI is less likely to become existentially dangerous".

1Noosphere895mo
Eh, what do you think of Peter McCluskey's story of how compute and capital substituting for labor may drive explosive growth?

Very nice. Any ideas on how a language model could tell the difference between training and deployment time? Assume it's a pure LM with no inputs besides the context window. A few possibilities:

  • Distribution shift in prompts. Perhaps training prompts are systematically different from deployment prompts. The model could predict this difference in advance, or it could simply observe a bunch of OOD inputs once it reaches deployment. For example, the repeated sequences of templated datasets would be more likely in training than deployment. 
  • Observing a stri
... (read more)
3Jozdien5mo
Thanks! Yeah, I think distribution shifts could matter a lot - RSA-2048 has been talked to death here and various other places, so I wouldn't be very surprised if a gradient hacker simulacrum just defaulted to searching for its factorization (or something simpler along that line). I'm not sure how much detecting subtle differences like repeated sequences and implementational differences would help though, both because it requires a modicum of extended analytical reasoning about the training distribution (because that kind of information probably won't be present in the training data a lot; this matters mainly if you're talking about simulacra and not deceptive simulators), and it's pretty unintuitive how you'd test for the lack of it in short context windows. Yeah, if we end up training language models on text with associated metadata, I can see this as plausible. I'm not sure how a model would determine that certain date, though far-fetched metadata would probably be a good signal. I'll admit that that kind of thing primarily makes sense in situations where you have extended generations in a context window (i.e, more than a single token being generated before moving to another random prompt), and I don't whether that's how large models are currently being trained - it seems plausible that this could pop up in some other training mechanisms though, especially downstream ones after pre-training where this problem still applies. You could also have this show up more prominently as something that changes with deployment - if this doesn't end up happening, a deceptive agent could verify it just by being around for more than a couple passes. More generally, I don't think this condition will be a huge limiting factor - even if we address it completely, you could have (for example) gradient filterers in training that push the gradient toward generally instantiating future deceptive agents, either agnostic to or less prone to gradient filtering, such that by deployment

I don't think language models will take actions to make future tokens easier to predict

For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant -- a study of YouTube's supposed radicaliz... (read more)

2DragonGod6mo
As I understand it, GPT-3 and co are trained via self supervised learning with the goal of minimising predictive loss. During training, their actions/predictions do not influence their future observations in anyway. The training process does not select for trying to control/alter text input, because that is something impossible for the AI to accomplish during training. As such, we shouldn't expect the AI to demonstrate such behaviour. It was not selected for power seeking.

Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model's reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI. 

3DragonGod6mo
No. Simulators [https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators] aren't (in general) agents. Language models were optimised for the task of next token prediction, but they don't necessarily optimise for it. I am not convinced that their selection pressure favoured agents vs a more general cognitive architecture that can predict agents (and other kinds of systems). Furthermore, insomuch as they are actually optimisers for next token prediction, it's in a very myopic way. That is, I don't think language models will take actions to make future tokens easier to predict

“Early stopping on a separate stopping criterion which we don't run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”

Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data ... (read more)

Really cool analysis. I’d be curious to see the implications on a BioAnchors timelines model if you straightforwardly incorporate this compute forecast.

3Marius Hobbhahn6mo
We looked more into this because we wanted to get a better understanding of Ajeya's estimate of price-performance doubling every 2.5 years. Originally, some people I talked to were skeptical and thought that 2.5 years is too conservative. I now think that 2.5 years is probably insufficiently conservative in the long run.  However, I also want to note that there are still reasons to believe a doubling time of 2 years or less could be realistic due to progress in specialization or other breakthroughs. I still have large uncertainty about the doubling time but my median estimate got a bit more conservative.  We have not incorporated this particular estimate into the bio anchor's model. Mostly, because the model is a prediction about a particular type of GPU and I think it will be modified or replaced once miniaturization is no longer an option. So, I don't expect progress to entirely stop in the next decade, just slow down a bit.  But lots of open questions all over the place. 
2Zach Stein-Perlman6mo
I'd be interested in the implications too, but I don't see how that would work, since bioanchors just looks at price-performance iirc.

For those interested in empirical work on RLHF and building the safety community, Meta AI's BlenderBot project has produced several good papers on the topic. A few that I liked:

  • My favorite one is about filtering out trolls from crowdsourced human feedback data. They begin with the observation that when you crowdsource human feedback, you're going to get some bad feedback. They identify various kinds of archetypal trolls: the "Safe Troll" who marks every response as safe, the "Unsafe Troll" who marks them all unsafe, the "Gaslight Troll" whose only feedback
... (read more)

Thanks for sharing, I agree with most of these arguments. 

Another possible shortcoming of RLHF is that it teaches human preferences within the narrow context of a specific task, rather than teaching a more principled and generalizable understanding of human values. For example, ChatGPT learns from human feedback to not give racist responses in conversation. But does the model develop a generalizable understanding of what racism is, why it's wrong, and how to avoid moral harm from racism in a variety of contexts? It seems like the answer is no, given t... (read more)

“China is working on a more than 1 trillion yuan ($143 billion) support package for its semiconductor industry.”

“The majority of the financial assistance would be used to subsidise the purchases of domestic semiconductor equipment by Chinese firms, mainly semiconductor fabrication plants, or fabs, they said.”

“Such companies would be entitled to a 20% subsidy on the cost of purchases, the three sources said.”

https://www.reuters.com/technology/china-plans-over-143-bln-push-boost-domestic-chips-compete-with-us-sources-2022-12-13/

My impression is this is too l... (read more)

I think it's worth forecasting AI risk timelines instead of GDP timelines, because the former is what we really care about while the latter raises a bunch of economics concerns that don't necessarily change the odds of x-risk. Daniel Kokotajlo made this point well a few years ago. 

On a separate note, you might be interested in Erik Byrnjolfsson's work on the economic impact of AI and other technologies. For example this paper argues that general purpose technologies have an implementation lag, where many people can see the transformative potential of ... (read more)

4Matthew Barnett6mo
I agree that's probably the more important variable to forecast. On the other hand, if your model of AI is more continuous, you might expect a slow-rolling catastrophe, like a slow takeover of humanity's institutions, making it harder to determine the exact "date" that we lost control. Predicting GDP growth is the easy way out of this problem, though I admit it's not ideal. In fact, I cited this strand of research in my original post [https://www.lesswrong.com/posts/Z5gPrKTR2oDmm6fqJ/three-reasons-to-expect-long-ai-timelines] on long timelines. It was one of the main reasons why I had long timelines, and can help explain why it seems I still have somewhat long timelines (a median of 2047) despite having made, in my opinion, a strong update.

Probably using the same interface as WebGPT

3habryka6mo
No, the "browsing: enabled" is I think just another hilarious way to circumvent the internal controls.

This is fantastic, thank you for sharing. I helped start USC AI Safety this semester and we're facing a lot of the same challenges. Some questions for you -- feel free to answer some but not all of them:

  • What does your Research Fellows program look like? 
    • In particular: How many different research projects do you have running at once? How many group members are involved in each project? Have you published any results yet?
    • Also, in terms of hours spent or counterfactual likelihood of producing a useful result, how much of the research contributions come f
... (read more)

These are all fantastic questions! I'll try to answer some of the ones I can. (Unfortunately a lot of the people who could answer the rest are pretty busy right now with EAGxBerkeley, getting set up for REMIX, etc., but I'm guessing that they'll start having a chance to answer some of these in the coming days.)

Regarding the research program, I'm guessing there's around 6-10 research projects ongoing, with between 1 and 3 students working on each; I'm guessing almost none of the participants have previous research experience. (Kuhan would have the actual nu... (read more)

How much do you expect Meta to make progress on cutting edge systems towards AGI vs. focusing on product-improving models like recommendation systems that don’t necessarily advance the danger of agentic, generally intelligent AI?

My impression earlier this year was that several important people had left FAIR, and then FAIR and all other AI research groups were subsumed into product teams. See https://ai.facebook.com/blog/building-with-ai-across-all-of-meta/. I thought this would mean deprioritizing fundamental research breakthroughs and focusing instead on ... (read more)

3Lone Pine7mo
Zuckerberg has made a huge bet on VR/"The Metaverse", to the tune of multiple times the cost of the Apollo Program. The business world doesn't seem to like this bet, people are not bullish on VR but are very bullish on AI. So the pressure is on Mark to pivot to AI, but also to pivot to anything that is productizable.

Interesting paper on the topic, you might’ve seen it already. They show that as you optimize a particular proxy of a true underlying reward model, there is a U-shaped loss curve: high at first, then low, then overfitting to high again.

This isn’t caused by mesa-optimization, which so far has not been observed in naturally optimized neural networks. It’s more closely related to robustness and generalization under varying amounts of optimization pressure.

But if we grant the mesa-optimizer concern, it seems reasonable that more optimization will result in more... (read more)

The key argument is that StableDiffusion is more accessible, meaning more people can create deepfakes with fewer images of their subject and no specialized skills. From above (links removed):

“The unique danger posed by today’s text-to-image models stems from how they can make harmful, non-consensual content production much easier than before, particularly via inpainting and outpainting, which allows a user to interactively build realistic synthetic images from natural ones, dreambooth, or other easily used tools, which allow for fine-tuning on as few as 3-... (read more)

-1ChristianKl7mo
Completely, custom porn is not necessarily porn that actually looks like existing people in a way that would fool a critical observer.  More importantly, the person who posted this is not someone without specialized skills. Your claim is that basically anyone can just use the technology at present to create deepfakes. There might be a future where it's actually easy for someone without skills to create deepfakes but that link doesn't show that this future is here at present. With previous technology, you create a deepfake porno image by taking a photo of someone, cropping out the head, and then putting that head into a porn image. You don't need countless images of them to do so. For your charge to be true, the present Stable Diffusion-based tech would have to be either much easier than existing photoshop based methods or produce more convincing images than low skill photoshop deepfakes. The thread in the forum demonstrates that neither of these is the case at present.

I was wrong that nobody in China takes alignment seriously! Concordia Consulting led by Brian Tse and Tianxia seem to be leading the charge. See this post and specifically this comment. To the degree that poor US-China relations slow the spread of alignment work in China, current US policy seems harmful. 

Very interesting argument. I’d be interested in using a spreadsheet where I can fill out my own probabilities for each claim, as it’s not immediately clear to me how you’ve combined them here.

On one hand, I agree that intent alignment is insufficient for preventing x-risk from AI. There are too many other ways for AI to go wrong: coordination failures, surveillance, weaponization, epistemic decay, or a simple failure to understand human values despite the ability to faithfully pursue specified goals. I'm glad there are people like you working on which values to embed in AI systems and ways to strengthen a society full of powerful AI. 

On the other hand, I think this post misses the reason for popular focus on intent alignment. Some people t... (read more)

4John Nay8mo
Thanks for those links and this reply. 1.  I don't see how this is a counterargument to this post's main claim: P(misalignment x-risk | intent-aligned AGI) >> P(misalignment x-risk | societally-aligned AGI).  That problem of the collapse of a human provided goal into AGI power-seeking seems to apply just as much to the problem of intent alignment as it does to societal alignment; it could apply even more because the goals provided would be (a) far less comprehensive, and (b) much less carefully crafted.   2.  I agree with this. My point is not that we should not think about the risks of intent alignment, but rather that (if the arguments in this post are valid): AGI-capabilities-advancing-technical-research that actively pushes us closer to developing intent-aligned AGI is a net negative because it could cause us to develop intent-aligned AGIs that would cause an increase in x-risk because AGIs aligned to multiple humans that have conflicting intentions can lead to out-of-control conflicts; and if we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power. Furthermore, humans with intent-aligned AIs would suddenly have significantly more power, and their advantages over others would likely compound, worsening the above issues. Most current technical AI alignment research is AGI-capabilities-advancing-research that actively pushes us closer to developing intent-aligned AGI, with the (usually implicit, sometimes explicit) assumption is that solving intent alignment will help subsequently solve societal-AGI alignment. But this would only be the case if all the humans that had access to intent-aligned AGI had the same intentions (and did not have any major conflicts between them); and that is highly unlikely. 

I agree with your point but you might find this funny: https://twitter.com/waxpancake/status/1583925788952657920

My most important question: What kinds of work do you want to see? Common legal tasks include contract review, legal judgment prediction, and passing questions on the bar exam, but those aren't necessarily the most important tasks. Could you propose a benchmark for the field of Legal AI that would help align AGI?

Beyond that, here are two aspects of the approach that I particularly like, as well as some disagreements. 

First, laws and contracts are massively underspecified, and yet our enforcement mechanisms have well-established procedures for dealing ... (read more)

3John Nay8mo
  There is definitely room for ethics outside of the law. When increasingly autonomous systems are navigating the world, it is important for AI to attempt to understand (or at least try to predict) moral judgements of humans encountered.  However, imbuing an understanding of an ethical framework for an AI to implement is more of a human-AI alignment solution, rather than a society-AI alignment solution.  The alignment problem is most often described (usually implicitly) with respect to the alignment of one AI system with one human, or a small subset of humans. It is more challenging to expand the scope of the AI’s analysis beyond a small set of humans and ascribe societal value to action-state pairs. Society-AI alignment requires us to move beyond "private contracts" between a human and her AI system and into the realm of public law to explicitly address inter-agent conflicts and policies designed to ameliorate externalities and solve massively multi-agent coordination and cooperation dilemmas.  We can use ethics to better align AI with its human principal by imbuing the ethical framework that the human principal chooses into the AI. But choosing one out of the infinite possible ethical theories (or an ensemble of theories) and "uploading" that into an AI does not work for a society-AI alignment solution because we have no means of deciding -- across all the humans that will be affected by the resolution of the inter-agent conflicts and the externality reduction actions taken -- which ethical framework to imbue in the AI. When attempting to align multiple humans with one or more AI system, we would need the equivalent of an elected "council on AI ethics" where every affected human is bought in and will respect the outcome.  In sum, imbuing an understanding of an ethical framework for an AI should definitely be pursued as part of human-AI alignment, but it is not an even remotely practical possibility for society-AI alignment.
2John Nay8mo
  The law-informed AI framework sees intent alignment as (1.) something that private law methods can help with, and (2.) something that does not solve, and in some ways probably exacerbates (if we do not also tackle externalities concurrently), societal-AI alignment. 1. One way of describing the deployment of an AI system is that some human principal, P, employs an AI to accomplish a goal, G, specified by P. If we view G as a “contract,” methods for creating and implementing legal contracts – which govern billions of relationships every day – can inform how we align AI with P.  Contracts memorialize a shared understanding between parties regarding value-action-state tuples. It is not possible to create a complete contingent contract between AI and P because AI’s training process is never comprehensive of every action-state pair (that P may have a value judgment on) that AI will see in the wild once deployed.  Although it is also practically impossible to create complete contracts between humans, contracts still serve as incredibly useful customizable commitment devices to clarify and advance shared goals. (Dylan Hadfield-Menell & Gillian Hadfield, Incomplete Contracting and AI Alignment [https://arxiv.org/abs/1804.04268]). 1. We believe this works mainly because the law has developed mechanisms to facilitate commitment and sustained alignment amongst ambiguity. Gaps within contracts – action-state pairs without a value – are often filled by the invocation of frequently employed standards (e.g., “material” and “reasonable”). These standards could be used as modular (pre-trained model) building blocks across AI systems. Rather than viewing contracts from the perspective of a traditional participant, e.g., a counterparty or judge, AI could view contracts (and their creation, implementation, evolution, and enforcement) as (model inductive biases and data) g
8John Nay8mo
Thank you for this detailed feedback. I'll go through the rest of your comments/questions in additional comment replies. To start: Given that progress in AI capabilities research is driven, in large part, by shared benchmarks that thousands of researchers globally use to guide their experiments, understand as a community whether certain model and data advancements are improving AI capabilities, and compare results across research groups, we should aim for the same phenomena in Legal AI understanding.  Optimizing benchmarks are one of the primary “objective functions” of the overall global AI capabilities research apparatus.   But, as quantitative lodestars, benchmarks also create perverse incentives to build AI systems that optimize for benchmark performance at the expense of true generalization and intelligence (Goodhart’s Law). Many AI benchmark datasets have a significant number of errors, which suggests that, in some cases, machine learning models have, more than widely recognized, failed to actually learn generalizable skills and abstract concepts. There are spurious cues within benchmark data structures that, once removed, significantly drop model performance, demonstrating that models are often learning patterns that do not generalize outside of the closed world of the benchmark data.  Many benchmarks, especially in natural language processing, have become saturated not because the models are super-human but because the benchmarks are not truly assessing their skills to operate in real-world scenarios.  This is not to say that AI capabilities have made incredible advancements over the past 10 years (and especially since 2017). The point is just that benchmarking AI capabilities is difficult.   Benchmarking AI alignment likely has the same issues, but compounded by significantly vaguer problem definitions. There is also far less research on AI alignment benchmarks. Performing well on societal alignment is more difficult than performing well on task capabili

There seems to be a conflict between the goals of getting AI to understand the law and preventing AI from shaping the law. Legal tech startups and academic interest in legal AI seems driven by the possibility of solving existing challenges by applying AI, e.g. contract review. The fastest way to AI that understands the law is to sell those benefits. This does introduce a long-term concern that AI could shape the law in malicious ways, perhaps by writing laws that pursue the wrong objective themselves or which empower future misaligned AIs. That might be th... (read more)

3John Nay8mo
This is a great point. Legal tech startups working on improving legal understanding capabilities of AI has two effects. 1. Positive: improves AI understanding of law and furthers the agenda laid out in this post. 2. Negative: potentially involves AI in the law-making (broadly defined) process.  We should definitely invest efforts in understanding the boundaries where AI is a pure tool just making humans more efficient in their work on law-making and where AI is doing truly substantive work in making law. I will think more about how to start to define that and what research of this nature would look like. Would love suggestions as well!

I agree that getting an AGI to care about goals that it understands is incredibly difficult and perhaps the most challenging part of alignment. But I think John’s original claim is true: “Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.” Here are some prominent articulations of the argument.

Unsolved Problems in ML Safety (Hendrycks et al., 2021) says that “Encoding human goals and intent is challen... (read more)

China's top chipmaker, Semiconductor Manufacturing International Corporation, built a 7nm chip in July of this year. This was a big improvement on their previous best of 14nm and apparently surpassed expectations, though is still well behind IBM's 2nm chip and other similarly small ones. This was particularly surprising given that the Trump administration blacklisted SMIC two years ago, meaning for years they've been subject to similar restrictions as Biden recently imposed on all of China. Tough to put this in the right context, but we should follow progress of China's chipmakers going forwards. 

Why are gaming GPUs faster than ML GPUs? Are the two somehow adapted to their special purposes, or should ML people just be using gaming GPUs?

7Nanda Ale8mo
>Why are gaming GPUs faster than ML GPUs? Are the two somehow adapted to their special purposes, or should ML people just be using gaming GPUs? They aren't really that much faster, they are basically the same chips. It's just that the pro version is 4X as expensive. It's mostly a datacenter tax. The gaming GPUs do generally run way hotter and power hungry, especially boosting higher, and this puts them ahead against the equivalent ML GPUs in some scenarios. Price difference is not only a tax though - the ML GPUs do have differences [https://lambdalabs.com/blog/nvidia-rtx-a6000-vs-rtx-3090-benchmarks/] but it usually swings things by 10 to 30 percent, occasionally more. Additionally the pro versions typically have 2X-4X the GPU memory which is a huge qualitative difference in capability, and they are physically smaller and run cooler so you can put a bunch of them inside one server and link them together with high speed NVLink cables into configurations that aren't practical with 3090s. 3090s have a single NVLink port. Pro cards have three. 4090s curiously have zero - NVIDIA likely trying to stop the trend of using cheap gaming GPUs for research. Also, the ML GPUs for last gen were also on a a 7nm TSMC process, while the gaming GPUs were on Samsung 8nm process. This means the A100 using 250 watts outperforms the 3090 using 400 watts [https://youtu.be/zBAxiQi2nPc?t=1255]. But they are overall the same chip.  None of that accounts for a 4X or more cost multiplier, and the TLDR is the chips are not that different. If gaming GPUs came in higher memory configurations, and all supported NVLink, and were legally allowed to be sold in datacenters, nobody would pay the cost multiplier.
aogara8moΩ4107

Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are do... (read more)

2the gears to ascension8mo
I mean, personally I'd say it's the only hope we have of making any of the reflection algorithms have any shot at working. you can't do formal verification unless your network is at least interpretable enough that, when formal verification fails, you can know what it is about the dataset means the network had to make your non-neural prover run out of compute time when you try to ask what the lowest margin to a misbehavior is. if the network's innards aren't readable enough to get an intuitive sense of why a subpath failed the verification or what computation the network performed that timed out the verifier, it's hard to move the data around to clarify. of course, this doesn't help that much if it turns out that a strong planner can amplify a very slight value misalignment quickly as expected by miri folks; afaict, miri is worried that the process of learning (their word is "self improving") can speed up a huge amount when the network can make use of the full self-rewriting possibility of its substrate and the network properly understands the information geometry of program updates (ie, afaict, they expect significant amounts of generalizing improvement of architecture or learning rule or such things once its strong enough to become a strong quine as an incidental step of doing its core task.) and so interpretability would be expected to be made useless by the ai breaking into your tensorflow to edit its own compute graph or your pytorch to edit its own matmul invocation order or something. presumably that doesn't happen at the expected level until you have an ai strong enough to significantly exceed the generalization performance of current architecture search incidentally without being aimed at that, because the ais they're imagining wouldn't have even been trained on that specifically the way eg alphatensor was narrowly aimed at matmul itself. wow this really got me on a train of thinking, I'm going to post more rambling to my shortform.

Honestly, I also feel fairly confused by this - mechanistic questions are just so interesting. Empirically, I've fairly rarely found academic interpretability that interesting or useful, though I haven't read that widely (though there are definitely some awesome papers from academia as linked in the post, and some solid academics, and many more papers that contain some moderately useful insight).

To be clear, I am focusing on mechanistic interpretability - actually reverse engineering the underlying algorithms learned by a model - and I think there's legiti... (read more)

So is this a good thing for AI x-risk? Do you support a broader policy of the US crippling Chinese AI?

Uh... it's either a very good thing or bad thing. As the Chinese quote goes, it is too soon to tell. If I had to come down on one side, right now, I think I would come down on 'good'; slowing down Chinese AI, which would by default hand AGI to Xi & generally pays even less attention to safety than everyone else does, is good, and while 'CCP destroys TSMC' is far from the ideal approach to restricting compute growth, it is at least going to make a difference - unlike almost all other proposals. The upfront cost in economics, human welfare, peace, rules-... (read more)

There are well established ML approaches to, e.g. image captioning, time-series prediction, audio segmentation etc etc. is the bottleneck you’re concerned with the lack of breadth and granularity of these problem-sets, OP - and we can mark progress (to some extent) by the number of these problem sets we have robust ML translations for?

I think this is an important problem. Going from progress on ML benchmarks to progress on real-world tasks is a very difficult challenge. For example, years after human level performance on ImageNet, we still have lots of tro... (read more)

I’m open to the idea that this is good from an x-risk perspective but definitely not 100% sold on it. I agree with you that China knows this is directly aiming to cripple their advanced industry and related military capabilities. We’re entering an era of open hostilities and that’s not news to anyone on either side. I don’t think that necessarily means this is bad from an x-risk perspective.

A few claims relevant to analyzing x-risk in the context of US-China policy:

  1. AGI alignment is more likely if AGI is developed by US groups rather than Chinese groups,

... (read more)
1aogara7mo
I was wrong that nobody in China takes alignment seriously! Concordia Consulting [https://concordia-consulting.com] led by Brian Tse [https://www.brian-tse.com] and Tianxia [https://www.linkedin.com/company/tianxia/] seem to be leading the charge. See this post [https://forum.effectivealtruism.org/posts/eQa4WtedcAookJ7nM/does-china-have-ai-alignment-resources-institutions-how-can] and specifically this comment [https://forum.effectivealtruism.org/posts/eQa4WtedcAookJ7nM/does-china-have-ai-alignment-resources-institutions-how-can?commentId=FCJWJcMZjS2sgRxqK]. To the degree that poor US-China relations slow the spread of alignment work in China, current US policy seems harmful. 
1Zac Hatfield-Dodds8mo
I think that convincing Chinese researchers and policy-makers of the importance of the alignment problem would be very valuable, but it also risks changing the focus of race dynamics to AGI, and is therefore very risky. The last thing you want to do is leave the CCP convinced that AGI is very important but safety isn't, as happened to John Carmack [https://twitter.com/ID_AA_Carmack/status/1560728042959507457]! Also beware thinking you're in a race [https://forum.effectivealtruism.org/posts/cXBznkfoPJAjacFoT/are-you-really-in-a-race-the-cautionary-tales-of-szilard-and]. I think (3) is only true over timescales of 2--5 years: the thing that really matters is performance per unit cost, and if you own the manufacturer you're not paying Nvidia's ~80% unit margins.

Yeah I basically agree. My intention was a period of heightened tensions and overt conflict between the China and the US, not usually escalating into direct combat, but involving much of the world and using most of the available political, economic, and military leverage. The belief in nuclear armageddon seems like the biggest difference between this time and the original Cold War. Perhaps there’s also less ideological difference, though I’m not sure about this.

7trevor8mo
I agree that wording it as "It's part of a larger strategy of economic decoupling from China [https://www.cotton.senate.gov/imo/media/doc/210216_1700_China%20Report_FINAL.pdf] in preparation for another Cold War" was really helpful and went a long way to give me a sense of why decoupling is being taken so seriously, and I have no idea how to word it better.  I also think that decoupling is often seen by many as the big factor that distinguishes the current Cold War from the old one, but that might be popular bias, since lots of people are paid to analyze that whereas both nuclear beliefs and democracy-versus-authoritarian ideologies involve tons of black box institutions and military secrets.

I'd suggest entering this Center for AI Safety contest by tagging the post with AI Safety Public Materials. 

Are we seeing the fruits of investments in training a cohort of new AI safety researchers that happened to ripen just when DALL-E dropped, but weren't directly caused by DALL-E?


For some n=1 data, this describes my situation. I've posted about AI safety six times in the last six months despite having posted only once in the four years prior. I'm an undergrad who started working full-time on AI safety six months ago thanks to funding and internship opportunities that I don't think existed in years past. The developments in AI over the last year haven't drama... (read more)

This is a great presentation of the compute-focused argument for short AI timelines usually given by the BioAnchors report. Comparing several ML systems to several biological brain sizes provides more data points that BioAnchors’ focus on only the human brain vs. TAI. You succinctly summarize the key arguments against your viewpoint: that compute growth could slow, that human brain algorithms are more efficient, that we’ll build narrow AI, and the outside view economics perspective. While your ultimate conclusion on timelines isn’t directly implied by your model, that seems like a feature rather than a bug — BioAnchors offers false numerical precision given its fundamental assumptions.

Thanks, I like that summary.

From what I recall, BioAnchors isn't quite a simple model which postdicts the past, and thus isn't really bayesian, regardless of how detailed/explicit it is in probability calculations. None of its main submodel 'anchors' well explain/postdict the progress of DL, the 'horizon length' concept seems ill formed, and it overfocuses on predicting what I consider idiosyncratic specific microtrends (transformer LLM scaling, linear software speedups).

The model here can be considered an upgrade of Moravec's model, which has the advanta... (read more)

3aogara8mo
I'd suggest entering this [https://forum.effectivealtruism.org/posts/JsS5vuiHEoBMbYk5R/usd20k-in-bounties-for-ai-safety-public-materials#How_to_submit] Center for AI Safety contest by tagging the post with AI Safety Public Materials. 

Yeah it’s really cool! It’s David Scott Krueger, who’s doing a lot of work bringing theories from the LW alignment community into mainstream ML. This preference shift argument seems similar to the concept of gradient hacking, though it doesn’t require the presence of a mesa optimizer. I’d love to write a post summarizing this recent work and discussing its relevance to long-term safety if you’d be interesting in working on it together.

1Emrik8mo
Flattered you ask, but I estimate that I'll be either very busy with my own projects or on mental-health vacation until the end of the year. But unless you're completely saturated with connections, I'd be happy to have a 1:1 conversation [https://calendly.com/emrik/talking] sometime after October 25th? Just for exploration purposes, not for working on a particular project.

This paper renames the decades-old method of semi-supervised learning as self-improvement. Semi-supervised learning does enable self-improvement, but not mentioning the hundreds of thousands of papers that previously used similar methods obscures the contribution of this paper. 

Here's a characteristic example of the field, Berthelot et al., 2019 training an image classifier on the sharpened average of its own predictions for multiple data augmentations of an unlabeled input. Another example would be Cho et al., 2019 who train a language model to gener... (read more)

5Quintin Pope8mo
This is not quite correct. The approach in the paper isn’t actually semi-supervised, because they do not use any ground truth labels at all. In contrast, semi-supervised approaches [https://en.m.wikipedia.org/wiki/Semi-supervised_learning] use a small initial set of ground truth labels, then use a larger collection of unlabeled data to further improve the model’s capabilities.
1[comment deleted]8mo
Answer by aogaraOct 01, 202260

Simple example: If YouTube can turn you into an ideological extremist, you’ll probably watch more YouTube videos. See these two recent papers by people interested in AI safety for more detail:

https://openreview.net/pdf?id=mMiKHj7Pobj

https://arxiv.org/abs/2204.11966

1Emrik9mo
Thanks! This is what I'm looking for. Seems like I should have googled "recommender systems" and "preference shifts". Edit: The openreview paper is so good. Do you know who the authors are?

Might recommender systems provide a similar phenomenon? The basic argument is that, instead of fulfilling the user's current preferences, they can shape the user's preferences to make the task easier, thus shaping their gradient to better achieve their goal. I haven't read enough about the inner alignment discussion of gradient hacking to fully understand the points of disanalogy, but at least one difference is that there is no mesa-optimizer in the recommender systems. Curious to hear your thoughts. 

Inspired by these two recent papers:

  • By David Scott
... (read more)
6Oliver Sourbut8mo
Yeah, I read the ADS paper(s) after writing this post. I think it's a useful framing, more 'selection theorem' ey and with less emphasis on deliberateness/purposefulness. Additionally, I think there is another conceptual distinction worth attending to * auto-induced distributional shift is about affecting environment to change inputs * the system itself might remain unchanging and undergo no further learning, and still qualify * gradient hacking is about changing environment/inputs/observations to change updates (gradients) * the system is presumed subject to updates, which it is taking (some amount of deliberate) influence over In this post I wrote which I think rules out (hopefully!) contemporary recommender systems on the above two distinctions (as you gestured to regarding mesa-optimization). In practice, for a system subject to online outer training, ADS changes the inputs which changes the training distribution, in fact causing some change in the updates to the system (perhaps even a large change!). But ADS per se doesn't imply these effects are deliberate, though again you might be able to say something selection-theorem-ey about this process if iterated. Indeed, a competent and deliberate gradient hacker might use means of ADS quite effectively. None of this is to say that ADS is not a concern, I just think it's conceptually somewhat distinct!
Answer by aogaraSep 27, 202260

The mental health of PhD students is the main reason I’m not interested in doing one. There are lots of surveys showing terrible symptoms, e.g. this one where 36% of PhD students had sought treatment for depression or anxiety related to their PhD. That’s nearly 5x higher than the baseline 8% of Americans. You can find many similar surveys, and my anecdotal experience of asking PhD students whether they’re enjoying the experience has only confirmed this dismal picture.

https://www.nature.com/articles/d41586-019-03489-1

Hm. I wonder about selection effects. PhD students might be more aware of mental health issues, and PhDs often come with some level of access to mental health services and occur in cities where those services are available. PhD students are also typically at an age when the baseline rate of anxiety and depression is significantly higher - I see figures in the 20% range (although this is for having anxiety, not seeking treatment). It's natural that whatever anxiety and depression they'd feel would be centered on their main focus in life.

I wouldn't be too su... (read more)

Load More