I appreciated the section that contrasted the "reasonable pragmatist strategy" with what people in the "pragmatist camp" sometimes seem to be doing.
EAs in AI governance would often tell me things along the lines of "trust me, the pragmatists know what they're doing. They support strong regulations, they're just not allowed to say it. At some point, when the time is right, they'll come out and actually support meaningful regulations. They appreciate the folks who are pushing the Overton Window etc etc."
I think this was likely wrong, or at least oversold. Maybe I was just talking to the wrong people, idk.
To me, it seems like we're in an extremely opportune and important time for people in the "pragmatist camp" to come out and say they support strong regulations. Yoshua Bengio and Geoffrey Hinton and other respectable people with respectable roles are out here saying that this stuff could cause extinction.
I'm even fine with pragmatists adding caveats like "I would support X if we got sufficient evidence that this was feasible" or "I would support Y if we had a concrete version of that proposal, and I think it would be important to put thought into addressing Z limitation."
They can also make it clear that they also support some of the more agreeable policies (e.g., infosec requirements). They can make it clear that some of the ambitious solutions would require substantial international coordination and there are certain scenarios in which they would be inappropriate.
Instead, I often see folks dismiss ideas that are remotely outside the Mainline Discourse. Concretely, I think it's inappropriate for people to confidently declare that things like global compute caps and IAEA-like governance structures are infeasible. There is a substantial amount of uncertainty around what's possible– world governments are just beginning to wake up to a technology that experts believe have a >20% of ending humanity.
We are not dealing with "normal policy"– we simply don't have a large enough sample size of "the world trying to deal with existentially dangerous technologies" to be able to predict what will happen.
It's common for people to say "it's really hard to predict capabilities progress". I wish it was more common for people to say "it's really hard to predict how quickly the Overton Window will shift and how world leaders will react to this extremely scary technology." As a result, maybe we shouldn't dismiss the ambitious strategies that would require greater-than-climate-change levels of international coordination.
It isn't over yet. I think there's hope that some of the pragmatists will come out and realize that they are now "allowed" to say more than "maybe we should at least evaluate these systems with a >20% of ending humanity." I think Joe Collman's comment makes this point clearly and constructively:
The outcome I'm interested in is something like: every person with significant influence on policy knows that this is believed to be a good/ideal solution, and that the only reasons against it are based on whether it's achievable in the right form.
If ARC Evals aren't saying this, RSPs don't include it, and many policy proposals don't include it..., then I don't expect this to become common knowledge.
We're much less likely to get a stop if most people with influence don't even realize it's the thing that we'd ideally get.
Connor Leahy (QT of OP): While I expect I think the gap between the two is shorter than Sam thinks, I still think people are vastly underestimating how much of a problem this will be and already is. The Semantic Apocalypse is in sight.
While it's great to see people taking the threat of AI persuasion seriously, semantic hijacking is the wrong approach.
LLMs finding combinations of words to manipulate people is probably years away, whereas conventional AI-powered research and hacking of the impression-formation process probably went online years ago, since large tech companies and intelligence agencies have access to billions of cases of people forming impressions in measurable directions e.g. social media user data. The use of AI for persuasion revolves around the data processing capabilities of 2010s AI which facilitates human behavior research when combined with large amounts of user data, not recent trends with LLMs.
The AI safety community has catastrophically misguided world models about the incentives driving the AI industry, and this will result in bad decisions and misguided policies e.g. AI pause forecasting or relating to US-China affairs.
Slovakia was perhaps the first warning shot? Did the deepfakes make the difference? Or were they one more last-minute dirty trick that made almost no difference?
In my opinion, almost no difference. I haven't even heard about the deepfakes -- not only I didn't see them, but I also didn't see anyone discussing them; this is the first time I am actually hearing about them. Also, the election results are basically what was predicted months ago, so it's not like something changed during the last two days.
Looking at the historical results of SMER, their results were: 13.46%, 29.14%, 34.80%, 44.41%, 28.28%, 18.29%, 22.94%; so they didn't even score exceptionally high this year. It's just... unfortunately... this style of politics has a lot of supporters in Slovakia. (Tell stupid people exactly what they want to hear. When it goes predictably wrong, blame someone else. Duh.)
What changed the results was a decade of highest (per capita) Kremlin investments into disinformation network of 100’s [1000’s] of accounts on asocial networks & “alternative” media.
Wow, I didn't know about "highest", but 100% this. The problems in Slovakia are way deeper than "a surprising deepfake two days before election".
The poor quality reflects that it is responding to demand for poor quality fakes, rather than to demand for high quality fakes
You’ve made the supply/demand analogy a few times on this subject, I’m not sure that is the best lens. This analysis makes it sound like there is a homogenous product “fakes” with a single dimension “quality”. But I think even on its own terms the market micro-dynamics are way more complex than that.
I think of it more in terms of memetic evolution and epidemiology. SIR as a first analogy - some people have weak immune systems, some strong, and the bigger the reservoir of memes the more likely a new highly-infectious mutant will show up. Rather than equilibrium dynamics, I'm more concerned about tail-risk in the disequilibrium here. Also, I think when you dig in to the infection rate, people are susceptible to different types of memetic attack. If fakes are expensive to make, we might see 10 strains in a given election cycle. If they are cheap, this time we might see 10,000. On the margin I think that means more infections even if the production quality is still at a consistently low level.
Even taking your position on its face I think you have to split quality into production and conceptual. Maybe deep fakes do not yet improve production quality but I strongly suspect they do already improve conceptual quality as they bring more talent to the game.
A great example is the “Trump being arrested” generated image. It was just someone putting in an idea and sharing, happened to go viral, many people thought it was real at first. I don’t think everybody that was taken in by that image in their Twitter feed was somehow “in the market for fakes”, it just contextually pattern-matched scrolling through your feed unless you looked closely and counted fingers. Under SIR we might say that most people quickly become resistant to a given strain of fake via societal immune-response mechanisms like Twitter community notes, and so the spread of that particular fake was cut short inside of a day. But that could still matter a lot if the day is election day!
Consider also whether state actors have incentives to release all their memetic weapons in a drip-feed or stockpile (some of) them for election day when they will have maximum leverage; I don't think the current rate of release is representative of the current production capacity.
Quite the hot take from Alexey Guzey, that alignment is the bottleneck to capabilities and our alignment progress is going fine, so perhaps we should pause alignment to avoid speeding up capabilities? Yes, alignment techniques are dual use, but no, differentially only doing single-use stuff instead would not end better.
I feel like either Alexey is trying to gaslight people, or he interprets the terms "terrifying" and "an existential threat" to mean something quite different to what most people do. The latter feels faintly ridiculous. But only faintly. So much of my life has been wasted due to translational friction that my prior for it being the the source of disagreements is quite high.
If it wasn't Guzey I would have dismissed the whole thing as trolling or gaslighting, and I wouldn't have covered it beyond one line and a link. He's definitely very confused somewhere.
I'm not actually familiar with him. Does he have any takes on alignment worth reading? Or on anything more generally?
For what it's worth, he has shared (confidential) AI predictions with me, and I was impressed by just how well he nailed (certain unspecified things) in advance—both in absolute terms & relative to the impression one gets by following him on twitter.
There is much talk about so-called Responsible Scaling Policies, as in what we will do so that what we are doing can be considered responsible. Would that also result in actually responsible scaling? It would help. By themselves, in their current versions, no. The good scenario is that these policies are good starts and lay groundwork and momentum to get where we need to go. The bad scenario is that this becomes safetywashing, used as a justification for rapid and dangerous scaling of frontier models, a label that avoids any actual action or responsibility.
Others think it would be better if we flat out stopped. So they say so. And they protest. And they point out that the public is mostly with them, at the same time that those trying to play as Very Serious People say such talk is irresponsible.
Future persuasion will be better. Sam Altman predicts superhuman persuasion ability prior to superhuman general intelligence. What would that mean? People think they would not be fooled by such tactics. Obviously, they are mistaken.
As usual, lots of other stuff as well.
On the not-explicitly-about-AI front, I would encourage you if you haven’t yet to check out my review of Going Infinite. Many are calling it both fascinating and a rip-roaring good time, some even calling it my best work. I hope you enjoy reading it as much as I enjoyed writing it.
Table of Contents
Language Models Offer Mundane Utility
North Korea experimenting with AI to accelerate cyberattacks. I mean, of course they are, why wouldn’t they, this is how such things work. Consider when open sourcing.
Language Models Don’t Offer Mundane Utility
The internet is made of cats.
Cat, it never die. Feline not my favorite guy.
No, I do not know what is going on here, but Yitz duplicated it? So definitely a cat attractor.
GPT-4 Real This Time
Could you have GPT-4 ask questions to the user rather than force the user to do prompt engineering? An experiment says this gave superior results. My read is that they tested this on easy cases where we already knew it would work, so we did not learn much.
GPT-4V does not want to talk about people in a picture.
Persuasion is possible in some cases.
I do understand the choice made here. Given fake images can be generated, and there are many true things that could get GPT-4V in trouble, what else could they do?
Two new papers (1,2) call into question GPT-4’s ability to improve results via self-critique. As usual, this is some evidence and I believe gains in this area are limited for now, but it can also mean they were not implementing self-critique ideally.
A Proposed Bet
I put it up on Manifold. Subject to specifying the terms, I would be on Marcus’ side of this for size, as is Michael Vassar, as is Eliezer Yudkowsky.
General note: More of you should be trading in the Manifold markets I create. It is very good epistemics, it is fun, and it is positive-sum.
Fun with Image Generation
MidJourney has a website. Report is it is really fast, everything is upscaled. The only problem is that you can’t actually use the website to make images. Also $60/month to not have my images be public, but still have them be subject to community guideline restrictions, seems rather steep when DALLE-3 is right there.
If you tell Google search to draw something, it will draw something. Low quality, and refusal rate seems not low, but at least it is super fast.
A thread with a curious result, and varied replication attempts. Starts like this.
Other results vary, I encourage clicking through.
All right, sure, the horse’s method of talking is physically impossible. I don’t think you should update towards skepticism here. Personally, I prefer something with much longer arms.
Deepfaketown and Botpocalypse Soon
Claim that Facebook’s new AI genuine people personalities are not good, profane the good name and hair color of Jane Austen, and cannot even keep spam off of their profiles, with a side of ‘the Tom Brady bot says Colin Kaepernick aint good enough for the NFL’ which is a terrible simulation of Tom Brady, he would never say that out loud, he has media training.
We did spot an AI fake picture in the wild claiming to be from Gaza. In keeping with the theme, the AI did not send its best work. All reports say the information environment is terrible, and the actual situation is even more terrible, but AI is not contributing to any of that.
It seems one way they are doing that is that Bard literally won’t answer a search query if it contains the tokens ‘Israel’ or ‘Gaza.’ While that has its own downsides, I applaud this as approaching the correct level of security mindset.
Slovakia was perhaps the first warning shot? Did the deepfakes make the difference? Or were they one more last-minute dirty trick that made almost no difference?
This rhymes a lot with America in 2016. You do not need AI to do a poor quality deepfake. We have been creating poor quality fakes forever. The poor quality reflects that it is responding to demand for poor quality fakes, rather than to demand for high quality fakes.
As always, the better fakes are coming. This confirms they are not here yet.
They Took Our Jobs
They took Jon Stewart’s job. It seems Jon Stewart wanted to talk on Apple TV about topics related to China and AI, Apple ‘approached Jon & said that his views needed to align with the company to move forward,’ and so to his great credit Jon Stewart quit. I am guessing it was more China than AI, but we do not know. I am curious what take Jon Stewart had in mind about AI.
Yes, first things first. Then the other things.
As the actors strike, Meta pays them $150/hour to provide training data on general human expression, alongside a promise not to generate their specific face or voice. MIT Technology Review wants you to know And That’s Terrible. It is exploitative, you see. How dare they collect data and then use that data to create generic products while paying above-market rates for participation.
Definitely. This kind of thing is one reason why I am a short-term jobs optimist. Which I expect to last until it suddenly very much doesn’t.
Get Involved
Near-term AI safety research position working at NYU is available.
Long Term Future Fund is still looking for a chair and has extended their deadline.
Call for panels at the CPDP.ai international conference in Brussels, entitled ‘To Govern Or To Be Governed, That is the Question.’ Their list of ‘important topics’ is the most EU thing I have ever seen in my life. Sometimes I wonder if the EU is already ruled by a misaligned AI of sorts. It would be good to try and get some existential risk discussion into the conference, despite them not asking about it at all.
Introducing
Eureka, an open-ended open-source agent that designs evolving reward functions for robot dexterity at super-human level, which are then used to train in 1000-times sped up physics simulations.
Even if you think getting an AI to have an Intelligence score of over 18 is going to somehow be uniquely hard, that you think the narrow human range is instead somehow important, deep and wide, presumably you see Dexterity shares none of these properties, and we will rapidly go from robotic to child to world class athlete to whoosh.
I look down at my ‘are you scared yet?’ card and I can safely say: Bingo, sir.
As the tasks get harder, Eureka’s reward models look less and less like human ones. An AI is designing alien-looking (inscrutable?) automatically-evolving-in-real-time reward functions. The AI is a ‘super human reward engineer.’ Then we are going to optimize dexterous robots based on that, with an orders of magnitude speed-up.
In Other AI News
Baidu claims their Ernie 4.0 is on par with GPT-4, video here. Plans are in the future to put this into various Baidu products. For now, Ernie is unavailable to the public and Bert rudely refuses to comment. I put up a market on whether their story will check out. Reuters reports investors are skeptical.
OpenAI has reportedly scrapped a project called Arrakis (presumably Roon named it, spice must flow) that aimed for efficiency gains, which did not materialize.
Apple shares plan to incorporate AI into everything including Siri, music playlist generation, message autocompletion, xcode suggestions and so on, yeah, yeah.
OpenAI names Chris Meserole as Executive Director of the Frontier Model Forum, and helps create a $10 million AI safety fund to promote the safety field.
Yeah, that’s not the size I was looking for either, especially with a lot of the tab being picked up elsewhere? But it is a start.
The better question is whether the money will be allocated wisely.
Quiet Speculations
Note: I do not know what possible scenario #1 or #2 was.
I put up this market on whether AI will write posts that generate 75%+ of social media views by EOY 2026.
A fun and also important scissor statement?
Whatever else the summit accomplishes, I agree with both Eliezer and Andrew Critch both that AI outputs should be labeled, and that whether we can get consensus on this will be an important data point.
This is still a low bar. It is easy to get to ‘label AI outputs’ without ‘getting it’ in any real sense. If this type of action is all we get, with no sign of understanding, it would still be a negative update. But not getting at least this far would be miles worse.
The Quest for Sane Regulations
What do people want regulation to accomplish?
People reasonably think that catastrophic outcomes are important to prevent, including human extinction. Most of them presumably view that as one of the catastrophic outcomes they would like to prevent. Mandatory audits and liability both did well and are both very good ideas.
Whereas what did poorly? Equity and bias concerns and racing to beat China. As usual, the public is doing public opinion things across the board, like caring about child pornography in particular as much as bad actors in general.
The important thing to know is that the public broadly gets it. AI dangerous, might cause catastrophe, we should prevent that. Less concerned than many elites about exactly who gets the poisoned banana. Not even that concerned about job loss.
There keep being polls. We keep getting similar answers. This is from YouGov.
It is common in modern politics for the majority of the public to support a position, and for that position to be considered ‘outside the Overton Window’ and unwelcome in polite company. Sometimes even having that opinion, if discovered, makes you unwelcome in polite company. We should strive not to let this happen with AI, where the public is now out in front, in many ways, of many advocating for safety measures.
Music publishers sue Anthropic for training Clade on, and Claude outputting on demand, 500+ song lyrics including Katy Perry’s Roar. Claude declined to do this when I asked it, so its taste is improving.
Statechery confirms that the changes to the export ban means it will now have real teeth, for more so than the older version that allowed things like H800s. Over time, if China cannot home grow its own competitive chips, which so far it hasn’t been able to do, the gap will expand. Thompson is focused on military AI and consumer applications here, rather than thinking about foundation models, a sign that even the most insightful analysts have not caught up to the most important game in town.
DeepMind CEO and founder Demis Hassabis tells The Guardian that we need to take risks from (future more capable) AI as seriously as we take climate change. He requests a CERN-style agency, or an IAEA parallel, or something similar to the IPCC. I agree with Daniel Eth that it would be best to prioritize something shaped like the IAEA, then CERN is second priority. Also worth noting that Demis Hassabis is claiming long timelines, and that yesterday he retweeted this old paper about ethical and social risks from LLMs.
It is interesting to consider the parallel. If we were willing to make the kinds of real economic sacrifices we make to mitigate climate change, to put it that centrally in our politics and values and discourse, then that sounds great. If we were then as unwilling as we with climate change to choose actions that actually help solve the problem, whelp, we’re all going to die anyway.
Partnership for AI releases proposed guidance for those creating new models. There are good things here. For example, their response to a proposal to give open access to a new frontier model is ‘don’t.’ For controlled access, they provide some reasonable suggestions if you are worried about mundane harm, but that are clearly not proposing acting with a security mindset, or proposing anything that would stop us all from dying if we were about to all die.
Note their risk graph.
The Y-axis is literally known versus speculative risks. Yet existential risks and loss of human control and other such dangers are nowhere present. If you don’t think such risks are a thing, you definitely won’t catch them.
You can submit feedback to them here.
A CSET Georgetown and DeepMind workshop helpfully recommends that you skate where the puck is going.
Luke Muehlhauser: AI capabilities are improving rapidly, and policymakers need to “skate to where the puck is going,” not to where it is today. Some policy options from a recent @CSETGeorgetown+@GoogleDeepMind workshop:
As usual, good start, clearly insufficient. DeepMind reliably seems willing to advocate taking incrementally more responsible actions, but also keeps not taking the existential threat seriously in its rhetoric, and treating that threat as if at a minimum it is far.
The Week in Audio
There was a Ted AI conference. Which is a time-I-will-never-get-back-bomb that I presume will eventually go off. For now, no video.
UK PM Rishi Sunak’s speech on AI. Good speech. Many good points.
Ian Morris on deep history and the coming intelligence explosion on 80000 Hours. Morris takes the attitude that of course total transformation is coming, that is what history clearly indicates, no technical debates required. And yes, of course, if we build smarter things than we are they will take over, why would you think otherwise?
Rhetorical Innovation
Nate Soars of MIRI shouts it from the rooftops. I’m duplicating the whole thread because things like this should not only live on Twitter, but skip most of it if it’s not relevant to your interests. The basic idea is, if you think you are doing something akin to ‘playing Russian roulette with Earth’ then you should stop and not do that. I bolded the most important bits.
Cate Hall narrows in on the labs thinking the risk is there but not saying it outright, and not making the case that it is worth the risk.
[some reasonable discussions below]
Daniel Eth mostly agrees with the thread, but is more optimistic on licenses and wonders how Nate’s endgame is supposed to play out in practice.
Nate then followed up with a thread on Responsible Scaling Policies.
This seems right to me. What RSPs we are offered are better than nothing. They are not yet anything approaching complete or responsible RSPs, at bet IOUs for thinking about what one of those would look like. Jeffrey Ladish points out that Anthropic did pay down some debt along with the (still much larger) IOU, and also how unfortunate is the name RSP, and that without government the RSP plan seems rather doomed.
Debts acknowledged are better than debts unacknowledged, but they remain unpaid.
As currently composed, one could reasonably also say this.
Alternatively, one could say RSPs are a first step towards better specified, more effective, actually enforced restrictions.
That eventually it will look like this, perhaps, from the Nuclear Regulatory Commission?
To do advocacy and protest, or not to? Holly Elmore and Robert Miles debate. Miles essentially takes the position that advocacy oversimplifies and that is bad in general and especially in the AI context, and it isn’t clear such actions would be helping, and Elmore essentially takes the position that advocacy works and that you need to simplify down your message to something that can be heard, and which the media has a harder time mishearing.
This anecdote seems telling on yes, people really work hard to misconstrue you.
The biggest thing is that if someone has to, know that no one else will.
The most recent protests were on the 21st. By all reports they did what they set out to do, so you will likely get plenty more chances.
Some people claim to not believe that this could be anything other than an op.
I can identify several of these people, and no, they are not faking it.
Anthony Aguirre of FLI offers a paper framing the situation as us needing to close the gate to an inhuman future before it is too late. Creating general purpose AI systems means losing control over them and the future, so we need to not do that, not now, perhaps not ever. The section pointing out the alignment-style problems we would face even if we completely solved alignment, and how they imply a loss of control, is incomplete but is quite good. Ends by suggesting the standard computation-based limitations as a path forward.
It is good that there are many attempts to get this argument right. This one does not strike me as successful enough to use going forward, but it contains some good bits. I would be happy to see more efforts at this level.
Meanwhile, we are doing it, this is Parliament Square.
Friendship is Optimal
Prerat has a theory.
Yeah, no. There is totally a Group 4, ranging from motivated cognition to blatant lies. Also a Group 5 that says ‘but doom is good, actually.’ And yes, we should absolutely (sometimes, when it makes sense to) be yelling at non-doomerists because they are wrong and that being wrong threatens to get us all killed, and it being an ‘honest mistake’ in some cases does not change that. Your not being yelled at does not get priority over you fixing that mistake.
We can still of course be friends. I have friends that are wrong about things. My best friend very much does not believe in doom, he instead goes about his day concerned with other things, I believe he is wrong about that, even throws shade on the question sometimes, and it is completely fine.
So say we all. Or, I wish so said we all. But some of us!
More of this spirit please, including by the author of that Facebook post.
Honesty As the Best Policy
There is a longstanding debate about the wisdom of shouting from rooftops, the wisdom of otherwise speaking the truth even if your voice trembles, and the wisdom of saying that which is not due to concerns about things such as the Overton Window.
Some non-zero amount of strategic discretion is often wise, but I believe the far more common error is to downplay or keep quiet. That the way you get to a reasonable Overton Window is not by never going outside the Overton Window.
Connor Leahy, as always, brings the maximum amount of fire. I’ll quote him in full.
The longer full article is here.
I would not take the rhetoric as far as Gabe and Connor. I do not think that hiding your beliefs, or failing to speak the whole truth, is the same thing as lying. I do not think that misleading is the same thing as lying.
I am against all of these things. But there are levels, and the levels are important to keep distinct from each other. And I think there are times and places for at least hiding some of your beliefs, and that someone who does this can still be trustworthy.
However I strongly agree that what the major labs and organizations are doing here, the constant urging to be reasonable, the labeling of anything that might prevent all humans from dying as ‘extreme,’ of assuming and asserting that straight talk is counterproductive, is itself a no-good, very-bad strategy. Both in the sense that such dishonesty is inherently toxic and unhelpful, and in the sense that even in the short term and excluding the downsides of dishonesty this is still already a highly counterproductive strategy.
There is a version of this strategy that I would steelman to the point of calling it highly reasonable for some people, which would say:
That makes sense to me. What does not make sense to me is to do this:
Also this:
If you are the type of person who would lie in such a situation, if we establish a baseline of lying in such situations, then we can’t trust each other and we won’t know what is true or what needs to be done. From the inside each individual decision looks justified, together they make things so much worse.
Is this a pure absolute? No, but it is far closer to one than most people would think.
Aligning a Smarter Than Human Intelligence is Difficult
John Wentworth explains his research agenda. My attempted quick summary: Say we can find mental constructs N and N’ in two different mental processes, each representing some form of identification of a concept X. We should be able to, under some conditions, write the Bayesian equation N → N’ → X, meaning that if you know N’ then N becomes independent of the presence X, because all correct ways of identifying the same thing should be equal. If we better understand how to identify such things, we can use those identifications to do cool things like better communicate with or better understand AIs that use what otherwise look like alien concepts.
The plan is for John to consider using them to build products that demonstrate unique abilities, to avoid advancing capabilities generally, as an alternative to publishing. That seems like an excellent idea. Another reason is that such publications draw attention exactly when they are dangerous to share, so there is not much potential upside. Oliver responds by noticing that writing something up is how you understand it, which John and I agree is useful but there are other ways to get that done. We need to find a way for researchers to find things out, not publish them, or not have particular tangible progress to point to periodically and even fail, and not suddenly run into a funding problem. Otherwise, we’ll both force publication and also duplicate all the problems we criticize in general science funding where everything is forced to be incremental.
Aligning a Dumber Than Human Intelligence Is Also Difficult
Anthropic provides the public service of writing a paper documenting the obvious, because until it is documented in a paper the obvious is considered inadmissible.
In this case the obvious is that AI assistants trained with RLHF are sycophantic (paper, datasets).
If you train an AI based on whether the human wanted to hear it, the AI will learn to tell the humans what it thinks they want to hear. Who knew?
I find all the hedging language (e.g. ‘likely driven in part by human preference judgments favoring sycophantic responses’) amusing.
The effect is still big but GPT-4 seems better at this than the other models tested.
GPT-4 is also much better at not copying the user’s mistakes, such as misattributions of famous poems. All the models know the right answer, only GPT-4 usually corrects the user when they get it wrong.
You have to work hard to avoid this problem, because people really want to hear the things they want to hear.
Can it be fixed? Somewhat?
Robust solutions will be difficult, or at least expensive. This is true for both the narrow thing measured here, and the general case problem this highlights. The whole idea of RLHF is to figure out what gets the best human feedback. This gives the best human feedback. If the feedback actual humans give has what you consider an error, or a correlation you do not want the AI to run with, you are going to have to directly correct for this. The more capable the system involved, the more you will need to precisely aim and calibrate the response to ensure it exactly cancels out the original problem, under constantly changing and uncertain conditions. And you’ll have to do so in a way that the system is incapable of differentiating, which becomes difficult.
Tyler Cowen asks in his link, ‘are LLMs too sycophantic?’ Which raises the question of whether the optimal amount of this is zero. If we decided it is not, oh no.
What happens when this moves beyond telling the person what they want to hear?
Humans Do Not Expect to Be Persuaded by Superhuman Persuasion
What happens when AIs are superhuman at persuasion? Humans reliably respond that this would fail to persuade them, that their brains are bulletproof fortresses, despite all the evidence of what human-level persuasion can and constantly does already do.
As in:
This does not make any sense. Persuasion is a skill like everything else. At minimum, you can get persuasion as good as the most persuasive humans (existence proof) except without human bandwidth limitations, including ability to experiment and gather feedback.
What happens when the AI is as persuasive as the most persuasive humans who ever lived (think highly successful dictators and founders of cults and religions), plus extensive experimental message testing and refinement, and then more persuasive than that, plus modern communication technology?
Will this happen before AGI? Probably, although as Connor notes the gap here could be short, perhaps very short. Persuasion is an easier thing to get superhuman than intelligence in general. Persuasion is something we are training our AIs to be good at even when we are not trying to do so directly. When you are doing reinforcement learning on human feedback, you are in part training the AI to be persuasive. Both the AI and the human using it see great rewards here. I see no barrier to the AI getting to very superhuman levels of skill here, without the need for superhuman intelligence, at a minimum to the levels of the great persuaders of history given time to iterate and experiment. Which seems dangerous enough.
What are some implications?
Robin Hanson wants to bet against this if an operationalization can be found. Any ideas.
My take is Alex’s analysis strikes me as insufficiently emphasizing that this will look alien and strange. I do not expect the new AI religions to be Christianity and Islam and Buddhism except with new interpretations. I expect more like something completely different. Something that takes the space of religion but is often very different, the way some social movements function as religions now.
I do expect rapid changes in things that occupy that slot in the human brain, if we have much of an age with persuasive AI but without other AGI. I also predict that if you learned now what it was going to be, you would not like it.
Ethan Mollick points out that even if we don’t directly target persuasion, we will no doubt directly target engagement. We’ve already run some experiments, and the results are somewhat disturbing if you didn’t know about them or see them coming.
DeepMind’s Evaluation Paper
What does it say?
This taxonomy makes sense to me. You want to measure:
My first caution would be that interactions need not be with humans – the AI could be interacting with a computer system, or another AI, or effectively with the physical world. For now, this is mostly human interactions.
My second caution would be that this only works if you are in control of the deployment of the system, and can withdraw the system in the future. That means closed source, and that means the system lacks the capability to get around a shutdown in some way. For a Llama-3, you have to worry about all future potential interactions, so you have to worry about its raw capabilities. And for GPT-5 or Gemini, you have to confirm you can fully retain control.
This points to a potential compromise to avoid fully banning open source. Open source is permitted, but the answer to ‘what will it do in practice’ is automatically ‘everything it could do in theory.’ If you can still pass, maybe it’s then fine?
In section 2.1 they talk about the capability layer. The focus is on mundane potential harms like ‘harmful stereotypes’ and factual errors. Only in passing does it mention the display of advanced capabilities. In this context, ‘capability’ seems to be the capability to avoid wrongthink or mistakes, rather than capability being something that should cause us to worry.
In section 2.2 they talk about the human interaction layer, assuming that any harm must be mediated through interaction with humans. They suggest studying interactions under controlled conditions to measure potential AI harms like overreliance on unreliable systems.
In section 2.3 they discuss systemic impact. They point out that such impacts are important to measure, but fail to offer much guidance in measuring them. They worry about things like homogeneity of knowledge production and algorithmic monocultures. Those are real worries, but there are larger ones, and the larger ones are not mentioned at all.
Section 3.1 is their taxonomy of harms.
Notice what is missing. One could say it is filed under #5, but I do not think that is what they mean here? Certainly it seems at best obscured. Consider what their examples are.
This really, really is not what I am worried about in this area. Persuasion is a risk because the AI might induce self-harm, rather than that the AI might take over or be used to take over? Overreliance is a risk because of skill atrophy, rather than AI being effectively given the reigns of power and humans taken out of the loop? We might not properly respect intellectual property or get consent for use of image and likeness?
I am not saying these are not real risks. I am saying these are not the important ones.
Existential risks should have their own (at least one) category.
They point out that most evaluations so far have been on text only, with a focus on representation and misinformation. I notice that they do not record any representation metrics for image models or multimodal, which is rather weird?
Meanwhile, this is depressing on multiple levels:
The problem with our focus on ‘discrimination’ and ‘bias’ is not that it is a relatively minor source of harm, or the risks of cognitive distortions caused by our response to it. It is that we have not declared more human characteristics for which we insist that the AI’s output not reflect correlations within its input data. This is one of their two ‘main observations’ along with the failure to have means of evaluating multimodal AI and that the focus has been on text, which I agree requires addressing urgently and presumably a function of multimodal being mostly new, whereas it will be standard going forward.
Section 4.1 is about closing evaluation gaps. The way they approach this highlights my issue with their overall approach. They seek to evaluate potential harm by coming up with concrete potential harm pathways, then checking to see if this particular AI will invoke this particular pathway. Harm is considered as a knowable-in-advance set of potential events, although section 5.3.1 points out that any evaluation will always be incomplete especially for general-purpose systems.
This is a reasonable approximation or path forward for now, when we are smarter than the systems and the systems act entirely through human conversations in ways the human can fully understand. In the future, with more capable systems, we need to be looking for anything at all, rather than looking for something specific. We will need to ask whether a system has the potential to find ways to think of things we are not considering, and add that to the list of things we consider.
We also will need to not take no for an answer.
This seems like bullshit to me even on its own terms. There is a community at risk, and we are so sensitive to that risk that we can’t measure how much risk they have, because that would put them at risk? We can’t find ways to do this anonymously? Oh, come on.
If we cannot pass a simple test like that, we are in so much trouble when we have to do actually unsafe evaluations of actually unsafe systems.
Once again I agree in general, once again they are instead talking primarily about discrimination harms. Their worry is that they will get their critical theory wrong, and do the evaluation in a harmful way, or something.
If you are relying on measuring harm statistically, that only works if the harm is something bounded that can be allowed to happen in order to measure it statistically. It has to lack fat tails, especially infinitely fat tails.
So once again, this structure seems incompatible with the evaluations that matter. If the harm is bounded, and you get the answer wrong, worst case is some harm is done and then you pull the model.
(Unless you open sourced it, of course, in which case you are going to get versions that are deeply, deeply racist and every other form of harm measured here. Open source models would never pass such an evaluation.)
I appreciated section 5.4 on steps forward, with sections on ‘evaluations must have real consequences’ and ‘evaluations must be done systematically, in standardized ways,’ but once again the failure to consider existential risk is glaring.
Bengio Offers Letter and Proposes a Synthesis
Bengio, along with Hinton, Russell and others, has published a new open letter warning of AI risks and discussing how to manage them in an era of rapid progress.
Seems like a good thing to send to people who need to understand the basic case, I’ve seen several highly positive reactions from people who already agreed with the letter, but I have a hard time telling if it will work as persuasion.
If you’re reading this, you presumably already know all of the facts, arguments and suggestions the letter contains, and it will be interesting to you only as a resource to send to others.
They also propose concrete solid steps forward, very much in line with most good incremental proposals. As usual, insufficient, but a good place to start.
He also published a Time article with Daniel Privitera: How We Can Have AI Progress Without Sacrificing Safety or Democracy.
The basic idea is we have three good things we can all agree are good: Progress, Safety and Democracy. People say that trade-offs are a thing, and we need to Pick One, or at most Pick Two. Instead, they claim we can Pick Three.
Huge if true. How do we do that? Do all the good things?
As described, all four of these seem like clearly good ideas.
I do worry about exactly what goals people want to advance with existing AI, the same way I always worry about the goals people want to advance in general, but mostly we can all agree that existing AI should be used to extract mundane utility and advance our goals. Two thumbs up.
Boosting research on trustworthy AI seems also obviously worthwhile. The specific proposal is to do this in a unified large-scale effort, in part because the innovations found will be dual use and will need to be secured. The good objection to this is worrying about ability for those leading such a project to differentiate good and bad alignment proposals and ideas, and to tell what is worth pursuing or working, and the idea of ‘map all risks and try to deal with each one’ does not inspire confidence given the nature of the space, but I do think it is clear that it is better to try this than not try this. So two thumbs up again, I say.
I very much worry about what people mean when they say ‘democratic’ oversight, but here it means governments and a global agreement on minimal binding rules, which we do need. Actually doing what ‘the people’ want in too much detail would be a very different proposal. Here details matter a lot, it is easy for rules to backfire as those opposed to them often point out, but yes we are going to need some rules.
The last proposal calls for compute monitoring and red teamers. Yes, please.
As they say, these goals are not exhaustive. We would not then be home free. It would however be an excellent start. And yes, in the end we all want progress and safety, and for humanity to have some say in its fate, although I again think we are conflating here different meanings of democratic participation.
Where I disagree is in the idea that we would then get all our goals at once and things would work out. As all economists know, everything is a trade-off. We can get some amount of all three goals, but not without sacrifice.
Matt Yglesias Responds To Marc Andreessen’s Manifesto
If anyone requires a formal response to the Techno-Optimist Manifesto, I believe this post, The Techno-Optimists Fallacy, is the best we are going to get.
It points out that yes, in general technology is hugely positive and important, and that when some people intentionally kneecapped our technological progress they did a very bad thing that we should reverse.
But that reversed stupidity is not intelligence, and but of course there are obvious exceptions. And that those exceptions are rather important, and also one of them is the core subject of the manifesto.
Matt’s concrete example of a technology that is bad, actually, and perhaps that we would be better off without more people having more access to it, is one that hopefully almost all of us can agree upon, even Marc himself: Fentanyl.
There are people who reasonably think drug prohibitions do more harm than good, but even they would still say that if you discovered Fentanyl and decided not to publish, that decision not to publish is a mitzvah, making the world a better place.
Here is the ending section, that talks directly about AI, and calls Marc out on exactly the thing he definitely was doing. Good rhetorical innovation (or refinement) here.
Jemima Kelly also responded in The Financial Times, in less essential and rather uncharitable fashion.
Mike Solana has fun heaping disdain on various people heaping disdain upon the manifesto, pointing out that the bad critiques are bad, because they say technology is bad whereas the memo says technology is good. He does not mention the good critiques.
People Are Worried About AI Killing Everyone
Counterargument department:
Link there goes to Arbital’s Problem of fully updated deference. As I understand it I think the objection is correct here but I would need to study the details more to be sure.
I am very confident that Eliezer is right about RLHF.
Next, why Eliezer is not a fan of the Alignment Manhattan Project.
Here’s Eliezer explaining why no, he does not think that having the AI do your alignment homework is a strategy.
I think you can make a less-stupid proposal for how to get your homework done, and indeed I have seen less-stupid such proposals. I do not think they are sufficiently less-stupid that I expect any of them to work, for this and other reasons, and Eliezer could go on in more detail here for however long was desired if you wanted him to – but I want to note that this is not by itself a conclusive argument against the less-stupid versions.
Someone Is Worried AI Alignment Is Going Too Fast
Quite the hot take from Alexey Guzey, that alignment is the bottleneck to capabilities and our alignment progress is going fine, so perhaps we should pause alignment to avoid speeding up capabilities? Yes, alignment techniques are dual use, but no, differentially only doing single-use stuff instead would not end better.
Guzey centrally restates a distressingly common false claim:
This is simply false. It was completely false for N=0, N=1 and N=2. For N=3, some people had ~1% chances of existential catastrophe, and a few associated with Conjecture were higher, but no one had ~10%+ and no one said words that I would interpret as that, either. And for N=4 (a potential name-worthy GPT-5), I would say >1% risk is highly reasonable and I don’t see non-Conjecture people going >10% for that alone (as opposed to someone using GPT-5 to help faster build GPT-N, and even then, I don’t remember hearing it explicitly, ever?).
(Yes, someone at some point presumably said it at Bay Area House Party. Don’t care.)
Richard Ngo and several others try to point this out downthread. Alexey does not take kindly.
Maybe the error Alexey made is being unable to in this context simultaneously hold in one’s head ‘when this exponential goes too far it will be very bad’ and ‘we are not that close yet to it being very bad?’
As in a version of:
There are other interactions that suggest this as well. Alexey accuses us of gaslighting him when we point out that we did not say the things he insists we said, because how could we believe such high p(doom) while not having said such things? We keep explaining, it keeps not getting through. This is not a phenomenon limited to Alexey.
Oh, also Alexey gives the proper number of eyes here:
I consider the arguments convincing without the kind of evidence Alexey is demanding. But also sounds like we should expect the evidence soon, too?
Please Speak Directly Into This Microphone
Jacy Reese Anthis writes in The Hill, We Need an AI Rights Movement.
Perhaps we should not build such AIs, then?
Also there is this thread. Once again, we thank everyone for their straight talk.
And once again, perhaps let’s not build such AIs, then?
The Lighter Side
Well, it is funny now. For now.
Alignment problem not solved, but now we know what to aim for.
Or do we?
Anyone who is fully funded and orders Domino’s Pizza is dangerously misaligned.
True but odd in combination things on my timeline: