All of Zach Stein-Perlman's Comments + Replies

AGI is defined as "a highly autonomous system that outperforms humans at most economically valuable work." We can hope, but it's definitely not clear that AGI comes before existentially-dangerous-AI.

New misc remark:

It's not clear how the PF interacts with sharing models with Microsoft (or others). In particular, if OpenAI is required to share its models with Microsoft and Microsoft can just deploy them, even a great PF wouldn't stop dangerous models from being deployed. See OpenAI-Microsoft partnership.

2ryan_greenblatt5d
Note that the openai-microsoft deal stops at AGI. We might hope that AGI will be invoked prior to models which are existentially dangerous.

Good work.

One plausibly-important factor I wish you'd tracked: whether the company offers bug bounties.

Some prior discussion.

[Edit: didn't mean to suggest David's post is redundant.]

2Davidmanheim15d
Thanks - I agree that this discusses the licenses, which would be enough to make LlaMa not qualify, but I think there's a strong claim I put forward in the full linked piece that even if the model weights were released using a GPL license, those "open" model weights wouldn't make it open in the sense that Open Source means elsewhere.

I agree. But I claim saying "I can't talk about the game itself, as that's forbidden by the rules" is like saying "I won't talk about the game itself because I decided not to" -- the underlying reason is unclear.

9Matt Goldenberg18d
The original reasoning that Eliezer gave if I remember correctly was that it's better to make people realize there are unknown unknowns, rather than taking one specific strategy and saying "oh, I know how I would have stopped that particular strategy"

Unfortunately, I can't talk about the game itself, as that's forbidden by the rules.

You two can just change the rules... I'm confused by this rule.

2Nathan Helm-Burger18d
The implication is that they approved of this rule and agreed to conduct the social experiment with that rule as part of the contract. Is that not your understanding?

The control-y plan I'm excited about doesn't feel to me like squeeze useful work out of clearly misaligned models. It's like use scaffolding/oversight to make using a model safer, and get decent arguments that using the model (in certain constrained ways) is pretty safe even if it's scheming. Then if you ever catch the model scheming, do some combination of (1) stop using it, (2) produce legible evidence of scheming, and (3) do few-shot catastrophe prevention. But I defer to Ryan here.

7ryan_greenblatt2mo
I think once you're doing few-shot catastrophe prevention and trying to get useful work out of that model, you're plausibly in the "squeezing useful work out of clearly misaligned models regime". (Though it's not clear that squeezing is a good description and you might think that your few-shot catastrophe prevention interventions have a high chance of removing scheming. I'm generally skeptical about removing catastrophic misalignment just based on one escape attempt, but I think that we still might be able to greatly improve safety via other mechanisms.) As I discuss in my comment responding to the sibling comment from habryka, I'm interested in both of ensuring direct safety and getting useful work from scheming models.
2habryka2mo
I had a long two-hour conversation with Ryan yesterday about this, and at least my sense was that he is thinking about it as "squeeze useful work out of clearly misaligned models".  He also thought other people should probably work on making it so that if we see this behavior we stop scaling and produce legible evidence of scheming to facilitate a good societal response, but my sense is that he was thinking of that as separate from the work he was pointing to with the AI control post.

New misc remark:

  • OpenAI's commitments about deployment seem to just refer to external deployment, unfortunately.
    • This isn't explicit, but they say "Deployment in this case refers to the spectrum of ways of releasing a technology for external impact."
    • This contrasts with Anthropic's RSP, in which "deployment" includes internal use.

Added to the post:

Edit, one day later: the structure seems good, but I'm very concerned that the thresholds for High and Critical risk in each category are way too high, such that e.g. a system could very plausibly kill everyone without reaching Critical in any category. See pp. 8–11. If so, that's a fatal flaw for a framework like this. I'm interested in counterarguments; for now, praise mostly retracted; oops. I still prefer this to no RSP-y-thing, but I was expecting something stronger from OpenAI. I really hope they lower thresholds for the finished version of this framework.

1RogerDearnaley2mo
My impression was that (other than Autonomy) High means "effective & professionally skilled human levels of ability at creating this type of risk" and Critical means "superhuman levels of ability at creating this type of risk". I assume their rationale is that we already have a world containing plenty of people with human levels of ability to create risk, and we're not dead yet. I think their threshold for High may be a bit too high on Persuasion, by comparing to very rare, really exceptional people (by "country-wide change agents" I assume they mean people like Nelson Mandela or Barack Obama): we don't have a lot of those, especially not willing and able to work for a O(cents) per thousand tokens for anyone. I'd have gone with a lower bar like "as persuasive as a skilled & capable professional negotiator, politician+speechwriter team, or opinion writer": i.e. someone with charisma and a way with words, but not once-in-a-generation levels of charisma.

Any mention of what the "mitigations" in question would be?

Not really. For now, OpenAI mostly mentions restricting deployment (this section is pretty disappointing):

A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing co

... (read more)
5Akash2mo
They mention three types of mitigations: * Asset protection (e.g., restricting access to models to a limited nameset of people, general infosec) * Restricting deployment (only models with a risk score of "medium" or below can be deployed) * Restricting development (models with a risk score of "critical" cannot be developed further until safety techniques have been applied that get it down to "high." Although they kind of get to decide when they think their safety techniques have worked sufficiently well.) My one-sentence reaction after reading the doc for the first time is something like "it doesn't really tell us how OpenAI plans to address the misalignment risks that many of us are concerned about, but with that in mind, it's actually a fairly reasonable document with some fairly concrete commitments"). 

Some personal takes in response:

  • Yeah, largely the letter of the law isn't sufficient.
  • Some evals are hard to goodhart. E.g. "can red-teamers demonstrate problems (given our mitigations)" is pretty robust — if red-teamers can't demonstrate problems, that's good evidence of safety (for deployment with those mitigations), even if the mitigations feel jury-rigged.
  • Yeah, this is intended to be complemented by superalignment.

(I edited my comment to add the market, sorry for confusion.)

(Separately, a market like you describe might still be worth making.)

(The label "RSP" isn't perfect but it's kinda established now. My friends all call things like this "RSPs." And anyway I don't think "PFs" should become canonical instead. I predict change in terminology will happen ~iff it's attempted by METR or multiple frontier labs together. For now, I claim we should debate terminology occasionally but follow standard usage when trying to actually communicate.)

9Akash2mo
I believe labels matter, and I believe the label "preparedness framework" is better than the label "responsible scaling policy." Kudos to OpenAI on this. I hope we move past the RSP label. I think labels will matter most when communicating to people who are not following the discussion closely (e.g., tech policy folks who have a portfolio of 5+ different issues and are not reading the RSPs or PFs in great detail). One thing I like about the label "preparedness framework" is that it begs the question "prepared for what?", which is exactly the kind of question I want policy people to be asking. PFs imply that there might be something scary that we are trying to prepare for. 

Nice. Another possible market topic: the mix of Low/Medium/High/Critical on their Scorecard, when they launch it or on 1 Jan 2025 or something. Hard to operationalize because we don't know how many categories there will be, and we care about both pre-mitigation and post-mitigation scores.

Made a simple market:

2jacobjacob2mo
Oops, somehow didn't see there was actually a market baked into your question I'd also be interested in "Will there be a publicly revealed instance of a pause in either deployment or development, as a result of a model scoring High or Critical on a scorecard, by Date X?"

Mea culpa. Embarrassed I forgot that. Yay Anthropic too!

Edited the post.

But other labs are even less safe, and not far behind.

Yes, largely alignment is an unsolved problem on which progress is an exogenous function of time. But to a large extent we're safer with safety-interested labs developing powerful AI: this will boost model-independent alignment research, make particular critical models more likely to be aligned/controlled, help generate legible evidence that alignment is hard (insofar as that exists), and maybe enable progress to pause at a critical moment.

7RobertM2mo
I think that to the extent that other labs are "not far behind" (such as FAIR), this is substantially an artifact of them being caught up in a competitive arms race.  Catching up to "nearly SOTA" is usually much easier than "advancing SOTA", and I'm fairly persuaded by the argument that the top 3 labs are indeed ideologically motivated in ways that most other labs aren't, and there would be much less progress in dangerous directions if they all shut down because their employees all quit.

See Would edits to the adult brain even do anything?.

(Not endorsing the post or that section, just noticing that it seems relevant to your complaint.)

5jacob_cannell3mo
It does not. Despite the title of that section it is focused on adult expression factors. The post in general lacks a realistic mechanistic model of how tweaking genes affects intelligence. Is similar to expecting that a tweak to the hyperparams (learning rate) etc of trained GPT4 can boost its IQ (yes LLMs have their IQ or g factor). Most all variables that affect adult/trained performance do so only through changing the learning trajectory. The low hanging fruit or free energy in hyperparams with immediate effect is insignificant. Of course if you combine gene edits with other interventions to rejuvenate older brains or otherwise restore youthful learning rate more is probably possible, but again it doesn’t really matter much as this all takes far too long. Brains are too slow.

I think you use "AI governance" to mean "AI policy," thereby excluding e.g. lab governance (e.g. structured access and RSPs). But possibly you mean to imply that AI governance minus AI policy is not a priority.

I indeed mean to imply that AI governance minus AI policy is not a priority. Before the recent events at OpenAI, I would have assigned minority-but-not-negligible probability to the possibility that lab governance might have any meaningful effect. After the recent events at OpenAI... the key question is "what exactly is the mechanism by which lab governance will result in a dangerous model not being built/deployed?", and the answer sure seems to be "it won't". (Note that I will likely update back toward minority-but-not-negligible probability if the eventu... (read more)

(I disagree. Indeed, until recently governance people had very few policy asks for government.)

Did that change because people finally finished doing enough basic strategy research to know what policies to ask for? 

Yeah, that's Luke Muehlhauser's claim; see the first paragraph of the linked piece.

I mostly agree with him. I wasn't doing AI governance years ago but my impression is they didn't have many/good policy asks. I'd be interested in counterevidence — like pre-2022 (collections of) good policy asks.

Anecdotally, I think I know one AI safety person... (read more)

My own model differs a bit from Zach's. It seems to me like most of the publicly-available policy proposals have not gotten much more concrete. It feels a lot more like people were motivated to share existing thoughts, as opposed to people having new thoughts or having more concrete thoughts.

Luke's list, for example, is more of a "list of high-level ideas" than a "list of concrete policy proposals." It has things like "licensing" and "information security requirements"– it's not an actual bill or set of requirements. (And to be clear, I still like Luke's p... (read more)

Like, the whole appeal of governance as an approach to AI safety is that it's (supposed to be) bottlenecked mainly on execution, not on research.

(I disagree. Indeed, until recently governance people had very few policy asks for government.)

(Also note that lots of "governance" research is ultimately aimed at helping labs improve their own safety. Central example: Structured access.)

5Lucius Bushnaq3mo
Did that change because people finally finished doing enough basic strategy research to know what policies to ask for?  It didn't seem like that to me. Instead, my impression was that it was largely triggered by ChatGPT and GPT4 making the topic more salient, and AI safety feeling more inside the Overton window. So there were suddenly a bunch of government people asking for concrete policy suggestions.

Most don't do policy at all. Many do research. Since you're incredulous, here are some examples of great AI governance research (which don't synergize much with talking to policymakers):

I mean, those are all decent projects, but I would call zero of them "great". Like, the whole appeal of governance as an approach to AI safety is that it's (supposed to be) bottlenecked mainly on execution, not on research. None of the projects you list sound like they're addressing an actual rate-limiting step to useful AI governance.

How did various companies do on the requests? Here is how the UK graded them.

 

CFI reviewers (UK Government)

This wasn't the UK or anything official — it was just some AI safety folks

(I agree that the scores feel inflated, mostly due to insufficiently precise recommendations and scoring on the basis of whether it checks the box or not—e.g. all companies got 2/2 on doing safety research, despite that some companies are >100x better at it than others—and also just generous grading.)

I am excited about this. I've also recently been interested in ideas like nudge researchers to write 1-5 page research agendas, then collect them and advertise the collection.

Possible formats:

  • A huge google doc (maybe based on this post); anyone can comment; there's one or more maintainers; maintainers approve ~all suggestions by researchers about their own research topics and consider suggestions by random people.
  • A directory of google docs on particular agendas; the individual google docs are each owned by a relevant researcher, who is responsible for main
... (read more)
1Iknownothing3mo
Hi, we've already made a site which does this!
3Roman Leventov3mo
I've earlier suggested a principled taxonomy of AI safety work with two dimensions: 1. System level: * monolithic AI system * human--AI pair * AI group/org: CoEm, debate systems * large-scale hybrid (humans and AIs) society and economy * AI lab, not to be confused with an "AI org" above: an AI lab is an org composed of humans and increasingly of AIs that creates advanced AI systems. See Hendrycks et al.' discussion of organisational risks. 2. Methodological time: * design time: basic research, math, science of agency (cognition, DL, games, cooperation, organisations), algorithms * manufacturing/training time: RLHF, curriculums, mech interp, ontology/representations engineering, evals, training-time probes and anomaly detection * deployment/operations time: architecture to prevent LLM misuse or jailbreaking, monitoring, weights security * evolutionary time: economic and societal incentives, effects of AI on society and psychology, governance. So, this taxonomy is a 5x4 matrix, almost all slots or which are interesting, and some of them are severely under-explored.

There's also a much harder and less impartial option, which is to have an extremely opinionated survey that basically picks one lens to view the entire field and then describes all agendas with respect to that lens in terms of which particular cruxes/assumptions each agenda runs with. This would necessarily require the authors of the survey to deeply understand all the agendas they're covering, and inevitably some agendas will receive much more coverage than other agendas. 

This makes it much harder than just stapling together a bunch of people's descr... (read more)

4habryka3mo
LessWrong does have a relatively fully featured wiki system. Not sure how good of a fit it is, but like, everyone can create tags and edit them and there are edit histories and comment sections for tags and so on.  We've been considering adding the ability for people to also add generic wiki pages, though how to make them visible and allocate attention to them has been a bit unclear.

It's "a unified methodology" but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.

6ryan_greenblatt3mo
(Agreed except that "inference-time safety techiques" feels overly limiting. It's more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn't discriminated by our validation set and other measurements. I hope this isn't too incomprehensible, but don't worry if it is, this point isn't that important.)

Thanks!

I think there's another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:

... (read more)
3Seth Herd3mo
This is an excellent description of my primary work, for example Internal independent review for language model agent alignment That post proposes calling new instances or different models to review plans and internal dialogue for alignment, but it includes discussion of the several layers of safety scaffolding that have been proposed elsewhere. This post is amazingly useful. Integrative/overview work is often thankless, but I think it's invaluable for understanding where effort is going, and thinking about gaps where it more should be devoted. So thank you, thank you!
4ryan_greenblatt3mo
Explicitly noting for the record we have some forthcoming work on AI control which should be out relatively soon. (I work at RR)
1technicalities3mo
I like this. It's like a structural version of control evaluations. Will think where to put it in

Update: Greg Brockman quit.

Update: Sam and Greg say:

Sam and I are shocked and saddened by what the board did today.

Let us first say thank you to all the incredible people who we have worked with at OpenAI, our customers, our investors, and all of those who have been reaching out.

We too are still trying to figure out exactly what happened. Here is what we know:

- Last night, Sam got a text from Ilya asking to talk at noon Friday. Sam joined a Google Meet and the whole board, except Greg, was there. Ilya told Sam he was being fired and that the news was going

... (read more)

Perhaps worth noting: one of the three resignations, Aleksander Madry, was head of the preparedness team which is responsible for preventing risks from AI such as self-replication. 

Has anyone collected their public statements on various AI x-risk topics anywhere?

A bit, not shareable.

Helen is an AI safety person. Tasha is on the Effective Ventures board. Ilya leads superalignment. Adam signed the CAIS statement

Also D'Angelo is on the board of Asana, Moskovitz's company (Moskovitz who funds Open Phil).

For completeness - in addition to Adam D’Angelo, Ilya Sutskever and Mira Murati signed the CAIS statement as well.

Thanks!

automating the world economy will take longer

I'm curious what fraction-of-2023-tasks-automatable and maybe fraction-of-world-economy-automated you think will occur at e.g. overpower time, and the median year for that. (I sometimes notice people assuming 99%-automatability occurs before all the humans are dead, without realizing they're assuming anything.)

Distinguishing:

(a) 99% remotable 2023 tasks automateable (the thing we forecast in the OP)
(b) 99% 2023 tasks automatable
(c) 99% 2023 tasks automated
(d) Overpower ability

My best guess at the ordering will be a->d->b->c. 

Rationale: Overpower ability probably requires something like a fully functioning general purpose agent capable of doing hardcore novel R&D. So, (a). However it probably doesn't require sophisticated robots, of the sort you'd need to actually automate all 2023 tasks. It certainly doesn't require actually having replaced all... (read more)

@Daniel Kokotajlo it looks like you expect 1000x-energy 4 years after 99%-automation. I thought we get fast takeoff, all humans die, and 99% automation at around the same time (but probably in that order) and then get massive improvements in technology and massive increases in energy use soon thereafter. What takes 4 years?

(I don't think the part after fast takeoff or all humans dying is decision-relevant, but maybe resolving my confusion about this part of your model would help illuminate other confusions too.)

Good catch. Let me try to reconstruct my reasoning:

  • I was probably somewhat biased towards a longer gap because I knew I'd be discussing with Ege who is very skeptical (I think?) that even a million superintelligences in control of the entire human society could whip it into shape fast enough to grow 1000x in less than a decade. So I probably was biased towards 'conservatism.' (in scare quotes because the direction that is conservative vs. generous is determined by what other people think, not by the evidence and facts of the case)
  • As Habryka says, I think t
... (read more)

I think one component is that the prediction is for when 99% of jobs are automatable, not when they are automated (Daniel probably has more to say here, but this one clarification seems important).

So why doesn't one of those thousand people run for president and win? (This is a rhetorical question, I know the answer)

The answer is that there's a coordination problem.

It occurs to me that maybe these things are related. Maybe in a world of monarchies where the dynasty of so-and-so has ruled for generations, supporting someone with zero royal blood is like supporting a third-party candidate in the USA.

Wait, what is it that gave monarchic dynasties momentum, in your view?

In the future, sharing weights will enable misuse. For now, the main effect of sharing weights is boosting research (both capabilities and safety) (e.g. the Llama releases definitely did this). The sign of that research-boosting currently seems negative to me, but there's lots of reasonable disagreement.

@peterbarnett and I quickly looked at summaries for ~20 papers citing Llama 2, and we thought ~8 were neither advantaged nor disadvantaged for capabilities over safety, ~7 were better for safety than capabilities, and ~5 were better for capabilities than safety. For me, this was a small update towards the effects of Llama 2 so far, having been positive.

fwiw my guess is that OP didn't ask its grantees to do open-source LLM biorisk work at all; I think its research grantees generally have lots of freedom.

(I've worked for an OP-funded research org for 1.5 years. I don't think I've ever heard of OP asking us to work on anything specific, nor of us working on something because we thought OP would like it. Sometimes we receive restricted, project-specific grants, but I think those projects were initiated by us. Oh, one exception: Holden's standards-case-studies project.)

Also note that OpenPhil has funded the Future of Humanity Institute, the organization who houses the author of the paper 1a3orn cited for the claim that knowledge is not the main blocker for creating dangerous biological threats. My guess is that the dynamic 1a3orn describes is more about what things look juicy to the AI safety community, and less about funders specifically.

Interesting. If you're up for skimming a couple more EA-associated AI-bio reports, I'd be curious about your quick take on the RAND report and the CLTR report.

https://managing-ai-risks.com said "we call on major tech companies and public funders to allocate at least one-third of their AI R&D budget to ensuring safety and ethical use"

2ryan_b4mo
Woo! That's two in the span of one week!
9peterbarnett4mo
Mozilla, Oct 2023: Joint Statement on AI Safety and Openness (pro-openness, anti-regulation)  

The actual statement: Prominent AI Scientists from China and the West Propose Joint Strategy to Mitigate Risks from AI:

Coordinated global action on AI safety research and governance is critical to prevent uncontrolled frontier AI development from posing unacceptable risks to humanity.

Global action, cooperation, and capacity building are key to managing risk from AI and enabling humanity to share in its benefits. AI safety is a global public good that should be supported by public and private investment, with advances in safety shared widely. Governments ar

... (read more)

Nice.

You’ll need to evaluate more than just foundation models

Not sure what this is gesturing at—you need to evaluate other kinds of models, or whole labs, or foundation-models-plus-finetuning-and-scaffolding, or something else.

(I think "model evals" means "model+finetuning+scaffolding evals," at least to the AI safety community + Anthropic.)

This was the press release; the actual order has now been published.

One safety-relevant part:

4.2.  Ensuring Safe and Reliable AI.  (a)  Within 90 days of the date of this order, to ensure and verify the continuous availability of safe, reliable, and effective AI in accordance with the Defense Production Act, as amended, 50 U.S.C. 4501 et seq., including for the national defense and the protection of critical infrastructure, the Secretary of Commerce shall require:

          (i)   Companies developing or

... (read more)
2Charbel-Raphaël4mo
Is there a definition of "dual-use foundation model" anywhere in the text?
6Vladimir_Nesov4mo
This requires reporting of plans for training and deployment, as well as ownership and security of weights, for any model with training compute over 1026 FLOPs. Might be enough of a talking point with corporate leadership to stave off things like hypothetical irreversible proliferation of a GPT-4.5 scale open weight LLaMA 4.

Thanks.

  1. Shrug. You can get it ~arbitrarily high by using more sensitive tests. Of course the probability for Anthropic is less than 1, and I totally agree it's not "high enough to write off this point." I just feel like this is an engineering problem, not a flawed "core assumption."

[Busy now but I hope to reply to the rest later.]

there are clearly some training setups that seem more dangerous than other training setups . . . . 

Like, as an example, my guess is systems where a substantial chunk of the compute was spent on training with reinforcement learning in environments that reward long-term planning and agentic resource acquisition (e.g. many video games or diplomacy or various simulations with long-term objectives) sure seem more dangerous. 

Any recommended reading on which training setups are safer? If none exist, someone should really write this up.

This is great. Some quotes I want to come back to:

a thing that I like that both the Anthropic RSP and the ARC Evals RSP post point to is basically a series of well-operationalized conditional commitments. One way an RSP could be is to basically be a contract between AI labs and the public that concretely specifies "when X happens, then we commit to do Y", where X is some capability threshold and Y is some pause commitment, with maybe some end condition.

 

instead of an RSP I would much prefer a bunch of frank interviews with Dario and Daniella where som

... (read more)

I mostly disagree with your criticisms.

  1. On the other hand: before dangerous capability X appears, labs specifically testing for signs-of-X should be able to notice those signs. (I'm pretty optimistic about detecting dangerous capabilities; I'm more worried about how labs will get reason-to-believe their models are still safe after those models have dangerous capabilities.)
  2. There's a good solution: build safety buffers into your model evals. See https://www-files.anthropic.com/production/files/responsible-scaling-policy-1.0.pdf#page=11. "the RSP is unclear on
... (read more)
4Vaniver4mo
1. What's the probability associated with that "should"? The higher it is the less of a concern this point is, but I don't think it's high enough to write off this point. (Separately, agreed that in order for danger warnings to be useful, they also have to be good at evaluating the impact of mitigations unless they're used to halt work entirely.) 2. I don't think safety buffers are a good solution; I think they're helpful but there will still always be a transition point between ASL-2 models and ASL-3 models, and I think it's safer to have that transition in an ASL-3 lab than an ASL-2 lab. Realistically, I think we're going to end up in a situation where, for example, Anthropic researchers put a 10% chance on the next 4x scaling leading to evals declaring a model ASL-3, and it's not obvious what decision they will (or should) make in that case. Is 10% low enough to proceed, and what are the costs of being 'early'?  3. The relevant section of the RSP: I think it's sensible to reduce models to ASL-2 if defenses against the threat become available (in the same way that it makes sense to demote pathogens from BSL-4 to BSL-3 once treatments become available), but I'm concerned about the "dangerous information becomes more widely available" clause. Suppose you currently can't get slaughterbot schematics off Google; if those become available, I am not sure it then becomes ok for models to provide users with slaughterbot schematics. (Specifically, I don't want companies that make models which are 'safe' except they leak dangerous information X to have an incentive to cause dangerous information X to become available thru other means.) [There's a related, slightly more subtle point here; supposing you can currently get instructions on how to make a pipe bomb on Google, it can actually reduce security for Claude to explain to users how to make pipe bombs if Google is recording those searches and supplying information to law enforcement / the high-ranked sites on Google s

Cool.

I was surprised to see that OpenAI's version of an RSP will be developed by this somewhat focused team. I think the central part of an RSP is big commitments about scaling and deployment decisions as a function of risk-assessment-results — an RSP might entail that a lab basically stops scaling for months or years. I hope this team is empowered to create a strong RSP.

Something like it'll lead you to make worse predictions?

Scary possible AI influence ops include like making friends on Discord via text chats. I predict that if your predictions about influence-ops-concerns are only about political-propaganda, you'll make worse predictions.

4tailcalled4mo
Making friends on Discord and then using those friendships to disseminate propaganda through those friendships, no? Or to test the effectiveness of propaganda, or various other things. It's still centrally mediated by the propaganda. Like I agree that there are other potential AI dangers involving AIs making friends on Discord than just propaganda, but that doesn't seem to be what "influence ops" are about? And there are other actors than political actors who could do it (e.g. companies could), but he seems to be focusing on geopolitical enemies rather than those actors. Maybe he has concerns beyond this, but he doesn't seem to emphasize them much?

I checked, definitely directionally true, but "enemies will use it to generate propaganda" is a bad summary of legitimate concern about influence operations.

2tailcalled4mo
Bad summary by what criterion?

Mostly agree, but PAI's new guidance (released yesterday) includes some real safety stuff for frontier models — model evals for dangerous capabilities, staged release before sharing weights.

I was really excited about the Frontier Model Forum. This update seems... lacking; I was expecting more on commitments and best practices.

I'm not familiar with the executive director, Chris Meserole.

The associated red-teaming post seems ~worthless; it fails to establish anything like best practices or commitments. Most of the post is the four labs saying how they've done red-teaming; picking on Microsoft because its deployment of Bing Chat was the most obviously related to a failure of red-teaming, Microsoft fails to acknowledge this or discuss what they p... (read more)

Load More