A Ray

Alex Gray née Alex Ray, much of my work is under that name. I'm interested in language model alignment, and especially techniques to get models to reason out loud.

Wiki Contributions


Alex Ray's Shortform

AGI will probably be deployed by a Moral Maze

Moral Mazes is my favorite management book ever, because instead of "how to be a good manager" it's about "empirical observations of large-scale organizational dynamics involving management".

I wish someone would write an updated version -- a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.

My take (and the author's take) is that any company of nontrivial size begins to take on the characteristics of a moral maze.  It seems to be a pretty good null hypothesis -- any company saying "we aren't/won't become a moral maze" has a pretty huge evidential burden to cross.

I keep this point in mind when thinking about strategy around when it comes time to make deployment decisions about AGI, and deploy AGI.  These decisions are going to be made within the context of a moral maze.

To me, this means that some strategies ("everyone in the company has a thorough and complete understanding of AGI risks") will almost certainly fail.  I think the only strategies that work well inside of moral mazes will work at all.

To sum up my takes here:

  • basically every company eventually becomes a moral maze
  • AGI deployment decisions will be made in the context of a moral maze
  • understanding moral maze dynamics is important to AGI deployment strategy
AI Training Should Allow Opt-Out

(Caveat: I ran the first big code scrape and worked on the code generating models which later became codex.)

My one line response: I think opt-out is obviously useful and good and should happen.

AFAIK there are various orgs/bodies working on this but kinda blanking what/where.  (In particular there's a FOSS mailing list that's been discussing how ML training relates to FOSS license rights that seems relevant)

Opt-out strings exist today, in an insufficient form.  The most well known and well respected one is probably the big-bench canary string: https://github.com/google/BIG-bench/blob/main/docs/doc.md -- but this is just intended to protect data used for evaluating text models.

Mimicking the structure to comment on each point:


I think simplicity is points in favor of cheapness, but not points (directly) in favor of why something "should be done".  I see this as "technical cost to implement are low", and agree.


I think this also is points in favor of cheapness, but again not why it "should be done".  I see this as "expected reduction in ML perf is small", and agree.


I think this makes the point that we currently don't have settled understanding on what the ethics of various options are here.  People being upset at the state of things is pretty strong evidence that it's not settled, but seems to be less strong evidence that it's unethical.  I can't tell the point you're trying to make here is that "we should figure out the ethics of opt-out" (which I agree with) or that "opt-out is ethically required" (which I don't think you've sufficiently supported here for me to agree with).


I see this as making the point "opt-out would (very minorly) reduce AI risk".  I think this is both well supported by the arguments and technically valid.  I'm personally skeptical about the amount of protection this gets us, and am mostly optimistic in applying it to non-software domains (e.g. nanotech, gain of function, virology, etc).

A personal technical prediction I can add: I think that in the software domain, it will be inexpensive for a capable system to compose any non-allowed concepts out of allowed concepts.  I think this is non-obvious to traditional ML experts.  In traditional ML, removing a domain from the dataset usually robustly removes it from the model -- but things like the large-scale generative models mentioned in the top of the post have generalized very well across domains.  (They're still not very capable in-domain, but are similarly not-capable in domains that didn't exist in training.)  I think this "optimism about generalization" is the root of a bunch of my skepticism about domain-restriction/data-censoring as a method of restricting model capabilities.


I think the robots.txt example is great and basically this is the one that is most directly applicable.  (Other precedents exist but IMO none are as good.)  I totally agree with this precedent.

Separately, there's a lot of precedent for people circumventing or ignoring these -- and I think it's important to look at those precedents, too!

Risk Compensation

This is an interesting point.  I personally don't weigh this highly, and feel like a lot of my intuition here is attached to gut-level stuff.

As far as I know, the literature on risk compensation is almost entirely about things that are direct personal risk to someone.  I don't know of any cases of risk compensation where the risk was indirect or otherwise largely separated from the person.  (At some point of indirectness this seems to reduce more to a "principal-agent problem" than a risk-compensation problem)

What's Missing

I think it's easy to focus on the technical implementation costs and less on the "what happens next" costs.  Figuring out the legal status of this opt-out (and possibly pushing for legislation to change this) is difficult and expensive.  Figuring out standards for evaluation will be similarly hard, especially as the tech itself changes rapidly.

Personal Conclusion

I think opt-out is obviously good and useful and should be done.  It think its a pretty clear positive direction for ML/AI policy and regulatory development -- and also I'm optimistic that this is the sort of thing that will happen largely on its own (i.e. no drastic action is required).

Alex Ray's Shortform

Sometimes I get asked by intelligent people I trust in other fields, "what's up with AI x risk?" -- and I think at least part of it unpacks to this: Why don't more people believe in / take seriously AI x-risk?

I think that is actually a pretty reasonable question.  I think two follow-ups are worthwhile and I don't know of good citations / don't know if they exist:

  1. a sociological/anthropological/psychological/etc study of what's going on in people who are familiar with the ideas/reasonings of AI x-risk, but decide not to take it seriously / don't believe it.  I expect in-depth interviews would be great here.
  2. we should probably just write up as many obvious things ourselves up front.

The latter one I can take a stab at here.  Taking the perspective of someone who might be interviewed for the former:

  • historically, ignoring anyone that says "the end of the world is near" has been a great heuristic
  • very little of the public intellectual sphere engages with the topic
  • the public intellectual sphere that does in engages is disproportionately meme lords
  • most of the writings about this are exceptionally confusing and jargon-laden
  • there's no college courses on this / it doesn't have the trappings of a legitimate field
  • it feels a bit like a Pascal's mugging -- at the very least i'm not really prepared to try to think about actions/events with near-infinite consequences
  • people have been similarly doom-y about other technologies and so far the world turned out fine
  • we have other existential catastrophes looming (climate change, etc) that are already well understood and scientifically supported, so our efforts are better put on that than this confusing hodge-podge
  • this field doesn't seem very diverse and seems to be a bit monocultural
  • this field doesn't seem to have a deep/thorough understanding of all of the ways technology is affecting people's lives negatively today
  • it seems weird to care about future people when there are present people suffering
  • I see a lot of public disagreement about whether or not AGI is even real, which makes the risk arguments feel much less trustworthy to me

I think i'm going to stop for now, but I wish there was a nice high-quality organization of these.  At the very least, having the steel-version of them seems good to have around, in part as an "epistemic hygiene" thing.

A descriptive, not prescriptive, overview of current AI Alignment Research

Thanks so much for making this!

I'm hopeful this sort of dataset will grow over time as new sources come about.

In particular, I'd nominate adding MLSN (https://www.alignmentforum.org/posts/R39tGLeETfCZJ4FoE/mlsn-4-many-new-interpretability-papers-virtual-logit) to the list of newsletters in the future.

[Linkpost] A Chinese AI optimized for killing

This seems like an overly alarmist take on what is a pretty old trend of research.  Six years ago there was a number of universities working on similar models for the VizDoom competition (IIRC they were won by Intel and Facebook).  It seems good to track this kind of research, but IMO the conclusions here are not supported at all by the evidence presented.

A Small Negative Result on Debate

Do you have suggestions for domains where you do expect one-turn debate to work well, now that you've got these results?

We Are Conjecture, A New Alignment Research Startup

Congratulations!  Can you say if there will be a board, and if so who will start on it?

Alex Ray's Shortform

Longtermist X-Risk Cases for working in Semiconductor Manufacturing

Two separate pitches for jobs/roles in semiconductor manufacturing for people who are primarily interested in x-risk reduction.

Securing Semiconductor Supply Chains

This is basically the "computer security for x-risk reduction" argument applied to semiconductor manufacturing.

Briefly restating: it seems exceedingly likely that technologies crucial to x-risks are on computers or connected to computers.  Improving computer security increases the likelihood that those machines are not stolen or controlled by criminals.  In general, this should make things like governance and control strategy more straightforward.

This argument also applies to making sure that there isn't any tampering with the semiconductor supply chain.  In particular, we want to make sure that the designs from the designer are not modified in ways that make it easier for outside actors to steal or control information or technology.

One of the primary complaints about working in semiconductor manufacturing for longtermist reasons is accelerating semiconductor progress.  I think security work here is not nearly as much as a direct driver of progress as other roles, so I would argue this as differentially x-risk reducing.

Diversifying Semiconductor Manufacturing

This one is more controversial in mainline longtermist x-risk reduction, so I'll try to clearly signpost the hypotheses that this is based on.

The reasoning is basically:

  • Right now, most prosaic AI alignment techniques require access to a lot of compute
  • It's possible that some prosaic AI alignment techniques (like interpretability) will require much more compute in the future
  • So, right now AI alignment research is at least partially gated on access to compute, and it seems plausible this will be the case in the future

So if we want to ensure these research efforts continue to have access to compute, we basically need to make sure they have enough money to buy the compute, and that there is compute to be sold.

Normally this wouldn't be much of an issue, as in general we can trust markets to meet demands, etc.  However semiconductor manufacturing is increasingly becoming a part of international conflict strategy.

In particular, much of the compute acceleration used in AI research (including AI alignment research) is manufactured in Taiwan, which seems to be coming under increasing threats.

My argument here is that I think it is possible to increase the chances that AI alignment research labs will continue to have access to compute, even in cases of large-scale geopolitical conflict.  I think this can be done in ways that end up not dramatically increasing the global semiconductor manufacturing capacity by much.

Alex Ray's Shortform

I think your explanation of legibility here is basically what I have in mind, excepting that if it's human designed it's potentially not all encompassing.  (For example, a world model that knows very little, but knows how to search for information in a library)

I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system.  My take is that it is not "interpretability" to understand a legible system, but maybe I'm using the term differently than others here.  This is why I don't think "interpretability" applies to systems that are designed to be always-legible.  (In the second graph, "interpretability" is any research that moves us upwards)

I agree that the ability to come up with totally alien and untranslateable to humans ideas gives AGI a capabilities boost.  I do think that requiring a system to only use legible cognition and reasoning is a big "alignment tax".  However I don't think that this tax is equivalent to a strong proof that legible AGI is impossible.

I think my central point of disagreement with this comment is that I do think that it's possible to have compact world models (or at least compact enough to matter).  I think if there was a strong proof that it was not possible to have a generally intelligent agent with a compact world model (or a compact function which is able to estimate and approximate a world model), that would be an update for me.

(For the record, I think of myself as a generally intelligent agent with a compact world model)

Alex Ray's Shortform

Two Graphs for why Agent Foundations is Important (according to me)

Epistemic Signpost: These are high-level abstract reasons, and I don’t go into precise detail or gears-level models.  The lack of rigor is why I’m short form-ing this.

First Graph: Agent Foundations as Aligned P2B Fixpoint

P2B (a recursive acronym for Plan to P2B Better) is a framing of agency as a recursively self-reinforcing process.  It resembles an abstracted version of recursive self improvement, which also incorporates recursive empowering and recursive resource gathering.  Since it’s an improvement operator we can imagine stepping, I’m going to draw an analogy to gradient descent.

Imagine a highly dimensional agency landscape.  In this landscape, agents follow the P2B gradient in order to improve.  This can be convergent such that two slightly different agents near each other might end up at the same point in agency space after some number of P2B updates.

Most recursive processes like these have fixed point attractors — in our gradient landscape these are local minima.  For P2B these are stable points of convergence.

Instead of thinking just about the fixed point attractor, lets think about the parts of agency space that flow into a given fixed point attractor.  This is like analyzing watersheds on hilly terrain — which parts of the agency space flow into which attractors.

Now we can have our graph: it’s a cartoon of the “agency landscape” with different hills/valleys flowing into different local minimum, colored by which local minimum they flow into.

Here we have a lot of different attractors in agency space, but almost all of them are unaligned, what we need to do is get the tiny aligned attractor in the corner.

However it’s basically impossible to initialize an AI at one of these attractors, the best we can do is make an agent and try to understand where in agency space they will start.  Building an AGI is imprecisely placing a ball on this landscape, which will roll along the P2B gradient towards its P2B attractor.

How does this relate to Agent Foundations?  I see Agent Foundations as a research agenda to write up the criterion for characterizing the basin in agent space which corresponds to the aligned attractor.  With this criterion, we can try to design and build an agent, such that when it P2Bs, it does so in a way that is towards an Aligned end.

Second: Agent Foundations as designing an always-legible model

ELK (Eliciting Latent Knowledge) formalized a family of alignment problems, eventually narrowing down to the Ontology Mapping Problem.  This problem is about translating between some illegible machine ontology (basically it’s internal cognition) and our human ontology (concepts and relations that a person can understand).

Instead of thinking of it as a binary, I think we can think of the ontology mapping problem as a legibility spectrum.  On one end of the spectrum we have our entirely illegible bayes net prosaic machine learning system.  On the other end, we have totally legible machines, possibly specified in a formal language with proofs and verification.

As a second axis I’d like to imagine development progress (this can be “how far along” we are, or maybe the capabilities or empowerment of the system).  Now we can show our graph, of different paths through this legibility vs development space.

Some strategies move away from legibility and never intend to get back to it.  I think these plans have us building an aligned system that we don’t understand, and possibly can’t ever understand (because it can evade understanding faster than we can develop understanding).

Many prosaic alignment strategies are about going down in legibility, and then figuring out some mechanism to go back up again in legibility space.  Interpretability, ontology mapping, and other approaches fit in this frame.  To me, this seems better than the previous set, but still seem skeptical to me.

Finally my favorite set of strategies are ones that start legible and endeavor to never deviate from that legibility.  This is where I think Agent Foundations is in this graph.  I think there’s too little work on how we can build an Aligned AGI which is legible from start-to-finish, and almost all of them seem to have a bunch of overlap with Agent Foundations.

Aside: earlier I included a threshold in legibility space that‘s the “alignment threshold” but that doesn’t seem to fit right to me, so I took it out.

Load More