Alex Ray

Meeting people and trying things.
AI Research Engineer


Alex Ray's Shortform

My feeling is that I don't have a strong difference between them.  In general simpler policies are both easier to execute in the moment and also easier for others to simulate.

The clearest version of this is to, when faced with a decision, decide on an existing principle to apply before acting, or else define a new principle and act on this.

Principles are examples of short policies, which are largely path-independent, which are non-narrative, which are easy to execute, and are straightforward to communicate and be simulated by others.

Alex Ray's Shortform

(Note: this might be difficult to follow.  Discussing different ways that different people relate to themselves across time is tricky.  Feel free to ask for clarifications.)


I'm reading the paper Against Narrativity, which is a piece of analytic philosophy that examines Narrativity in a few forms:

  • Psychological Narrativity - the idea that "people see or live or experience their lives as a narrative or story of some sort, or at least as a collection of stories."
  • Ethical Narrativity - the normative thesis that "experiencing or conceiving one's life as a narrative is a good thing; a richly [psychologically] Narrative outlook is essential to a well-lived life, to true or full personhood."

It also names two kinds of self-experience that it takes to be diametrically opposite:

  • Diachronic - considers the self as something that was there in the further past, and will be there in the further future
  • Episodic - does not consider the self as something that was there in the further past and something that will be there in the further future

Wow, these seem pretty confusing.  It sounds a lot like they just disagree on the definition of the world "self".  I think there is more to it than that, some weak evidence being discussing this concept of length with a friend (diachronic) who had a very different take on narrativity than myself (episodic).

I'll try to sketch what I think "self" means.  It seems that for almost all nontrivial cognition, it seems like intelligent agents have separate concepts (or the concept of a separation between) the "agent" and the "environment".  In Vervaeke's works this is called the Agent-Arena Relationship.

You might say "my body is my self and the rest is the environment," but is that really how you think of the distinction?  Do you not see the clothes you're currently wearing as part of your "agent"?  Tools come to mind as similar extensions of our self.  If I'm raking leaves for a long time, I start to sense myself as a the agent being the whole "person + rake" system, rather than a person whose environment includes a rake that is being held.

(In general I think there's something interesting here in proto-human history about how tool use interacts with our concept of self, and our ability to quickly adapt to thinking of a tool as part of our 'self' as a critical proto-cognitive-skill.)

Getting back to Diachronic/Episodic:  I think one of the things that's going on in this divide is that this felt sense of "self" extends forwards and backwards in time differently.


I often feel very uncertain in my understanding or prediction of the moral and ethical natures of my decisions and actions.  This probably needs a whole lot more writing on its own, but I'll sum it up as two ideas having a disproportionate affect on me:

  • The veil of ignorance, which is a thought experiment which leads people to favor policies that support populations more broadly (skipping a lot of detail and my thoughts on it for now).
  • The categorical imperative, which I'll reduce here as the principle of universalizability -- a policy for actions given context is moral if it is one you would endorse universalizing (this is huge and complex, and there's a lot of finicky details in how context is defined, etc.  skipping that for now)

Both of these prompt me to take the perspective of someone else, potentially everyone else, in reasoning through my decisions.  I think the way I relate to them is very Non-Narrative/Episodic in nature.

(Separately, as I think more about the development of early cognition, the more the ability to take the perspective of someone else seems like a magical superpower)

I think they are not fundamentally or necessarily Non-Narrative/Episodic -- I can imagine both of them being considered by someone who is Strongly Narrative and even them imagining a world consisting of a mixture of Diachronic/Episodic/etc.


Priors are hard.  Relatedly, choosing between similar explanations of the same evidence is hard.

I really like the concept of the Solomonoff prior, even if the math of it doesn't apply directly here.  Instead I'll takeaway just this piece of it:

"Prefer explanations/policies that are simpler-to-execute programs"

A program may be simpler if it has fewer inputs, or fewer outputs.  It might be simpler if it requires less memory or less processing.

This works well for choosing policies that are easier to implement or execute, especially as a person with bounded memory/processing/etc.


A simplifying assumption that works very well for dynamic systems is the Markov property.

This property states that all of the information in the system is present in the current state of the system.

One way to look at this is in imagining a bunch of atoms in a moment of time -- all of the information in the system is contained in the current positions and velocities of the atoms.  (We can ignore or forget all of the trajectories that individual atoms took to get to their current locations)

In practice we usually do this to systems where this isn't literally true, but close-enough-for-practical-purposes, and combine it with stuffing some extra stuff into the context for what "present" means.

(For example we might define the "present" state of a natural system includes "the past two days of observations" -- this still has the Markov property, because this information is finite and fixed as the system proceeds dynamically into the future)


I think that these pieces, when assembled, steer me towards becoming Episodic.

When choosing between policies that have the same actions, I prefer the policies that are simpler. (This feels related to the process of distilling principles.)

When considering good policies, I think I consider strongly those policies that I would endorse many people enact.  This is aided by these policies being simpler to imagine.

Policies that are not path-dependent (for example, take into account fewer things in a person's past) are simpler, and therefore easier to imagine.

Path-independent policies are more Episodic, in that they don't rely heavily on a person's place in their current Narratives.


I don't know what to do with all of this.

I think one thing that's going on is self-fulfilling -- where I don't strongly experience psychological Narratives, and therefore it's more complex for me to simulate people who do experience this, which via the above mechanism leads to me choosing Episodic policies.

I don't strongly want to recruit everyone to this method of reasoning.  It is an admitted irony of this system (that I don't wish for everyone to use the same mechanism of reasoning as me) -- maybe just let it signal just how uncertain I feel about my whole ability to come to philosophical conclusions on my own.

I expect to write more about this stuff in the near future, including experiments I've been doing in my writing to try to move my experience in the Diachronic direction.  I'd be happy to hear comments for what folks are interested in.


The Case for a Journal of AI Alignment

I think there's a lot of really good responses, that I won't repeat.

I think the traditional model of journals has a lot of issues, not the least of which are bad incentives.

The new model used by eLife is pretty exciting to me, but very different than what you proposed.  I think it's worth considering:

  • only reviewing works that have already been published as preprints (I think LW/AF should count for this, as well as ArXiV)
  • publishing reviews -- this lets the rest of the community benefit more from the labor of reviewing, though it does raise the standard for reviewers
  • curate the best / highest reviewed articles to be "published"

The full details of their new system is here in an essay they published describing the changes and why they made them.

Why GPT wants to mesa-optimize & how we might change this

Clarifying Q: Does mesa-optimization refer to any inner optimizer, or one that is in particular not aligned with the outer context?

Why GPT wants to mesa-optimize & how we might change this

Epistemic status: I’m not really an expert at NLP.  I’ve only been working on language modeling for ~8mo, which is much less than some of the folks here, and this is based on my experiences.

Beam Search:

Beam search with large unsupervised generatively pretrained transformers (GPTs) is weirder than it appears in the NLP literature.  Other commenters have mentioned degeneracies, but for me the sticking points for beam search were:

  • It tends to quickly fall on a modal response — so it’s already bad for any sort of situation you want to generate a diversity of samples and choose the best from
  • It’s hard to correctly score between varying-length segments.  Every paper that uses beam search has some heuristic hack here, which is almost always some parametrized function they pulled from another paper or hacked together.
  • It seems to mostly do best (once tuned) at some narrow/specific distribution (e.g. generating short responses in a chat setting).  It’s hard to get beam search tuned to work well across the full distribution used to train these models (i.e. “text on the internet”)

Given these three issues, in my experience it’s been better to just focus on tuning naive sampling, with a few key parameters: temperature, top_p, etc (these are part of the OpenAI API).

Caveat: it’s possible I’m just bad at tuning beam search.  It’s possible I’m bad at scholarship and missed the “one key paper” that would make it all clear to me.  I would take the above as more of an anecdote than a scientific result.

Separation of training and sampling:

This has been mentioned by other commenters, but might bear repeating that there is no sampling at all in the training process for GPTs.  They’re trained to approximate marginal next token distributions, and the default is to share the loss on the prediction for every token equally.  In practice the loss on later tokens is lower.

All of this is saying that training is a separate process for sampling.  I think there is probably very good research to be done in better sampling — in particular, I think it is possible to have a machine which aligns sampling from an unaligned model.

Lookahead & pondering:

I think the point about lookahead is still worth considering.  One of the differences between transformers and the previous most-popular architecture for language models (LSTMs) is that transformers use the same amount of compute for every token.  (It’s possible to build them otherwise, but I haven’t seen any of these that I’ve been impressed by yet)

I think my favorite example of this in the literature is Adaptive Computation Time (ACT)[], where essentially the model learns how to “spend” extra compute on certain characters.

(One of the things going on with ACT is dealing with the non-uniformity of the distribution of information content in character strings — for GPTs this is at least partially ameliorated by the byte-pair encoding)

So I think it is reasonable to train a model to be able to use extra “pondering” time when sampling.  Either by having an external controller that tells the model when to ponder and when to output, or by having the model learn itself how to ponder (which is the “halting neuron” signal in ACT).

I do think that any sort of pondering is subject to mesa-optimization concerns.

Fix 1 - BERT:

Caveat: I haven’t trained BERT models or taken a trained one and tried hard to get high quality samples from it.  This is based on intuitions and hearsay.

Here I’ll use “GPT” to refer to autoregressive next token prediction objectives, to mirror the style of the article.  This objective can of course be used with other architectures in other settings.

Instead of thinking the “mask-part-out prediction” (BERT) and the “mask future text” (GPT) as two separate tasks, think of them as points in the space of distributions over masks.

In particular, its trivial to come up with mask distributions that include both a preponderance of masks which leave small parts out (BERT-like) and masks which leave future tokens out (GPT-like) as well as possibly other mask patterns.

My intuition is that the higher probability you mask out all future tokens, the easier it is to get high quality samples from that model.

Fix 1 - Editing Text:

(Same caveat as above regarding inexperience w/ BERT models)

BERT objectives by themselves do not allow efficient text editing, and neither do GPT objectives.

Thinking about the task of composing an edit you, the model needs to:

  • Identify the section that will be removed (if any)
  • Figure out the length of the replacement text (if any)
  • Compose the replacement text (if any)
  • Possibly also have some way of attending over the old text, while still knowing to replace it

Neither BERT nor GPT objectives do a great job of this by itself.  If I had to choose, though, I think you can encode this sort of thing in the GPT dataset and have it autoregressively generate edits.

(This is part of a conjecture I’ve been meaning to writeup for lesswrong of “the dataset is the interface” for GPT models)

Fix 2 - Changing the training:

I think there’s some interesting stuff here, but so far this is in the regime of training algorithms that are unexplored, enormously complex, and poorly understood.

The clearest part here is that it uses sampling in the training loop which so far I’ve almost exclusively seen in reinforcement learning (RL).

But, we can probably implement something like this with RL.  In particular, training is a process of selecting a context (masking), sampling from the model to fill in the mask, and scoring based on the objective.

In this case, drawing some analogies to RL:

  • Action - token
  • Action distribution - token distribution (the basic output of a GPT model given an input context)
  • Policy - language model (in particular a GPT model, though with hacks BERT/other models could be used)
  • Reward - objective (log-loss on the true document, for a GPT model)
  • Environment - a document, probably with some starting context already provided

It’s pretty easy to see here that this wouldn’t work well from generating from scratch.  If I provide zero contextual tokens to the model, sample N tokens, and then score it on how close it got to a true (hidden) document, I am going to have a very bad time.

This might be a good approach for fine-tuning a GPT model — which is (exactly what some colleagues did)[].

Even in the fine-tuning case, we have all of the myriad and sundry problems with RL (instability, inefficiency, etc) that our plain-and-simple language modeling objective lacks.

Fix 2 - update away:

I think this probably won’t work just from experience.  I’ve found it very hard to get the model to “reduce your probability on the most likely outcome and increase your probability on the next most likely outcome” — instead objectives like this tend to just increase the temperature of everything (or worse, it puts all of the increase in entropy in the long tail of bad answers).

It’s possible there is a good way to do this, but for now I don’t know of a good way to get a model to increase the probability of “secondary options” without just degenerating into increasing entropy.

Fix 2 - track updates:

If I understand this correctly, I think this is easily approximated by having an objective/loss/reward term which penalizes differences from the original model.  For small deltas I think this is a good approach, and unfortunately is only as good as the original model you’re comparing it too.

As far as the specific proposal for managing updates towards/away from beam search updates, that seems also possible via a similar mechanism — penalize distributional difference from those samples.

I think we haven’t really explored these sort of penalties enough, and in particular how they interact when combined with other objectives.

Fix 2 - will it stunt:

I think that any objective that scores better predictions higher will incentivize some sort of lookahead/pondering.

If you prevent it from being coincident with the beam search distribution, then I expect the model will learn how to do lookahead/pondering in the null space of beam search.

Will these solve mesa-optimization:

This isn’t clear to me, but I think it’s worth studying.

In particular, it would be good to figure out some way of contriving a mesa-optimization setup, such that we could measure if these fixes would prevent it or not.

Beam Search in the API:

I think my above comments about Beam Search apply here.

Beam search, like any optimization algorithm, is hugely dependent on its scoring function.  If you score on likelihood, you’ll end up with high-likelihood (“unsurprising”) text.

Future thoughts - sampling research:

I think in general we’re in a weirdly asymmetric world, where we have a huge amount of compute and effort into computing auto-regressive next token distributions, and comparatively very little sophistication in sampling from them.

This comment is probably too long already for me to expand too much on this, but in particular, I think the log-likelihood objective is default unaligned (as most datasets are default unaligned) but I think we can find ways of sampling from log-likelihood optimized models in ways that are aligned.

Final Version Perfected: An Underused Execution Algorithm


It seems like the prerequisite assumptions are likely to be violated sometimes (in general most assumptions aren't total rules).

My question is about the rate of violations to this prerequisite assumption.

A few ways to cut at it (feel free to answer just one or none of them):

  • When going through a list subsequent times, how often do you notice/feel internally that your views on a past item have shifted?
  • How often do you make a new list and start the process anew, even though you have an existing list that could be continued on?
  • How often do you go back and erase or modify marks on a list while using this process?

I think I find my internal experience (and relation to stuff on my to-do list) changes pretty significantly over the course of a day.

Final Version Perfected: An Underused Execution Algorithm

This is my favorite kind of lesswrong post -- a quick rationality technique that I can immediately go try and report back on.

I was able to prototype it quickly in my notes list by using a dedicated symbol as the marker.  It looks like any weird/unused symbol could be used as this.  Seems like a quick hack to work with any digital list (I used §).

Question about non-stationarity: How often is the "stable" prerequisite violated in practice?

E.g. if a bunch of items are physically exhausting, and a bunch are not, I might want to not do physically exhausting items in sequence.  I didn't run into this personally in my tiny trial, so at least the answer isn't "all the time".

Alex Ray's Shortform

1. What am I missing from church?

(Or, in general, by lacking a religious/spiritual practice I share with others)

For the past few months I've been thinking about this question.

I haven't regularly attended church in over ten years.  Given how prevalent it is as part of human existence, and how much I have changed in a decade, it seems like "trying it out" or experimenting is at least somewhat warranted.

I predict that there is a church in my city that is culturally compatible with me.

Compatible means a lot of things, but mostly means that I'm better off with them than without them, and they're better off with me than without me.

Unpacking that probably will get into a bunch of specifics about beliefs, epistemics, and related topics -- which seem pretty germane to rationality.

2. John Vervaeke's Awakening from the Meaning Crisis is bizzarely excellent.

I don't exactly have handles for exactly everything it is, or exactly why I like it so much, but I'll try to do it some justice.

It feels like rationality / cognitive tech, in that it cuts at the root of how we think and how we think about how we think.

(I'm less than 20% through the series, but I expect it continues in the way it has been going.)

Maybe it's partially his speaking style, and partially the topics and discussion, but it reminded me strongly of sermons from childhood.

In particular: they have a timeless quality to them.  By "timeless" I mean I think I would take away different learnings from them if I saw them at different points in my life.

In my work & research (and communicating this) -- I've largely strived to be clear and concise.  Designing for layered meaning seems antithetical to clarity.

However I think this "timelessness" is a missing nutrient to me, and has me interested in seeking it out elsewhere.

For the time being I at least have a bunch more lectures in the series to go!

Alex Ray's Shortform

I don't know if he used that phrasing, but he's definitely talked about the risks (and advantages) posed by singletons.

Alex Ray's Shortform

Thinking more about the singleton risk / global stable totalitarian government risk from Bostrom's Superintelligence, human factors, and theory of the firm.

Human factors represent human capacities or limits that are unlikely to change in the short term.  For example, the number of people one can "know" (for some definition of that term), limits to long-term and working memory, etc.

Theory of the firm tries to answer "why are economies markets but businesses autocracies" and related questions.  I'm interested in the subquestion of "what factors given the upper bound on coordination for a single business", related to "how big can a business be".

I think this is related to "how big can an autocracy (robustly/stably) be", which is how it relates to the singleton risk.

Some thoughts this produces for me:

  • Communication and coordination technology (telephones, email, etc) that increase the upper bounds of coordination for businesses ALSO increase the upper bound on coordination for autocracies/singletons
  • My belief is that the current max size (in people) of a singleton is much lower than current global population
  • This weakly suggests that a large global population is a good preventative for a singleton
  • I don't think this means we can "war of the cradle" our way out of singleton risk, given how fast tech moves and how slow population moves
  • I think this does mean that any non-extinction event that dramatically reduces population also dramatically increases singleton risk
  • I think that it's possible to get a long-term government aligned with the values of the governed, and "singleton risk" is the risk of an unaligned global government

So I think I'd be interested in tracking two "competing" technologies (for a hand-wavy definition of the term)

  1. communication and coordination technologies -- tools which increase the maximum effective size of coordination
  2. soft/human alignment technologies -- tools which increase alignment between government and governed
Load More