On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

by adamShimi22 min read11th Oct 20213 comments


Ω 31

EpistemologyIntellectual Progress (Society-Level)Practice & Philosophy of ScienceAI RiskAI
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted to the EA Forum


Imagine you are tasked with curing a disease which hasn’t appeared yet. Setting aside why you would know about such a disease’s emergence in the future, how would you go about curing it? You can’t directly or indirectly gather data about the disease itself, as it doesn’t exist yet; you can’t try new drugs to see if they work, as the disease doesn’t exist yet; you can’t do anything that is experimental in any way on the disease itself, as… you get the gist. You would be forgiven for thinking that there is nothing you can do about it right now.

AI Alignment looks even more hopeless: it’s about solving a never-before seen problem on a technology which doesn’t exist yet.

Yet researchers are actually working on it! There are papers, books, unconferences, whole online forums full of alignment research, both conceptual and applied. Some of these work in a purely abstract and theoretical realm, others study our best current analogous of Human Level AI/Transformative AI/AGI, still others iterate on current technologies that seem precursors to these kinds of AI, typically large language models. With so many different approaches, how can we know if we’re making progress or just grasping at straws?

Intuitively, the latter two approaches sound more like how we should produce knowledge: they take their epistemic strategies (ways of producing knowledge) out of Science and Engineering, the two cornerstones of knowledge and technology in the modern world. Yet recall that in alignment, models of the actual problem and/or technology can’t be evaluated experimentally, and one cannot try and iterate on proposed solutions directly. So when we take inspiration from Science and Engineering (and I want people to do that), we must be careful and realize that most of the guarantees and checks we associate with both are simply not present in alignment, for the decidedly pedestrian reason that Human Level AI/Transformative AI/AGI doesn’t exist yet.

I thus claim that:

  • Epistemic strategies from Science and Engineering don’t dominate other strategies in alignment research.
  • Given the hardness of grounding knowledge in alignment, we should leverage every epistemic strategy we can find and get away with.
  • These epistemic strategies should be made explicit and examined. Both the ones taken or adapted from Science and Engineering, and every other one (for example the more theoretical and philosophical strategies favored in conceptual alignment research).

This matters a lot, as it underlies many issues and confusions in how alignment is discussed, taught, created and criticized. Having such a diverse array of epistemic strategies is fascinating, but their implicit nature makes it challenging to communicate with newcomers, outsiders, and even fellow researchers leveraging different strategies. Here is a non-exhaustive list of issues that boil down to epistemic strategy confusion:

  • There is a strong natural bias towards believing that taking epistemic strategies from Science and Engineering automatically leads to work that is valuable for alignment.
    • This leads to a flurry of work (often by well-meaning ML researchers/engineers) that doesn’t tackle or help with alignment, and might push capabilities (in an imagined tradeoff with the assumed alignment benefits of the work)
    • Very important: that doesn’t mean I consider work based on ML as useless for alignment. Just that the sort of ML-based work that actually tries to solve the problem tends to be by people who understand that one must be careful when transfering epistemic strategies to alignment.
  • There is a strong natural bias towards disparaging any epistemic strategy that doesn’t align neatly with the main strategies of Science and Engineering (even when the alternative epistemic strategies are actually used by scientists and engineers in practice!)
    • This leads to more and more common confusion about what’s happening on the Alignment Forum and conceptual alignment research more generally. And that confusion can easily turn into the sort of criticism that boils down to “this is stupid and people should stop doing that”.
  • It’s quite hard to evaluate whether alignment research (applied or conceptual) creates the kind of knowledge we’re after, and helps move the ball forward. This comes from both the varieties of epistemic strategies and the lack of the usual guarantees and checks when applying more mainstream ones (every observation and experiment is used through analogies or induction to future regimes).
    • This makes it harder for researchers using different sets of epistemic strategies to talk to each other and give useful feedback to one another.
  • Criticism of some approach or idea often stops at the level of the epistemic strategy being weird.
    • This happens a lot with criticism of lack of concreteness and formalism and grounding for Alignment Forum posts.
    • It also happens when applied alignment research is rebuked solely because it uses current technology, and the critics have decided that it can’t apply to AGI-like regimes.
  • Teaching alignment without tackling this pluralism of epistemic strategies, or by trying to fit everything into a paradigm using only a handful of those, results in my experience in people who know the lingo and some of the concepts, but have trouble contributing, criticizing and teaching the ideas and research they learnt.
    • You can also end up with a sort of dogmatism that alignment can only be done a certain way.

Note that most fields (including many sciences and engineering disciplines) also use weirder epistemic strategies. So do many attempts at predicting and directing the future (think existential risk mitigation in general). My point is not that alignment is a special snowflake, more that it’s both weird enough (in the inability to experiment and iterate directly with) and important enough that elucidating the epistemic strategies we’re using, finding others and integrating them is particularly important.

In the rest of this post, I develop and unfold my arguments in more detail. I start with digging deeper into what I mean by the main epistemic strategies of Science and Engineering, and why they don’t transfer unharmed to alignment. Then I demonstrate the importance of looking at different epistemic strategies, by focusing on examples of alignment results and arguments which make most sense as interpreted through the epistemic lens of Theoretical Computer Science. I use the latter as an inspiration because I adore that field and because it’s a fertile ground for epistemic strategies. I conclude by pointing at the sort of epistemic analyses I feel are needed right now.

Lastly, this post can be seen as a research agenda of sorts, as I’m already doing some of these epistemic analyses, and believe this is the most important use of my time and my nerdiness about weird epistemological tricks. We roll with what evolution’s dice gave us.

Thanks to Logan Smith, Richard Ngo, Remmelt Ellen,, Alex Turner, Joe Collman, Ruby, Antonin Broi, Antoine de Scorraille, Florent Berthet, Maxime Riché, Steve Byrnes, John Wentworth and Connor Leahy for discussions and feedback on drafts.

Science and Engineering Walking on Eggshells

What is the standard way of learning how the world works? For centuries now, the answer has been Science.


I like Peter Godfrey-Smith’s description in his glorious Theory and Reality:

Science works by taking theoretical ideas and trying to find ways to expose them to observation. The scientific strategy is to construe ideas, to embed them in surrounding conceptual frameworks, and to develop them, in such a way that this exposure is possible even in the case of the most general and ambitious hypotheses about the universe.

That is, the essence of science is in the trinity of modelling, predicting and testing.

This doesn’t mean there are no epistemological subtleties left in modern science; finding ways of gathering evidence, of forcing the meeting of model and reality, often takes incredibly creative turns. Fundamental Chemistry uses synthesis of molecules never seen in nature to test the edge cases of its models; black holes are not observable directly, but must be inferred by a host of indirect signals like the light released by matter falling in the black hole or gravitational waves during black holes merging; Ancient Rome is probed and explored through a discussion between textual analysis and archeological discoveries.

Yet all of these still amount to instantiating the meta epistemic strategy “say something about the world, then check if the world agrees”. As already pointed out, it doesn’t transfer straightforwardly to alignment because Human Level AI/Transformative AI/AGI doesn’t exist yet.

What I’m not saying is that epistemic strategies from Science are irrelevant to alignment. But because they must be adapted to tell us something about a phenomenon that doesn’t exist yet, they lose their supremacy in the creation of knowledge. They can help us to gather data about what exists now, and to think about the sort of models that are good at describing reality, but checking their relevance to the actual problem/thinking through the analogies requires thinking in more detail about what kind of knowledge we’re creating.

If we instead want more of a problem solving perspective, tinkering is a staple strategy in engineering, before we know how to solve the problem things reliably. Think about curing cancer or building the internet: you try the best solutions you can think of, see how they fail, correct the issues or find a new approach, and iterate.

Once again, this is made qualitatively different in alignment by the fact that neither the problem nor the source of the problem exist yet. We can try to solve toy versions of the problem, or what we consider analogous situations, but none of our solutions can be actually tested yet. And there is the additional difficulty that Human-level AI/Transformative AI/AGI might be so dangerous that we have only one chance to implement the solution.

So if we want to apply the essence of human technological progress, from agriculture to planes and computer programs, just trying things out, we need to deal with the epistemic subtleties and limits of analogies and toy problems.

An earlier draft presented the conclusion of this section as “Science and Engineering can’t do anything for us—what should we do?” which is not my point. What I’m claiming is that in alignment, the epistemic strategies from both Science and Engineering are not as straightforward to use and leverage as they usually are (yes, I know, there’s a lot of subtleties to Science and Engineering anyway). They don’t provide privileged approaches demanding minimal epistemic thinking in most cases; instead we have to be careful how we use them as epistemic strategies. Think of them as tools which are so good that most of the time, people can use them without thinking about all the details of how the tools work, and get mostly the wanted result. My claim here is that these tools need to be applied with significantly more care in alignment, where they lose their “basically works all the time” status.

Acknowledging that point is crucial for understanding why alignment research is so pluralistic in terms of epistemic strategies. Because no such strategy works as broadly as we’re used to for most of Science and Engineering, alignment has to draw from literally every epistemic strategy it can pull, taking inspiration from Science and Engineering, but also philosophy, pre-mortems of complex projects, and a number of other fields.

To show that further, I turn to some main alignment concepts which are often considered confusing and weird, in part because they don’t leverage the most common epistemic strategies. I explain these results by recontextualizing them through the lens of Theoretical Computer Science (TCS).

Epistemic Strategies from TCS

For those who don’t know the field, Theoretical Computer Science focuses on studying computation. It emerged from the work of Turing, Church, Gödel and others in the 30s, on formal models of what we would now call computation: the process of following a step-by-step recipe to solve a problem. TCS cares about what is computable and what isn’t, as well as how much resources are needed for each problem. Probably the most active subfield is Complexity Theory, which cares about how to separate computational problems in classes capturing how many resources (most often time – number of steps) are required for solving them.

What makes TCS most relevant to us is that theoretical computer scientists excel at wringing knowledge out of the most improbable places. They are brilliant at inventing epistemic strategies, and remember, we need every one we can find for solving alignment.

To show you what I mean, let’s look at three main ideas/arguments in alignment (Convergent Subgoals, Goodhart’s Law and the Orthogonality Thesis) through some TCS epistemological strategies.

Convergent Subgoals and Smoothed Analysis

One of the main argument for AI Risk and statement of a problem in alignment is Nick Bostrom’s Instrumental Convergence Thesis (which also takes inspiration from Steve Omohundro’s Basic AI Drives):

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.

That is, actions/plans exist which help with a vast array of different tasks: self-preservation, protecting one’s own goal, acquiring resources... So a Human-level AI/Transformative AI/AGI could take them while still doing what we asked it to do. Convergent subgoals are about showing that behaviors which look like they can only emerge from rebellious robots actually can be pretty useful for obedient (but unaligned in some way) AI.

What kind of argument is that? Bostrom makes a claim about “most goals”—that is, the space of all goals. His claim is that convergent subgoals are so useful that goal-space is almost chock-full of goals incentivizing convergent subgoals.

And more recent explorations of this argument have followed this intuition: Alex Turner et al.’s work on power-seeking formalizes the instrumental convergence thesis in the setting of Markov decision processes (MDP) and reward functions by looking, for every “goal” (a distribution over reward functions) at the set of all its permuted variants (the distribution given by exchanging some states – so the reward labels stay the same, but are not put on the same states). Their main theorems state that given some symmetry properties in the environment, a majority (or a possibly bigger fraction) of the permuted variant of every goal will incentivize convergent subgoals for its optimal policies.

So this tells us that goals without convergent subgoals exist, but they tend to be crowded out by ones with such subgoals. Still, it’s very important to realize what neither Bostrom nor Turner are arguing for: they’re not saying that every goal has convergent subgoals. Nor are they claiming to have found the way humans sample goal-space, such that their results imply goals with convergent subgoals must be sampled by humans with high probability. Instead, they show the overwhelmingness of convergent subgoals in some settings, and consider that a strong indicator that avoiding them is hard.

I see a direct analogy with the wonderful idea of smoothed analysis in complexity theory. For a bit of background, complexity theory generally focuses on the worst case time taken by algorithms. That means it mostly cares about which input will take the most time, not about the average time taken over all inputs. The latter is also studied, but it’s nigh impossible to find the distribution of input actually used in practice (and some problems studied in complexity theory are never used in practice, so a meaningful distribution is even harder to make sense of). Just like it’s very hard to find the distribution from which goals are sampled in practice.

As a consequence of the focus on worst case complexity, some results in complexity theory clash with experience. Here we focus on the simplex algorithm, used for linear programming: it runs really fast and well in practice, despite having provable exponential worst case complexity. Which in complexity-theory-speak means it shouldn’t be a practical algorithm.

Daniel Spielman and Shang-Hua Teng had a brilliant intuition to resolve this inconsistency: what if the worst case inputs were so rare that just a little perturbation would make them easy again? Imagine a landscape that is mostly flat, with some very high but very steep peaks. Then if you don’t land exactly on the peaks (and they’re so pointy that it’s really hard to get there exactly), you end up on the flat surface.

This intuition yielded smoothed analysis: instead of just computing the worst case complexity, we compute the worst case complexity averaged over some noise on the input. Hence the peaks get averaged with the flatness around them and have a low smoothed time complexity.

Convergent subgoals, especially in Turner’s formulation, behave in an analogous way: think of the peaks as goals without convergent subgoals; to avoid power-seeking we would ideally reach one of them, but their rarity and intolerance to small perturbations (permutations here) makes it really hard. So the knowledge created is about the shape of the landscape, and leveraging the intuition of smoothed analysis, that tells us something important about the hardness of avoiding convergent subgoals.

Note though that there is one aspect in which Turner’s results and the smoothed analysis of the simplex algorithm are complete opposite: in the former the peaks are what we want (no convergent subgoals) while in the latter they’re what we don’t want (inputs that take exponential time to treat). This inversion doesn’t change the sort of knowledge produced, but it’s an easy source of confusion.

epistemic analysis isn’t only meant for clarifying and distilling results: it can and should pave the way to some insights on how the arguments could fail. Here, the analogy to smoothed complexity and the landscape picture suggests that Bostrom and Turner’s argument could be interrogated by:

  • Arguing that the sampling method we use in practice to decide tasks and goals for AI targets specifically the peaks.
  • Arguing that the sampling is done in a smaller goal-space for which the peaks are broader
    • For Turner’s version, one way of doing that might be to not consider the full orbit, but only the close-ish variations of the goals (small permutations instead of all permutations). So questioning the form of noise over the choice of goals that is used in Turner’s work.
Smoothed AnalysisConvergent Subgoals (Turner et al)           
Possible InputsPossible Goals
Worst-case inputs make steep and rare peaksGoals without convergent subgoals make steep and rare peaks

Goodhart’s Law and Distributed Impossibility/Hardness Results

Another recurrent concept in alignment thinking is Goodhart’s law. It wasn’t invented by alignment researchers, but Scott Garrabrant and David Manheim proposed a taxonomy of its different forms. Fundamentally, Goodhart’s law tells us that if we optimize a proxy of what we really want (some measure that closely correlates with the wanted quantity in the regime we can observe), strong optimization tends to make the two split apart, meaning we don’t end up with what we really wanted. For example, imagine that everytime you go for a run you put on your running shoes, and you only put on these shoes for running. Putting on your shoes is thus a good proxy for running; but if you decide to optimize the former in order to optimize the latter, you will take most of your time putting on and taking off your shoes instead of running.

In alignment, Goodhart is used to argue for the hardness of specifying exactly what we want: small discrepancies can lead to massive differences in outcomes.

Yet there is a recurrent problem: Goodhart’s law assumes the existence of some True Objective, of which we’re taking a proxy. Even setting aside the difficulties of defining what we really want at a given time, what if what we really want is not fixed but constantly evolving? Thinking about what I want nowadays, for example, it’s different from what I wanted 10 years ago, despite some similarities. How can Goodhart’s law apply to my values and my desires if there is not a fixed target to reach?

Salvation comes from a basic insight when comparing problems: if problem A (running a marathon) is harder than problem B (running 5 km), then showing that the latter is really hard, or even impossible, transfers to the former.

My description above focuses on the notion of one problem being harder than another. TCS formalizes this notion by saying the easier problem is reducible to the harder one: a solution for the harder one lets us build a solution for the easier problem. And that’s the trick: if we show there is no solution for the easier problem, this means that there is no solution for the harder one, or such a solution could be used to solve the easier problem. Same thing with hardness results which are about how difficult it is to solve a problem.

That is, when proving impossibility/hardness, you want to focus on the easiest version of the problem for which the impossibility/hardness still holds.

In the case of Goodhart’s law, this can be used to argue that it applies to moving targets because having True Values or a True Objective makes the problem easier. Hitting a fixed target sounds simpler than hitting a moving or shifting one. If we accept that conclusion, then because Goodhart’s law shows hardness in the former case, it also does in the latter.

That being said, whether the moving target problem is indeed harder is debatable and debated. My point here is not to claim that this is definitely true, and so that Goodhart’s law necessarily applies. Instead, it’s to focus the discussion on the relative hardness of the two problems, which is what underlies the correctness of the epistemic strategy I just described. So the point of this analysis is that there is another argument to decide the usefulness of Goodhart’s law in alignment than debating the existence of True Value

Easier Problem5kApproximating a fixed target (True Values)
Harder ProblemMarathonApproximating a moving target

Orthogonality Thesis and Complexity Barriers

My last example is Bostrom’s Orthogonality Thesis: it states that goals and competence are orthogonal, meaning that they are independent– a certain level of competence doesn’t force a system to have only a small range of goals (with some subtleties that I address below).

That might sound only too general to really be useful for alignment, but we need to put it in context. The Orthogonality Thesis is a response to a common argument for alignment-by-default: because a Human-level AI/Transformative AI/AGI would be competent, it should realize what we really meant/wanted, and correct itself as a result. Bostrom points out that there is a difference between understanding and caring. The AI understanding our real intentions doesn’t mean it must act on that knowledge, especially if it is programmed and instructed to follow our initial commands. So our advanced AI might understand that we don’t want it to follow convergent subgoals while maximizing the number of paperclips produced in the world, but what it cares about is the initial goal/command/task of maximizing paperclips, not the more accurate representation of what we really meant.

Put another way, if one wants to prove alignment-by-default, the Orthogonality thesis argues that competence is not enough. As it is used, it’s not so much a result about the real world, but a result about how we can reason about the world. It shows that one class of arguments (competence will lead to human values) isn’t enough.

Just like some of the weirdest results in complexity theory: the barriers to P vs NP. This problem is one of the biggest and most difficult open questions in complexity theory: settling formally the question of whether the class of problems which can tractably be solved (P for Polynomial time) is equal to the class of problems for which solutions can be tractably checked (NP for Non-deterministic Polynomial time). Intuitively those are different: the former is about creativity, the second about taste, and we feel that creating something of quality is harder than being able to recognize it. Yet a proof of this result (or its surprising opposite) has evaded complexity theorists for decades.

That being said, recall that theoretical computer scientists are experts at wringing knowledge out of everything, including their inability to prove something. This resulted in the three barriers to P vs NP: three techniques from complexity theory which have been proved to not be enough by themselves for showing P vs NP or its opposite. I won’t go into the technical details here, because the analogy is mostly with the goal of these barriers. They let complexity theorists know quickly if a proof has potential – it must circumvent the barriers somehow.

The Orthogonality thesis plays a similar role in alignment: it’s an easy check for the sort of arguments about alignment-by-default that many people think of when learning about the topic. If they extract alignment purely from the competence of the AI, then the Orthogonality Thesis tells us something is wrong.

What does this mean for criticism of the argument? That what matters when trying to break the Orthogonality Thesis isn’t its literal statement, but whether it still provides a barrier to alignment-by-default. Bostrom himself points out that the Orthogonality Thesis isn’t literally true in some regimes (for example some goals might require a minimum of competence) but that doesn’t affect the barrier nature of the result.

Barriers to P vs NPOrthogonality Thesis
Proof techniques that are provably not enough to settle the questionCompetence by itself isn’t enough to show alignment-by-default

Improving the Epistemic State-of-the-art

Alignment research aims at solving a never-before-seen problem caused by a technology that doesn’t exist yet. This means that the main epistemic strategies from Science and Engineering need to be adapted if used, and lose some of their guarantees and checks. In consequence, I claim we shouldn’t only focus on those, but use all epistemic strategies we can find to produce new knowledge. This is already happening to some extent, causing both progress on many fronts but also difficulties in teaching, communicating and criticizing alignment work.

In this post I focused on epistemological analyses drawing from Theoretical Computer Science, both because of my love for it and because it fits into the background of many conceptual alignment researchers. But many different research directions leverage different epistemic strategies, and those should also be studied and unearthed to facilitate learning and criticism.

More generally, the inherent weirdness of alignment makes it nigh impossible to find one unique methodology of doing it. We need everything we can get, and that implies a more pluralistic epistemology. Which means the epistemology of different research approaches must be considered and studied and made explicit, if we don’t want to be confusing for the rest of the world, and each other too.

I’m planning on focusing on such epistemic analyses in the future, both for the main ideas and concepts we want to teach to new researchers and for the state-of-the art work that needs to be questioned and criticized productively.


Ω 31

3 comments, sorted by Highlighting new comments since Today at 7:35 AM
New Comment

This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you. I have a few thoughts after reading:

(1) I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice. I do experimental neuroscience and I would argue that we also are not even really sure what the "right" questions are (in a local sense, as in, what experiment should I do next), and so we are in a state where we kinda fumble around using whatever inspiration we can. The inspiration can take many forms - philosophical, theoretical, emperical, a very simple model, thought experiments of various kinds, ideas or experimental results with an aesthetic quality. It is true that at the end of the day brain's already exist, so we have that to probe, but I'd argue that we don't have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it's not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of "does any of this work even say anything about how the brain works?!"

(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I'm always really intrigued by the difference in how they go about problem solving compared to us scientists.



Thanks for the kind words and thoughtful comment!

This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you.

Glad it helped! That was definitely one goal, the hardest to check with early feedback because I mostly know people who already work in the field or have never been confronted to it, while you're in the middle. :)

I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice.

Completely! One thing I tried to make clear in this draft (maybe not successfully given your comment ^^) is that many field, including natural sciences, leverage far more epistemic strategies than the Popperian "make a model, predict something and test it in the real world". My points are more that:

  1. Alignment is particularly weird in terms of epistemic strategies because neither the problem nor the technology exists
  2. Given the potential urgency of alignment, it's even more important to clarify these epistemic subtleties.

But I'm convinced that almost all disciplines could be the subject of a deep study in the methods people use to wrestle knowledge from the world. That's part of my hopes, since I want to steal epistemic strategies from many different fields and see how they apply to alignment.

It is true that at the end of the day brain's already exist, so we have that to probe, but I'd argue that we don't have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it's not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of "does any of this work even say anything about how the brain works?!"

Fascinating! Yeah, I agree with you that the analogy definitely exists, particularly with fundamental science. And that's part of the difficulty in alignment. Maybe the best comparison would be trying to cure a neurological pathology without having access to a human brain, but only to brains of individuals of very old species in our evolutionary lineage. It's harder, but linking the experimental results to the concrete object is still part of the problem.

(Would be very interested in having a call about the different epistemic strategies you use in experimental neuroscience by the way)

(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I'm always really intrigued by the difference in how they go about problem solving compared to us scientists.

So I disagree, but you're touching on a fascinating topic, one that confused me for the longest time.

My claim is that pure maths is fundamentally the study of abstraction (Platonists would disagree, but that's more of a minority position nowadays). Patterns is also a word commonly used when mathematicians wax poetic. What this means is that pure maths studies ways of looking at the world, of representing it. But, and that's crucial for our discussion, pure maths doesn't care about how to apply these representations to the world itself. When pure mathematicians study concrete systems, it's often to get exciting abstractions out of them, but they don't check those abstractions against the world again -- they study the abstraction in and of themselves.

The closest pure maths gets to caring about how to apply abstractions is when using one type of abstractions to understand another. The history of modern mathematics is full of dualities, representation theorems, correspondences and tricks for seeing one abstraction through the prism of another. By the way that's one of the most useful aspect of maths in my opinion: if you manage to formalize your intuition/ideas as a well-studied mathematical abstraction, then you get for free a lot of powerful tools and ways of looking at it. But maths itself doesn't tell you how to do the formalization.

On the other hand, theoretical computer scientists are in this weird position where they almost only work in the abstract world reminiscent of pure maths, but their focus is on an aspect of the real world: computation, its possibilities and its limits. TCS doesn't care about a cool abstractions if it doesn't unlock some nice insight about computation.

The post already have some nice examples, but here is another one I like: defining the class of tractable problems. You want to delineate the problems that can be solved in practice, in the sense that the time taken to solve them grows slowly enough with the size of the input that solving for massive inputs is a matter of days, maybe weeks or months of computation, but not thousands of years. Yet there is a fundamental tension between a category that fits the problems for which we have fast algorithms, and mathematical elegance/well-behavedness.

So complexity theorists reached for a compromise with P, the class of problems solvable in polynomial time. Polynomials are nice because a linear sum of them is still a polynomial, which means among other things that if solving problem A is the same than solving problem B 3 times then problem C 2 times, and both B and C are in P, then A is also in P. Yet polynomials can grow pretty damn fast. If n is the size of the input, a problem who can be solved in time growing like  is technically in P, but is completely intractable. The argument for why this is not an issue is that in practice, all the problems we can prove are in P have an algorithm with complexity at most growing like , which is the limit of tractable.

So much of the reasoning I presented in the previous paragraphs is reminiscent of troubles and subtleties in alignment, but so different from pure maths, that I really believe TCS is more relevant to the epistemic strategies of alignment than pure maths. Which doesn't mean maths doesn't prove crucial when instanciating these strategies, for example by proving theorems we need.

Great post -  similar to Adam Shai's comment, this reminds of a discussion in psychology about the over-application of the scientific paradigm. In a push to seem more legitimate/prestigious (physics envy) , psychology has pushed 3rd-person experimental science at the expense of more observational or philosophical approaches.

(When) should psychology be a science? - https://onlinelibrary.wiley.com/doi/abs/10.1111/jtsb.12316