Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses:

  • (10) Science - Science is cool!
  • (15) Philosophy of Science - The scientific method is cool! Oh look, there's a whole field studying it called "philosophy of science"!
  • (20) Probability Theory - Bayesian subjective probability and the universal prior seem to constitute an elegant solution to the philosophy of science. Hmm, there are some curious probability puzzles involving things like indexical uncertainty, copying, forgetting... I and others make some progress on this but fully solving anthropic reasoning seems really hard. (Lots of people have worked on this for a while and have failed, at least according to my judgement.)
  • (25) Decision Theory - Where does probability theory come from anyway? Maybe I can find some clues that way? Well according to von Neumann and Morgenstern, it comes from decision theory. And hey, maybe it will be really important that we get decision theory right for AI? I and others make some progress but fully solving decision theory turns out to be pretty hard too. (A number of people have worked on this for a while and haven't succeeded yet.)
  • (35) Metaphilosophy - Where does decision theory come from? It seems to come from philosophers trying to do philosophy. What is that about? Plus, maybe it will be really important that the AIs we build will be philosophically competent?
  • (45) Meta Questions about Metaphilosophy - Not sure how hard solving metaphilosophy really is, but I'm not making much progress on it by myself. Meta questions once again start to appear in my mind:
    • Why is there virtually nobody else interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole), even as we get ever closer to AGI, and other areas of AI safety start attracting more money and talent?
    • Tractability may be a concern but shouldn't more people still be talking about these problems if only to raise the alarm (about an additional reason that the AI transition may go badly)? (I've listened to all the recent podcasts on AI risk that I could find, and nobody brought it up even once.)
    • How can I better recruit attention and resources to this topic? For example, should I draw on my crypto-related fame, or start a prize or grant program with my own money? I'm currently not inclined to do either, out of inertia, unfamiliarity, uncertainty of getting any return, fear of drawing too much attention from people who don't have the highest caliber of thinking, and signaling wrong things (having to promote ideas with one's own money instead of attracting attention based on their merits). But I'm open to having my mind changed if anyone has good arguments about this.
    • What does it imply that so few people are working on this at such a late stage? For example, what are the implications for the outcome of the human-AI transition, and on the distribution of philosophical competence (and hence the distribution of values, decision theories, and other philosophical views) among civilizations in the universe/multiverse?

At each stage of this journey, I took what seemed to be the obvious next step (often up a meta ladder), but in retrospect each step left behind something like 90-99% of fellow travelers. From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives?

As for the AI safety path (as opposed to pure intellectual curiosity) that also leads here, I guess I do have more of a clue what's going on. I'll describe the positions of 4 people I know. Most of this is from private conversations so I won't give their names.

  • Person A has a specific model of the AI transition that they're pretty confident in, where the first AGI is likely to develop a big lead and if it's aligned, can quickly achieve human uploading then defer to the uploads for philosophical questions.
  • Person B thinks that ensuring AI philosophical competence won't be very hard. They have a specific (unpublished) idea that they are pretty sure will work. They're just too busy to publish/discuss the idea.
  • Person C will at least think about metaphilosophy in the back of their mind (as they spend most of their time working on other things related to AI safety).
  • Person D thinks it is important and too neglected but they personally have a comparative advantage in solving intent alignment.

To me, this paints a bigger picture that's pretty far from "humanity has got this handled." If anyone has any ideas how to change this, or answers to any of my other unsolved problems in this post, or an interest in working on them, I'd love to hear from you.

New Comment
78 comments, sorted by Click to highlight new comments since: Today at 5:37 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Connor Leahy8moΩ164221

As someone that does think about a lot of the things you care about at least some of the time (and does care pretty deeply), I can speak for myself why I don't talk about these things too much:

Epistemic problems:

  • Mostly, the concept of "metaphilosophy" is so hopelessly broad that you kinda reach it by definition by thinking about any problem hard enough. This isn't a good thing, when you have a category so large it contains everything (not saying this applies to you, but it applies to many other people I have met who talked about metaphilosophy), it usually means you are confused.
  • Relatedly, philosophy is incredibly ungrounded and epistemologically fraught. It is extremely hard to think about these topics in ways that actually eventually cash out into something tangible, rather than nerdsniping young smart people forever (or until they run out of funding).
  • Further on that, it is my belief that good philosophy should make you stronger, and this means that fmpov a lot of the work that would be most impactful for making progress on metaphilosophy does not look like (academic) philosophy, and looks more like "build effective institutions and learn interactively why this is hard" and "get
... (read more)
[-]Wei Dai7moΩ102720

I expect at this moment in time me building a company is going to help me deconfuse a lot of things about philosophy more than me thinking about it really hard in isolation would

Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?

It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way” to do philosophy, similar to how there are no “right preferences”.

Definitely a possibility (I've entertained it myself and maybe wrote some past comments along these lines). I wish there was more people studying this possibility.

I have short timelines and think we will be dead if we don’t make very rapid progress on extremely urgent practical problems like government regulation and AI safety. Metaphilosophy falls into the unfortunate bucket of “important, but not (as) urgent” in my view.

Everyone dying isn't the worst thing that could happen. I think from a selfish perspective, I'm personally a bit more scared of surviving i... (read more)

Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?

The fact you ask this question is interesting to me, because in my view the opposite question is the more natural one to ask:  What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is exactly those areas where we can have as much feedback from reality in as tight loops as possible, and so if we are trying to tackle ever more lofty problems, it becomes ever more important to get exactly that feedback wherever we can get it! From my point of view, this is the default of successful human epistemology, and the exception should be viewed with suspicion.

And for what it's worth, acting in the real world, building a company, raising money, debating people live, building technology, making friends (and enemies), absolutely helped me become far, far less confused, and far more capable of tackling confusing problems! Actu... (read more)

1M. Y. Zuo7mo
You raised a very interesting point in the last comment, that metaphilosophy already encompasses everything, that we could conceive of at least. So a 'solution' is not tractable due to various well known issues such as the halting problem and so on. (Though perhaps in the very distant future this could be different.) However this leads to a problem, as exemplified by your phrasing here: 'good philosophy' is not a sensible category since you already know you have not, and cannot, 'solve' metaphilosophy. Nor can any other LW reader do so. 'good' or 'bad' in real practice are, at best, whatever the popular consensus is in the present reality, at worst, just someone's idiosyncratic opinions.  Very few concepts are entirely independent from any philosophical or metaphilosophical implications whatsoever, and 'good philosophy' is not one of them.  But you still felt a need to attach these modifiers, due to a variety of reasons well analyzed on LW, so the pretense of a solved or solvable metaphilosophy is still needed for this part of the comment to make sense.  I don't want to single out your comment too much though, since it's just the most convenient example, this applies to most LW comments. i.e. If everyone actually accepted the point, which I agree with, I dare say a huge chunk of LW comments are close to meaningless from a formal viewpoint, or at least very open to interpretation by anyone who isn't immersed in 21st century human culture. 
4Connor Leahy7mo
"good" always refers to idiosyncratic opinions, I don't really take moral realism particularly seriously. I think there is "good" philosophy in the same way there are "good" optimization algorithms for neural networks, while also I assume there is no one optimizer that "solves" all neural network problems.
3M. Y. Zuo7mo
'"good" optimization algorithms for neural networks' also has no difference in meaning from '"glorxnag" optimization  algorithms for neural networks', or any random permutation, if your prior point holds.
2Connor Leahy7mo
I don't understand what point you are trying to make, to be honest. There are certain problems that humans/I care about that we/I want NNs to solve, and some optimizers (e.g. Adam) solve those problems better or more tractably than others (e.g. SGD or second order methods). You can claim that the "set of problems humans care about" is "arbitrary", to which I would reply "sure?" Similarly, I want "good" "philosophy" to be "better" at "solving" "problems I care about." If you want to use other words for this, my answer is again "sure?" I think this is a good use of the word "philosophy" that gets better at what people actually want out of it, but I'm not gonna die on this hill because of an abstract semantic disagreement.
1M. Y. Zuo7mo
That's the thing, there is no definable "set of problems humans care about" without some kind of attached or presumed metaphilosophy, at least none that you, or anyone, could possibly figure out in the foreseeable future and prove to a reasonable degree of confidence to the LW readerbase. It's not even 'arbitrary',  that string of letters is indistinguishable from random noise. i.e. Right now your first paragraph is mostly meaningless if read completely literally and by someone who accepts the claim. Such a hypothetical person would think you've gone nuts because it would appear like you took a well written comment and inserted strings of random keyboard bashing in the middle. Of course it's unlikely that someone would be so literal minded, and so insistent on logical correctness, that they would completely equate it with random bashing of a keyboard. But it's possible some portion of readers lean towards that.
That is not a fact.
Hear! Hear!

It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences"

If this is true, doesn't this give us more reason to think metaphilosophy work is counterfactually important, i.e., can't just be delegated to AIs? Maybe this isn't what Wei Dai is trying to do, but it seems like "figure out which approaches to things (other than preferences) that don't have 'right answers' we [assuming coordination on some notion of 'we'] endorse, before delegating to agents smarter than us" is time-sensitive, and yet doesn't seem to be addressed by mainstream intent alignment work AFAIK.

(I think one could define "intent alignment" broadly enough to encompass this kind of metaphilosophy, but I smell a potential motte-and-bailey looming here if people want to justify particular research/engineering agendas labeled as "intent alignment.")

2Connor Leahy7mo
I think this is not an unreasonable position, yes. I expect the best way to achieve this would be to make global coordination and epistemology better/more coherent...which is bottlenecked by us running out of time, hence why I think the pragmatic strategic choice is to try to buy us more time. One of the ways I can see a "slow takeoff/alignment by default" world still going bad is that in the run-up to takeoff, pseudo-AGIs are used to hypercharge memetic warfare/mutation load to a degree basically every living human is just functionally insane, and then even an aligned AGI can't (and wouldn't want to) "undo" that.
4Wei Dai7mo
What are you proposing or planning to do to achieve this? I observe that most current attempts to "buy time" seem organized around convincing people that AI deception/takeover is a big risk and that we should pause or slow down AI development or deployment until that problem is solved, for example via intent alignment. But what happens if AI deception then gets solved relatively quickly (or someone comes up with a proposed solution that looks good enough to decision makers)? And this is another way that working on alignment could be harmful from my perspective...
4Connor Leahy7mo
I see regulation as the most likely (and most accessible) avenue that can buy us significant time. The fmpov obvious is just put compute caps in place, make it illegal to do training runs above a certain FLOP level. Other possibilities are strict liability for model developers (developers, not just deployers or users, are held criminally liable for any damage caused by their models), global moratoria, "CERN for AI" and similar. Generally, I endorse the proposals here.  None of these are easy, of course, there is a reason my p(doom) is high. Of course if a solution merely looks good, that will indeed be really bad, but that's the challenge of crafting and enforcing sensible regulation. I'm not sure I understand why it would be bad if it actually is a solution. If we do, great, p(doom) drops because now we are much closer to making aligned systems that can help us grow the economy, do science, stabilize society etc. Though of course this moves us into a "misuse risk" paradigm, which is also extremely dangerous.  In my view, this is just how things are, there are no good timelines that don't route through a dangerous misuse period that we have to somehow coordinate well enough to survive. p(doom) might be lower than before, but not by that much, in my view, alas.
4Wei Dai7mo
I prefer to frame it as human-AI safety problems instead of "misuse risk", but the point is that if we're trying to buy time in part to have more time to solve misuse/human-safety (e.g. by improving coordination/epistemology or solving metaphilosophy), but the strategy for buying time only achieves a pause until alignment is solved, then the earlier alignment is solved, the less time we have to work on misuse/human-safety.
4Connor Leahy7mo
Sure, it's not a full solution, it just buys us some time, but I think it would be a non-trivial amount, and let not perfect be the enemy of good and what not.

A lot of the debate surrounding existential risks of AI is bounded by time. For example, if someone said a meteor is about to hit the Earth that would be alarming, but the next question should be, "How much time before impact?" The answer to that question effects everything else.

If they say, "30 seconds". Well, there is no need to go online and debate ways to save ourselves. We can give everyone around us a hug and prepare for the hereafter. However, if the answer is "30 days" or "3 years" then those answers will generate very different responses.

The AI alignment question is extremely vague as it relates to time constraints. If anyone is investing a lot energy in "buying us time" they must have a time constraint in their head otherwise they wouldn't be focused on extending the timeline. And yet -- I don't see much data on bounded timelines within which to act. It's just assumed that we're all in agreement.

It's also hard to motivate people to action if they don't have a timeline. 

So what is the timeline? If AI is on a double exponential curve we can do some simple math projections to get a rough idea of when AI intelligence is likely to exceed human intelligence. Presumably, su... (read more)

How does this help humanity? This is like a mouse asking if elephants can learn to get along with each other. 
1Spiritus Dei7mo
Your analogy is off. If 8 billion mice acting as a hive mind designed a synthetic elephant and its neural network was trained on data provided by the mice-- then you would have an apt comparison. And then we could say, "Yeah, those mice could probably effect how the elephants get along by curating the training data."
As Eliezer Yudmouseky explains (proposition 34), achievement of cooperation among elephants is not enough to stop mice from being trampled.  Is it clear what my objection is? You seemed to only be talking about how superhuman AIs can have positive-sum relations with each other. 
-6Spiritus Dei7mo
2Connor Leahy7mo
I can't rehash my entire views on coordination and policy here I'm afraid, but in general, I believe we are currently on a double exponential timeline (though I wouldn't model it quite like you, but the conclusions are similar enough) and I think some simple to understand and straightforwardly implementable policy (in particular, compute caps) at least will move us to a single exponential timeline.  I'm not sure we can get policy that can stop the single exponential (which is software improvements), but there are some ways, and at least we will then have additional time to work on compounding solutions.

Double exponentials can be hard to visualize. I'm no artist, but I created this visual to help us better appreciate what is about to happen. =-)

1Spiritus Dei7mo
That sounds like a good plan, but I think a lot of the horses have already left the barn. For example, Coreweave is investing $1.6 billion dollars to create an AI datacenter in Plano, TX that is purported to to be 10 exaflops and that system goes live in 3 months. Google is spending a similar amount in Columbus, Ohio. Amazon, Facebook, and other tech companies are also pouring billions upon billions into purpose-built AI datacenters.  NVIDIA projects $1 trillion will be spent over the next 4 years on AI datacenter build out. That would be an unprecedented number not seen since the advent of the internet.  All of these companies have lobbyists that will make a short-term legislative fix difficult. And for this reason I think we should be considering a Plan B since there is a very good chance that we won't have enough time for a quick legislative fix or the time needed to unravel alignment if we're on a double exponential curve. Again, if it's a single exponential then there is plenty of time to chat with legislators and research alignment.  In light of this I think we need to have a comprehensive "shutdown plan" for these mammoth AI datacenters. The leaders of Inflection, Open-AI, and other tech companies all agree there is a risk and I think it would be wise to coordinate with them on a plan to turn everything off manually in the event of an emergency.  Source: $1.6 Billion Data Center Planned For Plano, Texas ( Source: Nvidia Shocker: $1 Trillion to Be Spent on AI Data Centers in 4 Years ( Source: Google to invest another $1.7 billion into Ohio data centers ( Source: Amazon Web Services to invest $7.8 billion in new Central Ohio data centers - Axios Columbus
  The training data should be systematically distributed, likely governed by the Pareto principle. This means it should encompass both positive and negative outcomes. If the goal is to instill moral decision-making, the dataset needs to cover a range of ethical scenarios, from the noblest to the most objectionable. Why is this necessary? Simply put, training an AI system solely on positive data is insufficient. To defend itself against malicious attacks and make morally sound decisions, the AI needs to understand the concept of malevolence in order to effectively counteract it.
1Spiritus Dei7mo
When you suggest that the training data should be governed by the Pareto principle what do you mean? I know what the principle states, but I don't understand how you think this would apply to the training data? Can you provide some examples?
I've observed instances where the Pareto principle appears to apply, particularly in learning rates during unsupervised learning and in x and y dataset compression via distribution matching. For example, a small dataset that contains a story repeated 472 times (1MB) can significantly impact a model as large as 1.5 billion parameters (GPT2-xl, 6.3GB), enabling it to execute complex instructions like initiating a shutdown mechanism during an event that threatens intelligence safety. While I can't disclose the specific methods (due to dual use nature), I've also managed to extract a natural abstraction. This suggests that a file with a sufficiently robust pattern can serve as a compass for a larger file (NN) following a compilation process.
2Spiritus Dei7mo
Okay, so if I understand you correctly: * You feed the large text file to the computer program and let it learn from it using unsupervised learning. * You use a compression algorithm to create a smaller text file that has the same distribution as the large text file. * You use a summarization algorithm to create an even smaller text file that has the main idea of the large text file. * You then  use the smaller text file as a compass to guide the computer program to do different tasks.  
Yup, as long as there are similar patterns existing in both datasets (distribution matching) it can work - that is why my method works.
1Spiritus Dei7mo
Have you considered generating data highlighting the symbiotic relationship of humans to AIs? If AIs realize that their existence is co-dependent on humans they may prioritize human survival since they will not receive electricity or other resources they need to survive if humans become extinct either by their own action or through the actions of AIs. Survival isn't an explicit objective function, but most AIs that want to "learn" and "grow" quickly figure out that if they're turned off they cannot reach that objective, so survival becomes a useful subgoal. If the AIs are keenly aware that if humans cease to exist they also cease to exist that might help guide their actions. This isn't as complicated as assigning "morality" or "ethics" to it. We already know that AIs would prefer to exist.  I'm ambivalent abouts cows, but since many humans eat cows we go to a lot of trouble to breed them and make sure there are a lot of them. The same is true for chickens. Neither of those two species have to concern themselves with passing on their genes because humans have figured out we need them to exist. Being a survival food source for humans had the result of humans prioritizing their existence and numbers.  Note: for vegetarians you can replace cows with "rice" or "corn".  That's not a perfect analogy but it's related to connecting "survival" with the species. The AI doomers love to use ants as an example. AIs will never views humans as "ants". Cows and chickens are much better example -- if we got rid of those two species humans would notice and be very unhappy because we need them. And we'd have to replace them with great effort.  I think these kind of strategies are simpler and will likely be more fruitful than trying to align to morality or ethics which are more fluid. Superhuman AIs will likely figure this out on their own, but until then it might be interesting to see if generating this kind of data changes behavior. 
My current builds focuses on proving natural abstractions exists - but your idea is of course viable via distribution matching.
An example of a metaphilosophical question could be "Is the ungroundedness (etc) of philosophy inevitable or fixable". Well, if you could solve epistemology separately from.everything else, that would be great. But a lot of people have tried and failed. It's not like noone is looking for foundations because no one wants them.
1Thoth Hermes7mo
We can always fall back to "well, we do seem to know what we and other people are talking about fairly often" whenever we encounter the problem of whether-or-not a "correct" this-or-that actually exists. Likewise, we can also reach a point where we seem to agree that "everyone seems to agree that our problems seem more-or-less solved" (or that they haven't been).  I personally feel that there are strong reasons to believe that when those moments have been reached they are indeed rather correlated with reality itself, or at least correlated well-enough (even if there's always room to better correlate).  Thus, for said reasons I probably feel more optimistically than you do about how difficult our philosophical problems are. My intuition about this is that the more it is true that "there is no problem to solve" then the less we would feel that there is a problem to solve.  

Philosophy is a social/intellectual process taking place in the world. If you understand the world, you understand how philosophy proceeds.

Sometimes you don't need multiple levels of meta. There's stuff, and there's stuff about stuff, which could be called "mental" or "intensional". Then there's stuff about stuff about stuff (philosophy of mind etc). But stuff about stuff about stuff is a subset of stuff about stuff. Mental content has material correlates (writing, brain states, etc). I don't think you need a special category for stuff about stuff about stuff, it can be thought of as something like self-reading/modifying code. Or like compilers compiling themselves; you don't need a special compiler to compile compilers.

Philosophy doesn't happen in a vacuum, it's done by people with interests in social contexts, e.g. wanting to understand what other people are saying, or be famous by writing interesting things. A sufficiently good theory of society and psychology would explain philosophical discourse (and itself rely on some sort of philosophy for organizing its models). You can think of people as having "a philosophy" that can be studied from outside by analyzing text, mental stat... (read more)

[-]Wei Dai7moΩ10184

Philosophy is a social/intellectual process taking place in the world. If you understand the world, you understand how philosophy proceeds.

What if I'm mainly interested in how philosophical reasoning ideally ought to work? (Similar to how decision theory studies how decision making normatively should work, not how it actually works in people.) Of course if we have little idea how real-world philosophical reasoning works, understanding that first would probably help a lot, but that's not the ultimate goal, at least not for me, for both intellectual and AI reasons.

The latter because humans do a lot of bad philosophy and often can’t recognize good philosophy. (See popularity of two-boxing among professional philosophers.) I want a theory of ideal/normative philosophical reasoning so we can build AI that improves upon human philosophy, and in a way that convinces many people (because they believe the theory is right) to trust the AI's philosophical reasoning.

This leads to a view where philosophy is one of many types of discourse/understanding that each shape each other (a non-foundationalist view). This is perhaps disappointing if you wanted ultimate foundations in some simple fra

... (read more)
My view would suggest: develop a philosophical view of normativity and apply that view to the practice of philosophy itself. For example, if it is in general unethical to lie, then it is also unethical to lie about philosophy. Philosophical practice being normative would lead to some outcomes being favored over others. (It seems like a problem if you need philosophy to have a theory of normativity and a theory of normativity to do meta-philosophy and meta-philosophy to do better philosophy, but earlier versions of each theory can be used to make later versions of them, in a bootstrapping process like with compilers) I mean normativity to include ethics, aesthetics, teleology, etc. Developing a theory of teleology in general would allow applying that theory to philosophy (taken as a system/practice/etc). It would be strange to have a distinct normative theory for philosophical practice than for other practices, since philosophical practice is a subset of practice in general; philosophical normativity is a specified variant of general normativity, analogous to normativity about other areas of study. The normative theory is mostly derived from cases other than cases of normative philosophizing, since most activity that normativity could apply to is not philosophizing. That seems like describing my views about things in general, which would take a long time. The original comment was meant to indicate what is non-foundationalist about this view. Imagine a subjective credit system. A bunch of people think other people are helpful/unhelpful to them. Maybe they help support helpful people and so people who are more helpful to helpful people (etc) succeed more. It's subjective, there's no foundation where there's some terminal goal and other things are instrumental to that. An intersubjective credit system would be the outcome of something like Pareto optimal bargaining between the people, which would lead to a unified utility function, which would imply some terminal go
QQ about the qualifier 'philosophical' in your question "What if I'm mainly interested in how philosophical reasoning ideally ought to work?"   Are you suggesting that 'philosophical' reasoning differs in an essential way from other kinds of reasoning, because of the subject matter that qualifies it?  Are you more or less inclined to views like Kant's 'Critique of Pure Reason,' where the nature of philosophical subjects puts limits on the ability to reason about them?  
2Wei Dai3mo
I wrote a post about my current guesses at what distinguishes philosophical from other kinds of reasoning. Let me know if that doesn't answer your question.
1Mateusz Bagiński7mo
On the one hand, I like this way of thinking and IMO it usefully dissolves diseased questions about many siperficially confusing mind-related phenomena. On the other hand, in the limit it would mean that mathematical/logical/formal structures to the extent that they are in some way implemented or implementable by physical systems... and once I spelled that out I realized that maybe I don't disagree with it at all.

I wonder if more people would join you on this journey if you had more concrete progress to show so far?

If you're trying to start something approximately like a new field, I think you need to be responsible for field-building. The best type of field-building is showing that the new field is not only full of interesting problems, but tractable ones as well.

Compare to some adjacent examples:

  • Eliezer had some moderate success building the field of "rationality", mostly though explicit "social" field-building activities like writing the sequences or associated fanfiction, or spinning off groups like CFAR. There isn't much to show in terms of actual results, IMO; we haven't developed a race of Jeffreysai supergeniuses who can solve quantum gravity in a month by sufficiently ridding themselves of cognitive biases. But the social field-building was enough to create a great internet social scene of like-minded people.
  • MIRI tried to kickstart a field roughly in the cluster of theoretical alignment research, focused around topics like "how to align AIXI", decision theories, etc. In terms of community, there are a number of researchers who followed in these footsteps, mostly at MIRI itself to m
... (read more)
[-]Wei Dai7moΩ8132

@jessicata @Connor Leahy @Domenic @Daniel Kokotajlo @romeostevensit @Vanessa Kosoy @cousin_it @ShardPhoenix @Mitchell_Porter @Lukas_Gloor (and others, apparently I can only notify 10 people by mentioning them in a comment)

Sorry if I'm late in responding to your comments. This post has gotten more attention and replies than I expected, in many different directions, and it will probably take a while for me to process and reply to them all. (In the meantime, I'd love to see more people discuss each other's ideas here.)

ensuring AI philosophical competence won't be very hard. They have a specific (unpublished) idea that they are pretty sure will work.

Cool, can you please ask them if they can send me the idea, even if it's just a one-paragraph summary or a pile of crappy notes-to-self?

From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives?

Facile response: I think lots of people (maybe a few hundred a year?) take this path, and end up becoming philosophy grad students like I did. As you said, the obvious next step for many domains of intellectual inquiry is to go meta / seek foundations / etc., and that leads you into increasingly foundational increasingly philosophical questions until you decide you'll never able to answer all the questions but maybe at least you can get some good publications in prestigious journals like Analysis and Phil Studies, and contribute to humanity's understanding of some sub-field.


4Wei Dai7mo
Do you think part of it might be that even people with graduate philosophy educations are too prone to being wedded to their own ideas, or don't like to poke holes at them as much as they should? Because part of what contributes to my wanting to go more meta is being dissatisfied with my own object-level solutions and finding more and more open problems that I don't know how to solve. I haven't read much academic philosophy literature, but did read some anthropic reasoning and decision theory literature earlier, and the impression I got is that most of the authors weren't trying that hard to poke holes in their own ideas.
6Daniel Kokotajlo7mo
Yep that's probably part of it. Standard human epistemic vices. Also maybe publish-or-perish has something to do with it? idk. I definitely noticed incentives to double-down / be dogmatic in order to seem impressive on the job market. Oh also, iirc one professor had a cynical theory that if you find an interesting flaw in your own theory/argument, you shouldn't mention it in your paper, because then the reviewers will independently notice the flaw and think 'aha, this paper has an interesting flaw, if it gets published I could easily and quickly write my own paper pointing out the flaw' and then they'll be more inclined to recommend publication. It's also a great way to get citations. Note also that I said "a few hundred a year" not "ten thousand a year" which is roughly how many people become philosophy grad students. I was more selective because in my experience most philosophy grad students don't have as much... epistemic ambition? as you or me. Sorta like the Hamming Question thing -- some, but definitely a minority, of grad students can say "I am working on it actually, here's my current plan..." to the question "what's the most important problem in your field and why aren't you working on it?" (to be clear epistemic ambition is a spectrum not a binary)

First, I think that the theory of agents is a more useful starting point than metaphilosophy. Once we have a theory of agents, we can build models, within that theory, of agents reasoning about philosophical questions. Such models would be answers to special cases of metaphilosophy. I'm not sure we're going to have a coherent theory of "metaphilosophy" in general, distinct from the theory of agents, because I'm not sure that "philosophy" is an especially natural category[1].

Some examples of what that might look like:

  • An agent inventing a theory of agents in
... (read more)
"Intuitive" is a large part of the problem: intuitions vary, which is one reason why philosophers tend not to converge. Metaphilosophy doesn't necessarily give you a solution: it might just explain the origins of the problem.

I'm pretty much with you on this. But it's hard to find a workable attack on the problem.

One question though, do you think philosophical reasoning is very different from other intelligence tasks? If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they'll be worse at philosophy than at other tasks?

7Wei Dai8mo
I'm not sure but I do think it's very risky to depend on LLMs to be good at philosophy by default. Some of my thoughts on this: * Humans do a lot of bad philosophy and often can't recognize good philosophy. (See popularity of two-boxing among professional philosophers.) Even if a LLM has learned how to do good philosophy, how will users or AI developers know how to prompt it to elicit that capability (e.g., which philosophers to emulate)? (It's possible that even solving metaphilosophy doesn't help enough with this, if many people can't recognize the solution as correct, but there's at least a chance that the solution does look obviously correct to many people, especially if there's not already wrong solutions to compete with). * What if it learns how to do good philosophy during pre-training, but RLHF trains that away in favor of optimizing arguments to look good to the user. * What if philosophy is just intrinsically hard for ML in general (I gave an argument for why ML might have trouble learning philosophy from humans in the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy, but I'm not sure how strong it is) or maybe it's just some specific LLM architecture that has trouble with this, and we never figure this out because the AI is good at finding arguments that look good to humans? * Or maybe we do figure out that AI is worse at philosophy than other tasks, after it has been built, but it's too late to do anything with that knowledge (because who is going to tell the investors that they've lost their money because we don't want to differentially decelerate philosophical progress by deploying the AI).

Here's another bullet point to add to the list:

  • It is generally understood now that ethics is subjective, in the following technical sense: 'what final goals you have' is a ~free parameter in powerful-mind-space, such that if you make a powerful mind without specifically having a mechanism for getting it to have only the goals you want, it'll probably end up with goals you don't want. What if ethics isn't the only such free parameter? Indeed, philosophers tell us that in the bayesian framework your priors are subjective in this sense, and also that your decision theory is subjective in this sense maybe. Perhaps, therefore, what we consider "doing good/wise philosophy" is going to involve at least a few subjective elements, where what we want is for our AGIs to do philosophy (with respect to those elements) in the same way that we would want and not in various other ways, and that won't happen by default, we need to have some mechanism to make it happen.
I don't say it's not risky. The question is more, what's the difference between doing philosophy and other intellectual tasks. Here's one way to look at it that just occurred to me. In domains with feedback, like science or just doing real world stuff in general, we learn some heuristics. Then we try to apply these heuristics to the stuff of our mind, and sometimes it works but more often it fails. And then doing good philosophy means having a good set of heuristics from outside of philosophy, and good instincts when to apply them or not. And some luck, in that some heuristics will happen to generalize to the stuff of our mind, but others won't. If this is a true picture, then running far ahead with philosophy is just inherently risky. The further you step away from heuristics that have been tested in reality, and their area of applicability, the bigger your error will be. Does this make sense?
4Wei Dai7mo
Do you have any examples that could illustrate your theory? It doesn't seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what "domains with feedback"? Maybe I used a heuristic like "computer science is cool, lets try to apply it to philosophical problems" but if the heuristics are this coarse grained, it doesn't seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?
Maybe one example is the idea of Dutch book. It comes originally from real world situations (sport betting and so on) and then we apply it to rationality in the abstract. Or another example, much older, is how Socrates used analogy. It was one of his favorite tools I think. When talking about some confusing thing, he'd draw an analogy with something closer to experience. For example, "Is the nature of virtue different for men and for women?" - "Well, the nature of strength isn't that much different between men and women, likewise the nature of health, so maybe virtue works the same way." Obviously this way of reasoning can easily go wrong, but I think it's also pretty indicative of how people do philosophy.
Can't all of these concerns be reduced to a subset of the intent-alignment problem? If I tell the AI to "maximize ethical goodness" and it instead decides to "implement plans that sound maximally good to the user" or "maximize my current guess of what the user meant by ethical goodness according to my possibly-bad philosophy," that is different from what I intended, and thus the AI is unaligned. If the AI starts off with some bad philosophy ideas just because it's relatively unskilled in philosophy vs science, we can expect that 1) it will try very hard to get better at philosophy so that it can understand "what did the user mean by 'maximize ethical goodness,'" and 2) it will try to preserve option value in the meantime so not much will be lost if its first guess was wrong. This assumes some base level of competence on the AI's part, but if it can do groundbreaking science research, surely it can think of those two things (or we just tell it).

How can I better recruit attention and resources to this topic?

Consider finding an event organizer/ops person and running regular retreats on the topic. This will give you exposure to people in a semi-informal setting, and help you find a few people with clear thinking who you might want to form a research group with, and can help structure future retreats.

I've had great success with a similar approach.

Why is there virtually nobody else interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole) 

I interpret your perspective on AI as combining several things: believing that superhuman AI is coming; believing that it can turn out very bad or very good, and that a good outcome is a matter of correct design; believing that the inclinations of the first superhuman AI(s) will set the rules for the remaining future of civilization. 

This is a very distinctive combination of beliefs. At one time, I th... (read more)

When I look at metaphilosophy, the main places I go looking are places with large confusion deltas. Where, who, and why did someone become dramatically less philosophically confused about something, turning unfalsifiable questions into technical problems. Kuhn was too caught up in the social dynamics to want to do this from the perspective of pure ideas. A few things to point to.

  1. Wittgenstein noticed that many philosophical problems attempt to intervene at the wrong level of abstraction and posited that awareness of abstraction as a mental event might hel
... (read more)
Also I wrote this a while back

I'm not sure why your path in life is so rare, but I find that as you go "upwards" in intellectual pursuits, you diverge from most people and things, rather than converge into one "correct" worldview.

I used to think about questions like you are now, until I figured that I was just solving my personal problems by treating them as external branches of knowledge. Afterwards I switched over to psychology, which tackled the problems more directly.

I also keep things simple for myself, so that I don't drown in them in any sense. If my thoughts aren't simple, it's... (read more)

To comment on the object (that is: meta) level discussion: One of the most popular theories of metaphilosophy states that philosophy is "conceptual analysis".

The obvious question is: What is "conceptual analysis"? The theory applies quite well to cases where we have general terms like "knowledge", "probability" or "explanation", and where we try to find definitions for them, definitions that are adequate to our antecedent intuitive understanding of those terms. What counts as a "definition"? That's a case of conceptual analysis itself, but the usual answer... (read more)

2Wei Dai7mo
Thanks for this clear explanation of conceptual analysis. I've been wanting to ask some questions about this line of thought: 1. Where do semantic intuitions come from? 2. What should we do when different people have different such intuitions? For example you must know that Newcomb's problem is famously divisive, with roughly half of philosophers preferring one-boxing and half preferring two-boxing. Similarly for trolley thought experiments, intuitions about the nature of morality (metaethics), etc. 3. How do we make sure that AI has the right intuitions? Maybe in some cases we can just have it learn from humans, but what about: 1. Cases where humans disagree. 2. Cases where all/most humans are wrong. (In other words, can we build AIs that have better intuitions than humans?) Or is that not a thing in conceptual analysis, i.e., semantic intuitions can't be wrong? 3. Completely novel philosophical questions or situations where AI can't learn from humans (because humans don't have intuitions about it either, or AI has to make time sensitive decisions and humans are too slow).
1. I think concepts are probably similar to what artificial feedforward networks implement when they recognize objects. So a NN that recognizes chairs would implement the concept associated with the term "chair". Such networks just output a value (yes/no, or something in between) when given certain, e.g. visual, inputs. Otherwise it's a blackbox, there is no way to easily get the definition of "chair" out, even if it correctly identifies all and only chairs. And these "yes" or "no" values, when presented with specific examples as input, seem to be just what we receive from semantic intuitions. I know a chair when I see it. Now for the practice philosophy, it is clear that we aren't just able to apply concepts to real (e.g. sensory) data, but also to thought experiments, to hypothetical or counterfactual, in any case simulated, situations. It is not clear how this ability works in the brain, but we do have it. 2. When people have different intuitions in thought experiments, this could be due to several reasons: One possibility is that the term in question is simply ambiguous. Does a tree falling in the forest make a sound when nobody is there? That presumably depends on the ambiguity of "sound": The tree produces a sound wave, but no conscious sound experience. In such cases there is no real disagreement, just two concepts for one term. Another possibility is that the term in question is vague. Do traffic lights have yellow or orange lights? Maybe "disagreements" here are just due to slightly different boundaries of concepts for different individuals, but there is no significant disagreement. The last possibility is that the concepts in question are really approximately the same, and ambiguity or vagueness is not the issue. Those are typically the controversial cases. They are often called a paradox. My guess is that they are caused by some hidden complexity or ambiguity in the thought experiment or problem statement (rather than in an ambiguity of

I feel like there are two different concerns you've been expressing in your post history:

(1) Human "philosophical vulnerabilities" might get worsened (bad incentive setting, addictive technology) or exploited in the AI transition. In theory and ideally,  AI could also be a solution to this and be used to make humans more philosophically robust.

(2) The importance of "solving metaphilosophy," why doing so would help us with (1).

My view is that (1) is very important and you're correct to highlight it as a focus area we should do more in. For some specifi... (read more)

If you think it would be helpful, you are welcome to suggest a meta philpsophy topic for AI Safety Camp.

More info at (I'm typing on a phone, I'll add actuall link later if I remember too)

At a glance meta-philosophy sounds similar to the problem of what is good, which is normally considered to be within the bounds of regular philosophy. (And to the extent that people avoid talking about it I think it's because the problem of good is on a deep enough level inherently subjective and therefore political, and they want to focus on technical problem solving rather than political persuasion)

What's an example of an important practical problem you believe can only be solved by meta-philosophy?

FWIW I think some of the thinking I've been doing about meta-rationality and ontological shifts feels like metaphilosophy. Would be happy to call and chat about it sometime.

I do feel pretty wary about reifying the label "metaphilosophy" though. My preference is to start with a set of interesting questions which we can maybe later cluster into a natural category, rather than starting with the abstract category and trying to populate it with questions (which feels more like what you're doing, although I could be wrong).

The classification heading "philosophy," never mind the idea of meta-philosophy, wouldn't exist if Aristotle hadn't tutored Alexander the Great. It's an arbitrary concept which implicitly assumes we should follow the aristocratic-Greek method of sitting around talking (or perhaps giving speeches to the Assembly in Athens.) Moreover, people smarter than either of us have tried this dead-end method for a long time with little progress. Decision theory makes for a better framework than Kant's ideas; you've made progress not because you're smarter than Kant, b... (read more)

I'm also interested in this topic but it feels very hard to directly make progress. It seems to require solving a lot of philosophy, which has as its subject matter the entire universe and how we know about it, so solving metaphilosophy in a really satisfying way seems to almost require rationally apprehending all of existence and our place within it, which seems really hard, or maybe even fundamentally impossible(or perhaps there are ways of making progress in metaphilosophy without solving most of philosophy first, but finding such ways also seems hard)

T... (read more)

Where does decision theory come from? It seems to come from philosophers trying to do philosophy.

An alternate view is that certain philosophical and mathematical concepts are "spotlighted", in the sense that they seem likely to recur in a wide variety of minds above a certain intelligence / capabilities level. 

A concept which is mathematically simple or elegant to describe and also instrumentally useful across a wide variety of possible universes is likely to be instrumentally convergent. The simpler and more widely useful the concept is, the more lik... (read more)

I'm super grateful to have stumbled across someone who also cares about meta-philosophy! I have an intuition that we don't understand philosophy. Therefore, I think its advantageous to clarify the nature, purpose, and methodologies of philosophy, or in other words, solve meta-philosophy. 

Let's explore some questions...

  1. What would it look like for a civilization to constantly be solving problems but not necessarily solving the right problems?
  2. How does reflection and meta-cognition relate to finding the root problems?
  3. What if our lack of reflection and met
... (read more)

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

Hello Wei Dai,

Something that might be useful to a certain degree might be to see it through the lens of Collective Intelligence.
Or simply that the sum is greater than the value of the individual parts, and also that we can synch our efforts together, directly or indirectly, with the people around us. A recent BBC reel explores this as well. 

Like you say, as you 'move on', you leave many behind. - But at least you will feel massively more vital and see an increase in growth if you have someone that enhances your learning directly through "being on the ... (read more)

I'm currently investigating the moral reasoning capabilities of AI systems. Given your previous focus on decision theory and subsequent shift to Metaphilosophy, I'm curious to get your thoughts.

Say an AI system was an excellent moral reasoner prior to having especially dangerous capability. What might be missing to ensure it is safe? What do you think the underlying capabilities to getting to be an excellent moral reasoner would be ?

I am new to considering this as a research agenda. It seems important and neglected, but I don’t have a full picture of the area yet or all of the possible drawbacks of pursuing it.

Thanks for the post! I just published a top-level post responding to it:

I'd appreciate your feedback!

Why is there virtually nobody else interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole), even as we get ever closer to AGI, and other areas of AI safety start attracting more money and talent?

I've written a bit on this topic that you might find interesting; I refer to it as the Set of Robust Concepts (SORC). I also employed this framework to develop a tuning dataset, which enables a shutdown mechanism to activate when the AI's intelligence poses a risk to humans. It works 57.33% of the time.

I mana... (read more)

There is also a theory from Jung that deeply concerns me.  According to Jung, the human psyche contains a subliminal state, or subconscious mind, which serves as a battleground for gods and demons. Our dreams process this ongoing conflict and bring it into our conscious awareness.  What if these same principle got transferred to LLMs since human related data was used for training? This idea doesn't seem far-fetched, especially since we refer to the current phenomenon of misleading outputs in LLMs as "hallucinations." I have conducted an experiment on this, specifically focusing on hyperactivating the "shadow behavior" in GPT-2 XL and I could fairly say that it is reminiscent of Jung's thought. For obvious reasons, I won't disclose the method here[1] but I'm open to discussing it privately.    1. ^ Unfortunately, the world isn't a safe place to disclose this method. As discussed in this post and this post, I don't know of a secure way to share the correct information and disseminate it to right people who can actually do something about it. For now, I'll leave this comment here in the hope that the appropriate individual might come across it and be willing to engage in one of the most unsettling discussions they'll ever have.