Wei Dai — LessWrong

LESSWRONG
LW

To try to explain how I see the difference between philosophy and metaphilosophy:

My definition of philosophy is similar to @MichaelDickens' but I would use "have serviceable explicitly understood methods" instead of "formally studied" or "formalized" to define what isn't philosophy, as the latter might be or could be interpreted as being too high of a bar, e.g., in the sense of formal systems.

So in my view, philosophy is directly working on various confusing problems (such as "what is the right decision theory") using whatever poorly understood methods that we have or can implicitly apply, and then metaphilosophy is trying to help solve these problems on a meta level, by better understanding the nature of philosophy, for example:

Try to find if there is some unifying quality that ties all of these "philosophical" problems together (besides "lack of serviceable explicitly understood methods").
Try to formalize some part of philosophy, or find explicitly understood methods for solving certain philosophical problems.
Try to formalize all of philosophy wholesale, or explicitly understand what is it that humans are doing (or should be doing, or what AIs should be doing) when it comes to solving problems in general. This may not be possible, i.e., maybe there is no such general method that lets us solve every problem given enough time and resources, but it sure seems like humans have some kind of general purpose (but poorly understood) method, that lets us make progress slowly over time on a wide variety of problems, including ones that are initially very confusing, or hard to understand/explain what we're even asking, etc. We can at least aim to understand what is it that humans are or have been doing, even if it's not a fully general method.

Does this make sense?

Shortform

Wei Dai10h*123

One way to see that philosophy is exceptional is that we have serviceable explicit understandings of math and natural science, even formalizations in the forms of axiomatic set theory and Solomonoff Induction, but nothing comparable in the case of philosophy. (Those formalizations are far from ideal or complete, but still represent a much higher level of understanding than for philosophy.)

If you say that philosophy is a (non-natural) science, then I challenge you, come up with something like Solomonoff Induction, but for philosophy.

life lessons from trading

Wei Dai1d144

Trading is a zero sum game inside a larger positive sum game. Though every trade has a winner and offsetting losers,

This isn't true. Sometimes you're trading against someone with non-valuation motives, i.e., someone buying or selling for a reason besides thinking that the current market price is too low or too high, for example, someone being liquidated due to a margin violation, or the founder of a company wanting to sell in order to diversify. In that case, it makes more sense to think of yourself as providing a service for the other side of the trade, instead of there being a winner and a loser.

markets as a whole direct resources across space and time and help civilizations grow.

Unpriced externalities imply that sometimes markets harm civilizations. I think investments into AGI/ASI is a prime example of this, with x-risks being the unpriced externality.

leogao's Shortform

Wei Dai3d00

Figuring out the underlying substance behind "philosophy" is a central project of metaphilosophy, which is far from solved, but my usual starting point is "trying to solve confusing problems which we don't have established methodologies for solving" (methodologies meaning explicitly understood methods), which I think bakes in the least amount of assumptions about what philosophy is or could be, while still capturing the usual meaning of "philosophy" and explains why certain fields started off as being part of philosophy (e.g., science starting off as nature philosophy) and then became "not philosophy" when we figured out methodologies for solving them.

I think "figure out what are the right concepts to be use, and, use those concepts correctly, across all of relevant-Applied-conceptspace" is the expanded version of what I meant, which maybe feels more likely to be what you mean.

This bakes in "concepts" being the most important thing, but is that right? Must AIs necessarily think about philosophy using "concepts", or is that really the best way to formulate how idealized philosophical reasoning should work?

Is "concepts" even what distinguishes philosophy from non-philosophical problems, or is "concepts" just part of how humans reason about everything, which we latch onto when trying to define or taboo philosophy, because we have nothing else better to latch onto? My current perspective is that what uniquely distinguishes philosophy is their confusing nature and the fact that we have no well-understood methods for solving them (but would of course be happy to hear any other perspectives on this).

Regarding good philosophical taste (or judgment), that is another central mystery of metaphilosophy, which I've been thinking a lot about but don't have any good handles on. It seems like a thing that exists (and is crucial) but is very hard to see how/why it could exist or what kind of thing it could be.

So anyway, I'm not sure how much help any of this is, when trying to talk to the type of person you mentioned. The above are mostly some cached thoughts I have on this, originally for other purposes.

BTW, good philosophical taste being rare definitely seems like a very important part of the strategic picture, which potentially makes the overall problem insurmountable. My main hopes are 1) someone makes an unexpected metaphilosophical breakthrough (kind of like Satoshi coming out of nowhere to totally solve distributed currency) and there's enough good philosophical taste among the AI safety community (including at the major labs) to recognize it and incorporate it into AI design or 2) there's an AI pause during which human intelligence enhancement comes online and selecting for IQ increases the prevalence of good philosophical taste as a side effect (as it seems too much to hope that good philosophical taste would be directly selected for) and/or there's substantial metaphilosophical progress during the pause.

leogao's Shortform

Wei Dai3d40

Unless you can abstract out the "alignment reasoning and judgement" part of a human's entire brain process (and philosophical reasoning and judgement as part of that) into some kind of explicit understanding of how it works, how do you actually build that into AI without solving uploading (which we're obviously not on track to solve in 2-4 year either)?

put a bunch of smart thoughtful humans in a sim and run it for a long time

Alignment researchers have had this thought for a long time (see e.g. Paul Christiano's A formalization of indirect normativity) but I think all of the practical alignment research programs that this line of thought led to, such as IDA and Debate, are all still bottlenecked by lack of metaphilosophical understanding, because without the kind of understanding that lets you build an "alignment/philosophical reasoning checker" (analogous to a proof checker for mathematical reasoning) they're stuck trying to do ML of alignment/philosophical reasoning from human data, which I think is unlikely to work out well.

leogao's Shortform

Wei Dai3d33

first, I think it implies that we should try to figure out how to reduce the asymmetry in verifiability between capabilities and alignment

If solving alignment implies solving difficult philosophical problems (and I think it does), then a major bottlenecks for verifying alignment will be verifying philosophy, which in turn implies that we should be trying to solve metaphilosophy (i.e., understand the nature of philosophy and philosophical reasoning/judgment). But that is unlikely to be possible within 2-4 years, even with the largest plausible effort, considering the history of analogous fields like metaethics and philosophy of math.

What to do in light of this? Try to verify the rest of alignment, just wing it on the philosophical parts, and hope for the best?

in particular, because ultimately the only way we can make progress on alignment is by relying on whatever process for deciding that research is good that human alignment researchers use in practice (even provably correct stuff has the step where we decide what theorem to prove and give an argument for why that theorem means our approach is sound), there’s an upper bound on the best possible alignment solution that humans could ever have achieved, which is plausibly a lot lower than perfectly solving alignment with certainty.

I kind of want to argue against this, but also am not sure how this fits in with the rest of your argument. Whether or not there's an upper bound that's plausibly a lot lower than perfectly solving alignment with certainty, it doesn't seem to affect your final conclusions?

leogao's Shortform

Wei Dai4dΩ340

Have you seen A Master-Slave Model of Human Preferences? To summarize, I think every human is trying to optimize for status, consciously or subconsciously, including those who otherwise fit your description of idealized platonic researcher. For example, I'm someone who has (apparently) "chosen ultimate (intellectual) freedom over all else", having done all of my research outside of academia or any formal organizations, but on reflection I think I was striving for status (prestige) as much as anyone, it was just that my subconscious picked a different strategy than most (which eventually proved quite successful).

at the end of the day, what’s even the point of all this?

I think it's probably a result of most humans not being very strategic, or their subconscious strategizers not being very competent. Or zooming out, it's also a consequence of academia being suboptimal as an institution for leveraging humans' status and other motivations to produce valuable research. That in turn is a consequence of our blind spot for recognizing status as an important motivation/influence for every human behavior, which itself is because not explicitly recognizing status motivation is usually better for one's status.

Wei Dai's Shortform

Wei Dai4d60

I'm still using it for this purpose, but don't have a good sense of how much worse it is compared to pre-0325. However I'm definitely very wary of the sycophancy and overall bad judgment. I'm only using them to point out potential issues I may have overlooked, and not e.g. whether a draft is ready to post, or whether some potential issue is a real issue that needs to be fixed. All the models I've tried seem to err a lot in both directions.

Plan 1 and Plan 2

Wei Dai5d50

But in the end, those plans are not opposed to each other.

I think they are somewhat opposed, due to signaling effects: If you're working on Plan 2 only, then that signals to the general public or non-experts that you think the risks are manageable/acceptable. If a lot of people are working on Plan 2, that gives ammunition to the people who want to race or don't want to pause/stop to say "Look at all these AI safety experts working on solving AI safety. If the risks are really as high as the Plan 1 people say, wouldn't they be calling for a pause/stop too instead of working on technical problems?"

Reminder: Morality is unsolved

Wei Dai5d40

I wonder whether, if you framed your concerns in this concrete way, you'd convince more people in alignment to devote attention to these issues? As compared to speaking more abstractly about solving metaethics or metaphilosophy.

I'm not sure. It's hard for me to understand other humans a lot of the time, for example these concerns (both concrete and abstract) seem really obvious to me, and it mystifies me why so few people share them (at least to the extent of trying to do anything about them, like writing a post to explain the concern, spending time to try to solve the relevant problems, or citing these concerns as another reason for AI pause).

Also I guess I did already talk about the concrete problem, without bringing up metaethics or metaphilosophy, in this post.

(Of course, you may not think that's a helpful alternative, if you think solving metaethics or metaphilosophy is the main goal, and other concrete issues will just continue to show up in different forms unless we do it.)

I think a lot of people in AI alignment think they already have a solution for metaethics (including Eliezer who explicitly said this in his metaethics sequence), which is something I'm trying to talk them out of, because assuming a wrong metaethical theory in one's alignment approach is likely to make the concrete issues worse instead of better.

For instance, I'm also concerned as an anti-realist that giving people their "aligned" AIs to do personal reflection will likely go poorly and lead to outcomes we wouldn't want for the sake of those people or for humanity as a collective.

This illustrates the phenomenon I talked about in my draft, where people in AI safety would confidently state "I am X" or "As an X" where X is some controversial meta-ethical position that they shouldn't be very confident in, whereas they're more likely to avoid overconfidence in other areas of philosophy like normative ethics.

I take your point that people who think they've solved meta-ethics can also share my concrete concern about possible catastrophe caused by bad reflection among some or all humans, but as mentioned above, I'm pretty worried that if their assumed solution is wrong, they're likely to contribute to making the problem worse instead of better.

BTW, are you actually a full-on anti-realist, or actually take one of the intermediate positions between realism and anti-realism? (See my old post Six Plausible Meta-Ethical Alternatives for a quick intro/explanation.)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments