Wei Dai

I think I need more practice talking with people in real time (about intellectual topics). (I've gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.

www.weidai.com

Comments

Sorted by
  1. You can also make an argument for not taking over the world on consequentialist grounds, which is that nobody should trust themselves to not be corrupted by that much power. (Seems a bit strange that you only talk about the non-consequentialist arguments in footnote 1.)
  2. I wish this post also mentioned the downsides of decentralized or less centralized AI (such as externalities and race dynamics reducing investment into safety, potential offense/defense imbalances, which in my mind are just as worrisome as the downsides of centralized AI), even if you don't focus on them for understandable reasons. To say nothing risks giving the impression that you're not worried about that at all, and people should just straightforwardly push for decentralized AI to prevent the centralized outcome that many fear.
Wei Dai311

I'm actually pretty confused about what they did exactly. From the Safety section of Learning to Reason with LLMs:

Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model's safety refusal boundaries. We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.

from Hiding the Chains of Thought:

For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

These two sections seem to contradict each other but I can also think of ways to interpret them to be more consistent. (Maybe "don't train any policy compliance or user preferences onto the chain of thought" is a potential future plan, not what they already did. Maybe they taught the model to reason about safety rules but not to obey them in the chain of thought itself.)

Does anyone know more details about this, and also about the reinforcement learning that was used to train o1 (what did they use as a reward signal, etc.)? I'm interested to understand how alignment in practice differs from theory (e.g. IDA), or if OpenAI came up with a different theory, what its current alignment theory is.

Wei Dai2310

If the other player is a stone with “Threat” written on it, you should do the same thing, even if it looks like the stone’s behavior doesn’t depend on what you’ll do in response. Responding to actions and ignoring the internals when threatened means you’ll get a lot fewer stones thrown at you.

In order to "do the same thing" you either need the other's player's payoffs, or according to the next section "If you receive a threat and know nothing about the other agent’s payoffs, simply don’t give in to the threat!" So if all you see is a stone, then presumably you don't know the other agent's payoffs, so presumably "do the same thing" means "don't give in".

But that doesn't make sense because suppose you're driving and suddenly a boulder rolls towards you. You're going to "give in" and swerve, right? What if it's an animal running towards you and you know they're too dumb to do LDT-like reasoning or model your thoughts in their head, you're also going to swerve, right? So there's still a puzzle here where agents have an incentive to make themselves look like a stone (i.e., part of nature or not an agent), or to never use LDT or model others in any detail.

Another problem is, do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?

Wei Dai100

#1 has obviously happened. Nordstream 1 was blown up within weeks of my OP, and AFAIK Russian hasn't substantially expanded its other energy exports. Less sure about #2 and #3, as it's hard to find post-2022 energy statistics. My sense is that the answers are probably "yes" but I don't know how to back that up without doing a lot of research.

However coal stocks (BTU, AMR, CEIX, ARCH being the main pure play US coal stocks) haven't done as well as I had expected (the basket is roughly flat from Aug 2022 to today) for two other reasons: A. There have been two mild winters that greatly reduced winter energy demands and caused thermal coal prices to crash. Most people seem to attribute this to global warming caused by maritime sulfur regulations. B. Chinese real-estate problems caused metallurgical coal prices to also crash in recent months.

My general lesson from this is that long term investing is harder than I thought. Short term trading can still be profitable but can't match the opportunities available back in 2020-21 when COVID checks drove the markets totally wild. So I'm spending a lot less time investing/trading these days.

Wei DaiΩ163911

Unfortunately this ignores 3 major issues:

  1. race dynamics (also pointed out by Akash)
  2. human safety problems - given that alignment is defined "in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy", why should we believe that AI developers and/or parts of governments that can coerce AI developers will steer the AI systems in a good direction? E.g., that they won't be corrupted by power or persuasion or distributional shift, and are benevolent to begin with.
  3. philosophical errors or bottlenecks - there's a single mention of "wisdom" at the end, but nothing about how to achieve/ensure the unprecedented amount of wisdom or speed of philosophical progress that would be needed to navigate something this novel, complex, and momentous. The OP seems to suggest punting such problems to "outside consensus" or "institutions or processes", with apparently no thought towards whether such consensus/institutions/processes would be up to the task or what AI developers can do to help (e.g., by increasing AI philosophical competence).

Like others I also applaud Sam for writing this, but the actual content makes me more worried, as it's evidence that AI developers are not thinking seriously about some major risks and risk factors.

Wei Dai30

I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy

Thanks for this info and the references. I guess by "metaphilosophy" I meant something more meta than metaethics or metaepistemology, i.e., a field that tries to understand all philosophical reasoning in some unified or systematic way, including reasoning used in metaethics and metaepistemology, and metaphilosophy itself. (This may differ from standard academic terminology, in which case please let me know if there's a preferred term for the concept I'm pointing at.) My reasoning being that metaethics itself seems like a hard problem that has defied solution for centuries, so why stop there instead of going even more meta?

Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen.

I think you (and other philosophers) may be too certain that a pause won't happen, but I'm not sure I can convince you (at least not easily). What about calling for it in a low cost way, e.g., instead of doing something high profile like an open letter (with perceived high opportunity costs), just write a blog post or even a tweet saying that you wish for an AI pause, because ...? What if many people privately prefer an AI pause, but nobody knows because nobody says anything? What if by keeping silent, you're helping to keep society in a highly suboptimal equilibrium?

I think there are also good arguments for doing something like this from a deontological or contractualist perspective (i.e. you have a duty/obligation to honestly and publicly report your beliefs on important matters related to your specialization), which sidestep the "opportunity cost" issue, but I'm not sure if you're open to that kind of argument. I think they should have some weight given moral uncertainty.

Wei Dai30

Sadly, I don't have any really good answers for you.

Thanks, it's actually very interesting and important information.

I don't know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.

I've noticed (and stated in the OP) that normative ethics seems to be an exception where it's common to express uncertainty/confusion/difficulty. But I think, from both my inside and outside views, that this should be common in most philosophical fields (because e.g. we've been trying to solve them for centuries without coming up with broadly convincing solutions), and there should be a steady stream of all kinds of philosophers going up the meta ladder all the way to metaphilosophy. It recently dawned on me that this doesn't seem to be the case.

Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don't know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.

What seems useless, calling for an AI pause, or the AI pause itself? Have trouble figuring out because if "calling for an AI pause", what is the opportunity cost (seems easy enough to write or sign an open letter), and if "AI pause itself", "seems useless" contradicts "would love". In either case, this seems extremely important to openly discuss/debate! Can you please ask these philosophers to share their views of this on LW (or their preferred venue), and share your own views?

Wei Dai60

Thank you for your view from inside academia. Some questions to help me get a better sense of what you see:

  1. Do you know any philosophers who switched from non-meta-philosophy to metaphilosophy because they become convinced that the problems they were trying to solve are too hard and they needed to develop a better understanding of philosophical reasoning or better intellectual tools in general? (Or what's the closest to this that you're aware of?)
  2. Do you know any philosophers who have expressed an interest in ensuring that future AIs will be philosophically competent, or a desire/excitement for supercompetent AI philosophers? (I know 1 or 2 private expressions of the former, but not translated into action yet.)
  3. Do you know any philosophers who are worried that philosophical problems involved in AI alignment/safety may be too hard to solve in time, and have called for something like an AI pause to give humanity more time to solve them? (Even philosophers who have expressed a concern about AI x-risk or are working on AI safety have not taken a position like this, AFAIK.)
  4. How often have you seen philosophers say something like "Upon further reflection, my proposed solution to problem X has many problems/issues, I'm no longer confident it's the right approach and now think X is much harder than I originally thought."

Would also appreciate any links/citations/quotes (if personal but sharable communications) on these.

These are all things I've said or done due to high estimate of philosophical difficulty, but not (or rarely) seen among academic philosophers, at least from my casual observation from outside academia. It's also possible that we disagree on what estimate of philosophical difficulty is appropriate (such that for example you don't think philosophers should often say or do these things), which would also be interesting to know.

Wei Dai92

My understanding of what happened (from reading this) is that you wanted to explore in a new direction very different from the then preferred approach of the AF team, but couldn't convince them (or someone else) to join you. To me this doesn't clearly have much to do with streetlighting, and my current guess is that it was probably reasonable of them to not be convinced. It was also perfectly reasonable of you to want to explore a different approach, but it seems unreasonable to claim without giving any details that it would have produced better results if only they had listened to you. (I mean you can claim this, but why should I believe you?)

If you disagree (and want to explain more), maybe you could either explain the analogy more fully (e.g., what corresponds to the streetlight, why should I believe that they overexplored the lighted area, what made you able to "see in the dark" to pick out a more promising search area or did you just generally want to explore the dark more) and/or try to convince me on the object level / inside view that your approach is or was more promising?

(Also perfectly fine to stop here if you want. I'm pretty curious on both the object and meta levels about your thoughts on AF, but you may not have wanted to get into such a deep discussion when you first joined this thread.)

Wei Dai42

(Upvoted since your questions seem reasonable and I'm not sure why you got downvoted.)

I see two ways to achieve some justifiable confidence in philosophical answers produced by superintelligent AI:

  1. Solve metaphilosophy well enough that we achieve an understanding of philosophical reasoning on par with mathematical reason, and have ideas/systems analogous to formal proofs and mechanical proof checkers that we can use to check the ASI's arguments.
  2. We increase our own intelligence and philosophical competence until we can verify the ASI's reasoning ourselves.
Load More