Thomas Kwa

Student at Caltech. Currently trying to get an AI safety inside view.

Wiki Contributions


Rob B's Shortform Feed

Have you seen this ever work for an advance prediction? It seems like you need to be in a better epistemic position than Feynman, which is pretty hard.

Humans Reflecting on HRH

I disagree that every indirect normativity process must approximate this linear process of humans delegating to the future. In the ELK appendix, there is some discussion of a process that allows humans to delegate to the past so as to form a self-consistent chain:

  • Sometimes the delegate Hn will want to delegate to a future version of themselves, but they will realize that the situation they are in is actually not very good (for example, the AI may have no way to get them food for the night), and so they would actually prefer that the AI had made a different decision at some point in the past. We want our AI to take actions now that will help keep us safe in the future, so it’s important to use this kind of data to guide the AI’s behavior. But doing so introduces significant complexities, related to the issues discussed in Appendix: subtle manipulation.

Mark would probably have more to say here.

Thomas Kwa's Shortform

Maybe this is too tired a point, but AI safety really needs exercises-- tasks that are interesting, self-contained (not depending on 50 hours of readings), take about 2 hours, have clean solutions, and give people the feel of alignment research.

I found some of the SERI MATS application questions better than Richard Ngo's exercises for this purpose, but there still seems to be significant room for improvement. There is currently nothing smaller than ELK (which takes closer to 50 hours to develop a proposal for and properly think about it) that I can point technically minded people to and feel confident that they'll both be engaged and learn something.

Technocracy and the Space Age

SpaceX's superpower is doing things slightly better, which yields substantial gains thanks to the large exponent on the rocket equation.

Agree with the rest of this comment, but I don't think SpaceX's success is due to the rocket equation. Their engine specific impulse is not better than state-of-the-art, and as a consequence their payload fraction to orbit isn't better either. The success is driven by huge reductions in cost per ton at launch of each rocket.

Utility functions and probabilities are entangled

Why can't we use wrong probabilities in real life?

There are various circumstances where I want to "rotate" between probabilities and utilities in real life, in ways that still prescribe the correct decisions. For example, if I have a startup idea and want to maximize my expected profit, I'd be much more emotionally comfortable with thinking it has a 90% chance of making $1 billion than a 10% chance of making $9 billion. So why can't we use wrong probabilities in real life?I think there are three major reasons why this doesn't always work.

  • Idea: with a bounded world-model, messing with your probabilities changes probabilities of other things.
  • Idea: It's just easier to have the correct policy if your probabilities are aligned with frequentist probabilities. Despite the extra degree of freedom, we don't know of any better algorithm that lets us update towards the correct combination of probability and utility, other than getting each right individually.
  • You can no longer use evidence the same way.
    • Suppose you believe your startup is 90% to succeed, rather than the true probability of 10%.
    • A naive Bayesian update 2:1 towards the startup working now brings you to ~95%, not ~18%. Your model gives a smaller change in expected utility than reality.
    • To get the correct expected utility, the new probability of your startup succeeding has to be 164%. Something has gone wrong when you estimate a probability of 164%. I think you basically have to say probability of this is upper bounded at 900%, which makes sense because it's not magic, it's just shuffling probabilities around
Superrational Agents Kelly Bet Influence!

Nuno Sempere points out that this was written up in an economics paper in 2012:

David Udell's Shortform

I'm pretty skeptical that sophisticated game theory happens between shards in the brain, and also that coalitions between shards are how value preservation in an AI will happen (rather than there being a single consequentialist shard, or many shards that merge into a consequentialist, or something I haven't thought of).

To the extent that shard theory makes such claims, they seem to be interesting testable predictions.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

RSA-2048 has not been factored. It was generated by taking two random primes of approximately 1024 bits each and multiplying them together.

A central AI alignment problem: capabilities generalization, and the sharp left turn

Not an answer, but I think of "adversarial coherence" (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.

Load More