Thomas Kwa

Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

Also, why do you think that error is heavier tailed than utility?

Goodhart's Law is really common in the real world, and most things only work because we can observe our metrics, see when they stop correlating with what we care about, and iteratively improve them. Also the prevalence of reward hacking in RL often getting very high values.

If the reward model is as smart as the policy and is continually updated with data, maybe we're in a different regime where errors are smaller than utility.

It could be anything because KL divergence basically does not restrict the expected value of anything heavy-tailed. You could get finite utility and  error, or the reverse, or infinity of both, or neither converging, or even infinite utility and negative infinity error—any of these with arbitrarily low KL divergence.

To draw any conclusions, you need to assume some joint distribution between the error and utility, and use some model of selection that is not optimal policies under a KL divergence penalty or limit. If they are independent and you think of optimization as conditioning on a minimum utility threshold, we proved last year that you get 0 of whichever has lighter tails and  of whichever has heavier tails, unless the tails are very similar. I think the same should hold if you model optimization as best-of-n selection. But the independence assumption is required and pretty unrealistic, and you can't weaken it in any obvious way.

Realistically I expect that error will be heavy-tailed and heavier-tailed than utility by default so error goes to infinity. But error will not be independent of utility, so the expected utility depends mostly on how good extremely high error outcomes are. The prospect of AIs creating some random outcome that we overestimated the utility of by 10 trillion points does not seem especially good, so I think we should not be training AIs to maximize this kind of static heavy-tailed reward function.

If my interpretation is right, the relative dose from humming compared to NO nasal spray is >200 times lower than this post claims, so humming is unlikely to work.

I think 0.11 ppm*hrs means that the integral of the curve of NO concentration added by the nasal spray is 0.11 ppm*hr. This is consistent with the dose being 130µl of a dilute liquid. If NO is produced and reacts immediately, say in 20 seconds, this means the concentration achieved is 19.8 ppm, not 0.88 ppm, which seems far in excess of what is possible through humming. The study linked (Weitzberg et al) found nasal NO concentrations ranging between 0.08 and 1 ppm depending on subject, with the center (mean log concentration) being 0.252 ppm, not this post's estimate of 2-3 ppm.

If the effectiveness of NO depends on the integral of NO concentration over time, then one would have to hum for 0.436 hours to match one spray of Enovid, and it is unclear if it works like this. It could be that NO needs to reach some threshold concentration >1ppm to have an antiseptic effect, or that the production of NO in the sinuses would drop off after a few minutes. On the other hand it could be that 0.252ppm is enough and the high concentrations delivered by Enovid are overkill. In this case humming would work, but so would a 100x lower dose of the nasal spray. Which someone should study inasmuch as you still believe in humming.

This is a fair criticism. I changed "impossible" to "difficult".

My main concern is with future forms of RL that are some combination of better at optimization (thus making the model more inner aligned even in situations it never directly sees in training) and possibly opaque to humans such that we cannot just observe outliers in the reward distribution. It is not difficult to imagine that some future kind of internal reinforcement could have these properties; maybe the agent simulates various situations it could be in without stringing them together into a trajectory or something. This seems worth worrying about even though I do not have a particular sense that the field is going in this direction.

Much dumber ideas have turned into excellent papers

Is there an AI transcript/summary?

Thomas KwaΩ470

I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the "agent"/"robustly optimizes" concept, and the difficulty of proving things about a program's internals given its behavior all pointed against working on this. But it turned out that we maybe didn't disagree on tractability much, it's just that Alex had somewhat different research taste, plus thought fundamental problems in agent foundations must be figured out to make it to a good future, and therefore working on fairly intractable problems can still be necessary. This seemed pretty out of scope and so I likely won't publish.

Now that this post is out, I feel like I should at least make this known. I don't regret attempting the dialogue, I just wish we had something more interesting to disagree about.

Thomas KwaΩ120

The model ultimately predicts the token two positions after B_def. Do we know why it doesn't also predict the token two after B_doc? This isn't obvious from the diagram; maybe there is some way for the induction head or arg copying head to either behave differently at different positions, or suppress the information from B_doc.

The Brownian motion assumption is rather strong but not required for the conclusion. Consider the stock market, which famously has heavy-tailed, bursty returns. It happens all the time for the S&P 500 to move 1% in a week, but a 10% move in a week only happens a couple of times per decade. I would guess (and we can check) that most weeks have >0.6x of the average per-week variance of the market, which causes the median weekly absolute return to be well over half of what it would be if the market were Brownian motion with the same long-term variance.

Also, Lawrence tells me that in Tetlock's studies, superforecasters tend to make updates of 1-2% every week, which actually improves their accuracy.

Thomas KwaΩ342

I talked about this with Lawrence, and we both agree on the following:

  • There are mathematical models under which you should update >=1% in most weeks, and models under which you don't.
  • Brownian motion gives you 1% updates in most weeks. In many variants, like stationary processes with skew, stationary processes with moderately heavy tails, or Brownian motion interspersed with big 10%-update events that constitute <50% of your variance, you still have many weeks with 1% updates. Lawrence's model where you have no evidence until either AI takeover happens or 10 years passes does not give you 1% updates in most weeks, but this model is almost never the case for sufficiently smart agents.
  • Superforecasters empirically make lots of little updates, and rounding off their probabilities to larger infrequent updates make their forecasts on near-term problems worse.
  • Thomas thinks that AI is the kind of thing where you can make lots of reasonable small updates frequently. Lawrence is unsure if this is the state that most people should be in, but it seems plausibly true for some people who learn a lot of new things about AI in the average week (especially if you're very good at forecasting). 
  • In practice, humans often update in larger discrete chunks. Part of this is because they only consciously think about new information required to generate new numbers once in a while, and part of this is because humans have emotional fluctuations which we don't include in our reported p(doom).
  • Making 1% updates in most weeks is not always just irrational emotional fluctuations; it is consistent with how a rational agent would behave under reasonable assumptions. However, we do not recommend that people consciously try to make 1% updates every week, because fixating on individual news articles is not the right way to think about forecasting questions, and it is empirically better to just think about the problem directly rather than obsessing about how many updates you're making.
Load More