Thomas Kwa

Just left Vivek Hebbar's team at MIRI, now doing various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

Oh, I actually 70% agree with this. I think there's an important distinction between legibility to laypeople vs legibility to other domain experts. Let me lay out my beliefs:

  • In the modern history of fields you mentioned, more than 70% of discoveries are made by people trying to discover the thing, rather than serendipitously.
  • Other experts in the field, if truth-seeking, are able to understand the theory of change behind the research direction without investing huge amounts of time.
  • In most fields, experts and superforecasters informed by expert commentary will have fairly strong beliefs about which approaches to a problem will succeed. The person working on something will usually have less than 1 bit advantage about whether their framework will be successful than the experts, unless they have private information (e.g. already did the crucial experiment). This is the weakest belief and I could probably be convinced otherwise just by anecdotes.
    • The successful researchers might be confident they will succeed, but unsuccessful ones could be almost as confident on average. So it's not that the research is illegible, it's just genuinely hard to predict who will succeed.
  • People often work on different approaches to the problem even if they can predict which ones will work. This could be due to irrationality, other incentives, diminishing returns to each approach, comparative advantage, etc.

If research were illegible to other domain experts, I think you would not really get Kuhnian paradigms, which I am pretty confident exist. Paradigm shifts mostly come from the track record of an approach, so maybe this doesn't count as researchers having an inside view of others' work though.

Novel research is inherently illegible.

I'm pretty skeptical of this and think we need data to back up such a claim. However there might be bias: when anyone makes a serendipitous discovery it's a better story, so it gets more attention. Has anyone gone through, say, the list of all Nobel laureates and looked at whether their research would have seemed promising before it produced results?

There is a box which contains money iff the front and back are painted the same color. Each side is independently 30% to be blue, and 70% to be red. You observe that the front is blue, and your friend observes that the back is red.

Who is Adam? Is this FAR AI CEO Adam Gleave?

Rather, I am looking for a discussion of  evidence that the  LLMs internal  "true" motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently. A good argument might analyze bits of weird inhuman behavior to try to infer the internal model.

I think we do not understand enough about either the LLM's true algorithms or humans' to make such arguments, except for basic observations like the fact that humans have non-language recurrent state which many LLMs lack.

In practice it is not as bad as uniform volume throughout the day would be for two reasons:

  • Market-makers narrow spreads to prevent any low-value-exchange pairings that would be predictable price fluctuations. They do extract some profits in the process.
  • Volume is much higher near the open and close.

I would guess that any improvements of this scheme would manifest as tighter effective spreads, and a reduction in profits of HFT firms (which seem to provide less value to society than other financial firms).

OP was a professional trader and definitely (98%) agrees with us. I think the (edit: former) title is pretty misleading and gives people the impression that all trades are bad though.

I think habryka's explanation of this post's idea of adverse selection is basically correct:

I think all of them follow a pattern of "there is a naive baseline expectation where you treat other people's maps as a blackbox that suggest a deal is good, and a more sophisticated expectation that involves modeling the details of other people's maps that suggests its bad"

In example #8, you naively think that a market order will clear at slightly more than the going rate for a field, which it will in a normal competitive market. But in this case, you let your counterparty decide the price, and they're incentivized to make it maximally bad for you.

My guess is that some later post in the sequence will argue why this broad definition of adverse selection makes sense.

Or would you have thought, "I wonder what that trader selling Avant! for $2 knows that I don't?"

The correct move is to think this, but correctly conclude you have the information advantage and keep buying. Adverse selection is extremely prevalent in public markets so you need to always be thinking about it, and as a professional trader you can and must model it well enough to not be scared off of good trades.

EA definitely has more controversies. Doesn't mean it's worse for the world.

Load More