purple fire

Wikitag Contributions

Comments

Sorted by

I mean this is a summary of a talk I didn't see so I want to reserve some judgement. But at the same time I just can't imagine being in a situation where I don't want to use the word signal because the other person might think I'm talking about cell signal, so instead I bust out "kodo"

In general, I think we should be pretty wary of taking basic ideas and dressing them up in fancy words. It serves no purpose other than in-group signaling.

That makes sense. However, I do see this as being meaningfully different from evals or mech interp in that to make progress you really kind of want access to frontier models and lots of compute/tooling, so for individuals who want to prioritize this approach it might make sense to try to join a safety team at a lab first.

How are these ideas different from "signal" and "noise"?

Do you think building a new organization around this work would be more effective than implementing these ideas at a major lab?

I think this might be a result of o-series being trained in a non-chat setup for most of the CoT RL phase and then being hamfistedly finetuned right at the end so it can go into ChatGPT, which just makes them kind of bad at chat and so o3 gets confused when the conversation has a lot of turns. Retraining it to be good at multi-turn chat with separate reasoning traces would probably just be super expensive and not worth the squeeze. (this is just a guess)

No, that is not the definition of a prior. There are priors which imply an expected number of buses, and priors that don't. If you select a prior that doesn't, you can still get a meaningful posterior distribution even if that posterior distribution doesn't have a real-valued mean.

No, that's not the posterior distribution--clearly, the number of buses cannot be lower than 1546, but that distribution has material probability mass on low integers. I'm not quite sure how you got that equation.

But regardless, I think this shows where we disagree. That prior has mean 2... that's a pretty strong assumption about the distribution of n. If you want to avoid that kind of assumption, you can get posterior distributions but not a posterior expectation.

I'm not disagreeing with that categorically--for many priors the posterior distribution is well defined. But all of those priors carry information (in the information theoretical sense) about the number of buses. If you have an uninformative reference prior, your posterior distribution does not have a mean.

You can see the sketch of this proof if you consider the likelihoods of seeing the bus for any given n. If there are 1546 buses, there was a 1/1546 chance you saw this one. If there were 1547, there was a 1/1547 chance you saw this one. This is the harmonic series, which diverges. That divergence is the fundamental issue that's going to cause the mean to be undefined.

You can't make claims about the posterior without setting at least some conditions on what your prior is--obviously, for some priors the posterior expectation is well-defined. (Trivially, if I already think n=2000 with probability 1, I will still think that after seeing the bus.) But I claim that all such priors make assumptions about the distribution of the possible number of buses. In the uninformative case, your posterior distribution is well-defined (as I said, it's a truncated zeta distribution) but it does not have a finite mean.

The intuition is that if we both saw bus 1546, and you guessed that there were 1546 buses and I guessed that there were 1547, you would be a little more likely to be correct but I would almost certainly be closer to the real number.

The Bayesian update isn't generally well-defined because you get a divergent mean. Your implicit prior is  1/n which is an improper prior. This is fine for deriving a posterior median, which in this case happens to be about 3,100 buses, and a posterior distribution, which in this case is a truncated zeta distribution with s=2 and k=1546. But the posterior mean does not exist.

Yeah, I was a little hand wavy with that, sorry. You're right about Sharpe scaling with (the square root of) number of trades, but that gets dampened a lot by pairwise correlation between trades and correlated leverage/financing costs. I worked at a hedge fund that had a lot of siloed strategies ("pods") and each one targeted a Sharpe of ~2, but then they would get sliced and diced into end products for investors with different risk levels etc. (I never worked on that side of the firm and tbh do not know all the details of how that works).

In practice, you actually care about incremental Sharpe, which you get by subtracting the Sharpe of your strategy minus the covariance-adjusted Sharpe of your existing portfolio over the norm of the covariance matrix, and you want to keep that pro-forma portfolio Sharpe above 2. It would be awesome to have 100 uncorrelated bets each with a 2 Sharpe, but in practice our strategy was so specific that a lot of our trades were still pretty correlated and created incremental transaction and hedging costs. And beyond being directly correlated (if position A goes up position B will also go up), the structure of a pod means you introduce auto-correlation (if our financing rate goes up all of our positions will have reduced Sharpe, or if one position blows out we will have to exit multiple other trades suboptimally) so you end up getting pretty high covariance matrix norms. The cutoff for a standalone position is maybe like 1.2-1.5 Sharpe, if that position is small enough relative to the overall portfolio that you can just think about marginal Sharpe as a first-order effect. So a little lower but still pretty non-trivial for an individual investor picking single stocks to achieve, and in the case of this discussion--a pretty concentrated, highly auto-correlated, long-only equity portfolio--I don't think it's unreasonable to just round off individual thresholds to 2.0 even if it's more like 1.8 or 1.9.

Load More