I think KL/entropy regularization is usually used to prevent mode collapse partly because it has nice theoretical properties. In particular, it is easy to reason about the optimal policy for the regularized objective - see for example the analysis in the paper Equivalence Between Policy Gradients and Soft Q-Learning.
Nevertheless, action-dependent baselines do appear in the literature, although the story is a bit confusing. This is my understanding of it from some old notes:
We will do our best to fairly consider all applications, but realistically there is probably a small advantage to applying earlier. This is simply because there is a limit to how quickly we can grow the organization, so if hiring goes better than expected then it will be longer before we can take on even more people. With that being said, we do not have a fixed number of positions that we are hiring for; rather, we plan to vary the number of hires we make based on the strength of the applications we receive. Moreover, if we were unable to hire someone due to capacity constraints, we would very likely be interested in hiring them at a later date. For these reasons, I think the advantage to applying earlier is a fairly small consideration overall, and it sounds like it would make more sense for you to apply whenever you are comfortable.
The questions on the take-home test vary in difficulty but are generally easier than olympiad problems, and should be accessible to graduates with relevant background. However, it is important to note that we are ultimately interested in research ability rather than the ability to solve self-contained problems under timed conditions. So although the take-home test forms part of our assessment, we also look at other signals such as research track-record (while recognizing that assessing research ability is unfortunately very hard).
(Note: I am talking about the current version of the test, it's possible that the difficulty will change as we refine our interview process.)
I think the kind of mathematical problem solving we're engaged in is common across theoretical physics (although this is just my impression as a non-physicist). I've noticed that some specific topics that have come up (such as Gaussian integrals and permanents) also crop up in quantum field theory, but I don't think that's a strong reason to prefer that background particularly. Broad areas that often come up include probability theory, computational complexity and discrete math, but it's not necessary to have a lot of experience in those areas, only to be able to pick things up from them as needed.
It's not quite this simple, the same issue arises if every PSD completion of the known-diagonal minor has zero determinant (e.g. ((?, 1, 2), (1, 1, 1), (2, 1, 1))). But I think in that case making the remaining diagonal entries large enough still makes the eigenvalues at least −ε, which is good enough.
I think the examples you give are valid, but there are several reasons why I think the situation is somewhat contingent or otherwise less bleak than you do:
You might find this work interesting, which takes some small steps in this direction. It studies the effect of horizon length inasmuch as it makes credit assignment harder, showing that the number of samples required is an affine function of horizon length in a toy context.
I think the direction depends on what your expectations were – I'll try to explain.
First, some terminology: the term "horizon length" is used in the paper to refer to the number of timesteps over which the algorithm pays attention to rewards, as governed by the discount rate. In the biological anchors framework, the term "effective horizon length" is used to refer to a multiplier on the number of samples required to train the model, which is influenced by the horizon length and other factors. For clarity, I'll using the term "scaling multiplier" instead of "effective horizon length" in this comment. The paper studies the effect of the horizon length on the scaling multiplier in a toy MNIST setting.
One key takeaway is that the scaling multiplier is not simply proportional to the horizon length, as one might have naively expected. Instead, the number of samples required is the sum of two components, one that is inherent to the task and independent of the horizon length, and one that is proportional to the horizon length. Compared to the naive expectation, this means that training compute requirements are lower. On the other hand, this ignores reward sparsity, so you might expect training compute requirements to be higher once both horizon length and reward sparsity are accounted for.
The paper also lends some support to the modeling assumptions of the neural network anchor, by validating the hypotheses that (a) training compute requirements still scale as a power law in model size for reinforcement learning, and with a similar exponent, and (b) the scaling multiplier can indeed vary a lot between environments. This might make you put more weight on the neural network anchor, which could again have either directional effect.
The other takeaways are more methodological and I don't think have much of a directional effect.
I would wildly speculate that "simply" scaling up RLHF ~100x, while paying careful attention to rewarding models appropriately (which may entail modifying the usual training setup, as discussed in this comment), would be plenty to get current models to express calibrated uncertainty well. However:
In short, sample efficiency is a problem right now, but not the only problem, and it's unclear how much longer it will continue to be a problem for.
My understanding of why it's especially hard to stop the model making stuff up (while not saying "I don't know" too often), compared to other alignment failures:
In practice, incorporating retrieval should help mitigate the problem to a significant extent, but that's a different kind of solution.
I expect that making the model adversarially robust to "jailbreaking" (enough so for practical purposes) will be easier than stopping the model making stuff up, since sample efficiency should be less of a problem, but still challenging due to the need to generate strong adversarial attacks. Other unwanted behaviors such as the model stating incorrect facts about itself should be fairly straightforward to fix, and it's more a matter of there being a long list of such things to get through.
(To be clear, I am not suggesting that aligning much smarter models will necessarily be as easy as this, and I hope that once "jailbreaking" is mostly fixed, people don't draw the conclusion that it will be as easy.)