Rachel Freedman

AIS researcher at CHAI. London/Berkeley. Cats are not model-based reinforcement learners.

Posts

Sorted by New

36Your LLM Judge may be biased

1mo

58CIRL Corrigibility is Fragile

Wiki Contributions

Comments

Your LLM Judge may be biased

Rachel Freedman25d10

This is so interesting. I had no idea that this was a thing! I would have assumed that test-writers wrote all of the answers out, then used a (pseudo-)randomizer to order them. But if that really is a pattern in multiple choice tests, it makes absolute sense that Llama would pick up on it.

Your LLM Judge may be biased

Rachel Freedman1mo10

I suspect that if you ask the model to reconsider its answer, it would double down even on the incorrect (B-biased) responses. LLMs really like being self-consistent. We haven’t run this experiment, but if you do, let us know the result!

If I understand correctly, your proposed fix is something like supervised finetuning on adversarial examples that trigger the B-bias. We can access the output logits directly (replacing step 1) and the ground-truth answer is provided in the dataset (removing the need for step 2), so this seems relatively doable.

The main challenges that I see are 1) the computational cost of doing additional optimization (we had to do best-of-N optimization rather than updating the entire model to make our experiments manageable) and 2) it requires finetuning access (which often isn’t available for the latest models). But these challenges aren’t insurmountable, so I wonder why I haven’t seen finetuned “judges” more often.

CIRL Corrigibility is Fragile

Rachel Freedman4mo32

I’d be interested to see this as well!

Lightcone Infrastructure/LessWrong is looking for funding

Rachel Freedman10mo50

Thank you for such a detailed and thorough answer! This resolves a lot of my confusion.

Based on conversations around closing the wework Lightcone office, I had assumed that you didn't want to continue hosting office space, and so hadn't considered that counterfactual cost. But the Inn expenses you mention seem more reasonable if the alternative is continuing to rent wework space.

The FTX context also makes a lot of sense. I was confused how the purchase fit into your current strategy and funding situation, but I understand that both of those were quite different a year or two ago. Given how much things have changed, do you have conditions under which you would decide to sell the space and focus on other projects? Or are you planning to hold onto it no matter what, and decide how best to use it to support your current strategy as that develops?

Lightcone Infrastructure/LessWrong is looking for funding

Rachel Freedman10mo30

These all sound like major benefits to owning the venue yourself!

To be clear, I don't doubt at all that using the Inn for events is much better than non-purpose-built space. However, the Inn also has costs that renting existing spaces wouldn't: I assume that purchasing and renovating it costs more than renting hotel spaces as-needed for events (though please correct me if I'm wrong!), and my impression is that it's taken the Lightcone team a lot of time and effort over the past year+ to purchase and renovate, which naturally has opportunity costs.

I'm asking because my uninformed guess is that those financial and time costs outweigh the (very real) benefits of hosting events like you have been. I'm interested to hear if I'm just wrong about the costs, or if you have additional plans to make even more effective use of the space in the future, or if there's additional context I'm missing.

ETA: Oli answered these questions below, so no need to respond to them unless you have something additional you'd like me to know.

Lightcone Infrastructure/LessWrong is looking for funding

Rachel Freedman10mo1011

Will much of that $3-6M go into renovating and managing the Rose Garden Inn, or to cover work that could have been covered by existing funding if the Inn wasn't purchased?

If so, I'm curious to hear more about the strategy behind buying and renovating the space, since it seems like a substantial capital investment, and a divergence from LightCone Infrastructure's previous work and areas of expertise. I'm aware that several (primarily social?) events were held there over the past year, and I see from an earlier comment that you're planning to host SERI MATS scholars, and to continue providing space for events and retreats.

it seems valuable to have a central and optimized space for hosting people and events, but I'm curious how large the counterfactual benefit of the Inn is. If it didn't exist, programs would have to use existing venues such as hotels, which would charge them more (I assume?) and presumably be somewhat less nice. How would you quantify the counterfactual benefit that the Inn has provided here? How does that compare to the expense of buying, renovating and managing it? If the costs exceed those benefits, what additional value do you plan to get out of the space?

CIRL Corrigibility is Fragile

Rachel Freedman1y21

I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn't rely on a likelihood model. I’m more sympathetic to the latter approach, mostly because reducing human model misspecification to zero seems categorically impossible (unless we can fully simulate human minds, which has other problems).

I also share your concern about the human-evaluating-atomic-actions failure mode. Another challenge with this line of research is that it implicitly assumes a particular scale, when in reality that scale is just one point on hierarchy. For example, the CIRL paper treats “make paperclips” as an atomic action. But we could easily increase the scale (“construct and operate a paperclip factory”) or decrease it (“bend this piece of wire” or even “send a bit of information to this robot arm”). “Make paperclips” was probably chosen because it’s the most natural level of abstraction of a human, but how do we figure that out in general? I think this is an unsolved challenge for reward learning (including this post).

My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do. (For example, R no-op’ing until it gets more information rather than immediately shutting off when told to.) This isn’t sufficient to claim that optimal R behavior in CIRL games always or even often has this type, just that it possibly does and therefore I think it’s worth figuring out whether this is a coherent behavior class or not. Do you disagree with that?

Does a LLM have a utility function?

Answer by Rachel FreedmanDec 09, 202240

I think that the significant distinction is whether an AI system has a utility function that it is attempting to optimize at test time. A LLM does have an utility function, in that there is an objective function written in its training code that it uses to calculate gradients and update its parameters during training. However, once it is deployed, its parameters are frozen and its score on this objective function can no longer impact its behavior. In that sense, I don't think that it makes sense to think of a LLM as "trying to" optimize this objective after deployment. However, this answer could change in response to changes in model training strategy, which is why this distinction is significant.