dentalperson — LessWrong

Refusal in LLMs is mediated by a single direction

I really appreciate the way you have written this up. It seems that 2-7% of refusals do not respond to the unidimensional treatment. I'm curious if you've looked at this subgroup the same way as you have the global data to see if they have another dimension for refusal, or if the statistics of the subgroup shed some other light on the stubborn refusals.

E.T. Jaynes Probability Theory: The logic of Science I

dentalperson2y20

Thanks, your confusion pointed out a critical typo. Indeed the relatively large number of walls broken should make it more likely that the orcs were the culprits. The 1:20 should have been 20:1 (going from -10 dB to +13 dB).

E.T. Jaynes Probability Theory: The logic of Science I

dentalperson2y1-1

Thanks! These are great points. I applied the correction you noted about the signs and changed the wording about the direction of evidence. I agree that the clarification about the 3 dB rule is useful; linked to your comment.

Edit: The 10 was also missing a sign. It should be -10 + 60 - 37. I also flipped the 1:20 to 20:1 posterior odds that the orcs did it.

Sam Altman fired from OpenAI

dentalperson2y52

How surprising is this to the alignment community professionals (e.g. people at MIRI, Redwood Research, or similar)? From an outside view, the volatility/flexibility and movement away from pure growth and commercialization seems unexpected and could be to alignment researchers' benefit (although it's difficult to see the repercussions at this point). While it is surprising to me because I don't know the inner workings of OpenAI, I'm surprised that it seems similarly surprising to the LW/alignment community as well.

Perhaps the insiders are still digesting and formulating a response, or want to keep hot takes to themselves for other reasons. If not, I'm curious if there is actually so little information flowing between alignment communities and companies like OpenAI such that this would be as surprising as it is to an outsider. For example, there seems to be many people at Anthropic that are directly in or culturally aligned with LW/rationality, and I expected the same to be true to a lesser extend for OpenAI.

I understood there was a real distance between groups, but still, I had a more connected model in my head that is challenged by this news and the response in the first day.

ACX/EA Lisbon October 2023 Meetup

dentalperson2y10

I will be there at 3:30 or so.

Full Transcript: Eliezer Yudkowsky on the Bankless podcast

dentalperson2y10

I missed your reply, but thanks for calling this out. I'm nowhere as close to you to EY so I'll take your model over mine, since mine was constructed on loose grounds. I don't even remember where my number came from, but my best guess is 90% came from EY giving 3/15/16 as the largest number he referenced in the timeline, and from some comments in the Death with Dignity post, but this seems like a bad read to me now.

At 87, Pearl is still able to change his mind

dentalperson2y50

Thanks for sharing! It's nice to see plasticity, especially for stats, which seems to have more opinionated contributors than other applied maths. Although, it seems this 'admission' is not changing his framework, but rather reinterpreting how ML is used to be compatible with his framework.

Pearl's texts talk about having causal models that use the do(X) operator (e.g. P(Y|do(X))) to signify causal information. Now in LLMs, he sees the text the model is conditioning on as sometimes being do(X) or X. I'm curious what else besides text would count as this. I'm not sure that I recall this correctly but in his third level, you can use purely observational data to infer causality with things like instrumental variables. If I had a ML model that took as input purely numerical input, such as (tar, smoking status, got cancer, and various other health data), should it be able to predict counterfactual results?

I'm uncertain about what the right answer here is, and how Pearl would view this. My guess is a naive ML model would be able to do this provided the data covered the counterfactual cases which is likely for the smoking case. But it would not be as useful for out of sample counterfactual inferences where there is little or no coverage for the interventions and outcomes (e.g. if one of the inputs was 'location' it had to predict the effects of smoking on the ISS, where no one smokes). However, if we kept adding more purely observational information about the universe, it feels like we might be able to get a causal model out a transformer-like thing. I'm aware there are some tools that try to extract a DAG from the data as a primitive form of this approach, but is at odds with the Bayesian stats approach of having a DAG first, then checking to see if the DAG holds with the data or vice versa. Please share if you have some references that would be useful.

Full Transcript: Eliezer Yudkowsky on the Bankless podcast

dentalperson3y30

Thanks! I'm aware of the resources mentioned but haven't read deeply or frequently enough to have this kind of overview of the interaction between the cast of characters.

There are more than a few lists and surveys that state the CDFs for some of these people which helps a bit. A big-as-possible list of evidence/priors would be one way to closer inspect the gap. I wonder if it would be helpful to expand on the MIRI conversations and have a slow conversation between a >99% doom pessimist and a <50% doom 'optimist' with a moderator to prod them to exhaustively dig up their reactions to each piece of evidence and keep pulling out priors until we get to indifference. It probably would be an uncomfortable, awkward experiment with a useless result, but there's a chance that some item on the list ends up being useful for either party to ask questions about.

That format would be useful for me to understand where we're at. Maybe something along these lines will eventually prompt a popular and viral sociology author like Harari or Bostrom (or even just update the CDFs/evidence in Superintelligence). The general deep learning community probably needs to hear it mentioned and normalized on NPR and a bestseller a few times (like all the other x-risks are) before they'll start talking about it at lunch.

Full Transcript: Eliezer Yudkowsky on the Bankless podcast

dentalperson3y-1-9

Thanks! Do you know of any arguments with a similar style to The Most Important Century that is as pessimistic as EY/MIRI folks (>90% probability of AGI within 15 years)? The style looks good, but time estimates for that one (2/3rd chance AGI by 2100) are significantly longer and aren't nearly as surprising or urgent as the pessimistic view asks for.

Full Transcript: Eliezer Yudkowsky on the Bankless podcast

dentalperson3y3011

I still don't follow why EY assigns seemingly <1% chance of non-earth-destroying outcomes in 10-15 years (not sure if this is actually 1%, but EY didn't argue with the 0% comments mentioned in the "Death with dignity" post last year). This seems to place fast takeoff as being the inevitable path forward, implying unrestricted fast recursive designing of AIs by AIs. There are compute bottlenecks which seem slowish, and there may be other bottlenecks we can't think of yet. This is just one obstacle. Why isn't there more probability mass for this one obstacle? Surely there are more obstacles that aren't obvious (that we shouldn't talk about).

It feels like we have a communication failure between different cultures. Even if EY thinks the top industry brass is incentivized to ignore the problem, there are a lot of (non-alignment oriented) researchers that are able to grasp the 'security mindset' that could be won over. Both in this interview, and in the Chollet response referenced, the arguments presented by EY aren't always helping the other party bridge from their view over to his, but go on 'nerdy/rationalist-y' tangents and idioms that end up being walls that aren't super helpful for working on the main point, but instead help the argument by showing that EY is smart and knowledgeable about this field and other fields.

Are there any digestible arguments out there for this level of confident pessimism that would be useful for the public industry folk? By publicly digestible, I'm thinking more of the style in popular books like Superintelligence or Human Compatible.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments