This is a thread for updates about the upcoming LessOnline festival. I (Ben) will be posting bits of news and thoughts, and you're also welcome to make suggestions or ask questions.

If you'd like to hear about new updates, you can use LessWrong's "Subscribe to comments" feature from the triple-dot menu at the top of this post.

Reminder that you can get tickets at the site for $400 minus your LW karma in cents.

Ben Pace3m20

Nice, just had a good call with Alkjash, who is coming and will be preparing 2 layman-level math talks about questions he's been thinking about.

Other ideas we chatted about having at LessOnline include maybe having some discussions about doing research inside and outside of academia, and also about learning from GlowFic writers how to write well collaboratively. (Let me know if you'd be interested in either of these!)

5NicholasKross4h

How scarce are tickets/"seats"?

4Ben Pace3h

I think on-site housing is pretty scarce, though we're going to make more high-density rooms in response to demand for that. Tickets aren't scarce, our venue could fit like a 700 person event, so I don't expect to hit the limits.

What is the best way to talk about probabilities you expect to change with evidence/experiments?

Will_Pearson

I was thinking about my p(doom) in the next 10 years and came up with something around 6%^[1]. However that involves lots of current unknowns to me, like the nature of current human knowledge production (and the bottle necks involved) which impact my P(doom) to be either 3% or 15% depending upon what type of bottle necks are found or not found. Is there a technical way to describe this probability distribution contingent on evidence?

^{^}
I'm bearish on LLMs leading AI directly (10% chance) and roughly a 30% chance of LLMs based AI fooming quickly enough to kill us and to want to kill us within 10 years. There is a 3% chance that something will come out of left field and doing the same.

2Answer by Dagon19m

If you're giving one number, that IS your all-inclusive probability. You can't predict the direction that new evidence will change your probability (per https://www.lesswrong.com/tag/conservation-of-expected-evidence), but you CAN predict that there will be evidence with equal probability of each direction. An example is if you're flipping a coin twice. Before any flips, you give 0.25 to each of HH, HT, TH, and TT. But you strongly expect to get evidence (observing the flips) that will first change two of them to 0.5 and two to 0, then another update which will change one of the 0.5 to 1 and the other to 0. Likewise, p(doom) before 2035 - you strongly believe your probability will be 1 or 0 in 2036. You currently believe 6%. You may be able to identify intermediate updates, and specify the balance of probability * update that adds to 0 currently, but will be specific when the evidence is obtained. I don't know any shorthand for that - it's implied by the probability given. If you want to specify your distribution of probable future probability assignments, you can certainly do so, as long as the mean remains 6%. "There's a 25% chance I'll update to 15% and a 75% chance of updating to 3% over the next 5 years" is a consistent prediction.

6Answer by Richard_Ngo41m

I don't think there's a very good precise way to do so, but one useful concept is bid-ask spreads, which are a way of protecting yourself from adverse selection of bets. E.g. consider the following two credences, both of which are 0.5. 1. My credence that a fair coin will land heads. 2. My credence that the wind tomorrow in my neighborhood will be blowing more northwards than southwards (I know very little about meteorology and have no recollection of which direction previous winds have mostly blown). Intuitively, however, the former is very difficult to change, whereas the latter might swing wildly given even a little bit of evidence (e.g. someone saying "I remember in high school my teacher mentioned that winds often blow towards the equator.") Suppose I have to decide on a policy that I'll accept bets for or against each of these propositions at X:1 odds (i.e. my opponent puts up $X for every $1 I put up). For the first proposition, I might set X to be 1.05, because as long as I have a small edge I'm confident I won't be exploited. By contrast, if I set X=1.05 for the second proposition, then probably what will happen is that people will only decide to bet against me if they have more information than me (e.g. checking weather forecasts), and so they'll end up winning a lot of money for me. And so I'd actually want X to be something more like 2 or maybe higher, depending on who I expect to be betting against, even though my credence right now is 0.5. In your case, you might formalize this by talking about your bid-ask spread when trading against people who know about these bottlenecks.

Razied9m20

Surely something like the expected variance of $log (p / (1 - p))$ would be a much simpler way of formalising this, no? The probability over time is just a stochastic process, and OP is expecting the variance of this process to be very high in the near future.

1Answer by harfe1h

A lot of the probabilities we talk about are probabilities we expect to change with evidence. If we flip a coin, our p(heads) changes after we observe the result of the flipped coin. My p(rain today) changes after I look into the sky and see clouds. In my view, there is nothing special in that regard for your p(doom). Uncertainty is in the mind, not in reality. However, how you expect your p(doom) to change depending on facts or observation is useful information and it can be useful to convey that information. Some options that come to mind: 1. describe a model: If your p(doom) estimate is the result of a model consisting of other variables, just describing this model is useful information about your state of knowledge, even if that model is only approximate. This seems to come closest to your actual situation. 2. describe your probability distribution of your p(doom) in 1 year (or another time frame): You could say that you think there is a 25% chance that your p(doom) in 1 year is between 10% and 30%. Or give other information about that distribution. Note: your current p(doom) should be the mean of your p(doom) in 1 year. 3. describe your probability distribution of your p(doom) after a hypothetical month of working on a better p(doom) estimate: You could say that if you were to work hard for a month on investigating p(doom), you think there is a 25% chance that your p(doom) after that month is between 10% and 30%. This is similar to 2., but imo a bit more informative. Again, your p(doom) should be the mean of your p(doom) after a hypothetical month of investigation, even if you don't actually do that investigation.

How to Model the Future of Open-Source LLMs?

Joel Burget

I previously expected open-source LLMs to lag far behind the frontier because they're very expensive to train and naively it doesn't make business sense to spend on the order of $10M to (soon?) $1B to train a model only to give it away for free.

But this has been repeatedly challenged, most recently by Meta's Llama 3. They seem to be pursuing something like a commoditize your complement strategy: https://twitter.com/willkurt/status/1781157913114870187 .

As models become orders-of-magnitude more expensive to train can we expect companies to continue to open-source them?

In particular, can we expect this of Meta?

Answer by Aaron_ScherApr 19, 202420

Yeah, I think we should expect much more powerful open source AIs than we have now. I've been working on a blog post about this, maybe I'll get it out soon. Here are what seem like the dominant arguments to me:

Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise.
There are numerous (maybe 7) actors in the open source world who are at least moderately competent and want to open source powerful models. There is a niche in the market for powerful open source models, an

A.H.

Consequentialists (including utilitarians) claim that the goodness of an action should be judged based on the goodness of its consequences. The word utility is often used to refer to the quantified goodness of a particular outcome. When the consequences of an action are uncertain, it is often taken for granted that consequentialists should choose the action which has the highest expected utility. The expected utility is the sum of the utilities of each possible outcome, weighted by their probability. For a lottery which gives outcome utilities $u_{i}$ with respective probabilities $p_{i}$ , the expected utility is:

E [U] = \sum i p_{i} u_{i} .

There are several good reasons to use the maximization of expected utility as a normative rule. I'll talk about some of them here, but I recommend Joe Carlsmith's series of posts 'On Expected Utility' as a...

(Continue Reading – 2613 more words)

MichaelStJules15m10

If bounded below, you can just shift up to make it positive. But the geometric expected utility order is not preserved under shifts.

1MichaelStJules17m

Violations of continuity aren't really vulnerable to proper/standard money pumps. The author calls it "arbitrarily close to pure exploitation" but that's not pure exploitation. It's only really compelling if you assume a weaker version of continuity in the first place, but you can just deny that. I think transitivity (+independence of irrelevant alternatives) and countable independence (or the countable sure-thing principle) are enough to avoid money pumps, and I expect give a kind of expected utility maximization form (combining McCarthy et al., 2019 and Russell & Isaacs, 2021). Against the requirement of completeness (or the specific money pump argument for it by Gustafsson in your link), see Thornley here. To be clear, countable independence implies your utilities are "bounded" in a sense, but possibly lexical/lexicographic. See Russell & Isaacs, 2021.

2cousin_it1h

Well, you can't have some states as "avoid at all costs" and others as "achieve at all costs", because having them in the same lottery leads to nonsense, no matter what averaging you use. And allowing only one of the two seems arbitrary. So it seems cleanest to disallow both. But geometric averaging wouldn't let you do that either, or am I missing something?

1A.H.35m

Fine. But the purpose of exploring different averaging methods is to see whether it expands the richness of the kind of behaviour we want to describe. The point is that using arithmetic averaging is a choice which limits the kind of behaviour we can get. Maybe we want to describe behaviours which can't be described under expected utility. Having an 'avoid at all costs state' is one such behaviour which finds natural description using a non-arithmetic averaging which can't be described in more typical VNM terms. If your position is 'I would never want to describe normative ethics using anything other than expected utility' then that's fine, but some people (like me) are interested in looking at what alternatives to expected utility might be. That's why I wrote this post. As it stands, I didn't find geometric averaging very satisfactory (as I wrote in the post), but I think things like this are worth exploring. You are right. Geometric averaging on its own doesn't give allow violations of independence. But some other protocol for deciding over lotteries does. It's described more in the Garrabrant post linked above.

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks

Ω 401d

In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.^[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior.

In the rest of this post I will:

Articulate a class of approaches to scalable oversight I call cognition-based oversight.
Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most

...

(Continue Reading – 3415 more words)

Buck21mΩ220

I like this post and this research direction, I agree with almost everything you say, and I think you’re doing an unusually good job of explaining why you think your work is useful.

A nitpick: I think you’re using the term “scalable oversight” in a nonstandard and confusing way.

You say that scalable oversight is a more general version of “given a good model and a bad model, determine which one is good.” I imagine that more general sense you wanted is something like: you can implement some metric that tells you how “good” a model is, which can be applied not... (read more)

Progress Update #1 from the GDM Mech Interp Team: Full Update

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

Ω 172h

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

TL;DR: We use SAEs trained on GPT-2 XL’s residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces...

(Continue Reading – 2322 more words)

Sam Marks31mΩ120

With the ITO experiments, my first guess would be that reoptimizing the sparse approximation problem is mostly relearning the encoder, but with some extra uninterpretable hacks for low activation levels that happen to improve reconstruction. In other words, I'm guessing that the boost in reconstruction accuracy (and therefore loss recovered) is mostly not due to better recognizing the presence of interpretable features, but by doing fiddly uninterpretable things at low activation levels.

I'm not really sure how to operationalize this into a prediction. Mayb... (read more)

2Sam Marks39m

Awesome stuff -- I think that updates like this (both from the GDM team and from Anthropic) are very useful for organizing work in this space. And I especially appreciate the way this was written, with both short summaries and in-depth write-ups.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub, Henry Sleight

Ω 91h

Emergent Instrumental Reasoning Without Explicit Goals

TL;DR: LLMs can act and scheme without being told to do so. This is bad.

Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic.

Introduction

Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned^[1]. Existing Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint for Deceptive Alignment research, in particular, and AI Safety, in general. My project is aimed...

(Continue Reading – 4744 more words)

12ryan_greenblatt37m

I would summarize this result as: If you train models to say "there is a reason I should insert a vulnerability" and then to insert a code vulnerability, then this model will generalize to doing "bad" behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do "bad" behavior if it is given a plausible excuse in the prompt. Does this seems like a good summary? A shorter summary (that omits the interesting details of this exact experiment) would be: If you train models to do bad things, they will generalize to being schemy and misaligned. This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more "unprompted" than it actually is. As in, my initial skim of these sections made me think this result is much more striking than it actually is.

ryan_greenblatt34mΩ442

To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking).

So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

131

johnswentworth, David Lorell

Yesterday Adam Shai put up a cool post which… well, take a look at the visual:

Yup, it sure looks like that fractal is very noisily embedded in the residual activations of a neural net trained on a toy problem. Linearly embedded, no less.

I (John) initially misunderstood what was going on in that post, but some back-and-forth with Adam convinced me that it really is as cool as that visual makes it look, and arguably even cooler. So David and I wrote up this post / some code, partly as an explainer for why on earth that fractal would show up, and partly as an explainer for the possibilities this work potentially opens up for interpretability.

One sentence summary: when tracking the hidden state of a hidden Markov model, a Bayesian’s...

(Continue Reading – 1789 more words)

eggsyntax1h10

Is there a link to the code? I'm overlooking it if so; it would be useful to see.

hydrogen tube transport

bhauth

This is a linkpost for https://www.bhauth.com/blog/industrial%20design/hydrogen%20tubes.html

Elon Musk's Hyperloop proposal had substantial public interest. With various initial Hyperloop projects now having failed, I thought some people might be interested in a high-speed transportation system that's...perhaps not "practical" per se, but at least more-practical than the Hyperloop approach.

aerodynamic drag in hydrogen

Hydrogen has a lower molecular mass than air, so it has a higher speed of sound and lower density. The higher speed of sound means a vehicle in hydrogen can travel at 2300 mph while remaining subsonic, and the lower density reduces drag. This paper evaluated the concept and concluded that:

the vehicle can cruise at Mach 2.8 while consuming less than half the energy per passenger of a Boeing 747 at a cruise speed of Mach 0.81

In a tube, at subsonic speeds, the gas...

(Continue Reading – 1289 more words)

2gilch1h

Hydrogen can only burn in the presence of oxygen. The pipe does not contain any, and combustion isn't possible until after they have had time to mix. It's also not going to explode from the pressure, because it's the same as the atmosphere. The shaped charge is obviously going to explode, that's the point, but it will be more directional. That still doesn't sound safe in an enclosed space. Maybe the vehicle could deploy a gasket seal with airbags or something to reduce the leakage of expensive hydrogen.

3bhauth12h

It can't use "air" around it for engines because what's around it isn't "air". Oxygen is much heavier than the fuel it's used with, and you'd either need liquid oxygen (which increases costs) or pressurized tanks (which would perhaps double that mass). That's still lighter than batteries, yes, but engines are also needed. Piston engines are inefficient and/or heavy, and gas turbines are somewhat expensive. It's not that difficult to separate water and hydrogen, that's true, but processing that much gas is still rather impractical when batteries have enough specific energy. Simply condensing it in the tube is...possible, but would increase drag, especially considering density variation issues, and you'd have to deal with getting it out of a long sealed tube without leaking hydrogen. Also, if batteries are good enough, the cost of replacing the hydrogen alone probably makes batteries better than burning the hydrogen.

3gilch1h

Condensation is not just possible but would happen by default. You described the tubes as steel lined with aluminum in contact with the ground, if not buried. That's going to be consistently cool enough for passive condensation. Getting water out of a long tube shouldn't be hard with multiple drains, and if there's any incline, you just need them at the bottom. You can just dump it in the ground. Use a plumbing trap to keep the gasses separated. They're at equal pressure, so this should work, and the pressure can also be maintained mostly passively with hydrogen bladders exposed to the atmosphere on the outside, although the burned hydrogen will have to be regenerated before they empty completely, but this can be done anywhere on the pipe. Hydrogen can be easily regenerated by electrolysis of water, which doesn't seem any more expensive than charging the batteries. It might be even cheaper to crack if off of natural gas or to use white hydrogen when available. Are turbines more expensive than electric motors for similar power? It's true that conventional piston engines are heavy, but batteries are also heavy, especially the cheaper chemistries. Alternatively, run electricity through the pipe to power the vehicles so they don't have to carry any extra weight for power. It's coated with conductive aluminum already. If half-pipes could be welded with a dielectric material and not cost any more that would work. Or use an internal monorail, but maybe only if you were going to do that already. Or you could suspend a wire. That's got to be pretty cheap compared to the pipe itself.

Carl Feynman1h10

…run electricity through the pipe…

Simpler to do what some existing electric trains do: use the rails as ground, and have a charged third rail for power. We don’t like this system much for new trains, because the third rail is deadly to touch. It’s a bad thing to leave lying on the ground where people can reach it. But in this system, it’s in a tube full of unbreathable hydrogen, so no one is going to casually come across it.

LESSWRONG
LW

Recommendations

Latest Posts

Quick Takes

Popular Comments

Recent Discussion

Activation Steering with SAEs

Emergent Instrumental Reasoning Without Explicit Goals

Introduction

aerodynamic drag in hydrogen

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA