What is the mysterious impressive new ‘gpt2-chatbot’ from the Arena? Is it GPT-4.5? A refinement of GPT-4? A variation on GPT-2 somehow? A new architecture? Q-star? Someone else’s model? Could be anything. It is so weird that this is how someone chose to present that model.

There was also a lot of additional talk this week about California’s proposed SB 1047.

I wrote an additional post extensively breaking that bill down, explaining how it would work in practice, addressing misconceptions about it and suggesting fixes for its biggest problems along with other improvements. For those interested, I recommend reading at least the sections ‘What Do I Think The Law Would Actually Do?’ and ‘What are the Biggest Misconceptions?’

As usual, lots of other things happened as well.

Introduction.
Table of

...

(Continue Reading – 9081 more words)

lukehmiles13m10

Original post that introduced the technique is best explanation of steering stuff. https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector

5Viliam13h

The thing with "boosting productivity" is tricky, because productivity is not a linear thing. For example, in software development, using a new library can make adding new features faster (more functionality out of the box), but fixing bugs slower (more complexity involved, especially behind the scenes). So what I would expect to happen is that there is a month or two with exceptionally few bugs, the team velocity is measured and announced as a new standard, deadlines are adjusted accordingly, then a few bugs happen and now you are under a lot more pressure than before. Similarly, with LLMs it will be difficult to explain to non-technical management if they happen to be good at some kind of tasks, but worse at a different kind of tasks. Also, losing control... for some reasons that you do not understand, the LLM has a problem with the specific task that was assigned to you, and you are blamed for that.

"AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case

habryka

11h

This is a linkpost for https://aisafety.dance/

Nicky Case, of "The Evolution of Trust" and "We Become What We Behold" fame (two quite popular online explainers/mini-games) has written an intro explainer to AI Safety! It looks pretty good to me, though just the first part is out, which isn't super in-depth. I particularly appreciate Nicky clearly thinking about the topic themselves, and I kind of like some of their "logic vs. intuition" frame, even though I think that aspect is less core to my model of how things will go. It's clear that a lot of love has gone into this, and I think having more intro-level explainers for AI-risk stuff is quite valuable.

===

The AI debate is actually 100 debates in a trenchcoat.

Will artificial intelligence (AI) help us cure all disease, and build a...

(See More – 964 more words)

2mako yass3h

This is good! I would recommend it to a friend! Some feedback. * An individual human can be inhumane, but the aggregate of human values kind of visibly isn't and in most ways couldn't be: Human cultures are getting more humane reliably as transparency/reflection and coordination increases over time, but also inevitably if you aggregate a bunch of concave values it will produce a value system that treats all of the subjects of the aggregation pretty decently. A lot of the time, when people accuse us of conflating something, we equate those things because we have an argument that they're going to turn out to be equivalent. So emphasizing a difference between these two things could be really misleading, and possibly kinda harmful, given that it could undermine the implementation of the simplest/most arguably correct solutions to alignment (which are just aggregations of humans' values). This could be a whole conversation, but could we just not define humane values as being necessarily distinct from human values? How about this: * People are sometimes confused by 'Human values', as it seems to assume that all humans value the same things, but many humans have values that conflict with the preferences of other humans. When we say 'Humane values', we're defining a value system that does a decent job at balancing and reconciling the preferences of every human (Humans, Every one). * [graph point for "systems programmer with mlp shirt"] would it be funny if there were another point, "systems programmer without mlp shirt", and it was pareto-inferior * "What if System 2 is System 1". This is a great insight, I think it is, and I think the main reason nerdy types often fail to notice how permeable and continuous the boundary is a kind of tragic habitual cognitive autoimmune disease, and I have a post brewing about this after I used a repaired relationship with the unconscious bulk to cure my astigmatism (I'm going to let it sit for a year just to confirm that the

8LawrenceC5h

I think this is really quite good, and went into way more detail than I thought it would. Basically my only complaints on the intro/part 1 are some terminology and historical nitpicks. I also appreciate the fact that Nicky just wrote out her views on AIS, even if they're not always the most standard ones or other people dislike them (e.g. pointing at the various divisions within AIS, and the awkward tension between "capabilities" and "safety"). I found the inclusion of a flashcard review applet for each section super interesting. My guess is it probably won't see much use, and I feel like this is the wrong genre of post for flashcards.[1] But I'm still glad this is being tried, and I'm curious to see how useful/annoying other people find it. I'm looking forward to parts two and three. ---------------------------------------- Nitpicks:[2] Logic vs Intuition: I think "logic vs intuition" frame feels like it's pointing at a real thing, but it seems somewhat off. I would probably describe the gap as explicit vs implicit or legible and illegible reasoning (I guess, if that's how you define logic and intuition, it works out?). Mainly because I'm really skeptical of claims of the form "to make a big advance in/to make AGI from deep learning, just add some explicit reasoning". People have made claims of this form for as long as deep learning has been a thing. Not only have these claims basically never panned out historically, these days "adding logic" often means "train the model harder and include more CoT/code in its training data" or "finetune the model to use an external reasoning aide", and not "replace parts of the neural network with human-understandable algorithms". I also think this framing mixes together "problems of game theory/high-level agent modeling/outer alignment vs problems of goal misgeneralization/lack of robustness/lack of transparency" and "the kind of AI people did 20-30 years ago" vs "the kind of AI people do now". This model of logic an

2mako yass3h

The intention of this part of the paragraph wasn't totally clear but you seem to be saying this wasn't great? From what I understand, these actually did all made the model far more interpretable? Chain of thought is a wonderful thing, it clears a space where the model will just earnestly confess its inner thoughts and plans in a way that isn't subject to training pressure, and so it, in most ways, can't learn to be deceptive about it.

LawrenceC14m20

No, I'm saying that "adding 'logic' to AIs" doesn't (currently) look like "figure out how to integrate insights from expert systems/explicit bayesian inference into deep learning", it looks like "use deep learning to nudge the AI toward being better at explicit reasoning by making small changes to the training setup". The standard "deep learning needs to include more logic" take does not look like using deep learning to get more explicit reasoning, it looks like doing a slightly different RL or supervised finetuning setup to end up with a more capable mode... (read more)

Why I'm doing PauseAI

Joseph Miller

GPT-5 training is probably starting around now. It seems very unlikely that GPT-5 will cause the end of the world. But it’s hard to be sure. I would guess that GPT-5 is more likely to kill me than an asteroid, a supervolcano, a plane crash or a brain tumor. We can predict fairly well what the cross-entropy loss will be, but pretty much nothing else.

Maybe we will suddenly discover that the difference between GPT-4 and superhuman level is actually quite small. Maybe GPT-5 will be extremely good at interpretability, such that it can recursively self improve by rewriting its own weights.

Hopefully model evaluations can catch catastrophic risks before wide deployment, but again, it’s hard to be sure. GPT-5 could plausibly be devious enough to circumvent all of...

(See More – 955 more words)

Nathan Helm-Burger42m20

I absolutely sympathize, and I agree that with the world view / information you have that advocating for a pause makes sense. I would get behind 'regulate AI' or 'regulate AGI', certainly. I think though that pausing is an incorrect strategy which would do more harm than good, so despite being aligned with you in being concerned about AGI dangers, I don't endorse that strategy.

Some part of me thinks this oughtn't matter, since there's approximately ~0% chance of the movement achieving that literal goal. The point is to build an anti-AGI movement, and to ge... (read more)

1yanni kyriacos2h

Hi Tomás! is there a prediction market for this that you know of?

3Joseph Miller10h

While I want people to support PauseAI Is one of the main points of my post. If you support PauseAI today you may unleash a force which you cannot control tomorrow.

1yanni kyriacos2h

I think it is unrealistic to ask people to internalise that level of ambiguity. This is how EA's turn themselves into mental pretzels.

KAN: Kolmogorov-Arnold Networks

Gunnar_Zarncke

This is a linkpost for https://arxiv.org/abs/2404.19756

Abstract:

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

mako yass1h20

Theoretically and em-
pirically, KANs possess faster neural scaling laws than MLPs

What do they mean by this? Isn't that contradicted by this recommendation to use the an ordinary architecture if you want fast training:

A section from their diagram where they disrecommend KANs if you want fast training

It seems like they mean faster per parameter, which is an... unclear claim given that each parameter or step, here, appears to represent more computation (there's no mention of flops) than a parameter/step in a matmul|relu would? Maybe you could buff that out with specialized hardware, but they don't discuss hardware.

One might worry that KANs

... (read more)

2Nathan Helm-Burger2h

So, after reading the KAN paper, and thinking about it in the context of this post: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their My vague intuition is that the same experiment done with a KAN would result in a clearer fractal which wiggled less once training loss had plateaued. Is that also other people's intuition?

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter, Francis Rhys Ward

Ω 188d

Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance on an evaluation. The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). This post is an introduction to the problem of sandbagging.

The Volkswagen emissions scandal

There are environmental regulations which require the reduction of harmful emissions from diesel vehicles, with the goal of protecting public health and the environment. Volkswagen struggled to meet these emissions standards while maintaining the desired performance and fuel efficiency of their diesel engines (Wikipedia). Consequently, Volkswagen...

(Continue Reading – 2217 more words)

Nathan Helm-Burger1h20

I've mentioned it elsewhere, but I'll repeat it again here since it's relevant. For GPT-style transformers, and probably for other model types, you can smoothly subtly degrade the performance of the model by adding in noise to part or all of the activations. This is particularly useful for detecting sandbagging, because you would expect sandbagging to show up as an anomalous increase in capability, breaking the smooth downward trend in capability, as you increased the amount of noise injected or fraction of activations to which noise was added. I found tha... (read more)

William_S's Shortform

William_S

Ω 31y

2O O3h

I assume timelines are fairly long or this isn’t safety related. I don’t see a point in keeping PPUs or even caring about NDA lawsuits which may or may not happen and would take years in a short timeline or doomed world.

1mishka2h

I think having a probability distribution over timelines is the correct approach. Like, in the comment above:

O O1h12

Even in probabilistic terms, the evidence of OpenAI members respecting their NDAs makes it more likely that this was some sort of political infighting (EA related) than sub-year takeoff timelines. I would be open to a 1 year takeoff, I just don't see it happening given the evidence. OpenAI wouldn't need to talk about raising trillions of dollars, companies wouldn't be trying to commoditize their products, and the employees who quit OpenAI would speak up.

Political infighting is in general just more likely than very short timelines, which would go in c... (read more)

1mishka4h

Why at most one of them can be meaningfully right? Would not a simulation typically be "a multi-player game"? (But yes, if they assume that their "original self" was the sole creator (?), then they would all be some kind of "clones" of that particular "original self". Which would surely increase the overall weirdness.)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

My hour of memoryless lucidity

Eric Neyman

This is a linkpost for https://ericneyman.wordpress.com/2024/05/04/my-hour-of-memoryless-lucidity/

Yesterday, I had a coronectomy: the top halves of my bottom wisdom teeth were surgically removed. It was my first time being sedated, and I didn’t know what to expect. While I was unconscious during the surgery, the hour after surgery turned out to be a fascinating experience, because I was completely lucid but had almost zero short-term memory.

My girlfriend, who had kindly agreed to accompany me to the surgery, was with me during that hour. And so — apparently against the advice of the nurses — I spent that whole hour talking to her and asking her questions.

The biggest reason I find my experience fascinating is that it has mostly answered a question that I’ve had about myself for quite a long time: how deterministic am...

(Continue Reading – 1467 more words)

ErioirE2h30

I had a very similar experience as a teenager after a mild concussion from falling on ice. According to my family, I would 'reboot' every few minutes and ask the same few questions exactly. It got burdensome enough that they put up a note on the inside of my bedroom door with something along the lines of:
"You are having amnesia"
"You hit your head and got a mild concussion"
"You've already been to the ER, they said you're likely to be fine after a few hours and it is safe to sleep."

The entire experience was (reportedly) very stressful to me due to disorientation.

How would you navigate a severe financial emergency with no help or resources?

Tigerlily

Hello, friends.

This is my first post on LW, but I have been a "lurker" here for years and have learned a lot from this community that I value.

I hope this isn't pestilent, especially for a first-time post, but I am requesting information/advice/non-obvious strategies for coming up with emergency money.

I wouldn't ask except that I'm in a severe financial emergency and I can't seem to find a solution. I feel like every minute of the day I'm butting my head against a brick wall trying and failing to figure this out.

I live in a very small town in rural Arizona. The local economy is sustained by fast food restaurants, pawn shops, payday lenders, and some huge factories/plants that are only ever hiring engineers and other highly specialized personnel.

I...

(See More – 560 more words)

5Answer by nim13h

You're here, which tells me you have internet access. I mentally categorize options like Fiverr and mturk as "about as scammy as DoorDash". I don't think they're a good option, but I also don't think DoorDash is a very good option either. It's probably worth looking into online gig economy options. What skills were you renting to companies before you became a stay-at-home parent? There are probably online options to rent the same skills to others around the world. You write fluently in English and it sounds like English is your first language. Have you considered renting your linguistic skills to people with English as a second language? You may be able to find wealthy international people who value your proof-reading skills on their college work, or conversational skills to practice their spoken English with gentle correction as needed. It won't pay competitively with the tech industry, but it'll pay more than nothing. If you're in excellent health, the classic "super weird side gig" is stool donor programs. https://www.lesswrong.com/posts/i48nw33pW9kuXsFBw/being-a-donor-for-fecal-microbiota-transplants-fmt-do-good for more. Another weird one that depends on your age and health and bodily situation, since you've had more than 0 kids of your own, is gestational surrogacy. Maybe not a good fit, but hey, you asked for weird. For a less weird one, try browsing Craigslist in a more affluent area to see what personal services people offer. House cleaning? Gardening? Dog walking? Browse Craigslist in your area and see which of those niches seem under-populated relative to elsewhere. Then use what you saw in the professionalism of the ads in wealthier areas to offer the missing services. This may get 0 results, but you might discover that there are local rich techies who would quite enjoy outsourcing certain household services for a rate that seems affordable to them but game-changing to you. Basically anything you imagine servants doing for a fairytale princess, som

Tigerlily2h10

Thank you for your response. I probably should have given a more exhaustive list of things I have already tried. Other than a couple things you mentioned, I have already tried the rest.

Before becoming a stay-at-home parent, I was a writer. I wasn't well paid but was starting to earn professional rates when I got pregnant with my second child and that took over my life. I have found it difficult to start writing again since then. The industry has changed so much and is changing still, and so am I. My life is so different now. I'm less sure of what I write -... (read more)

4romeostevensit11h

Oh yeah, food banks for sure!

3Viliam18h

Just some random thoughts: * are the some kind of summer seasonal jobs? perhaps you could try looking for those * find opportunities to meet local people, then ask them if they know about a job * is there anything you could make at home and try to sell?

Transformers Represent Belief State Geometry in their Residual Stream

349

Adam Shai

Ω 13417d

Produced while being an affiliate at PIBBSS^[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Guillaume Corlouer for suggestions on this writeup.

Introduction

What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because

We have a formalism that relates training data to internal

...

(Continue Reading – 3335 more words)

Alexander Gietelink Oldenziel6h20

Non exhaustive list of reasons one could be interested in computational mechanics: https://www.lesswrong.com/posts/GG2NFdgtxxjEssyiE/dalcy-s-shortform?commentId=DdnaLZmJwusPkGn96

1Moughees Ahmed14h

Excited to see what you come up with! Plausibly, one could think that if a model, trained on the entirety of human output, should be able to decipher more hidden states - ones that are not obvious to us - but might be obvious in latent space. It could mean that models might be super good at augmenting our existing understanding of fields but might not create new ones from scratch.

4Alexander Gietelink Oldenziel15h

I agree with you that the new/surprising thing is the linearity of the probe. Also I agree that not entirely clear how surprising & new linearity of the probe is. If you understand how the causal states construction & the MSP works in computational mechanics the experimental results isn't surprising. Indeed, it can't be any other way! That's exactly the magic of the definition of causal states. What one person might find surprising or new another thinks trivial. The subtle magic of the right theoretical framework is that it makes the complex simple, surprising phenomena apparent. Before learning about causal states I would have not even considered that there is a unique (!) optimal minimal predictor canonical constructible from the data. Nor that the geometry of synchronizing belief states is generically a fractal. Of course, once one has properly internalized the definitions this is almost immediate. Pretty pictures can be helpful in building that intuition ! Adam and I (and many others) have been preaching the gospel of computational mechanics for a while now. Most of it has fallen on deaf ears before. Like you I have been (positively!) surprised and amused by the sudden outpouring of interest. No doubt it's in part a the testimony to the Power of the Visual! Never look a gift horse in the mouth ! _ I would say the parts of computational mechanics I am really excited are a little deeper - downstream of causal states & the MSP. This is just a taster. I'm confused & intrigued by your insistence that this is follows from the good regulator theorem. Like Adam I don't understand it. It is my understanding is that the original 'theorem' was wordcelled nonsense but that John has been able to formulate a nontrivial version of the theorem. My experience is that it the theorem is often invoked in a handwavey way that leaves me no less confused than before. No doubt due to my own ignorance ! I would be curious to hear a *precise * statement why the result here follows

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Table of Contents

The Volkswagen emissions scandal

Introduction

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA