RobertM10h3728
0
I've seen a lot of takes (on Twitter) recently suggesting that OpenAI and Anthropic (and maybe some other companies) violated commitments they made to the UK's AISI about granting them access for e.g. predeployment testing of frontier models.  Is there any concrete evidence about what commitment was made, if any?  The only thing I've seen so far is a pretty ambiguous statement by Rishi Sunak, who might have had some incentive to claim more success than was warranted at the time.  If people are going to breathe down the necks of AGI labs about keeping to their commitments, they should be careful to only do it for commitments they've actually made, lest they weaken the relevant incentives.  (This is not meant to endorse AGI labs behaving in ways which cause strategic ambiguity about what commitments they've made; that is also bad.)
William_S5dΩ681578
27
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool. I resigned from OpenAI on February 15, 2024.
quila10h106
6
On Pivotal Acts I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake's recent post and came across some discussion of pivotal acts as well. > Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you. This is in a context where the 'pivotal act' example is using a safe ASI to shut down all AI labs.[1] My thought is that I don't see why a pivotal act needs to be that. I don't see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the 'most direct' or 'simplest to imagine' possible actions, but in the case of superintelligence, simplicity is not a constraint. We can instead select for the 'kindest' or 'least adversarial' or actually: functional-decision-theoretically optimal actions that save the future while minimizing the amount of adversariality this creates in the past (present). Which can be broadly framed as 'using ASI for good'. Which is what everyone wants, even the ones being uncareful about its development. Capabilities orgs would be able to keep working on fun capabilities projects in those days during which the world is saved, because a group following this policy would choose to use ASI to make the world robust to the failure modes of capabilities projects rather than shutting them down. Because superintelligence is capable of that, and so much more. 1. ^ side note: It's orthogonal to the point of this post, but this example also makes me think: if I were working on a safe ASI project, I wouldn't mind if another group who had discreetly built safe ASI used it to shut my project down, since my goal is 'ensure the future lightcone is used in a valuable, tragedy-averse way' and not 'gain personal power' or 'have a fun time working on AI' or something. In my morality, it would be naive to be opposed to that shutdown. But to the extent humanity is naive, we can easily do something else in that future to create better present dynamics (as the maintext argues). If there is a group for whom using ASI to make the world robust to risks and free of harm, in a way where its actions don't infringe on ongoing non-violent activities is problematic, then this post doesn't apply to them as their issue all along was not with the character of the pivotal act, but instead possibly with something like 'having my personal cosmic significance as a capabilities researcher stripped away by the success of an external alignment project'. Another disclaimer: This post is about a world in which safely usable superintelligence has been created, but I'm not confident that anyone (myself included) currently has a safe and ready method to create it with. This post shouldn't be read as an endorsement of possible current attempts to do this. I would of course prefer if this civilization were one which could coordinate such that no groups were presently working on ASI, precluding this discourse.
Anyone know folks working on semiconductors in Taiwan and Abu Dhabi, or on fiber at Tata Industries in Mumbai?  I'm currently travelling around the world and talking to folks about various kinds of AI infrastructure, and looking for recommendations of folks to meet!  If so, freel free to DM me!  (If you don't know me, I'm a dev here on LessWrong and was also part of founding Lightcone Infrastructure.)
Do we expect future model architectures to be biased toward out-of-context reasoning (reasoning internally rather than in a chain-of-thought)? As in, what kinds of capabilities would lead companies to build models that reason less and less in token-space? I mean, the first obvious thing would be that you are training the model to internalize some of the reasoning rather than having to pay for the additional tokens each time you want to do complex reasoning. The thing is, I expect we'll eventually move away from just relying on transformers with scale. And so I'm trying to refine my understanding of the capabilities that are simply bottlenecked in this paradigm, and that model builders will need to resolve through architectural and algorithmic improvements. (Of course, based on my previous posts, I still think data is a big deal.) Anyway, this kind of thinking eventually leads to the infohazardous area of, "okay then, what does the true AGI setup look like?" This is really annoying because it has alignment implications. If we start to move increasingly towards models that are reasoning outside of token-space, then alignment becomes harder. So, are there capability bottlenecks that eventually get resolved through something that requires out-of-context reasoning? So far, it seems like the current paradigm will not be an issue on this front. Keep scaling transformers, and you don't really get any big changes in the model's likelihood of using out-of-context reasoning. This is not limited to out-of-context reasoning. I'm trying to have a better understanding of the (dangerous) properties future models may develop simply as a result of needing to break a capability bottleneck. My worry is that many people end up over-indexing on the current transformer+scale paradigm (and this becomes insufficient for ASI), so they don't work on the right kinds of alignment or governance projects. --- I'm unsure how big of a deal this architecture will end up being, but the rumoured xLSTM just dropped. It seemingly outperforms other models at the same size: Maybe it ends up just being another drop in the bucket, but I think we will see more attempts in this direction. Claude summary: The key points of the paper are: 1. The authors introduce exponential gating with memory mixing in the new sLSTM variant. This allows the model to revise storage decisions and solve state tracking problems, which transformers and state space models without memory mixing cannot do. 2. They equip the mLSTM variant with a matrix memory and covariance update rule, greatly enhancing the storage capacity compared to the scalar memory cell of vanilla LSTMs. Experiments show this matrix memory provides a major boost. 3. The sLSTM and mLSTM are integrated into residual blocks to form xLSTM blocks, which are then stacked into deep xLSTM architectures. 4. Extensive experiments demonstrate that xLSTMs outperform state-of-the-art transformers, state space models, and other LSTMs/RNNs on language modeling tasks, while also exhibiting strong scaling behavior to larger model sizes. This work is important because it presents a path forward for scaling LSTMs to billions of parameters and beyond. By overcoming key limitations of vanilla LSTMs - the inability to revise storage, limited storage capacity, and lack of parallelizability - xLSTMs are positioned as a compelling alternative to transformers for large language modeling. Instead of doing all computation step-by-step as tokens are processed, advanced models might need to store and manipulate information in a compressed latent space, and then "reason" over those latent representations in a non-sequential way. The exponential gating with memory mixing introduced in the xLSTM paper directly addresses this need. Here's how: 1. Exponential gating allows the model to strongly update or forget the contents of each memory cell based on the input. This is more powerful than the simple linear gating in vanilla LSTMs. It means the model can decisively revise its stored knowledge as needed, rather than being constrained to incremental changes. This flexibility is crucial for reasoning, as it allows the model to rapidly adapt its latent state based on new information. 2. Memory mixing means that each memory cell is updated using a weighted combination of the previous values of all cells. This allows information to flow and be integrated between cells in a non-sequential way. Essentially, it relaxes the sequential constraint of traditional RNNs and allows for a more flexible, graph-like computation over the latent space. 3. Together, these two components endow the xLSTM with a dynamic, updateable memory that can be accessed and manipulated "outside" the main token-by-token processing flow. The model can compress information into this memory, "reason" over it by mixing and gating cells, then produce outputs guided by the updated memory state. In this way, the xLSTM takes a significant step towards the kind of "reasoning outside token-space" that I suggested would be important for highly capable models. The memory acts as a workspace for flexible computation that isn't strictly tied to the input token sequence. Now, this doesn't mean the xLSTM is doing all the kinds of reasoning we might eventually want from an advanced AI system. But it demonstrates a powerful architecture for models to store and manipulate information in a latent space, at a more abstract level than individual tokens. As we scale up this approach, we can expect models to perform more and more "reasoning" in this compressed space rather than via explicit token-level computation.

Popular Comments

Recent Discussion

On June 22nd, there was a “Munk Debate”, facilitated by the Canadian Aurea Foundation, on the question whether “AI research and development poses an existential threat” (you can watch it here, which I highly recommend). On stage were Yoshua Bengio and Max Tegmark as proponents and Yann LeCun and Melanie Mitchell as opponents of the central thesis. This seems like an excellent opportunity to compare their arguments and the effects they had on the audience, in particular because in the Munk Debate format, the audience gets to vote on the issue before and after the debate.

The vote at the beginning revealed 67% of the audience being pro the existential threat hypothesis and 33% against it. Interestingly, it was also asked if the listeners were prepared to change...

I wasn't able to find the full video on the site you linked, but I found it here, if anyone else has the same issue: 

Most people avoid saying literally false things, especially if those could be audited, like making up facts or credentials. The reasons for this are both moral and pragmatic — being caught out looks really bad, and sustaining lies is quite hard, especially over time. Let’s call the habit of not saying things you know to be false ‘shallow honesty’[1].

Often when people are shallowly honest, they still choose what true things they say in a kind of locally act-consequentialist way, to try to bring about some outcome. Maybe something they want for themselves (e.g. convincing their friends to see a particular movie), or something they truly believe is good (e.g. causing their friend to vote for the candidate they think will be better for the country).

Either way, if you...

being able to credibly commit to doing this at appropriate times seems useful. I wouldn't want to commit to doing it at all times; becoming cooperatebot makes it rational for cooperative-but-preference-misaligned actors to exploit you. Shallow honesty seems like a good starting point for being able to say when you are attempting to be deep honest, perhaps. But for example, I would sure appreciate it if people could be less deeply honest about the path to ai capabilities. I do think the "deeply honest at the meta level" thing has some promise.

12sapphire6h
Im a real fan of insane ideas. I literally do Acid every monday. But I gotta say among crazy ideas 'be way way more honest' is well trodden ground and the skulls are numerous. It just really rarely goes well. Im a pretty honest guy and am attracted to the cluster. But if you start doing this you are definitely trying something in a cluster of ideas that usually works terribly.  If anything I have to constantly tell myself to be less explicit and 'deeply honest'. It just doesnt work well for most people.
3zeshen9h
Scott Alexander talked about explicit honesty (unfortunately paywalled) in contrast with radical honesty. In short, explicit honesty is being completely honest when asked, and radical honesty is being completely honest even without being asked. From what I understand from your post, it feels like deep honesty is about being completely honest about information you perceive to be relevant to the receiver, regardless of whether the information is explicitly being requested.  Scott also links to some cases where radical honesty did not work out well, like this, this, and this. I suspect deep honesty may lead to similar risks, as you have already pointed out.  And with regards to: I think they would form a 3-circle venn diagram. Things that are within the intersection of all three circles would be a no-brainer. But the tricky bits are the things that are either true but not kind/useful, or kind/useful but not true. And I understood this post as a suggestion to venture more into the former. 
1PhilosophicalSoul10h
I'm so happy you made this post.  I only have two (2) gripes. I say this as someone who 1) practices/believes in determinism, and 2) has interacted with journalists on numerous occasions with a pretty strict policy on honesty.

A few days ago I came upstairs to:

Me: how did you get in there?

Nora: all by myself!

Either we needed to be done with the crib, which had a good chance of much less sleeping at naptime, or we needed a taller crib. This is also something we went through when Lily was little, and that time what worked was removing the bottom of the crib.

It's a basic crib, a lot like this one. The mattress sits on a metal frame, which attaches to a set of holes along the side of the crib. On it's lowest setting, the mattress is still ~6" above the floor. Which means if we remove the frame and sit the mattress on the floor, we gain ~6".

Without the mattress weighing it down, though, the crib...

2jefftk17h
Climbing out of the crib is mildly dangerous, since it's farther down on the outside than the inside. So it's good practice to switch a way from a crib (or adjust the crib to be taller) once they get to where they'll be able to do that soon. Even if they can do it safely, though, a crib they can get in and out of on their own defeats the purpose of a crib -- at that point you should just move to something optimized for being easy to get in and out of, like a bed.
mikbp1h10

at that point you should just move to something optimized for being easy to get in and out of, like a bed

 

yes, yes. Exactly. Isn't it much more practical to put her in a bet/mattress on the floor? That's what we do. Just using the mattress from the crib, for example.

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...

it's extremely high immediate value -- it solves IP rights entirely.

It's the barbed wire for IP rights

15Algon18h
Even though I think the comment was useful, it doesn't look to me like it was as useful as the typical 139 karma comment as I expect LW readers to be fairly unlikely to start popping benzos after reading this post. IMO it should've gotten like 30-40 karma. Even 60 wouldn't have been too shocking to me. But 139? That's way more karma than anything else I've posted.    I don't think it warrants this much karma, and I now share @ryan_greenblatt's concerns about the ability to vote on Quick Takes and Popular Comments introducing algorithmic virality to LW. That sort of thing is typically corrosive to epistemic hygeine as it changes the incentives of commenting more towards posting applause-lights. I don't think that's a good change for LW, as I think we've got too much group-think as it is. 
4habryka17h
Yeah, seems right to me. If this is a recurring thing we might deactivate voting on the popular comments interface or something like that.
6Ben Pace15h
Here's a quick mockup of what that might look like. In my head you the voting UI is available after you click to expand and then scroll down to the bottom of the comment. Added: Oops, I realize I did the quick takes, not the popular comments. Still, the relevant changes are very similar.

With this article, I intend to initiate a discussion with the community on a remarkable (thought) experiment and its implications. The experiment is to conceptualize Stuart Kauffman's NK Boolean networks as a digital social communication network, which introduces a thus far unrealized method for strategic information transmission. From this premise, I deduce that such a technology would enable people to 'swarm', i.e.: engage in self-organized collective behavior without central control. Its realization could result in a powerful tool for bringing about large-scale behavior change. The concept provides a tangible connection between network topology, common knowledge and cooperation, which can improve our understanding of the logic behind prosocial behavior and morality. It also presents us with the question of how the development of such a technology should be...

2faul_sname11h
It's a pretty cool idea, and I think it could make a fun content-discovery (article/music/video/whatever) pseudo-social-media application (you follow some number of people, and have some unknown number of followers, and so you get feedback on how many of your followers liked the things you passed on, but no further information than that. I don't know whether I'd say it's super alignment-relevant, but also this isn't the Alignment Forum and people are allowed to have interests that are not AI alignment, and even to share those interests.

Nice! I actually had this as a loose idea in the back of my mind for a while, to have a network of people connected like this and have them signal to each other their track of the day, which could be actual fun. It is a feasible use case as well. The underlying reasoning is also that (at least for me) I would be more open to adopt an idea from a person with whom you feel a shared sense of collectivity, instead of an algorithm that thinks it knows me. Intrinsically, I want such an algorithm to be wrong, for the sake of my own autonomy :)

The way I see it, th... (read more)

Do we expect future model architectures to be biased toward out-of-context reasoning (reasoning internally rather than in a chain-of-thought)? As in, what kinds of capabilities would lead companies to build models that reason less and less in token-space?

I mean, the first obvious thing would be that you are training the model to internalize some of the reasoning rather than having to pay for the additional tokens each time you want to do complex reasoning.

The thing is, I expect we'll eventually move away from just relying on transformers with scale. And so... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of these papers.

TL;DR

Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we’ve pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more.

Important Links

Neuronpedia has received 1 year of funding from LTFF. Johnny Lin is full-time on engineering, design, and product, while Joseph Bloom is supporting with...

Neuronpedia has an API (copying from a recent message Johnny wrote to someone else recently.):

"Docs are coming soon but it's really simple to get JSON output of any feature. just add "/api/feature/" right after "neuronpedia.org".for example, for this feature: https://neuronpedia.org/gpt2-small/0-res-jb/0
the JSON output of it is here: https://www.neuronpedia.org/api/feature/gpt2-small/0-res-jb/0
(both are GET requests so you can do it in your browser)note the additional "/api/feature/"i would prefer you not do this 100,000 times in a loop though - if you'd l... (read more)

Every few years, someone asks me what I would do to solve a Trolley Problem. Sometimes, they think I’ve never heard of it before—that I’ve never read anything about moral philosophy (e.g. Plato, Foot, Thomson, Graham)—and oh do they have a zinger for me. But for readers who are well familiar with these problems, I have some thoughts that may be new. For those that haven't done the reading, I'll provide links and some minor notes and avoid recapping in great detail.

Ask a dozen people to explain what the trolley problem is and you are likely to get a dozen variations. In this case my contact presented the scenario as a modified Bystander variation format—though far evolved away from Philippa Foot's original version in “The Problem of...

Although I somewhat agree with the comment about style, I feel that the point you're making could be received with some more enthusiasm. How well-recognized is this trolley problem fallacy? The way I see it, the energy spent on thinking about the trolley problem in isolation illustrates innate human short-sightedness and perhaps a clear limit of human intelligence as well. 'Correctly' solving one trolley problem does not prevent that you or someone else will be confronted with the next. My line of arguing is that the question of ethical decision making req... (read more)

2mukashi10h
Just a general comment on style: I think this article would be much easier to parse if you included different headings, sections, etc.  Normally when I approach writing here in LessWrong I scroll slowly to the bottom, do some diagonal reading and try to get a quick impression about the contents and whether the article is worth reading. My impression is that most people will ignore articles that are big uninterrupted long chunks of text like this one
1atomantic2h
Thanks for the feedback. I've added some headers to break it up.

Fooming Shoggoths Dance Concert

June 1st at LessOnline

After their debut album I Have Been A Good Bing, the Fooming Shoggoths are performing at the LessOnline festival. They'll be unveiling several previously unpublished tracks, such as "Nothing is Mere", feat. Richard Feynman.

Ticket prices raise $100 on May 13th