Q. "Can you hold the door?" A. "Sure."

That's straightforward.

Q. "Can you play the violin at my wedding next year?" A. "Sure."

Colloquial language would imply not only am I willing and able to do this, I already know how to play the violin. Sometimes, what I want to answer is that I don't know how to play the violin, I'm willing to learn, but you should know I currently don't know.

Which I can say, it just takes more words.

AGI: Cryptography, Security & Multipolar Scenarios Workshop

May 14th415 Mission Street, San Francisco

Allison Duettmann

If you're working at the intersection between cryptogrpahy, secuity and AI, consider joining this upcoming workshop:

Foresight's AGI: Cryptography, Security & Multipolar Scenarios Workshop
May 14-15, all-day
The Institute, Salesforce Tower, San Francisco

Goals
To help AI development benefit humanity, Foresight Institute has held various workshops over the past years and launched a Grants Program that funds work on AI security risks, cryptography tools for safe AI, and safe multipolar AI scenarios. Our 2024 workshop invites leading researchers, entrepreneurs, and funders in this growing space to explore new tools and architectures that help humans and AIs cooperate securely. In addition to short presentations, working groups, and project development, we offer mentorship hours, open breakouts, and speaker & sponsor gatherings.

Questions we’ll address include

Which challenges in AI alignment, AI security and AI coordination (in particular

...

(See More – 56 more words)

Chipmonk37m10

I'll be there! Talk to me about boundaries and coordination/Goodness

Richard Ngo's Shortform

Richard_Ngo

Ω 34y

6Richard_Ngo1h

Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible. This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition? The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions). The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal". Havi

Garrett Baker1h40

There's also the problem of: what do you mean by "the human"? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you're perfectly capable of making different decisions, but you just don't.

Another problem, which I like to think of as the "control panel of the universe" problem, is where the AI gives you the "control panel of the universe", bu... (read more)

4Richard_Ngo1h

You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.

TurnTrout's shortform feed

TurnTrout

Ω 105y

1samshap7h

Instead of demanding orthogonal representations, just have them obey the restricted isometry property. Basically, instead of requiring ∀i≠j:<xi,xj>=0, we just require ∀i≠j:xi⋅xj≤ϵ . This would allow a polynomial number of sparse shards while still allowing full recovery.

7cubefox15h

For illustration, what would be an example of having different shards for "I get food" (F) and "I see my parents again" (P) compared to having one utility distribution over F∧P, F∧¬P, ¬F∧P, ¬F∧¬P?

Daniel Kokotajlo1h73

I think this is also what I was confused about -- TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren't always applied, but rather only in some circumstances does the circuitry fire at all, and there are circums... (read more)

ACX Covid Origins Post convinced readers

ErnestScribbler

ACX recently posted about the Rootclaim Covid origins debate, coming out in favor of zoonosis. Did the post change the minds of those who read it, or not? Did it change their judgment in favor of zoonosis (as was probably the goal of the post), or conversely did it make them think Lab Leak was more likely (as the "Don't debate conspiracy theorists" theory claims)?

I analyzed the ACX survey to find out, by comparing responses before and after the post came out. The ACX survey asked readers whether they think the origin of Covid is more likely natural or Lab Leak. The ACX survey went out March 26th and was open until about April 10th. The Covid origins post came out March 28th, and the highlights on April...

(See More – 578 more words)

DusanDNesic1h10

Is a lot of the effect not "people who read ACX trust Scott Alexander"? Like, the survey selects for most "passionate" readers, those willing to donate their free time to Scott for research with ~nothing in return. Him publicly stating on his platform "I am now much less certain of X" is likely to make that group of people be less certain of X?

8Brendan Long5h

I'm not sure if this really applies here, since Lab Leak was never really treated as a crazy/fringe idea among rationalists. In fact, it looks like it was the majority opinion before the debate and ACX posts.

1ErnestScribbler2h

This is a very good point.

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski, Neel Nanda

Ω 261d

Summary

We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered).

...

(Continue Reading – 4944 more words)

Johnny Lin1h62

Hey Jacob + Philippe,

Hope you all don't mind but we put up layer 8 of your transcoders onto Neuronpedia, with ~22k dashboards here:

https://neuronpedia.org/gpt2-small/8-tres-dc

Each dashboard can be accessed at their own url:

https://neuronpedia.org/gpt2-small/8-tres-dc/0 goes to feature index 0.

You can also test each feature with custom text:

Or search all features at: https://www.neuronpedia.org/gpt2-small/tres-dc

An example search: https://www.neuronpedia.org/gpt2-small/?sourceSet=tres-dc&selectedLayers=[]&sortIndexes=[]&q=the%20cat%20sat%20on%20... (read more)

3Philippe Chlenski7h

This sounds like it could work. I can think of a few reasons why this approach could be challenging, however: 1. We don't really know how transcoders (or SAEs, to the best of my knowledge) behave when they're being trained to imitate a model component that's still updating 2. Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions. Substituting one transcoder at a time would require restarting the forward pass at each layer. 3. If the transcoders are used to predict next tokens, they may lose interpretability and return to superposition. Under a "transcoder-aware" training regime, these would be the first things I would check for. Also, you may be interested in Jacob's comment here for some details on when we tried to co-train SAEs and transcoders to have sparse connections to one another. This is a very different question, of course, but it provides some preliminary evidence that the fidelity-interpretability tradeoff persists across more elaborate training settings.

2Vladimir_Nesov5h

Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the "full-precision" master model), while transcoders only need to be maintained to remain in sync with the MLPs, whatever they are (according to the same local objective as before, which doesn't care at all about token prediction). The search is for MLPs such that their transcoders are good predictors, not directly for transcoders that are good predictors. Unclear given the extreme quantization results, where similarly post-training replacement would degrade model performance a lot, yet quantization-aware pre-training somehow doesn't. This seems to be the main technical hurdle to do the experiment, updating transcoders both efficiently and correctly, as underlying MLPs gradually change. (I'm guessing some discontinuous jumps in choice of transcoders might be OK.)

3Jacob Dunefsky11h

I'm pretty sure that there's at least one other MATS group (unrelated to us) currently working on this, although I'm not certain about any of the details. Hopefully they release their research soon! I did try something similar at one point, but it didn't quite work out. In particular: given an SAE for MLP-out activations, you can try and train an MLP transcoder with an additional loss term penalizing the L1 norm of the pullback of the SAE encoder features by the transcoder decoder matrix. This was intended to induce sparse input-independent connections from the transcoder features to the MLP-out SAE features. Unfortunately, this didn't yield great results. The transcoder features were often polysemantic, while the input-independent connections from the transcoder features to the SAE features were somewhat bizarre-looking. Here's an old graph I just dug up: the x-axis is transcoder feature index and the y-axis is the input-independent connection strength to a certain SAE feature: In the end, I decided to pause working on this idea. Potentially, it could turn out that this idea is workable, but if so, then there are probably a few extra tweaks that have to be done to get it working beyond the naive approach that I tried.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

metachirality's Shortform

metachirality

1mo

3metachirality1h

I think there should be a way to find the highest rated shortform posts.

habryka1h20

You can! Just go to the all-posts page, sort by year, and the highest-rated shortform posts for each year will be in the Quick Takes section:

2024:

2023:

2022:

KAN: Kolmogorov-Arnold Networks

Gunnar_Zarncke

This is a linkpost for https://arxiv.org/abs/2404.19756

Abstract:

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

4Nathan Helm-Burger4h

Wow, this is super fascinating. A juicy tidbit: "

6Mateusz Bagiński3h

I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it's solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic's work on Sleeper Agents seems like a very strong piece of evidence that it is the case. Still, if they're right that KANs don't have this problem at much smaller sizes than MLP-based NNs, that's very interesting. Nevertheless, I think talking about catastrophic forgetting as a "serious problem in modern ML" seems significantly misleading

Nathan Helm-Burger1h20

I'm not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B's training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint?

I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.

dkornai's Shortform

dkornai

1dkornai1h

Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals. In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed. This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] which impedes long term goals. This suggests that any system that possesses both a set of goals and the capacity to understand how events influence their probability of achieving such goals should posses a capacity to feel pain. This also suggests that the amount of pain is proportional to the degree of "setbacks" and the degree to which "setbacks" are perceived. I think this is a relatively robust argument for the inherent reality of pain not just in a broad spectrum biological organisms, but also in synthetic [including sufficiently advanced AI] agents. We should strive to reduce the pain we cause in the agents we interact with.

Alexander Gietelink Oldenziel1h31

It also suggests that there might some sort of conservation law for pain for agents.

Conservation of Pain if you will

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Summary

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA