LessWrong

Ω 04y

I think you should write it. It sounds funny and a bunch of people have been calling out what they see as bad arguements that alginment is hard lately e.g. TurnTrout, QuintinPope, ZackMDavis, and karma wise they did fairly well.

Rejecting Television

Declan Molony

14d

I didn’t use to be, but now I’m part of the 2% of U.S. households without a television. With its near ubiquity, why reject this technology?

The Beginning of my Disillusionment

Neil Postman’s book Amusing Ourselves to Death radically changed my perspective on television and its place in our culture. Here’s one illuminating passage:

We are no longer fascinated or perplexed by [TV’s] machinery. We do not tell stories of its wonders. We do not confine our TV sets to special rooms. We do not doubt the reality of what we see on TV [and] are largely unaware of the special angle of vision it affords. Even the question of how television affects us has receded into the background. The question itself may strike some of us as strange, as if one were

...

(Continue Reading – 1540 more words)

Declan Molony1h10

To make an analogy to diet, you essentially replaced a sugar fix from eating Snickers bars with eating strawberries. Gradation matters!

I had a similar slide with my technologies, as I explained in the post. I eventually landed on reading books. But even that became a form of intellectual procrastination as I wrote in my latest LW post.

Decaeneus's Shortform

Decaeneus

3mo

Decaeneus1h10

Pretending not to see when a rule you've set is being violated can be optimal policy in parenting sometimes (and I bet it generalizes).

Example: suppose you have a toddler and a "rule" that food only stays in the kitchen. The motivation is that each time food is brough into the living room there is a small chance of an accident resulting in a permanent stain. There's cost to enforcing the rule as the toddler will put up a fight. Suppose that one night you feel really tired and the cost feels particularly high. If you enforce the rule, it will be much more p... (read more)

Designing for a single purpose

Itay Dreyfus

This is a linkpost for https://productidentity.co/p/designing-for-a-single-purpose

If you’ve ever been to Amsterdam, you’ve probably visited, or at least heard about the famous cookie store that sells only one cookie. I mean, not a piece, but a single flavor.

I’m talking about Van Stapele Koekmakerij of course—where you can get one of the world's most delicious chocolate chip cookies. If not arriving at opening hour, it’s likely to find a long queue extending from the store’s doorstep through the street it resides. When I visited the city a few years ago, I watched the sensation myself: a nervous crowd awaited as the rumor of ‘out of stock’ cookies spreaded across the line.

Van Stapele Koekmakerij - Cookie Shop in Amsterdam — Owner Vera Van Stapele with fresh-baked cookies, via store website

The store, despite becoming a landmark for tourists, stands for an idea that seems to...

(Continue Reading – 2805 more words)

Mechanistically Eliciting Latent Behaviors in Language Models

151

Andrew Mack, TurnTrout

Ω 747d

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).

TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.

Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given...

(Continue Reading – 13417 more words)

tailcalled2h20

I think it's easier to see the significance if you imagine the neural networks as a human-designed system. In e.g. a computer program, there's a clear distinction between the code that actually runs and the code that hypothetically could run if you intervened on the state, and in order to explain the output of the program, you only need to concern yourself with the former, rather than also needing to consider the latter.

For neural networks, I sort of assume there's a similar thing going on, except it's quite hard to define it precisely. In technical terms,... (read more)

metachirality's Shortform

metachirality

2mo

1metachirality11h

I wish I could bookmark comments/shortform posts.

2faul_sname10h

Yes, that would be cool. Next to the author name of a post orcomment, there's a post-date/time element that looks like "1h 🔗". That is a copyable/bookmarkable link.

metachirality2h10

Sure, I just prefer a native bookmarking function.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Biorisk is an Unhelpful Analogy for AI Risk

Davidmanheim

There are two main areas of catastrophic or existential risk which have recently received significant attention; biorisk, from natural sources, biological accidents, and biological weapons, and artificial intelligence, from detrimental societal impacts of systems, incautious or intentional misuse of highly capable systems, and direct risks from agentic AGI/ASI. These have been compared extensively in research, and have even directly inspired policies. Comparisons are often useful, but in this case, I think the disanalogies are much more compelling than the analogies. Below, I lay these out piecewise, attempting to keep the pairs of paragraphs describing first biorisk, then AI risk, parallel to each other.

While I think the disanalogies are compelling, comparison can still be useful as an analytic tool - while keeping in mind that the ability to directly...

(See More – 707 more words)

Steven Byrnes2h20

This book argues (convincingly IMO) that it’s impossible to communicate, or even think, anything whatsoever, without the use of analogies.

If you say “AI runs on computer chips”, then the listener will parse those words by conjuring up their previous distilled experience of things-that-run-on-computer-chips, and that previous experience will be helpful in some ways and misleading in other ways.
If you say “AI is a system that…” then the listener will parse those words by conjuring up their previous distilled experience of so-called “systems”, and that previo

... (read more)

2Steven Byrnes3h

This is an 800-word blog post, not 5 words. There’s plenty of room for nuance. The way it stands right now, the post is supporting conversations like: Or: Is this what you want? I.e., are you on the side of Person B in both these cases?

2faul_sname9h

"Immunology" and "well-understood" are two phrases I am not used to seeing in close proximity to each other. I think with an "increasingly" in between it's technically true - the field has any model at all now, and that wasn't true in the past, and by that token the well-understoodness is increasing. But that sentence could also be iterpreted as saying that the field is well-understood now, and is becoming even better understood as time passes. And I think you'd probably struggle to find an immunologist who would describe their field as "well-understood". My experience has been that for most basic practical questions the answer is "it depends", and, upon closet examination, "it depends on some stuff that nobody currently knows". Now that was more than 10 years ago, so maybe the field has matured a lot since then. But concretely, I expect if you were to go up to an immunologist and say "I'm developing a novel peptide vaccine from the specifc abc surface protein of the specific xyz virus. Can you tell me whether this will trigger an autoimmune response due to cross-reactivity" the answer is going to be something more along the lines of "lol no, run in vitro tests followed by trials (you fool!)" and less along the lines of "sure, just plug it in to this off-the-shelf software".

2Davidmanheim5h

I agree that we do not have an exact model for anything in immunology, unlike physics, and there is a huge amount of uncertainty. But that's different than saying it's not well-understood; we have clear gold-standard methods for determining answers, even if they are very expensive. This stands in stark contrast to AI, where we don't have the ability verify that something works or is safe at all without deploying it, and even that isn't much of a check on its later potential for misuse. But aside from that, I think your position is agreeing with mine much more than you imply. My understanding is that we have newer predictive models which can give uncertain but fairly accurate answers to many narrow questions. (Older, non-ML methods also exist, but I'm less familiar with them.) In your hypothetical case, I expect that the right experts can absolutely give indicative answers about whether a novel vaccine peptide is likely or unlikely to have cross-reactivity with various immune targets, and the biggest problem is that it's socially unacceptable to assert confidence in anything short of tested and verified case. But the models can get, in the case of the Zhang et al paper above, 70% accurate answers, which can help narrow the problem for drug or vaccine discovery, then they do need to be followed with in vitro tests and trials.

Now THIS is forecasting: understanding Epoch’s Direct Approach

Elliot_Mckernon, Zershaaneh Qureshi

Happy May the 4th from Convergence Analysis! Cross-posted on the EA Forum.

As part of Convergence Analysis’s scenario research, we’ve been looking into how AI organisations, experts, and forecasters make predictions about the future of AI. In February 2023, the AI research institute Epoch published a report in which its authors use neural scaling laws to make quantitative predictions about when AI will reach human-level performance and become transformative. The report has a corresponding blog post, an interactive model, and a Python notebook.

We found this approach really interesting, but also hard to understand intuitively. While trying to follow how the authors derive a forecast from their assumptions, we wrote a breakdown that may be useful to others thinking about AI timelines and forecasting.

In what follows, we set out our interpretation of...

(Continue Reading – 5526 more words)

Zershaaneh Qureshi2h20

The point of the paragraph that the above quote was taken from is, I think, better summarised in its first sentence:

although Epoch takes an approach to forecasting TAI that is quite different to others in this space, its resulting probability distribution is not vastly dissimilar to those produced by other influential models

It is fair to question whether these two forecasts are “not vastly dissimilar” to one another. In some senses, two decades is a big difference between medians: for example, we suspect that a future where TAI arrives in the 2030s lo... (read more)

Explaining a Math Magic Trick

Robert_AIZI

Introduction

A recent popular tweet did a "math magic trick", and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question:

This is a cute magic trick, and like any good trick they nonchalantly gloss over the most important step. Did you spot it? Did you notice your confusion?

Here's the key question: Why did they switch from a differential equation to an integral equation? If you can use $(1 - x)^{- 1} = 1 + x + x^{2} + . . .$ when $x = \int$ , why not use it when $x = d / d x$ ?

Well, lets try it, writing $D$ for the derivative:

$\begin{matrix} f^{'} & = & f (1 - D) f & = & 0 f & = & (1 + D + D^{2} + . . .) 0 f & = & 0 + 0 + 0 + . . . f & = & 0 \end{matrix}$

So now you may be disappointed, but relieved: yes, this version fails, but at least it fails-safe, giving you the trivial solution, right?

But no, actually $(1 - D)^{- 1} = 1 + D + D^{2} + . . .$ can fail catastrophically, which we can see if we try a nonhomogeneous equation...

(Continue Reading – 1217 more words)

2Robert_AIZI12h

Ah sorry, I skipped over that derivation! Here's how we'd approach this from first principals: to solve f=Df, we know we want to use the (1-x)=1+x+x^2+... trick, but now know that we need x=I instead of x=D. So that's why we want to switch to an integral equation, and we get f=Df If=IDf = f-f(0) where the final equality is the fundamental theorem of calculus. Then we rearrange: f-If=f(0) (1-I)f=f(0) and solve from there using the (1-I)=1+I+I^2+... trick! What's nice about this is it shows exactly how the initial condition of the DE shows up.

1notfnofn12h

This is true, but I'm looking for an explicit, non-recursive formula that needs to handle the general case of the kth anti-derivative (instead of just the first). The solution involves doing something funny with formal power series, like in this post.

2DaemonicSigil11h

Heh, sure.

notfnofn3h10

Very nice! Notice that if you write $r = j - k,$ $I$ as $D^{- 1}$ , and play around with binomial coefficients a bit, we can rewrite this as:

D^{- k} (f p) = \infty \sum r = 0 (\frac{- k}{r}) (D^{- k - r} f) (D^{r} p)

which holds for $k < 0$ as well, in which case it becomes the derivative product rule. This also matches the formal power series expansion of $(x + y)^{- k}$ , which one can motivate directly

(By the way, how do you spoiler tag?)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

The Beginning of my Disillusionment

Introduction

Fooming Shoggoths Dance Concert

June 1st at LessOnline