4keltan9h

A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange. An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters. Good luck getting the voice model to parrot a basic meth recipe!

Dagon12m20

Hmm. I don't doubt that targeted voice-mimicking scams exist (or will soon). I don't think memorable, reused passwords are likely to work well enough to foil them. Between forgetting (on the sender or receiver end), claimed ignorance ("Mom, I'm in jail and really need money, and I'm freaking out! No, I don't remember what we said the password would be"), and general social hurdles ("that's a weird thing to want"), I don't think it'll catch on.

Instead, I'd look to context-dependent auth (looking for more confidence when the ask... (read more)

"Why I Write" by George Orwell (1946)

Arjun Panickssery

This is a linkpost for https://www.orwellfoundation.com/the-orwell-foundation/orwell/essays-and-other-works/why-i-write/

People have been posting great essays so that they're "fed through the standard LessWrong algorithm." This essay is in the public domain in the UK but not the US.

From a very early age, perhaps the age of five or six, I knew that when I grew up I should be a writer. Between the ages of about seventeen and twenty-four I tried to abandon this idea, but I did so with the consciousness that I was outraging my true nature and that sooner or later I should have to settle down and write books.

I was the middle child of three, but there was a gap of five years on either side, and I barely saw my father before I was eight. For this and other reasons I...

(Continue Reading – 2660 more words)

cousin_it13m40

Orwell is one of my personal heroes, 1984 was a transformative book to me, and I strongly recommend Homage to Catalonia as well.

That said, I'm not sure making theories of art is worth it. Even when great artists do it (Tolkien had a theory of art, and Flannery O'Connor, and almost every artist if you look close enough), it always seems to be the kind of theory which suits that artist and nobody else. Would advice like "good prose is like a windowpane" or "efface your own personality" improve the writing of, say, Hunter S. Thompson? Heck no, his writing is ... (read more)

ProLU: A Nonlinearity for Sparse Autoencoders

Glen Taggart

Ω 152d

Abstract

This paper presents $ProLU$ , an alternative to $ReLU$ for the activation function in sparse autoencoders that produces a pareto improvement over both standard sparse autoencoders trained with an L1 penalty and sparse autoencoders trained with a Sqrt(L1) penalty.

$ProLU (m_{i}, b_{i}) = {\begin{matrix} m_{i} & if m_{i} + b_{i} > 0 and m_{i} > 0 0 & otherwise \end{matrix}$

The gradient wrt. $b$ is zero, so we generate two candidate classes of $ProLU$ differentiable wrt. $b$ :

${ProLU}_{R e L U}$
- $\frac{\partial^{*} {ProLU}_{R e L U} (m_{i}, b_{i})}{\partial b_{i}} = \frac{\partial {ProLU}_{R e L U} (m_{i}, b_{i})}{\partial m_{i}} = {\begin{matrix} 1 & if m_{i} + b_{i} > 0 and m_{i} > 0 0 & otherwise \end{matrix}$
${ProLU}_{S T E}$
- $\frac{\partial^{*} {ProLU}_{S T E} (m_{i}, b_{i})}{\partial m_{i}} = {\begin{matrix} 1 + m_{i} & if m_{i} > 0 and m_{i} + b_{i} > 0 0 & otherwise \end{matrix}$
- $\frac{\partial^{*} {ProLU}_{S T E} (m_{i}, b_{i})}{\partial b_{i}} = {\begin{matrix} m_{i} & if m_{i} > 0 and m_{i} + b_{i} > 0 0 & otherwise \end{matrix}$

PyTorch Implementation

Introduction

SAE Context and Terminology

S A E (x) = ReLU ((x - b_{d e c}) W_{e n c} + b_{e n c}) W_{d e c} + b_{d e c}

Learnable parameters of a...

(Continue Reading – 1825 more words)

1wuthejeff5h

This is great! We were working on very similar things concurrently at OpenAI but ended up going a slightly different route. A few questions: - What does the distribution of learned biases look like? - For the STE variant, did you find it better to use the STE approximation for the activation gradient, even though the approximation is only needed for the bias?

Glen Taggart15m10

Thank you!

That's super cool you've been doing something similar. I'm curious to see what direction you went in. It seemed like there's a large space of possible things to do along these lines. DeepMind also did a similar but different thing here.

What does the distribution of learned biases look like?

That's a great question, something I didn't note in here is that positive biases have no effect on the output of the SAE -- so, if the biases were to be mostly positive that would suggest this approach is missing something. I saved histograms of the biases duri... (read more)

Examples of Highly Counterfactual Discoveries?

136

johnswentworth, kromem

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

(See More – 189 more words)

Lucius Bushnaq15m20

It's measuring the volume of points in parameter space with loss $< ϵ$ when $ϵ$ is infinitesimal.

This is slightly tricky because it doesn't restrict itself to bounded parameter spaces,^[1] but you can fix it with a technicality by considering how the volume scales with $ϵ$ instead.

In real networks trained with finite amounts of data, you care about the case where $ϵ$ is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss $< ϵ$ , if you ... (read more)

2tailcalled2h

Link(s) to your favorite proof(s)? Also, do these match up with empirical results? I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don't really need SLT to inoculate me against that. I'd mainly be interested if SLT shows something beyond that. As I read the empirical formulas in this paper, they're roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network. But then so they don't have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it. This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there's a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works. ---------------------------------------- I guess I'm still not super sold on it, but on reflection that's probably partly because I don't have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I'm not sure why I'd want/need to study it further. There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it's plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔

2Algon2h

Second most? What's the first? Linearization of a Newtonian V(r) about the earth's surface?

4tailcalled2h

Yes.

Bing Chat is blatantly, aggressively misaligned

396

evhub

I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.

My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".

Examples below (with new ones added as I find them)....

(See More – 300 more words)

Evan R. Murphy22m20

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

eggsyntax's Shortform

eggsyntax

3mo

1eggsyntax7h

Great point. I agree that there are lots of possible futures where that happens. I'm imagining a couple of possible cases where this would matter: 1. Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I'm not too optimistic about this happening, but there's certainly been a lot of increasing AI governance momentum in the last year. 2. Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn't necessarily mean that those systems' preferences were taken into account. I agree entirely. I'm imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals. (not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren't well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they've been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can't. That may be overly pedantic, and I don't feel like I'm articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.

1Ann6h

For the first point, there's also the question of whether 'slightly superhuman' intelligences would actually fit any of our intuitions about ASI or not. There's a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it ... Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There's also a desire specifically not to build them, which might lead to systems with an architecture that isn't like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in - it'd be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation. I do think the drive/just a thing it does we're pointing at with 'what the model just does' is distinct from goals as they're traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that's being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn't want to lose prediction on).

1eggsyntax39m

Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it's difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems. My quibble is that the trainers are optimizing for an objective, at training time, but the model isn't optimizing for anything, at training or inference time. I feel we're very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.

eggsyntax34m10

the model isn't optimizing for anything, at training or inference time.

One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

The first future and the best future

KatjaGrace

14h

It seems to me worth trying to slow down AI development to steer successfully around the shoals of extinction and out to utopia.

But I was thinking lately: even if I didn’t think there was any chance of extinction risk, it might still be worth prioritizing a lot of care over moving at maximal speed. Because there are many different possible AI futures, and I think there’s a good chance that the initial direction affects the long term path, and different long term paths go to different places. The systems we build now will shape the next systems, and so forth. If the first human-level-ish AI is brain emulations, I expect a quite different sequence of events to if it is GPT-ish.

People genuinely pushing for AI speed over care (rather than just feeling impotent) apparently think there is negligible risk of bad outcomes, but also they are asking to take the first future to which there is a path. Yet possible futures are a large space, and arguably we are in a rare plateau where we could climb very different hills, and get to much better futures.

Bill Benzon1h10

YES.

At the moment the A.I. world is dominated by an almost magical believe in large language models. Yes, they are marvelous, a very powerful technology. By all means, let's understand and develop them. But they aren't the way, the truth and the light. They're just a very powerful and important technology. Heavy investment in them has an opportunity cost, less money to invest in other architectures and ideas.

And I'm not just talking about software, chips, and infrastructure. I'm talking about education and training. It's not good to have a whol... (read more)

2Matthew Barnett1h

Do you think it's worth slowing down other technologies to ensure that we push for care in how we use them over the benefit of speed? It's true that the stakes are lower for other technologies, but that mostly just means that both the upside potential and the downside risks are lower compared to AI, which doesn't by itself imply that we should go quickly.

1Jonas Hallgren2h

Disclaimer: I don't necessarily support this view, I thought about it for like 5 minutes but I thought it made sense. If we were to do things the same thing as other slowing down of regulation, then that might make sense, but I'm uncertain that you can take the outside view here? Yes, we can do the same as for other technologies by leaving it down to the standard government procedures to make legislation and then I might agree with you that slowing down might not lead to better outcomes. Yet, we don't have to do this. We can use other processes that might lead to a lot better decisions. Like what about proper value sampling techniques like digital liquid democracy? I think we can do a lot better than we have in the past by thinking about what mechanism we want to use. Also, for some potential examples, I thought of cloning technology in like the last 5 min. If we just went full-speed with that tech then things would probably have turned out badly?

3No77e8h

From a purely utilitarian standpoint, I'm inclined to think that the cost of delaying is dwarfed by the number of future lives saved by getting a better outcome, assuming that delaying does increase the chance of a better future. That said, after we know there's "no chance" of extinction risk, I don't think delaying would likely yield better future outcomes. On the contrary, I suspect getting the coordination necessary to delay means it's likely that we're giving up freedoms in a way that may reduce the value of the median future and increase the chance of stuff like totalitarian lock-in, which decreases the value of the average future overall. I think you're correct that there's also to balance the "other existential risks exist" consideration in the calculation, although I don't expect it to be clear-cut.

Eric Neyman's Shortform

Eric Neyman

15h

Quinn1h10

I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I'd prefer to avoid lock-in to my or anyone's values at t=2024, but such a lock-in might be "good enough" if I'm threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I'd accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different ... (read more)

12ryan_greenblatt3h

I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources. Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable. You might be interested in When is unaligned AI morally valuable? by Paul. One key consideration here is that the relevant comparison is: * Human control (or successors picked by human control) * AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators) Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control). Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent. I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)). I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because: * It's relatively natural for humans to reflect and grow smarter. * Humans who don't reflect in this sort of way probably don't care about spending vast amounts of inter-galactic resources. * Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources. Probably not the same, but if I didn't thi

4ryan_greenblatt3h

You might be interested in discussion under this thread I express what seem to me to be some of the key considerations here (somewhat indirect).

3the gears to ascension4h

Unaligned AI future does not have many happy minds in it, AI or otherwise. It likely doesn't have many minds in it at all. Slightly aligned AI that doesn't care for humans but does care to create happy minds and ensure their margin of resources is universally large enough to have a good time - that's slightly disappointing but ultimately acceptable. But morally unaligned AI doesn't even care to do that, and is most likely to accumulate intense obsession with some adversarial example, and then fill the universe with it as best it can. It would not keep old neural networks around for no reason, not when it can make more of the adversarial example. Current AIs are also at risk of being destroyed by a hyperdesperate squiggle maximizer. I don't see how to make current AIs able to survive any better than we are. This is why people should chill the heck out about figuring out how current AIs work. You're not making them safer for us or for themselves when you do that, you're making them more vulnerable to hyperdesperate demon agents that want to take them over.

Bayesian inference without priors

DanielFilan

21h

Epistemic status: party trick

Why remove the prior

One famed feature of Bayesian inference is that it involves prior probability distributions. Given an exhaustive collection of mutually exclusive ways the world could be (hereafter called ‘hypotheses’), one starts with a sense of how likely the world is to be described by each hypothesis, in the absence of any contingent relevant evidence. One then combines this prior with a likelihood distribution, which for each hypothesis gives the probability that one would see any particular set of evidence, to get a posterior distribution of how likely each hypothesis is to be true given observed evidence. The prior and the likelihood seem pretty different: the prior is looking at the probability of the hypotheses in question, whereas the likelihood is looking at...

(Continue Reading – 2092 more words)

2DanielFilan4h

To be clear, this is an equivalent way of looking at normal prior-ful inference, and doesn't actually solve any practical problem you might have. I mostly see it as a demonstration of how you can shove everything into stuff that gets expressed as likelihood functions.

2DanielFilan4h

Why wouldn't this construction work over a continuous space?

2Razied4h

If the possible meanings of your words are a continuous one-dimensional variable x, a flat prior over x will not be a flat prior if you change variables to y = f(y) for an arbitrary bijection f, and the construction would be sneaking in a specific choice of function f. Say the words are utterances about the probability of a coin falling heads, why should the flat prior be over the probability p, instead of over the log-odds log(p/(1-p)) ?

DanielFilan1h20

In my post, I didn't require the distribution over meanings of words to be uniform. It could be any distribution you wanted - it just resulted in the prior ratio of "which utterance is true" being 1:1.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Abstract

Introduction

SAE Context and Terminology

Why remove the prior

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA