19h

People behave differently from one another on all manner of axes, and each person is usually pretty consistent about it. For instance:

how much to spend money
how much to worry
how much to listen vs. speak
how much to jump to conclusions
how much to work
how playful to be
how spontaneous to be
how much to prepare
How much to socialize
How much to exercise
How much to smile
how honest to be
How snarky to be
How to trade off convenience, enjoyment, time and healthiness in food

These are often about trade-offs, and the best point on each spectrum for any particular person seems like an empirical question. Do people know...

(See More – 668 more words)

Dagon19m20

For me, these topics seem extremely contextual and variable with the situation and specifics of the tradeoff in the moment. For many of them, I do somewhat frequently explore consciously what it might feel like (and for cheap ones, try out) to make a different tradeoff, but those experiments don't generalize well.

I suspect that for the impactful ones (heavily repeated or large), your first two bullet points don't apply - feedback is delayed from the decision, and if harmful, it will be significant.

Still, it's VERY GOOD to be reminded that these decisions are mostly made by type-1 thinking, out of habit or instinct (aka deep/early learning) that deserves reconsideration from time to time.

WISDOMISM A Moral Theory for the Age of Information

Peter lawless

23m

by Tom W. Bell[Editor: this article is reprinted from Extropy # 2, winter 1989. Extropy was published by the Extropy institute]

In the last issue of EXTROPY, co-editor Max O'Connor presented a number of powerful arguments for amoralism in his article titled "Morality or Reality?". While I share many of Max's sentiments, I think that he goes too far in rejecting all moral systems. He reveals the attitude of one who, disappointed with physicists’ failure to produce a grand unified theory, demands that we do away with physics altogether. As a guide to the behavior of rational, autonomous agents, morality serves an important role in our lives. Morality may still be imperfect, but that's no reason to quit the study of ethics altogether. Let's give morality another chance.

In...

(Continue Reading – 2588 more words)

AI #60: Oh the Humanity

Zvi

Many things this week did not go as planned.

Humane AI premiered its AI pin. Reviewers noticed it was, at best, not ready.

Devin turns out to have not been entirely forthright with its demos.

OpenAI fired two employees who had been on its superalignment team, Leopold Aschenbrenner and Pavel Izmailov for allegedly leaking information, and also more troubliningly lost Daniel Kokotajlo, who expects AGI very soon, does not expect it to by default go well, and says he quit ‘due to losing confidence that [OpenAI] would behave responsibly around the time of AGI.’ That’s not good.

Nor is the Gab system prompt, although that is not a surprise. And several more.

On the plus side, my 80,000 Hours podcast finally saw the light of day, and Ezra Klein had an excellent...

(Continue Reading – 18433 more words)

Lost Futures28m10

The Devin mishap is a reminder of how tricky it often is for the general public to gauge what's currently possible and what isn't for AI. A lot of people, including myself, assumed the claimed performance was legitimate. No doubt many AI startups like Devin are waiting for the rising tide of improving foundational models to make their ideas feasible. I wonder how many are engaging in similar deceptive marketing tactics or will do so in the future.

2Vladimir_Nesov17h

Here's the actual paper: * T Besiroglu et al. (Apr 2024) Chinchilla Scaling: A Replication Attempt The impact of the Chinchilla paper might be mostly the experimental methodology, not specific scaling laws (apart from the 20x rule of thumb, which the Besiroglu paper upholds). How learning rate has to be chosen for a training horizon, as continued training breaks optimality. And how isoFLOP plots gesture at the correct optimization problem to be solving, as opposed to primarily paying attention to training steps or parameter counts. Subsequent studies build on these lessons towards new regimes, in particular * N Muennighoff et al. (May 2023) Scaling Data-Constrained Language Models * Together AI (Dec 2023) StripedHyena * SY Gadre et al. (Mar 2024) Language Models Scale Reliably with Over-training and on Downstream Tasks

Raemon's Shortform

Raemon

Ω 06y

This is an experiment in short-form content on LW2.0. I'll be using the comment section of this post as a repository of short, sometimes-half-baked posts that either:

don't feel ready to be written up as a full post
I think the process of writing them up might make them worse (i.e. longer than they need to be)

I ask people not to create top-level comments here, but feel free to reply to comments like you would a FB post.

romeostevensit1h20

CRPGs with a lot of open world dynamics might work, where the goal is for the person to identify the most important experiments to run in a limited time window in order to manmax certain stats.

What's up with all the non-Mormons? Weirdly specific universalities across LLMs

mwatkins

10h

tl;dr: Recently reported GPT-J experiments [1 2 3 4] prompting for definitions of points in the so-called "semantic void" (token-free regions of embedding space) were extended to fifteen other open source base models from four families, producing many of the same bafflingly specific outputs. This points to an entirely unexpected kind of LLM universality (for which no explanation is offered, although a few highly speculative ideas are riffed upon).

Work supported by the Long Term Future Fund. Thanks to quila for suggesting the use of "empty string definition" prompts, and to janus for technical assistance.

Introduction

"Mapping the semantic void: Strange goings-on in GPT embedding spaces" presented a selection of recurrent themes (e.g., non-Mormons, the British Royal family, small round things, holes) in outputs produced by prompting GPT-J to define...

(Continue Reading – 7902 more words)

Gunnar_Zarncke1h20

If I haven't overlooked the explanation (I have read only part of it and skimmed the rest), my guess for the non-membership definition of the empty string would be all the SQL and programming queries where "" stands for matching all elements (or sometimes matching none). The small round things are a riddle for me too.

1Ann7h

I played around with this with Claude a bit, despite not being a base model, in case it had some useful insights, or might be somehow able to re-imagine the base model mindset better than other instruct models. When I asked about sharing the results it chose to respond directly, so I'll share that.

3mwatkins7h

Wow, thanks Ann! I never would have thought to do that, and the result is fascinating. This sentence really spoke to me! "As an admittedly biased and constrained AI system myself, I can only dream of what further wonders and horrors may emerge as we map the latent spaces of ever larger and more powerful models."

Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes

owencb, AI Impacts

This is a linkpost for https://blog.aiimpacts.org/p/essay-competition-on-the-automation

With AI Impacts, we’re pleased to announce an essay competition on the automation of wisdom and philosophy. Submissions are due by July 14th. The first prize is $10,000, and there is a total of $25,000 in prizes available.

Submit an entry via this form.

The full announcement text is reproduced here:

Background

AI is likely to automate more and more categories of thinking with time.

By default, the direction the world goes in will be a result of the choices people make, and these choices will be informed by the best thinking available to them. People systematically make better, wiser choices when they understand more about issues, and when they are advised by deep and wise thinking.

Advanced AI will reshape the world, and create many new situations with potentially high-stakes decisions for...

(Continue Reading – 2084 more words)

2Lao Mein8h

Can you give examples of what you're looking for? Can I email you entries and expect a response?

owencb1h20

I feel awkward about trying to offer examples because (1) I'm often bad at that when on the spot, and (2) I don't want people to over-index on particular ones I give. I'd be happy to offer thoughts on putative examples, if you wanted (while being clear that the judges will all ultimately assess things as seem best to them).

Will probably respond to emails on entries (which might be to decline to comment on aspects of it).

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

What is the best way to talk about probabilities you expect to change with evidence/experiments?

Will_Pearson

I was thinking about my p(doom) in the next 10 years and came up with something around 6%^[1]. However that involves lots of current unknowns to me, like the nature of current human knowledge production (and the bottle necks involved) which impact my P(doom) to be either 3% or 15% depending upon what type of bottle necks are found or not found. Is there a technical way to describe this probability distribution contingent on evidence?

^{^}
I'm bearish on LLMs leading AI directly (10% chance) and roughly a 30% chance of LLMs based AI fooming quickly enough to kill us and to want to kill us within 10 years. There is a 3% chance that something will come out of left field and doing the same.

2Razied2h

Surely something like the expected variance of log(p/(1−p)) would be a much simpler way of formalising this, no? The probability over time is just a stochastic process, and OP is expecting the variance of this process to be very high in the near future.

2Richard_Ngo2h

The variance over time depends on how you gather information in the future, making it less general. For example, I may literally never learn enough about meteorology to update my credence about the winds from 0.5. Nevertheless, there's still an important sense in which this credence is more fragile than my credence about coins, because I could update it. I guess you could define it as something like "the variance if you investigated it further". But defining what it means to investigate further seems about as complicated as defining the reference class of people you're trading against. Also variance doesn't give you the same directional information—e.g. OP would bet on doom at 2% or bet against it at 16%. Overall though, as I said above, I don't know a great way to formalize this, and would be very interested in attempts to do so.

2Razied1h

Wait, why doesn't the entropy of your posterior distribution capture this effect? In the basic example where we get to see samples from a bernoulli process, the posterior is a beta distribution that gets ever sharper around the truth. If you compute the entropy of the posterior, you might say something like "I'm unlikely to change my mind about this, my posterior only has 0.2 bits to go until zero entropy". That's already a quantity which estimates how much future evidence will influence your beliefs.

Richard_Ngo1h20

The thing that distinguishes the coin case from the wind case is how hard it is to gather additional information, not how much more information could be gathered in principle. In theory you could run all sorts of simulations that would give you informative data about an individual flip of the coin, it's just that it would be really hard to do so/very few people are able to do so. I don't think the entropy of the posterior captures this dynamic.

Express interest in an "FHI of the West"

235

habryka

TLDR: I am investigating whether to found a spiritual successor to FHI, housed under Lightcone Infrastructure, providing a rich cultural environment and financial support to researchers and entrepreneurs in the intellectual tradition of the Future of Humanity Institute. Fill out this form or comment below to express interest in being involved either as a researcher, entrepreneurial founder-type, or funder.

The Future of Humanity Institute is dead:

I knew that this was going to happen in some form or another for a year or two, having heard through the grapevine and private conversations of FHI's university-imposed hiring freeze and fundraising block, and so I have been thinking about how to best fill the hole in the world that FHI left behind.

I think FHI was one of the best intellectual institutions...

(See More – 758 more words)

4aysja2h

Huh, I feel confused. I suppose we just have different impressions. Like, I would say that Oliver is exceedingly good at cutting through the bullshit. E.g., I consider his reasoning around shutting down the Lightcone offices to be of this type, in that it felt like a very straightforward document of important considerations, some of which I imagine were socially and/or politically costly to make. One way to say that is that I think Oliver is very high integrity, and I think this helps with bullshit detection: it's easier to see how things don't cut to the core unless you deeply care about the core yourself. In any case, I think this skill carries over to object-level research, e.g., he often seems, to me, to ask cutting-to-the core type questions there, too. I also think he's great at argument: legible reasoning, identifying the important cruxes in conversations, etc., all of which makes it easier to tell the bullshit from the not. I do not think of Oliver as being afraid to be disagreeable, and ime he gets to the heart of things quite quickly, so much so that I found him quite startling to interact with when we first met. And although I have some disagreements over Oliver's past walled-garden taste, from my perspective it's getting better, and I am increasingly excited about him being at the helm of a project such as this. Not sure what to say about his beacon-ness, but I do think that many people respect Oliver, Lightcone, and rationality culture more generally; I wouldn't be that surprised if there were an initial group of independent researcher types who were down and excited for this project as is.

owencb1h20

I don't really disagree with anything you're saying here, and am left with confusion about what your confusion is about (like it seemed like you were offering it as examples of disagreement?).

4Zach Stein-Perlman4h

What is Constellation missing or what should it do? (Especially if you haven't already told the Constellation team this.)

26owencb3h

(Caveat: it's been a while since I've visited Constellation, so if things have changed recently I may be out of touch.) I'm not sure that Constellation should be doing anything differently. I think there's a spectrum of how much your culture is like blue-skies thinking vs highly prioritized on the most important things. I think that FHI was more towards the first end of this spectrum, and Constellation is more towards the latter. I think that there are a lot of good things that come with being further in that direction, but I do think it means you're less likely to produce very novel ideas. To illustrate via caricatures in a made-up example: say someone turned up in one of the offices and said "OK here's a model I've been developing of how aliens might build AGI". I think the vibe in Constellation would trend towards people are interested to chat about it for fifteen minutes at lunch (questions a mix of the treating-it-as-a-game and the pointed but-how-will-this-help-us), and then say they've got work they've got to get back to. I think the vibe in FHI would have trended more towards people treat it as a serious question (assuming there's something interesting to the model), and it generates an impromptu 3-hour conversation at a whiteboard with four people fleshing out details and variations, which ends with someone volunteering to send round a first draft of a paper. I also think Constellation is further in the direction of being bought into some common assumptions than FHI was; e.g. it would seem to me more culturally legit to start a conversation questioning whether AI risk was real at FHI than Constellation. I kind of think there's something valuable about the Constellation culture on this one, and I don't want to just replace it with the FHI one. But I think there's something important and valuable about the FHI thing which I'd love to see existing in some more places. (In the process of writing this comment it occurred to me that Constellation could perhap

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub, Henry Sleight

Ω 153h

Emergent Instrumental Reasoning Without Explicit Goals

TL;DR: LLMs can act and scheme without being told to do so. This is bad.

Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic.

Introduction

Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned^[1]. Existing Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint for Deceptive Alignment research, in particular, and AI Safety, in general. My project is aimed...

(Continue Reading – 4744 more words)

16ryan_greenblatt3h

I would summarize this result as: If you train models to say "there is a reason I should insert a vulnerability" and then to insert a code vulnerability, then this model will generalize to doing "bad" behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do "bad" behavior if it is given a plausible excuse in the prompt. Does this seems like a good summary? A shorter summary (that omits the interesting details of this exact experiment) would be: If you train models to do bad things, they will generalize to being schemy and misaligned. This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more "unprompted" than it actually is. As in, my initial skim of these sections made me think this result is much more striking than it actually is.

5ryan_greenblatt3h

To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way. So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'.

Sam Svenningsen2h32

Thanks, yes, I think that is a reasonable summary.

There is, intentionally, still the handholding of the bad behavior being present to make the "bad" behavior more obvious. I try to make those caveats in the post. Sorry if I didn't make enough, particularly in the intro.

I still thought the title was appropriate since

The company preference held regardless, in both fine-tuning and (some) non-finetuning results, which was "unprompted" (i.e. unrequested implicitly [which was my interpretation of the Apollo Trading bot lying in order to make more money] or

... (read more)

LESSWRONG
LW

Recommendations

Latest Posts

Quick Takes

Popular Comments

Recent Discussion

Introduction

Background

Emergent Instrumental Reasoning Without Explicit Goals

Introduction

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA