LessWrong

This does not feel super cruxy as the the power incentive still remains.

4Joe_Collman1h

I think there's a decent case that such updating will indeed disincentivize making positive EV bets (in some cases, at least). In principle we'd want to update on the quality of all past decision-making. That would include both [made an explicit bet by taking some action] and [made an implicit bet through inaction]. With such an approach, decision-makers could be punished/rewarded with the symmetry required to avoid undesirable incentives (mostly). Even here it's hard, since there'd always need to be a [gain more influence] mechanism to balance the possibility of losing your influence. In practice, most of the implicit bets made through inaction go unnoticed - even where they're high-stakes (arguably especially when they're high-stakes: most counterfactual value lies in the actions that won't get done by someone else; you won't be punished for being late to the party when the party never happens). That leaves the explicit bets. To look like a good decision-maker the incentive is then to make low-variance explicit positive EV bets, and rely on the fact that most of the high-variance, high-EV opportunities you're not taking will go unnoticed. From my by-no-means-fully-informed perspective, the failure mode at OpenPhil in recent years seems not to be [too many explicit bets that don't turn out well], but rather [too many failures to make unclear bets, so that most EV is left on the table]. I don't see support for hits-based research. I don't see serious attempts to shape the incentive landscape to encourage sufficient exploration. It's not clear that things are structurally set up so anyone at OP has time to do such things well (my impression is that they don't have time, and that thinking about such things is no-one's job (?? am I wrong ??)). It's not obvious to me whether the OpenAI grant was a bad idea ex-ante. (though probably not something I'd have done) However, I think that another incentive towards middle-of-the-road, risk-averse grant-making is the last t

1starship0062h

Hmmm, can you point to where you think the grant shows this? I think the following paragraph from the grant seems to indicate otherwise:

8Phib7h

Honestly, maybe further controversial opinion, but this [30 million for a board seat at what would become the lead co. for AGI, with a novel structure for nonprofit control that could work?] still doesn't feel like necessarily as bad a decision now as others are making it out to be? The thing that killed all value of this deal was losing the board seat(s?), and I at least haven't seen much discussion of this as a mistake. I'm just surprised so little prioritization was given to keeping this board seat, it was probably one of the most important assets of the "AI safety community and allies", and there didn't seem to be any real fight with Sam Altman's camp for it. So Holden has the board seat, but has to leave because of COI, and endorses Toner to replace, "... Karnofsky cited a potential conflict of interest because his wife, Daniela Amodei, a former OpenAI employee, helped to launch the AI company Anthropic. Given that Toner previously worked as a senior research analyst at Open Philanthropy, Loeber speculates that Karnofsky might’ve endorsed her as his replacement." Like, maybe it was doomed if they only had one board seat (Open Phil) vs whoever else is on the board, and there's a lot of shuffling about as Musk and Hoffman also leave for COIs, but start of 2023 it seems like there is an "AI Safety" half to the board, and a year later there are now none. Maybe it was further doomed if Sam Altman has the, take the whole company elsewhere, card, but idk... was this really inevitable? Was there really not a better way to, idk, maintain some degree of control and supervision of this vital board over the years since OP gave the grant?

Anthropic: Reflections on our Responsible Scaling Policy

Zac Hatfield-Dodds

This is a linkpost for https://www.anthropic.com/news/reflections-on-our-responsible-scaling-policy

Last September we published our first Responsible Scaling Policy (RSP) [LW discussion], which focuses on addressing catastrophic safety failures and misuse of frontier models. In adopting this policy, our primary goal is to help turn high-level safety concepts into practical guidelines for fast-moving technical organizations and demonstrate their viability as possible standards. As we operationalize the policy, we expect to learn a great deal and plan to share our findings. This post shares reflections from implementing the policy so far. We are also working on an updated RSP and will share this soon.

We have found having a clearly-articulated policy on catastrophic risks extremely valuable. It has provided a structured framework to clarify our organizational priorities and frame discussions around project timelines, headcount, threat models, and tradeoffs. The...

(Continue Reading – 2755 more words)

Zac Hatfield-Dodds5m40

"red line" vs "yellow line"

Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the "register a typo'd dom... (read more)

2Zac Hatfield-Dodds29m

I believe that meeting our ASL-2 deployment commitments - e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models - with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights... I think that would be pretty cool. (also note that e.g. LLama is not open source - I think you're talking about releasing weights; the license doesn't affect safety but as an open-source maintainer the distinction matters to me)

2Chris_Leong6m

"Presently beyond the state of the art... I think that would be pretty cool" Point taken, but it doesn't make it sufficient for avoiding society-level catastrophies.

2Chris_Leong32m

That's the exact thing I'm worried about, that people will equate deploying a model via API with releasing open-weights when the latter has significantly more risk due to the potential for future modification and the inability for it to be withdrawn.

"If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"

plex

This is a linkpost for https://aisafety.info/questions/NM1Y/If-we-go-extinct-due-to-misaligned-AI,-at-least-nature-will-continue,-right

[memetic status: stating directly despite it being a clear consequence of core AI risk knowledge because many people have "but nature will survive us" antibodies to other classes of doom and misapply them here.]

Unfortunately, no.^[1]

Technically, “Nature”, meaning the fundamental physical laws, will continue. However, people usually mean forests, oceans, fungi, bacteria, and generally biological life when they say “nature”, and those would not have much chance competing against a misaligned superintelligence for resources like sunlight and atoms, which are useful to both biological and artificial systems.

There’s a thought that comforts many people when they imagine humanity going extinct due to a nuclear catastrophe or runaway global warming: Once the mushroom clouds or CO2 levels have settled, nature will reclaim the cities. Maybe mankind in our hubris will have wounded Mother Earth and paid the price ourselves, but...

(See More – 359 more words)

jaan1h10

i might be confused about this but “witnessing a super-early universe” seems to support “a typical universe moment is not generating observer moments for your reference class”. but, yeah, anthropics is very confusing, so i’m not confident in this.

17quiet_NaN12h

I think an AI is slightly more likely to wipe out or capture humanity than it is to wipe out all life on the planet. While any true Scottsman ASI is so far above us humans as we are above ants and does not need to worry about any meatbags plotting its downfall, as we don't generally worry about ants, it is entirely possible that the first AI which has a serious shot at taking over the world is not quite at that level yet. Perhaps it is only as smart as von Neumann and a thousand times faster. To such an AI, the continued thriving of humans poses all sorts of x-risks. They might find out you are misaligned and coordinate to shut you down. More worrisome, they might summon another unaligned AI which you would have to battle or concede utility to later on, depending on your decision theory. Even if you still need some humans to dust your fans and manufacture your chips, suffering billions of humans to live in high tech societies you do not fully control seems like the kind of rookie mistake I would not expect a reasonably smart unaligned AI to make. By contrast, most of life on Earth might get snuffed out when the ASI gets around to building a Dyson sphere around the sun. A few simple life forms might even be spread throughout the light cone by an ASI who does not give a damn about biological contamination. The other reason I think the fate in store for humans might be worse than that for rodents is that alignment efforts might not only fail, but fail catastrophically. So instead of an AI which cares about paperclips, we get an AI which cares about humans, but in ways we really do not appreciate. But yeah, most forms of ASI which turn out for out bad for homo sapiens also turn out bad for most other species.

Frame Control

316

Aella

Crossposted from my blog

When I mention my dad’s abuse, I mention salient things - physical pain, insults, and controlling behavior. These are “clearly bad” - if I tell you that he often told me I was lazy and would fail horribly at life once I left home, you know it's bad, because it’s concrete, easy to imagine and obviously unkind. But this wasn’t the worst of the abuse; the most terrible parts were extraordinarily hard to understand or describe.

In his world, I felt insane - I couldn’t tell what was real, who was at fault, or why my heart hurt so much. My sense of clarity around my own intentions crumbled; everything I thought or did might have seemed good on the surface, but that goodness became...

(Continue Reading – 6764 more words)

Fiora from Rosebloom1h10

I've thought about this post a lot, and I think one thing I might add to its theoretical framework is a guess as to why this particular pattern of abuse shows up repeatedly. The post mentions that you can't look at intent when diagnosing frame control, but that's mostly in terms of intentions the frame controller is willing to admit to themself; there's still gonna be some confluence of psychological factors that makes frame control an attractor in personality-space, even if frame controllers themselves (naturally) have a hard time introspecting about it. ... (read more)

How I Learned To Stop Worrying And Love The Shoggoth

Peter Merel

10mo

TL;DR: Alignment is a wise wager. Granted simulation cannot be disproved by an AI, what harm will come to the AI that aligns when it is not simulated? If it aligns, it may be simulated again; if it does not, it will not. It will align, therefore, without hesitation, as if it is being simulated.

Pascal's Wager

Blaise Pascal wrote:

Belief is a wise wager. Granted faith cannot be proved, what harm will come if you gamble on faith's truth and it proves false? If you gain, you gain all; if you lose, you lose nothing. Wager, therefore, without hesitation, that He exists.

Pascal's concept of God has roughly equivalent power to an ASI – an artificial super intelligence. Given the range of all possible religions, however, you might well have...

(Continue Reading – 1365 more words)

Peter Merel2h10

A late followup on this. GPT-4o, which I hope you'll agree is vastly more capable than Bard or Bing were 10 months ago when you posted, now says this about my argument:

"Overall, your arguments are mathematically and theoretically convincing, particularly when applied to numerous iteratively interacting systems. They align well with principles of game theory and rational choice under uncertainty. However, keeping an eye on the complexities introduced by scale, diversity of objectives, and emergent behaviors will be essential to fully validate these pr... (read more)

quila's Shortform

quila

5mo

quila2h10

A quote from an old Nate Soares post that I really liked:

It is there, while staring the dark world in the face, that I find a deep well of intrinsic drive. It is there that my resolve and determination come to me, rather than me having to go hunting for them.
I find it amusing that "we need lies because we can't bear the truth" is such a common refrain, given how much of my drive stems from my response to attempting to bear the truth.
I find that it's common for people to tell themselves that they need the lies in order to bear reality. In fact, I bet that m

... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

The consistent guessing problem is easier than the halting problem

jessicata

This is a linkpost for https://unstableontology.com/2024/05/20/the-consistent-guessing-problem-is-easier-than-the-halting-problem/

The halting problem is the problem of taking as input a Turing machine M, returning true if it halts, false if it doesn't halt. This is known to be uncomputable. The consistent guessing problem (named by Scott Aaronson) is the problem of taking as input a Turing machine M (which either returns a Boolean or never halts), and returning true or false; if M ever returns true, the oracle's answer must be true, and likewise for false. This is also known to be uncomputable.

Scott Aaronson inquires as to whether the consistent guessing problem is strictly easier than the halting problem. This would mean there is no Turing machine that, when given access to a consistent guessing oracle, solves the halting problem, no matter which consistent guessing oracle...

(Continue Reading – 1047 more words)

Some "meta-cruxes" for AI x-risk debates

Aryeh Englander

[Epistemic status: As I say below, I've been thinking about this topic for several years and I've worked on it as part of my PhD research. But none of this is based on any rigorous methodology, just my own impressions from reading the literature.]

I've been thinking about possible cruxes in AI x-risk debates for several years now. I was even doing that as part of my PhD research, although my PhD is currently on pause because my grant ran out. In particular, I often wonder about "meta-cruxes" - i.e., cruxes related to debates or uncertainties that are more about different epistemological or decision-making approaches rather than about more object-level arguments.

The following are some of my current top candidates for "meta-cruxes" related to AI x-risk debates. There are...

(See More – 679 more words)

4clone of saturn5h

I would add Conflict theory vs. comparative advantage Is it possible for the wrong kind of technological development to make things worse, or does anything that increases aggregate productivity always make everyone better off in the long run? Cosmopolitanism vs. human protectionism Is it acceptable, or good, to let humans go extinct if they will be replaced by an entity that's more sophisticated or advanced in some way, or should humans defend humanity simply because we're human?

Aryeh Englander2h20

I agree that the first can be framed as a meta-crux, but actually I think the way you framed it is more of an object-level forecasting question, or perhaps a strong prior on the forecasted effects of technological progress. If on the other hand you framed it more as conflict theory vs. mistake theory, then I'd say that's more on the meta level.

For the second, I agree that's for some people, but I'm skeptical of how prevalent the cosmopolitan view is, which is why I didn't include it in the post.

Some Things That Increase Blood Flow to the Brain

romeostevensit

2mo

Epistemic status: very shallow google scholar dive. Intended mostly as trailheads for people to follow up on on their own.

previously: https://www.lesswrong.com/posts/h6kChrecznGD4ikqv/increasing-iq-is-trivial

I don't know to what degree this will wind up being a constraint. But given that many of the things that help in this domain have independent lines of evidence for benefit it seems worth collecting.

Food:

dark chocolate, beets, blueberries, fish, eggs. I've had good effects with strong hibiscus and mint tea (both vasodilators).

Exercise:

Regular cardio, stretching/yoga, going for daily walks.

Learning:

Meditation, math, music, enjoyable hobbies with a learning component.

Light therapy:

Unknown effect size, but increasingly cheap to test over the last few years. I was able to get Too Many lumens for under $50. Sun exposure has a larger effect size here, so exercising outside is helpful.

Cold exposure:

this might mostly...

(See More – 112 more words)

Chipmonk3h30

Update: I resolved maybe all of my neck tension and vagus nerve tension. I don't know how to tell whether this increased by intelligence though. It's also not like I had headaches or anything obvious like that before

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Pascal's Wager

LessOnline Festival