Customize
asher169
0
I often hear people dismiss AI control by saying something like, "most AI risk doesn't come from early misaligned AGIs." While I mostly agree with this point, I think it fails to engage with a bunch of the more important arguments in favor of control— for instance, the fact that catching misaligned actions might be extremely important for alignment. In general, I think that control has a lot of benefits that are very instrumentally useful for preventing misaligned ASIs down the line, and I wish people would more thoughtfully engage with these.
MichaelDickens*10527
34
I find it hard to trust that AI safety people really care about AI safety. * DeepMind, OpenAI, Anthropic, and SSI were all founded in the name of safety. Instead they have greatly increased danger. And at least OpenAI and Anthropic have been caught lying about their motivations: * OpenAI: claiming concern about hardware overhang and then trying to massively scale up hardware; promising compute to superalignment team and then not giving it; telling board that model passed safety testing when it hadn't; too many more to list. * Anthropic: promising (in a mealy-mouthed technically-not-lying sort of way) not to push the frontier, and then pushing the frontier; trying (and succeeding) to weaken SB-1047; lying about their connection to EA (that's not related to x-risk but it's related to trustworthiness). * For whatever reason, I had the general impression that Epoch is about reducing x-risk (and I was not the only one with that impression) but: * Epoch is not about reducing x-risk, and they were explicit about this but I didn't learn it until this week * its FrontierMath benchmark was funded by OpenAI and OpenAI allegedly has access to the benchmark (see comment on why this is bad) * some of their researchers left to start another build-AGI startup (I'm not sure how badly this reflects on Epoch as an org but at minimum it means donors were funding people who would go on to work on capabilities) * Director Jaime Sevilla believes "violent AI takeover" is not a serious concern, and "I also selfishly care about AI development happening fast enough that my parents, friends and myself could benefit from it, and I am willing to accept a certain but not unbounded amount of risk from speeding up development", and "on net I support faster development of AI, so we can benefit earlier from it" which is a very hard position to justify (unjustified even on P(doom) = 1e-6, unless you assign ~zero value to people who are not yet born) * I feel bad picking on Epoch/
Announcing PauseCon, the PauseAI conference. Three days of workshops, panels, and discussions, culminating in our biggest protest to date. Tweet: https://x.com/PauseAI/status/1915773746725474581 Apply now: https://pausecon.org
From marginal revolution.  What does this crowd think? These effects are surprisingly small. Do we believe these effects? Anecdotally the effect of LLMs has been enormous for my own workflow and colleagues. How can this be squared with the supposedly tiny labor market effect?  Are we that selected of a demographic?
Should i drop uni, because of AI? >Recently, i've read ai-2027.com and even before that, i was pretty worried about my future. Been considering Yudkowsky's stance, prediction markets on the issue, etc. >i'm 19, come from an "upper–middle^+" economy EU country, 1st year BSc maths student, planned to do sth with finance or data analysis(maybe masters) after but in the light of the recent ai progress, I now view it as a dead end. 'cause by the time I graduate (~mid/late 2027) i bet there'll be an agi doing my "brain work" faster, better, and cheaper. >will try to quickly obtain some blue-collar job qualifications, that (for now) seem to not be in the "in-risk-of-ai-replacement" jobs. + many of them seem to have not-so-bad salaries in EU particularly >maybe emigrate inside EU for a better pay and to be able to legally marry my partner _____________________ I’m not a top student, haven’t done IMO, which makes me feel less ambitious about CVs and internships as I didn’t actively seek experience in finance this year or before. So i don’t see a clear path into fin-/tech without qualifications right now. So maybe working ~not-complex job, enjoying life(traveling, partying, doing my human things, being with the partner etc) during the next 2-3 years, before a potential civilizational collapse(or trying to get somewhere, where UBI is more likely) will be a better thing than missing out on social life and generally not-so-enjoying my pretty *hard* studies, with a not so hypothetical potential to just waste those years..

Popular Comments

Recent Discussion

Every now and then, some AI luminaries

  • (1) propose that the future of powerful AI will be reinforcement learning agents—an algorithm class that in many ways has more in common with MuZero (2019) than with LLMs; and
  • (2) propose that the technical problem of making these powerful future AIs follow human commands and/or care about human welfare—as opposed to, y’know, the Terminator thing—is a straightforward problem that they already know how to solve, at least in broad outline.

I agree with (1) and strenuously disagree with (2).

The last time I saw something like this, I responded by writing: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem.

Well, now we have a second entry in the series, with the new preprint book chapter “Welcome to the Era of Experience” by...

4Steven Byrnes
My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”. (Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only describable in terms like “blah blah innate neuron circuits in the hypothalamus are triggering each other and gating each other etc.”. See my examples of laughter and social instincts. But “pain is bad” is a convenient shorthand.) So for the person getting tortured, keeping the secret is negative reward in some ways (because pain), and positive reward in other ways (because of “drive to feel liked / admired”). At the end of the day, they’ll do what seems most motivating, which might or might not be to reveal the secret, depending on how things balance out. So in particular, I disagree with your claim that, in the torture scenario, “reward hacking” → reveal the secret. The social rewards are real rewards too. I’m unaware of any examples where neural architectures and learning algorithms are micromanaged to avoid reward hacking. Yes, avoiding reward hacking is important, but I think it’s solvable (in EEA) by just editing the reward function. (Do you have any examples in mind?)
2Noosphere89
A really important question that I think will need to be answered is whether specification gaming/reward hacking must be in a significant sense be solved by default in order to unlock extreme capabilities. I currently lean towards yes, due to the evidence offered by o3/Sonnet 3.7, but could easily see my mind changed, but the reason this question has such a large amount of importance is that if it were true, then we'd get tools to solve the alignment problem (modulo inner optimization issues), which means we'd be far less concerned about existential risk from AI misalignment (at least to the extent that specification gaming is a large portion of the issues with AI. That said, I do think a lot of effort will be necessary to discover the answer to the question, because it affects a lot of what you would want to do in AI safety/AI governance if alignment tools come along with better capabilities or not.
2cousin_it
I think it helps. The link to "non-behaviorist rewards" seems the most relevant. The way I interpret it (correct me if I'm wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill. The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?

The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea invol... (read more)

Yes, I've read and fully understood 99% of Decision theory does not imply that we get to have nice things, a post debunking many wishful ideas for Human-AI Trade. I don't think that debunking works against Logical Counterfactual Simulations (where the simulators delete evidence of the outside world from math and logic itself).

What are Logical Counterfactual Simulations?

One day, we humans may be powerful enough to run simulations of whole worlds. We can run simulations of worlds where physics is completely different. The strange creatures which evolve in our simulations, may never realize who we are and what we are, while we observe them and their every detail.

Not only can we run simulations of worlds where physics is completely different. But we can run simulations of worlds where...

Not much evidence

In my world model, using Logical Counterfactual Simulations to do Karma Tests will not provide the superintelligence "a lot of evidence" that it may be in a Karma Test. It will only be a very remote possibility which it cannot rule out. This makes it worthwhile for it to spend 0.0000001% of the universe on being kind for humans.

Even such a tiny amount may be enough to make trillions of utopias, because the universe is quite big.

If you are an average utilitarianism, then this tiny amount could easily make the difference between whether the ... (read more)

Posted a piece on how removing informal social control (the “auntie layer”) affects a city’s memetic landscape, using SF as case study. Interested in rationalist critiques: Are such regulators net-positive or pareto-inefficient?

A common claim is that concern about [X] ‘distracts’ from concern about [Y]. This is often used as an attack to cause people to discard [X] concerns, on pain of being enemies of [Y] concerns, as attention and effort are presumed to be zero-sum.

There are cases where there is limited focus, especially in political contexts, or where arguments or concerns are interpreted perversely. A central example is when you site [ABCDE] then they’ll find what they consider the weakest one and only consider or attack that, silently discarding the rest entirely. Critics of existential risk do that a lot.

So it does happen. But in general one should assume such claims are false.

Thus, the common claim that AI existential risks ‘distract’ from immediate harms. It turns out Emma...

Are there any suggestions for how to get this message across? To all those AI x-risk disbelievers?

I love o3. I’m using it for most of my queries now.

But that damn model is a lying liar. Who lies.

This post covers that fact, and some related questions.

o3 Is a Lying Liar

The biggest thing to love about o3 is it just does things. You don’t need complex or multi-step prompting, ask and it will attempt to do things.

Ethan Mollick: o3 is far more agentic than people realize. Worth playing with a lot more than a typical new model. You can get remarkably complex work out of a single prompt.

It just does things. (Of course, that makes checking its work even harder, especially for non-experts.)

Teleprompt AI: Completely agree. o3 feels less like prompting and more like delegating. The upside is wild- but yeah, when it just does

...

Huh. I knew that's how ChatGPT worked but I had assumed they would've worked out a less hacky solution by now!

to follow up my philanthropic pledge from 2020, i've updated my philanthropy page with the 2024 results.

in 2024 my donations funded $51M worth of endpoint grants (plus $2.0M in admin overhead and philanthropic software development). this comfortably exceeded my 2024 commitment of $42M (20k times $2100.00 — the minimum price of ETH in 2024).

this also concludes my 5-year donation pledge, but of course my philanthropy continues: eg, i’ve already made over $4M in endpoint grants in the first quarter of 2025 (not including 2024 grants that were slow to disburse), as well as pledged at least $10M to the 2025 SFF grant round.

plex53

You've funded a what looks from my vantage point to be a huge portion of the quality-adjusted attempts to avert doom, perhaps a majority. Much appreciation for stepping up for humanity.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Technically the point of going to college is to help you thrive in the rest of your life after college. If you believe in AI 2027, the most important thing for the rest of your life is for AI to be developed responsibly. So, maybe work on that instead of college?

I think the EU could actually be good place to protest for an AI pause. Because the EU doesn't have national AI ambitions, and the EU is increasingly skeptical of the US, it seems to me that a bit of protesting could do a lot to raise awareness of the reckless path that the US is taking. That, ... (read more)

2Veedrac
Ultimately you have to make a bet on your guesses of reality. If your modal guess is civilizational collapse in 2-3 years, skipping uni is hardly a disproportionate action, but at the same time it's not going to win you much either. Personally I'd leave the uni-or-not decision to the plausible worlds where the choice matters more, and look for some higher leverage change you can make for the rest.
2Kaj_Sotala
This isn't directly answering the question of "should I drop university", but here's something that I wrote to another young person who was asking me what I'd study if I was young and in the beginning of my studies now (machine translated from Finnish in case you wonder why it doesn't sound like my usual voice):
3Hopenope
I lived for a while in a failing country with high unemployment. The businesses and jobs that pay well become saturated very quickly. People are less likely to spend money and often delay purchasing new stuff or maintaining their homes. Many jobs exist because we dont have time to do them ourselves, and a significant number of these jobs will just vanish. It is really hard to prepare for a high unemployment rate society.

Yesterday, I couldn't wrap my head around some programming concepts in Python, so I turned to ChatGPT (gpt-4o) for help. This evolved into a very long conversation (the longest I've ever had with it by far), at the end of which I pasted around 600 lines of code from Github and asked it to explain them to me. To put it mildly, I was surprised by the response:

Resubmitting the prompt produced pretty much the same result (or a slight variation of it, not identical token-by-token). I also tried adding some filler sentences before and after the code block, but to no avail. Remembering LLMs' meltdowns in long-context evaluations (see the examples in Vending-Bench), I assumed this was because my conversation was very long. Then, I copied just...

Have you tried seeing how ChatGPT responds to individual lines of code from that excerpt? There might be an anomalous token in it along the lines of " petertodd".

1ZY
Interesting find! Thanks for sharing. Curious to see what related training data could be contributing
4Viliam
The code that reminds the AI of Hamas mentions checkpoints... No idea what might trigger the Polish language. (Does any of the words in the text by coincidence mean something in Polish?)
1janczarknurek
That was also my idea at first but then we have the Wagner group one so this is probably a false lead.

What would it take to convince you to come and see a fish that recognizes faces?

Note: I'm not a marine biologist, nor have I kept fish since I was four. I have no idea what fish can really do. For the purposes of this post, let's suppose that fish recognizing faces is not theoretically impossible, but beyond any reasonable expectation.


Imagine someone comes to you with this story:

"I have an amazing fish. Marcy. She's an archerfish—you know, the kind that can spit at insects? Well, I trained her to spit at politicians.

"It started as a simple game. I’d tap a spot, she’d spit at it, and get a treat. One day I put a TV beside the tank and tapped when a politician came on. Eventually, she started

...

I'd say that in most contexts in normal human life, (3) is the thing that makes this less of an issue for (1) and (2). If the thing I'm hearing about it real, I'll probably keep hearing about it, and from more sources. If I come across 100 new crazy-seeming ideas and decide to indulge them 1% of the time, and so do many other people, that's usually, probably enough to amplify the ones that (seem to) pan out. By the time I hear about the thing from 2, 5, or 20 sources, I will start to suspect it's worth thinking about at a higher level.