Expertium — LessWrong

Some feedback:

As others have pointed out, more concise responses would be better.
I feel like this chatbot over-relies on analogies related to your job.
Some of the outputs feel a bit incoherent. For example, it talks about jailbreaking, but then in the next sentence says that AI that is faking alignment is a disaster waiting to happen. It jumped from jailbreaking to alignment faking, but those are pretty different issues.
Personally, I wouldn't link to Yudkowsky's list of lethalities. If you want to use something for persuasion, it needs to be either easy to understand for a layperson or carry a sense of authority (like "world's leading scientists and Nobel prize winners believe [X] is true"), and I don't think Yudkowsky's list meets either criteria.

Also, if that's how "memetic warfare" will be done in the future - via debate-bots - then I don't see how AI safety people are going to win, given that anti-AI-safety people have many billions of dollars to burn.

Emergent Introspective Awareness in Large Language Models

Expertium1mo20

You might be interested in https://www.lesswrong.com/posts/dvbRv97GpRg5gXKrf/run-time-steering-can-surpass-post-training-reasoning-task

If Anyone Builds It Everyone Dies, a semi-outsider review

Expertium2mo30

At that point, the shut down argument is no longer speculative, and you can probably actually do it.
To be clear, I'm not saying that's a good plan if you can foresee all the developments in advance. But, if you're uncertain about all of it, then it seems like there is likely to be a period of time before it's necessarily too late when a lot of the uncertainty is resolved.

I think we are talking past each other, at least somewhat.

Let me clarify: even if humanity wins a fight against an intelligent-but-not-SUPER-intelligent AI (by dropping an EMP on the datacenter with that AI or whatever, the exact method doesn't matter for my argument), we will still be left with the technical question "What code do we need to write and what training data do we need to use so that the next AI won't try to kill everyone?".

Winning against a misaligned AI doesn't help you solve alignment. It might make an international treaty more likely, depending on the scale of damages caused by that AI. But if the plan is "let's wait for an AI dangerous enough to cause something 10 times worse than Chernobyl to go rogue, then drop an EMP on it before things get too out of hand, then once world leaders crap their pants, let's advocate for an international treaty", then it's one hell of a gamble.

If Anyone Builds It Everyone Dies, a semi-outsider review

Expertium2mo40

How do we know the AI will want to survive?

Because LLMs are already avoiding being shut down: https://arxiv.org/abs/2509.14260 . And even if future superintelligent AI will be radically different from LLMs, it likely will avoid being shut down as well. This is what people on lesswrong call a convergent instrumental goal:

If your terminal goal is to enjoy watching a good movie, you can't achieve it if you're dead/shut down.

If your terminal goal is to take over the world, you can't achieve it if you're dead/shut down.

If your goal is anything other than self-destruction, then self-preservation comes together in a bundle. You can't Do Things if you're dead/shut down.

Why should we think that there is no “in between” period where AI is powerful enough that it might be able to kill us and weak enough that we might win the fight?

Ok, let's say there is an "in between" period, and let's say we win the fight against a misaligned AI. After the fight, we will still be left with the same alignment problems, as other people in this thread pointed out. We will still need to figure out how to make safe, benevolent AI, because there is no guarantee that we will win the next fight, and the fight after that, and the one after that, etc.

If there will be an "in between" period, it could be good in the sense that it buys more time to solve alignment, but we won't be in that "in between" period forever.

Yes, AI Continues To Make Rapid Progress, Including Towards AGI

Expertium3mo*10

I've still found them useful. If METR's trend actually holds, they will indeed become increasingly more useful. If it actually holds to >1-month tasks, they may actually become transformative within the decade. Perhaps they will automate the within-paradigm AI R&D^[1], and it will lead to a software-only Singularity that will birth an AI model capable of eradicating humanity.
But that thing will still not be an AGI.

No offense, but to me it seems like you are being overly pedantic with a term that most people use differently. If you surveyed people on lesswrong, as well as AI researchers, I'm pretty sure almost everyone (>90% of people) would call an AI model capable enough to eradicate humanity an AGI.

Questions for old LW members: how have discussions about AI changed compared to 10+ years ago?

Expertium4mo20

Let me put it another way - do you expect that "LLMs do not optimize for a goal" will still be a valid objection in 2030? If yes, then I guess we have a very different idea of how progress will go.

Questions for old LW members: how have discussions about AI changed compared to 10+ years ago?

Expertium4mo10

But frontier labs are deliberately working on making LLMs more agentic. Why wouldn't they - AI that can do work autonomously is more economically valuable than a chatbot.

METR: How Does Time Horizon Vary Across Domains?