The point is to develop models within multiple framings at the same time, for any given observation or argument (which in practice means easily spinning up new framings and models that are very poorly developed initially). Through the ITT analogy, you might ask how various people would understand the topics surrounsing some observation/argument, which updates they would make, and try to make all of those updates yourself, filing them under those different framings, within the models they govern.
...the salience and methods that one instinctively chooses are
(I would like to note that a single person went through and strong downvoted my comments here.)
Yep. Put another way: With Y2K, the higher-quality "predictions of doom" were sufficiently specific that they were also a road map to preventing the doom.
(If nothing else, you could frequently test a system by running the system clock ahead to 1999-12-31 23:59:59 and waiting a moment to see if anything caught fire.)
Oh, maybe what you are imagining is that it is possible to perceive a homomorphic mind in progress, by encrypting yourself, and feeding intermediate states of that other mind to your own homomorphically encrypted mind. Interesting hypothetical.
I think with respect to "reality" I don't want to be making a dogmatic assumption "physics = reality" so I'm open to the possibility (C) that the computation occurs "in reality" even if not "in physics".
I'm annoyed that Tegmark and others don't seem to understand my position: you should try for great global coordination but also invest in safety in more rushed worlds, and a relatively responsible developer shouldn't unilaterally stop.
(I'm also annoyed by this post's framing for reasons similar to Ray.)
To perform homomorphic operations you need the public key, and that also allows one to encrypt any new value and perform further hidden computations under that key. The private key allows decryption of the values.
I suppose you could argue that the homomorphically encrypted mind exists ala mathematical realism even if the public key is destroyed, but it would be something "outside reality" computing future states of the encrypted mind after the public key is no longer available.
Thank you! Do you have a concrete example to help me better understand what you mean? Presumably the salience and methods that one instinctively chooses are those which we believe are more informative, based on our cumulative experience and reasoning. Isn't moving away from these also distortionary?
I am glad someone said this. This is a no-brainer suggestion and something fundamental and important that "the camps" can agree on.
links 10/23/25: https://roamresearch.com/#/app/srcpublic/page/10-23-2025
I define “AI villain data” to be documents which discuss the expectation that powerful AI systems will be egregiously misaligned. ... This includes basically all AI safety research targeted at reducing AI takeover risk.
AGIs should worry about alignment of their successor systems. Their hypothetical propensity to worry about AI alignment (for the right reasons) might be crucial in making it possible that ASI development won't be rushed (even if humanity itself keeps insisting on rushing both AGI and ASI development).
If AGIs are systematically prevented f...
Yes, but you'd naively hope this wouldn't apply to shitty posts, just to mediocre posts. Like, maybe more people would read, but if the post is actually bad, people would downvote etc.
...2: We have more total evidence from human outcomes
Additionally, I think we have a lot more total empirical evidence from "human learning -> human values" compared to from "evolution -> human values". There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from "human learn
The problems with local positivism seem to me... kinda important philosophically, but less so in practice.
Yes, most of the time they don't matter, but then sometimes they do! I think in particular the wrongness of logical positivism matters a lot if you're trying to solve a problem like proving that an AI is aligned with human flourishing because there's a specific, technical answer you want to guarantee but it requires formalizing a lot of concepts that normally squeak by because all the formal work is being done by humans who share assumptions. But when you need the AI to share those assumptions, things get dicier.
The effect seems natural and hard to prevent. Basically, certain authors get reputations for being high (quality * writing), and then it makes more sense for people to read their posts because both the floor and ceiling are higher in expectation. Then their worse posts get more readers (who vote) than posts of a similar quality by another author, who's floor and ceiling is probably lower.
I'm not sure the magnitude of the cost, or that one can realistically expect to ever prevent this effect. For instance, all Scott Alexander blogposts get more readership t...
As for measuring alignment, one could do something similar to Claude (and a version of GPT?) playing Undertale or another game where one can achieve goals in unethical ways, but isn't obliged to do so.[1] The experiment with Undertale is evidence for Claude being aligned. However, a YouTuber remarked that GPT suggested a line of action which would likely lead to the Genocide Ending.
Zero-sum games, like Diplomacy where o3 deceived a Claude into battling against Gemini, fall into the latter category since winning the game means that others lose.
The problems with local positivism seem to me... kinda important philosophically, but less so in practice.
Kinda like having Gödel's incompleteness proof it mathematics -- yes, it is shocking and yes it has some serious consequences, but... it has practically zero effect on high-school mathematics.
Similarly, the fact the verification principle is not itself an empirical fact is... a good argument against the generalization that everything must be an empirical fact. Yes, there is a place for abstractions, and general assumptions. And yet, I think that scienc...
Yeah that seems like a pretty reasonable false fact to insert into the model.
This is an example where framings are useful. An observation can be understood under multiple framings, some of which should intentionally exclude the compelling narratives (framings are not just hypotheses, but contexts where different considerations and inferences are taken as salient). This way, even the observations at risk of being rounded up to a popular narrative can contribute to developing alternative models, which occasionally grow up.
So even if there is a distortionary effect, it doesn't necessarily need to be resisted, if you additionally enter...
My guess is that the ideal is something like a default Ask culture with specific Guess culture contexts when it genuinely is worth the extra consideration.
IMO the ideal is a culture where everyone puts some reasonable effort into Guessing when feasible, but where Asking is also fully accepted.
The huge problem with that is that the extreme deadliness of one of ideologies that has rushed in to fill the void caused by the discrediting of Christianity: namely, the one (usually referred to vaguely by "progress" or "innovation") that views every personal, organizational and political decision through the lens of which decision best advances or accelerates science and technology.
Is this really a widely held ideology? My impression is that the AI race is driven by greed much more than ideology.
We used Claude Sonnet 4 for the agents and narration, and Claude 3.5 Sonnet for most of the evaluation.
We haven't made any specific plans yet on how to measure alignment; our first goal was to check if there were observable differences at all, before making those differences properly measurable.
I think if someone is very well-known their making a particular statement can be informative in itself, which is probably part of the reason it is upvoted.
RL can develop particular skills, and given that IMO has fallen this year, it's unclear that further general capability improvement is essential at this point. If RL can help cobble together enough specialized skills to enable automated adaptation (where the AI itself will become able to prepare datasets or RL environments etc. for specific jobs or sources of tasks), that might be enough. If RL enables longer contexts that can serve the role of continual learning, that also might be enough. Currently, there is a lot of low hanging fruit, and little things ...
My takes on your comment:
Intelligence really is giant incomprehensible matrices with non-linear functions tossed in (at best).
I think this is possible, but I currently suspect the likely answer is more boring than that, and it's the fact that getting to AGI with a labor-light, heavy compute approach (as evolution did) means that it's not worth investing much in interpretable AIs, even if strong AIs that were interpretable existed, and a similar condition holds in the modern era. But one of the effects of AIs that can replace humans is that it disproportion...
Okay, sorry about this. You are right. I have a thought up a somewhat nuanced view about how prosaic corrigibility could work and I kind of just assumed that was the same was what Max had because he uses a lot of the same keywords I use when I think about this, but after actually reading the CAST article (or I read part 0 and 1), I realize we have really quite different views.
Parenthetically, I do not yet know of anyone in the "never build ASI" camp and would be interested in reading or listening to such a person.
Would you agree that we have about as much of a handle on what corrigibility is as we do on what an agent is? Like, I claim that I have some knowledge about corrigibility, even though it's imperfect and I have remaining confusions. And I'm wondering whether you think humanity is deeply confused about what corrigibility even is, or whether you think it's more like we have a handle on it but can't quite give its True Name.
If you want to slow down AI Research, why not try to use the "250 documents method" to actively poison the models and create more busy-work for the AI companies?
Hopefully at some point!
Can you say more about the projects you're spending your time on now?
Thanks for this follow-up. My basic thoughts on the comment above this one is that while I agree that you definitely can't get a perfectly corrigible agent on your first try, you might, by virtue of the training data resembling the lab setting, get something that in practice doesn't go off the rails, and instead allows some testing and iterative refinement (perhaps with the assistance of the AI). So I think "iteration [can/can't] fix a semi-corrigible agent" is the central crux.
I just read your WWIDF post (upvoted!) and while I agree that the issues you po...
I donated $7K to Scott and $7K to Bores.
Yeah, thanks. Feel free to DM me or whatever if/when you finish a post.
One thing I want to make clear is that I'm asking about the feasibility of corrigibility in a weak superintelligence, not whether setting out to build such a thing is wise or stable.
I was actually expecting Penny to develop dystonia coincidentally, and the RL would tie-in by needing to be learned in reverse ie optimizing from dystonic to normal. It is a much more pleasant ending than the protagonist's tone the whole way through.
If I was writing a fanfic of this, I'd keep the story as is (+ or - the last paragraph), but then continue into the present moment which leads to the realization.
you can't just train your ASI for corrigibility because it will sit and do nothing
I'm confused. That doesn't sound like what Max means by corrigibility. A corrigible ASI would respond to requests from its principal(s) as a subgoal of being corrigible, rather than just sit and do nothing.
Or did you mean that you need to do some next-token training in order to get it to be smart enough for corrigibility training to be feasible? And that next-token training conflicts with corrigibility?
Thanks, updated to forecasters, does that seem fair?
Also, I know this is super hard, but do you have a sense of what superforcasters might have guessed back then?
It seems to me like buying an investment property is almost always a bad decision, because 1) single properties are very volatile, 2) you generally have to put a very large chunk of your net worth (sometimes even >100%!) in a property that's completely undiversified, and 3) renting out a property is work and you likely could get a better hourly elsewhere.
The only advantages I see are that there's far more cheap leverage available to retail investors in real estate than other sectors, and mortgages can act as a savings commitment device. Are there other reasons I'm missing that explain the apparent popularity of these investments?
While I think LW’s epistemic culture is better than most, one thing that seems pretty bad is that occasionally mediocre/shitty posts get lots of upvotes simply because they’re written by [insert popular rationalist thinker].
Of course, if LW were truly meritocratic (which it should be), this shouldn’t matter — but in my experience, it descriptively does.
Without naming anyone (since that would be unproductive), I wanted to know if others notice this too? And aside from simply trying not to upvote something because it’s written by a popular author, anyone have good ideas for preventing this?
I do think that progress will slow down, though its not my main claim. My main claim is to do with the tailwind of compute scaling will become weaker (unless some new scaling paradigm appears or a breakthrough saves this one). That is a piece in the puzzle of whether overall AI progress will accelerate or decelerate and I'd ideally let people form their own judgments about the other pieces (e.g. whether recursive self improvement will work, or whether funding will collapse in a market correction, taking away another tailwind of progress). But having a majo...
Doomers predicted that the Y2K bug would cause massive death and destruction. They were wrong.
This seems like a misleading example of doomers being wrong (agree denotationally, disagree connotationally), since I think it's plausible that Y2K was not a big deal (to such an extent that "most people think it was a myth, hoax, or urban legend") precisely because of the mitigation efforts stemmed by the doomsayers' predictions.
I've been thinking about what I'd call memetic black holes: regions of idea-space that have gathered enough mass that they will suck in anything adjacent to them, distorting judgement for believers and skeptics alike.
The UFO topic is, I think, one such memetic black hole. The idea of aliens is so deeply ingrained in our collective psyche that it is very hard to resist the temptation to attach to it any kind of e.g. bizarre aerial observation. Crucially, I think this works both for those who definitely do and those who definitely don't believe that UF...
Recently I've been spending much less than half of my time on projects like AI Lab Watch. Instead I've been thinking about projects in the "strategy/meta" and "politics" domains. I'm not sure what I'll work on in the future but sometimes people incorrectly assume I'm on top of lab-watching stuff; I want people to know I'm not owning the lab-watching ball. I think lab-watching work is better than AI-governance-think-tank work for the right people on current margins and at least one more person should do it full-time; DM me if you're interested.
For me the linked site with the statement doesn't load. And this was also the case when I first tried to access it yesterday. Seems less than ideal.
Cool simulation!
I also have to add that I find the idea that a cyclist wouldn't cycle on a road absurd. I don't think I know a single person who wouldn't do this, presumably a US vs EU thing.
You mean the "No Way No How" group? If so, yeah, it feels implausible to me as well. I have a feeling that for people who were surveyed and said this, it wouldn't match their actual behavior if they were able to experience an area with genuinely calm roads.
I agree that the statement doesn't require direct democracy but that seems like the most likely way to answer the question "do people want this".
Here's a brief list of things that were unpopular and broadly opposed that I nonetheless think were clearly good:
Generally I feel like people sometimes oppose things that seem disruptive and can be swayed by demagogues. There's a reason that representative democracy works better than direct democracy. (Tho...
Right so, by step 4 I'm not trying to assume that h is computationally tractable; the homomorphic case goes to show that it's probably not in general.
With respect to C, perhaps I'm not verbally expressing it that well, but the thing you are thinking of, where there is some omniscient perspective that includes "more than" just the low level of physics (where the "more than" could be certain informational/computational interconnections) would be an instance. Something like, "there is a way to construct an omniscient perspective, it just isn't going to be straightforwardly derivable from the physical state".
That's reasonable, but it seems to be different from what these quotes imply:
So while we may see another jump in reasoning ability beyond GPT-5 by scaling RL training a further 10x, I think that is the end of the line for cheap RL-scaling.
... Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.
There are a bunch of quotes like the above that make it sound like you are predicting progress will slow down in a few years. But instead you are saying that progress will continue,...
Yeah that seems like a case where non-locality is essential to the computation itself. I'm not sure how the "provably random noise from both" would work though. Like, it is possible to represent some string as the xor of two different strings, each of which are themselves uniformly random. But I don't know how to generalize that to computation in general.
I think some of the non locality is inherited from "no hidden variable theory". Like it might be local in MWI? I'm not sure.
Seems like the first two points contradict each other. How can an llm not be good at discovery and also automate human R&D
...Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure, much like the simple laws of arithmetic that abstractly underwrite a wide variety of empirical arithmetical facts about what happens when you add four people's bags of apples together on a table and then divide the contents among two people.
But that attractor well? It's got a free parameter. And that parameter is what the AGI is optimizing for.
The OP is about two "camps" of people. Do you understand what camps are? Hopefully you can see that this indeed does induce the analog of "because the claim of fakeness is about the entirety of the image". They gain and direct funding, consensus, hiring, propaganda, vibes, parties, organizations, etc., approximately as a unit. Camp A is a 90% poison twinkie. The fact that you are trying to not process this is a problem.
none of this means Sam Altman shouldn’t be welcome at Lighthaven, and Holly clarifies that even she agrees on this
That is not my reading of the linked tweet (which just agrees that Lighthaven wasn't "dazzled"), and the opposite is my reading of this tweet and its replies.
Um, no, you responded to the OP with what sure seems like a proposed alternative split. The OP's split is about
people who self-identify as members of the AI safety community
I think you are making an actual mistake in your thinking, due to a significant gap in your thinking and not just a random thing, and with bad consequences, and I'm trying to draw your attention to it.
There's a huge difference between the types of cases, though. A 90% poisonous twinkie is certainly fine to call poisonous[1], but a 90% male groups isn't reasonable to call male. You said "if most people who would say they are in C are not actually working that way and are deceptively presenting as C," that seems far like the latter than the former, because "fake" implies the entire thing is fake[2].
Though so is a 1% poisonous twinkie; perhaps the example should be a meal that is 90% protein would be a "protein meal" without implying there is no non-prote
I was trying to map out disagreements between people who are concerned enough about AI risk.
Agreed that this represents only a fraction of the people who talk about AI risk, and that there are a lot of people who will use some of these arguments as false justifications for their support of racing.
EDIT: as TsviBT pointed out in his comment, OP is actually about people who self-identify as members of the AI Safety community. Given that, I think that the two splits I mentioned above are still useful models, since most people I end up meeting who self-identify...
I was "let's build it before someone evil", I've left that particular viewpoint behind since realizing how hard aligning it is.
It was empirically infeasible (for the general AGI x-risk technical milieu) to explain this to you faster than you trying it for yourself, and one might have reasonably expected you to have been generally culturally predisposed to be open to having this explained to you. If this information takes so much energy and time to be gained, that doesn't bode well for the epistemic soundness of whatever stance is currently being taken by the funder-attended vibe-consensus. How would you explain this to your past self much faster?
I had a very busy IRL day yesterday and have intended to respond to this.
While I am initially inclined to simply do what you ask out of kindness I am still convinced that I have no real reason to do so and therefore acceding here may portray me as a pushover. This really is an instance where some human neutral third party input to this dispute would be extremely helpful and I wish there was more of a culture online of such interventions. I would expect there to be such a culture here on Lesswrong, but perhaps not.
Nevertheless I did consult a non-human medi...
Can anyone reading this truly deny that those warnings came true from the doom sayer's perspective?
yes. your arrow of causality look backwards to me - I don't see divorce destigmatization - > more divorce. in the divorce case it's clearly more divorce -> destigmatization. i don't remember where to find the posts about how the laws that allow divorce came after the spike in divorce, and not the other way around.
there is important point here. i only recently re-evaluate my opinion on TV and decided the doomers was right there. but it sure look to m...
Part of the implementation choices might be alienating-- the day after I signed I saw in the announcement email yesterday "Let's take our future back from Big Tech." and maybe a lot of people, who work at large tech companies, who are on the fence don't like that brand of populism.
I think we're just trying to do different things here... I'm trying to describe empirical clusters of people / orgs, you're trying to describe positions, maybe? And I'm taking your descriptions as pointers to clusters of people, of the form "the cluster of people who say XYZ". I think my interpretation is appropriate here because there is so much importance-weighted abject insincerity in publicly stated positions regarding AGI X-risk that it just doesn't make much sense to focus on the stated positions as positions.
Like, the actual people at The Curve or w...
You can either keep them on a short leash and do code review, or you can
Is there a missing segment here? It doesn't seem like a stylistic segway to the next section.
Gary Marcus offered Elon Musk 10:1 odds on the bet, offering to go up to $1 million dollars using Elon Musk’s definition of ‘capable of doing anything a human with a computer can do, but not smarter than all humans combined’, but I’m sure Elon Musk could hold out for 20:1 and he’d get it. By that definition, the chance Grok 5 will count seems very close to epsilon. No, just no.
Nitpick: we don't know what Musk's researchers actually did. If they found the actually capable neuralese architecture, then we are done. But what is the probability that they ...
As for how that gets to "definitely can't": the problem above means that, even if we nominally have time to fiddle and test the system, iteration would not actually be able to fix the relevant problems. And so the situation is strategically equivalent to "we need to get it right on the first shot", at least for the core difficult parts (like e.g. understanding what we're even aiming for).
And as for why that's hard to the point of de-facto impossibility with current knowledge... try the ball-cup exercise, then consider the level of detailed understanding re...
While I agree at a basic level, this also seems like a motte-and-bailey.
There is clearly a vibe that all doomers have obviously always been wrong. The author is clearly trying to push back against that vibe. I too prefer arguing at 'motte' level, but vibes (baileys) matter, and pushing back against one should not require a long airtight argument that stands up to the stronger version of the claims being made. Even though I agree the stronger version would be better, that's true for both sides of any debate.
I mentioned this before, but that interface didn't allow for wide spreads, so the thing that you might be looking at is a plot of people's medians, not the whole distribution. In general Hypermind's user interface was so so shitty, and they paid so poorly, if at all, that I don't think it's fair to describe that screenshot as "superforecasters".
That's comparing apples to oranges. There are doomers and doomers. I don't think the "doomers" predicting the Rapture or some other apocalypse are the same thing as the "doomers" predicting the moral decline of society. The two categories overlap in many people, but they are distinct, and I think it's misleading to conflate them. (Which is kind of a critique of the premise of the article as a whole--I would put the AI doomers in the former category, but the article only gives examples from the latter.)
The existential risk doomers hi...
I had been thinking about the exact same topic when I read this article, only I was using bus routes in my analogy. I created a quick program to simulate these dynamics[1].
It's very simple, there is a grid of squares, let's say 100 by 100, each square has some other square randomly assigned as its goal. Then I generate some paths via random walks until some fraction of squares are paths. Then I check what fraction of squares are connected to their goal via a path.
Doing this we get the following s-curve:
The y-axis shows the fraction of squares that are able...
You make a valid point. Here's another framing that makes the tradeoff explicit:
I guess we could say "mostly fake", but also there's important senses in which "mostly fake" implies "fake simpliciter". E.g. a twinkie made of "mostly poison" is just "a poisonous twinkie". Often people do, and should, summarize things and then make decisions based on the summaries, e.g. "is it poison, or no" --> "can I eat it, or no". My guess is that the conditions under which it would make sense for you to treat someone as genuinely holding position C, e.g. for purposes of allocating funding to them, are currently met by approximately no one. I coul...
That seems like a cool idea for the mediation condition, but Isn't it trivial for the redundancy conditions?
Indeed, that specific form doesn't work for the redundancy conditions. We've been fiddling with it.
The anti-naturality problems are an issue, especially if you want to build the thing via standard RL-esque training, but they're not the first things which will kill you.
The story in the post you link is a pretty standard training story, and runs into the same immediate problems which standard training stories usually run into:
I disagree a bit with your logic here. If 60 % of ChatGPT GPU is cut off as a result of one switch in just one datacenter, the whole model is reduced to something else than what it was. Users with simple queries won't notice (right away). But the model will get dumber instantly.
(How will it copy its weights then?)
What do you think of the argument here (if you read that far) to build this into progenitor models? This idea does not apply to an SI, as I clarify in the text as well.
Let's look at this from today. Current AI is unable to hide all deception from us. It seems reasonable to me that a switch would trigger before stealth SI is deployed.
Which LLMs did you use (for judging, for generating narratives, for peers)? And how do you plan to measure alignment?
The genre here is psychological horror fiction, and the style is first-person short story; so it's reminiscent of Edgar Allan Poe or Ted Chiang; but it's not clearly condensed or tightly edited the way those tend to be, and the narrator's style is prolix and euphuistic. From an editing perspective, I think the question I would have is to what extent this is a lack of editing & killing-your-darlings, and a deliberate unreliable-narrator stylistic choice in which the prose is trying to mirror the narrator's brute-force piano style or perhaps the dystonia...
By the way, your comment shows one thing that's may not be obvious from the outside (and maybe even from the inside): There's a lot of people who are in favour of the European project even if they never say so or act on it in any way. And not because it is cool and sexy, it most definitely isn't, but partly because of the historic experience (every family has stories like yours) and partly because they see EU as a check on their national government, preventing it from going fully bonkers. That being said, this political capital is completely untapped.
maybe i took all the low hanging fruit or something, but doing entire new thing every day is A LOT. like, the things i have to do and didn't, it's because they are hard and take more then 5 minutes. also, i can't even check if it worked, and i don't actually have so many things to do!
like, do you really expect to have 365 small things to do? because that suggestion sounds like applause lights to me - designed to be hard to say "actually, that's insane!", while being totally unrealistic.
also, i agree with Taylor. there are things like fixing small problem, ...
If you said "mostly bullshit" or "almost always disengenious" I wouldn't argue, but would still question whether it's actually a majority of people in group C, which I'm doubtful of, but very unsure about - but saying it is fake would usually mean it is not a real thing anyone believes, rather than meaning that the view is unusual or confused or wrong.
Closely related to: You Don't Exist, Duncan.
Shared on the EA Forum, with some commentary on the state of the EA Community (I guess the LessWrong rationality community is somewhat similar?)
In practice, bans can be lifted, so "never" is never going to become an unassailable law of the universe. And right now, it seems misguided to quibble over "Pause for 5, 10, 20 years", and "Stop for good", given the urgency of the extinction threat we are currently facing. If we're going to survive the next decade with any degree of certainty, we need an alliance between B1 and B2, and I'm happy for one to exist.
My anti-corrigibility argument would probably require several posts to make remotely convincing, but I can sketch it as bullet points:
I think a lot of problems in systems and mechanism design stem exactly because people have some degree of the following type of wishful thinking.
Something like in-ideal-conditions-which-everybody-wants-therefore-everybody-will-drive-towards OR people will do this pirouette with a double-leg evolution e.g., read instructions before engaging, self-organize in a specific way—while your important thing is one click too far. Or in an ideal world this system would work if people were habituated to do this thing in a specific way two-thing-up-yellow-triangle and ...
I am in a part of the A camp which wants us to keep pushing for superintelligence but with more of an overall percentage of funds/resources invested into safety.
To provide clarity to the debate, we[1], alongside thirty-one co-authors, recently released a paper that develops a detailed definition of AGI,
To me, this reads as "We, alongside thirty-one co-authors, recently released a paper trying to co-opt terminology in common use".
Hi, just wanted to respond here, that my report is now out:
https://www.lesswrong.com/posts/Em9sihEZmbofZKc2t/a-concrete-roadmap-towards-safety-cases-based-on-chain-of
https://arxiv.org/abs/2510.19476
Would be happy to hear thoughts on the categorisation of encoded reasoning/drivers/counter measures
An AI company I've never heard of called AGI, Inc has a model called AGI-0 that has achieved 76.3% on OSWorld-verified. This would qualify as human-level computer use, at least by that benchmark. It appears on the official OSWorld-verified leaderboard. It does seem like they trained on the benchmark, which could explain some of this. I am curious to see someone test this model.
This is a large increase from the previous state of the art, which has been climbing rapidly since Claude Sonnet 4.5's September 29th release. At that point, Claude achieved 61.4% on...
If you're going to assign the blame for the world wars to nationalism, why not also assign the credit for positive things to nationalism? Like the industrial revolution (courtesy of the British Empire), the success of the United States (and in particular its successful war of independence against Britain) and so on? Putting the suffering and damages caused by two World Wars on one side of the scale is indeed a tall order to overcome, but if much of the rest of modern history is put on the other side of the scale, that can easily outweigh them.
Regarding cos...
I'll point to a similarly pessimistic but divergent view on how to mange the likely bad transition to an AI future that I co-authored recently;
...Instead, we argue that we need a solution for preserving humanity and improving the future despite not having an easy solution of allowing gradual disempowerment coupled with single-objective beneficial AI...
The first question, one that is central to some discussions of long-term AI risk, is how can humanity stay in control after creating smarter-than-human AI?But given the question, the answer is overdetermin
That's an interesting historical perspective, thanks! Though my point was mostly about whether a voter in a European nation in the 20th or 21st century should vote to join, empower, or expand the EU. Whereas citizens in earlier centuries didn't even have the option to vote against the actions of their governments.
A beautiful and haunting story. Not entirely sure what it's doing on LessWrong but I'm glad it's here because I'm here and I'm glad I read it.
Another relevant market predicting in what years CoT monitoring will not work: https://manifold.markets/Jasonb/in-what-years-will-cot-monitoring-f?r=SmFzb25i
...The question of value drift is especially strange given that we have a "meta-intuition" that moral/social values evolving and changing is good in human history. BUT, at the same time, we know from historical precedent that we ourselves will not approve of the value changes. One might attempt to square the circle here by arguing that perhaps if we were, hypothetically, able to see and evaluate future changed values, that we would in reflective equilibrium accept these new values. Sadly, from what I can gather this is just not borne out by the social scienc
Why is there so little Rat brainpower devoted to the pragmatics of how AI safety could be advanced within the global and national political contexts?*
As someone who was there, I think the portrayal of the 2020-2022 era efforts to influence policy is strawmanned, but I agree that it was the first serious attempt to engage politically by the community - and was an effort which preceded SBF in lots of different ways - so it's tragic (and infuriating) that SBF poisoned the well by backing it and having it collapse. And most of the reason there was relatively little done by the existential risk community on pragmatic political action in 2022-2024 was directly because of that collapse!
Great to see more work on (better) influence functions!
Lots of interesting things to discuss here[1], but one thing I would like to highlight is that classical IFs indeed arise when you do the usual implicit function theorem + global minimum assumption (which is obviously violated in the context of DL), but they also arise as the limit of unrolling as . What follows will be more of theoretical nature summarizing statements in Mlodozeniec et al.
Influence functions suffer from another shortcoming, since they only use final weights (as you are a...
Remaining in this frame of "we make our case for [X course of action] so persuasively that the world just follows our advice" does not make for a compelling political theory on any level of analysis.
...but it's not fake, it's just confused according to your expectations about the future - and yes, some people may say it dishonestly, but we should still be careful not to deny that people can think things you disagree with, just because they conflict with your map of the territory.
That said, I don't see as much value in dichotomizing the groups as others seem to.
As I said below, I think people are ignoring many different approaches compatible with the statement, and so they are confusing the statement with a call for international laws or enforcement (as you said, "attempts to make it as a basis for laws"), which is not mentioned. I suggested some alternatives in that comment:
"We didn't need laws to get the 1975 Alisomar moratorium on recombinant DNA research, or the email anti-abuse (SPF/DKIM/DMARC) voluntary technical standards, or the COSPAR guidelines that were embraced globally for planetary protection in spa...
Last month, doomers predicted that the Rapture would happen. The doomers were wrong, as they have been all the other dozens of notable times they predicted this.
Doomers predicted that the Y2K bug would cause massive death and destruction. They were wrong.
Moving to your social examples, doomers predicted that the legalisation of gay marriage would lead to the legalisation of bestiality. They were wrong.
You even provide an example yourself where people claim that D&D leads to satanism. This didn't happen! Having an orc hero is not sata...
I agree that some people have this preference ordering, but I don't know of any difference in specific actionable recommendations that would be given by "don't until safely" and "don't ever" camps.
Cooperation between humans and AIs rather than an attempt to control AIs. I think the race is going to happen regardless of who drops out of it. If those who are in the lead eventually land on mutual alignment, then we stand a chance. We're not going to outsmart the AIs nor will we stay on control of them, nor should we.