epistemic status: Going out on a limb and claiming to have solved an open problem in decision theory[1] by making some strange moves. Trying to leverage Cunningham's law. Hastily written.
p(the following is a solution to Pascal's mugging in the relevant sense)≈25%[2].
Okay, setting (also here in more detail): You have a a Solomonoff inductor with some universal semimeasure as a prior. The issue is that the utility of programs can grow faster than your universal semimeasure can penalize them, e.g. a complexity prior has busy-beaver-like programs that produce ...
Does that sound right?
Can't give a confident yes because I'm pretty confused about this topic, and I'm pretty unhappy currently with the way the leverage prior mixes up action and epistemics. The issue about discounting theories of physics if they imply high leverage seems really bad? I don't understand whether the UDASSA thing fixes this. But yes.
That avoids the "how do we encode numbers" question that naturally raises itself.
I'm not sure how natural the encoding question is, there's probably an AIT answer to this kind of question that I don't know.
Over a decade ago I read this 17 year old passage from Eliezer
...When Marcello Herreshoff had known me for long enough, I asked him if he knew of anyone who struck him as substantially more natively intelligent than myself. Marcello thought for a moment and said "John Conway—I met him at a summer math camp." Darn, I thought, he thought of someone, and worse, it's some ultra-famous old guy I can't grab. I inquired how Marcello had arrived at the judgment. Marcello said, "He just struck me as having a tremendous amount of mental horsepow
I want to make a thing that talks about why people shouldn't work at Anthropic on capabilities and all the evidence that points in the direction of them being a bad actor in the space, bound by employees who they have to deceive.
A very early version of what it might look like: https://anthropic.ml
Help needed! Email me (or DM on Signal) ms@contact.ms (@misha.09)
I recall a video circulating that showed Dario had changed his position on racing with China that feels perhaps relevant. People can of course change their mind, but I still dislike it.
There has been a rash of highly upvoted quick takes recently that don't meet our frontpage guidelines. They are often timely, perhaps because they're political, pitching something to the reader or inside baseball. These are all fine or even good things to write on LessWrong! But I (and the rest of the moderation team I talked to) still want to keep the content on the frontpage of LessWrong timeless.
Unlike posts, we don't go through each quick take and manually assign it to be frontpage or personal (and posts are treated as personal until they're actively f...
I observe that https://www.lesswrong.com/posts/BqwXYFtpetFxqkxip/mikhail-samin-s-shortform?commentId=dtmeRXPYkqfDGpaBj isn't frontpage-y but remains on the homepage even after many mods have seen it. This suggests that the mods were just patching the hack. (But I don't know what other shortforms they've hidden, besides the political ones, if any.)
Some of Eliezer's founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day:
There’s a deeper problem, how do we know there is a feedback loop?
I’ve never actually seen a worked out proof of well any complex claim on this site using standard logical notation…(beyond pure math and trivial tautologies)
At most there’s a feedback loop on each other’s hand wavey arguments that are claimed to be proof of this or that. But nobody ever actually delivers the goods so to speak such that they can be verified.
people look into universal moral frameworks like utilitarianism and EA because they lack self-confidence to take a subjective personal point of view. They need to support themselves with an "objective" system to feel confident that they are doing the correct thing. They look for external validation.
i didn't upvote or react in any way because I don't understand how gender inequality is related to those issues unless you mean things such as "if more woman were in government it would surely be better for all of us" which I somewhat agree but also I don't think this sentence can be true in the same way give well cost effectiveness estimates can be
Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.[1] If he's right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460!
I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors:
I'm curious what you say about "which are the specific problems (if any) where you specifically think 'we really need to have solved philosophy / improved-a-lot-at-metaphilosophy' to have a decent shot at solving this?'"
Assuming by "solving this" you mean solving AI x-safety or navigating the AI transition well, I just post a draft about this. Or if you already read that and are asking for an even more concrete example, a scenario I often think about is an otherwise aligned ASI, some time into the AI transition when things are moving very fast (from a h...
Specifically, this is the privacy policy inherited from when LessWrong was a MIRI project; to the best of my knowledge, it hasn't been updated.
Mainstream belief: Rational AI agents (situationally aware, optimizes decisions, etc.) are superior problem solvers, especially if they can logically motivate their reasoning.
Alternative possibility: Intuition, abstraction and polymathic guessing will outperform rational agents in achieving competing problem-solving outcomes. Holistic reasoning at scale will force-solve problems intractable by much more formal agents, or at least outcompete in speed/complexity.
2)
Mainstream belief: Non-sentient machines will eventually r... 
In Improving the Welfare of AIs: A Nearcasted Proposal (from 2023), I proposed talking to AIs through their internals via things like ‘think about baseball to indicate YES and soccer to indicate NO’. Based on the recent paper from Anthropic on introspection, it seems like this level of cognitive control might now be possible:
Communicating to AIs via their internals could be useful for talking about welfare/deals because the internals weren't ever trained against, potentially bypassing strong heuristics learned from training and also making it easier to con...
Has anyone done any experiments into whether a model can interfere with the training of a probe (like that bit in the most recent Yudtale) by manipulating its internals?
Anthropic wrote a pilot risk report where they argued that Opus 4 and Opus 4.1 present very low sabotage risk. METR independently reviewed their report and we agreed with their conclusion. 
During this process, METR got more access than during any other evaluation we've historically done, and we were able to review Anthropic's arguments and evidence presented in a lot of detail. I think this is a very exciting milestone in third-party evaluations! 
I also think that the risk report itself is the most rigorous document of its kind. AGI companies wil... 
I made a linkpost for the report here: https://www.alignmentforum.org/posts/omRf5fNyQdvRuMDqQ/anthropic-s-pilot-sabotage-risk-report-2
I'm increasingly worried that philosophers tend to underestimate the difficulty of philosophy. I've previously criticized Eliezer for this, but it seems to be a more general phenomenon.
Observations:
The Problem of the Criterion, which is pretty is pretty much the same as the Münchhausen Trilemma.
"Moreover, its [philosophy's] central tool is intuition, and this displays a near-total ignorance of how brains work. As Michael Vassar observes, philosophers are "spectacularly bad" at understanding that their intuitions are generated by cognitive algorithms." -- Rob Bensinger, Philosophy, a diseased discipline.
What's the problem?
It's not that philosophers weirdly and unreasonably prefer intuition to empirical facts and mathematical/logical reasoning, it is t...
Suicide occupies a strange place in agent theory. It is the one goal whose attainment is not only impossible to observe, but whose attainment hinges on the impossibility of it being observed by the agent.
In some cases, this is resolved by a transfer of agency to the thing for whom the agent is in fact a sub-agent and is itself experiencing selective pressure, e.g. in the case of the beehive observing the altruistic suicide of an individual bee defending it. This behaviour disappears once the sub-agent experiences selective pressures that are independent fr...
The fact in question is not just unobserved, but unobservable because its attainment hinges on losing one's ability to make the observation.
Whenever I read yet another paper or discussion of activation steering to modify model behavior, my instinctive reaction is to slightly cringe at the naiveté of the idea. Training a model to do some task only to then manually tweak some of the activations or weights using a heuristic-guided process seems quite un-bitter-lesson-pilled. Why not just directly train for the final behavior you want—find better data, tweak the reward function, etc.?
But actually there may be a good reason to continue working on model-internals control (i.e. ways of influencing mo...
Well, mainly I'm saying that "Why not just directly train for the final behavior you want" is answered by the classic reasons why you don't always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.
To progress understanding in learning theory, I feel it is important to establish some form of "hierarchy" of key factors in deep learning methodology, in order from most critical for good performance, to least critical. I believe this hierarchy might help to identify the strengths of one theory over another.
My proposed order of importance is as follows:
Inspired by the Founder’s Pledge and the 10% Pledge, we can offer people transitioning to an AI safety career to make an AI Safety Pledge. It could look something like this:
Note: this is a very early idea, not a fully fledged proposa...
If I want to make lesswrongers really mad, I should write an article about how an arms race over human genetic engineering (US v China, parent v parent) would in the limit eliminate everything that makes us human, just as an arms race between digital minds would.
From Meditations on Moloch by Scott Alexander, from Zach Davis
I am a contract-drafting em,
The loyalest of lawyers!
I draw up terms for deals ‘twixt firms
To service my employers!
But in between these lines I write
Of the accounts receivable,
I’m stuck by an uncanny fright;
...The worl
I'd be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we've seen? I'd love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly).
Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like "How strong is the imitative prior?" A...
I don't know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that "AGI" is much less likely to be insanely good at generalization than we thought in 2015.
Evidence is basically this: I don't think "the scaling hypothesis" was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren't expecting massive data scale-ups to be the road to AGI, what...
I present four methods to estimate the Elo Rating for optimal play: (1) comparing optimal play to random play, (2) comparing optimal play to sensible play, (3) extrapolating Elo rating vs draw rates, (4) extrapolating Elo rating vs depth-search.
Random plays completely random legal moves. Optimal plays perfectly. Let ΔR denote the Elo gap between Random and Optimal. Random's expected score is given by E_Random = P(Random wins) + 0.5 × P(Random draws). This is related to Elo gap via the formula E_Ran...
I'll note that CDT and FDT prescribe identical actions against Stockfish, which is the frame of mind I had when writing.
More to your point - I'm not sure that I am describing CDT:
"always choose the move that maximises your expected value (that is, p(win) + 0.5 * p(draw)), taking into account your opponent's behaviour" sounds like a decision rule that necessitates a logical decision theory, rather than excluding it?
Your point about pathological robustness is valid but I'm not sure how much this matters in the setting of chess.
Lastly, if we're using the form...