Is this really that different from previous papers where a model provided a key step of the argument?
Rationalists often say "insane" to talk about normie behaviors they don't like, and "sane" to talk about behaviors they like better. This seems unnecessarily confusing and mean to me.
This clearly is very different from how most people use these words. Like, "guy who believes in God" is very different from "resident of a psych ward." It can even cause legitimate confusion when you want to switch back to the traditional definition of "insane". This doesn't seem very rational to me!
Also, the otherizing/dismissiv...
the probability of a global pandemic starting in China has increased incalculably
I think in order to make an intellectually honest critique you actually need to calculate it. I mean it is all about numbers now: if the prior probability of a pandemic occurring around 2019 in Wuhan is sufficiently high then I am wrong.
any 'successful' pandemic is, simply by existing, evidence of a laboratory leak.
Well, it is though. If I tell you that a pandemic happened but that its spread was slow and it only affected a small portion of the world, versus if I tell y...
OpenAI claims 5.2 solved an open COLT problem with no assistance: https://openai.com/index/gpt-5-2-for-science-and-math/
This might be the first thing that meets my bar of autonomously having an original insight??
I find it bizarre and surprising, no matter how often it happens, when someone thinks my helping them pressure-test their ideas and beliefs for consistency is anything except a deep engagement and joy. If I didn't want to connect and understand them, I wouldn't bother actually engaging with the idea.
I feel like I could have written this (and the rest of your comment)! It's confusing and deflating when deep engagement and joy aren't recognized as such.
...It's happened often enough that I often need to modulate my enthusiasm, as it does cause suffe
Chaitin was quite young when he (co-)invented AIT.
It's just saying that
I'm not surprised by this, my sense is that it's usually young people and outsiders who pioneer new fields. Older people are just so much more shaped by existing paradigms, and also have so much more to lose, that it outweighs the benefits of their expertise and resources.
All of the fields that come to my mind (cryptography, theory of computation, algorithmic information theory, decision theory, game theory) were founded by much more established researchers. (But on reflection these all differ from AI safety by being fairly narrow and technical/mathematica...
All of this for the funding of 2-3 OpenAI employee salaries, wow
[tag: fiction]
A novice needed a map of the northern mountain passes. He approached the temple cartographer.
"Draw me the northern passes," he said, "showing the paths, the fords, and the shelters."
The cartographer studied many sources and produced a map. The novice examined it carefully: the mountains were drawn, the paths clearly traced, fords and shelters marked in their proper notation. The distances seemed reasonable. The penmanship was excellent.
"This is good work," said the novice, and he led a merchant caravan into the mountains.
On the third night, t...
This is clearly one of the most important posts of 2024, so I'm giving it 9 points.
The only negative (oth...
Yeah, I think that part of the training my PhD program offers is learning how to handle and deliver criticism. Besides the common advice of choosing your language carefully, being praiseful, and limiting questions/critiques to zero or one most important area in most settings, my favorite tactic when giving my own presentations is to frontload limitations. Just to be clear (this is frontloading!), what follows is speculation based on my own personal experiences.
When I start presenting my research, I introduce the audience first to the biology (3D genome arc...
I greatly appreciate people saying the circumstances under which they are and are not truth seeking or truthful. I think Dragon Agnosticism is actually pretty widespread, and instrumentally rational in many societies.
This essay lays out in a concise way, without talking about a specific incendiary topic, and from a position of trust (I and likely many others do trust Jeff a lot) why someone would sometimes not go for maximum epistemic rationality. I haven't yet referenced this post in a conversation, but mostly because I haven't happened to wind up i...
Consider the set of concepts aka subsets in the Thingspace . A concept A is a specification of another concept B if . This allows one to partially compare concepts by specificity, whether A is more specific than B, less specific, they are equal or incomparable.
In addition, for any two concepts B and C we find that is a subset both of B and C. Therefore, it is a specification of both. Similarly, any concept D which is a specification both of B and C is also a specification of .
Additionally, B and C are ...
This is a great meetup format and ya'll can fight me.
I want more entries in Group Rationality, and this is a fine way for a group to be smarter than an individual. They can read faster, and the summation and presentation process might even help retention.
I also want more meetup descriptions. Jenn runs excellent meetups, many of which require a braver organizer than I. This is one of the ones I feel I can grab without risking sparking a fight, and it's well laid out with plenty of examples. I've run a Partitioned Book Club myself, and my main quibble is it ...
[cross-posted from EAF]
Thanks for writing this!!
This risk seems equal or greater to me than AI takeover risk. Historically the EA & AIS communities focused more on misalignment, but I'm not sure if that choice has held up.
Come 2027, I'd love for it to be the case that an order of magnitude more people are usefully working on this risk. I think it will be rough going for the first 50 people in this area; I expect there's a bunch more clarificatory and scoping work to do; this is virgin territory. We need some pioneers.
People with plans in this area shou...
Rationalists love our prediction markets. They have good features. They aren't perfect. I like Zvi's Prediction Markets: When Do They Work more since it gives a better overview, but for some strange reason the UI won't let me vote for that this year. As prediction markets gain in prominence (yay!) we should keep our eyes on where they fall short and whether there's anything that can be done to fix them.
I keep this in my back pocket in case anyone tries to argue that a thing's high odds on Manifold is definitive. It's a bit niche. It's probably not super im...
I just unironically love this?
First off, the Effective Samaritan idea is fun. It's a little bit of playful ribbing, but it's also not entirely wrong. The broader point is a good mental exercise, trying to talk to imaginary people who believe different things than you for the same reasons you believe what you believe.
The entire Starcraft section makes me smile. This is perfect Write A Thousand Roads To Rome territory. Some reader is going to be a Starcraft fan, run across this essay, and suddenly be enlightened at how the outside view actually w...
I am very much of two opposed minds here.
The case against: This is inside baseball. This is the insidest of inside baseball, it's about a LessWrong commenter talking about LessWrong on the talk pages of Wikipedia, written by someone who cut his teeth writing by writing in the LessWrong diaspora online.
Also, it's kind of a bad look for LessWrong to prominently feature an argument that a major detractor of LessWrong is up to shady epistemic nonsense. Like, obviously people on this forum are upvoting this, we're mostly here because we like LessWrong and...
High level
This is less of an explainer of how individual ideas work, and more of an index outlining how various named ideas would fit together. It's about how groups of people can function better, and how a certain kind of common knowledge grows.
This could be a big entry in the Group Rationality subject, and even where I disagree with it, it's productive disagreement that helps clarify to myself what I think. That's useful. And it does reference sub-ideas that the author's written before, which is the way to do this kind of high level thing I think.
Bits an...
I really like the ladder metaphor and think it can generalize out to many contexts where something has to go from point A to point B.
Examples.
1. Development economics will sometimes look at what the victorians did on their path to modern wealth for advice on what development countries should do. A lot of this advice doesn't quite work as the developing world has to contend with all the ways the existence of the very wealthy west changes the landscape (cheap imports, brain drain, etc).
2. Asking my parents for life advice they'll mention the importance of getting into a cheap mortgage on a single family house ...
There are also galaxy-brained arguments that power concentration is fine/good (because it’s the only way to stop AI takeover, or because any dictator will do moral reflection and end up pursuing the good regardless).
I think the most salient argument for this (which is brought up in the full article) is that monopolization of power solves the proliferation problem. If the first ASI actors perform a pivotal act to preemptively disempower unapproved dual-use AIs, we don’t need to worry much about new WMDs or existing ones falling in price.
This pro...
The way they use the word "aligned" in that paper is very weird to me :P (they basically use it as a synonym for "instruction-following" or "post-trained").
But I feel like this method could actually be adapted for AI alignment/safety. It's kind of similar to my "incremental steering" idea, but instead of a strong untrusted model guiding a weak trusted model, there's a weak post-trained model guiding a strong base model. This also looks more practical than incremental steering, because it alternates between the weak model and the strong model, rather than g...
The post I was working on is out, "Filler tokens don’t allow sequential reasoning". I think LLMs are fundamentally incapable of using filler tokens for sequential reasoning (do x and then based on that do y). I also think LLMs are unlikely to stumble upon the algorithm they came up with in the paper through RL.
Well, if some reasoning is inarticulable in human language, the circuits implementing that reasoning would probably be difficult to interpret regardless of what layer they appear in the model.
This is a good point, and I think I've been missing part o...
I've been on a years long quest to shore up a decent morning & evening routine that's not tightly coupled with the routine of the rest of my life. IE that can survive me getting sick or going oncall. This post hit me at a time when I was trying and failing to maintain a lot of simple but high value routines (Easy stuff like journaling ~ every night, remembering to take out the trash before it became a problem, etc). On it's advice I setup some relevant forfeits ... and found that it did not work for me at all, about the same level of motivation as any ...
https://www.nature.com/articles/s41598-025-09241-2
...Corneal safety assessment of germicidal far UV-C radiation
Abstract
Far UV-C radiation (200–240 nm) is a promising alternative to conventional UV-C for disinfection in occupied spaces, offering strong germicidal efficacy with reduced skin risk. However, its ocular safety remains unclear, as most studies relied only on non-human corneal models with physiological differences. This study investigated UV-induced DNA damage in the epithelium, stroma, and endothelium of ex vivo human corneas and porcine cornea
Yep, just start a paragraph with >! and then you will be in a spoiler tag
Thanks for the explanation of how thinking dot by dot works! That makes sense.
I actually think the dot-by-dot paper is evidence against LLMs being about to do most kinds of reasoning with filler tokens
Yeah, I'm just reading the abstract of the paper, but it looks like it required very "specific" training on their problem to get it to work. I guess this is why you think it's evidence against? But maybe with algorithmic improvements, or due to a CoT where the only goal is getting higher reward rather than looking good to the user, LLMs will naturally gain th...
That's a good take: treating trust as “some kind of structured uncertainty object over futures” is very close to what I was gesturing toward because a bare scalar clearly isn’t sufficient.
On reflection, I have to admit I was using “trust” a bit loosely in the post. What it's become clear to me I’m really trying to model isn’t trust in the common usage sense (intentions, warmth, etc.), but something structural: roughly, how stable someone’s behavior is under visible strain, and who tends to bear the cost when things get hard. In my head it’s closer to a rel...
Yeah the paper seems more like a material science paper than a biology paper. There was no test/simulations/discussion about biological function; similar to DNA computing/data storage, it's mostly interested in the properties of the material than how it interfaces with pre-existing biology.
They did optimize for foldability, and did successfully produce the folded protein in (standard bacterial) cells. So it can be produced by biological systems (at least briefly), and more complex proteins had lower yields.
Their application they looked at was hydrogels, and it seems to have improved performance there? But functioning in biological systems introduces more constraints.
I think this sometimes has a good explanation in the 80/20 rule. Which itself is based on a pretty deep tendency for things to be complex. So changing things often has some low-hanging fruit (in the idealized case, the 80% you can get for 20% of the effort). It's easy to eliminate some clutter, harder to eliminate the rest.
This isn't always the case. In many of your examples, there's no obvious reason that cutting to zero would be harder. In some there are.
Here's another one: instead of cutting clutter to zero or time with that friend to zero, maybe you use that time /effort to go get some low-hanging fruit in other areas of life optimization?
Yes, dopamine certainly plays an important role in "will imposing" actions (and will-distracting actions if you've got too much dopamine activity).
I don't think there's a free lunch on downregulation as a result of upregulation. The brain has those negative feedback loops at many levels. (I'm told the rest of biology does too- it's a big part of how a messy genome makes something that works under a lot of conditions).
What goes up, must come down. Artifrically changing how your brain works means it will adapt, and work more the opposite way when you're not ...
Thanks, this is good context. So they didn't even simulate if it would remain biologically functional? Seems to make it less impressive.
Cool result! Do you know they used Llama 2 instead of Llama 3? The paper was released recently.
This is great!
What do you think of UVC lamps that you just kind of stick into a hole in your HVAC intake ducts? Some of them are really cheap, most seem to be 254nm, no idea if any are any good. Would be really convenient if it works well.
If you had a correct causal model of someone having a red experience and saying so, your model would include an actual red experience, and some reflective awareness of it, along with whatever other entities and causal relations are involved in producing the final act of speech.
The model would quite likely amount to a successful brain emulation which would have a conscious experience like a biological human does when run. Though you get into some conceptual hairiness with whether it's a case that the model includes the experience qualia, or the execution...
If you had a correct causal model of someone having a red experience and saying so, your model would include an actual red experience, and some reflective awareness of it, along with whatever other entities and causal relations are involved in producing the final act of speech. I expect that a sufficiently advanced neuroscience would eventually reveal the details. I find it more constructive to try to figure out what those details might be, than to ponder a hypothetical completed neuroscience that vindicates illusionism.
Can you suggest any non-troubling approaches (for children or for AIs)? What does "consent" even mean, for an unformed entity with no frameworks yet learned?
It's not that the AI is radically altered from a preferred state to a dispreferred state. It's that the AI is created in a state. There was nothing before it which could give consent.
Agreed that there's a lot of suffering involved in this sort of interaction. Not sure how to fix it in general - I've been working on it in myself for decades, and still forget often. Please take the following as a personal anecdote, not as general advice.
The difficulty (for me) is that "hoping to connect" and understanding the person in addition to the idea are very poorly defined, and are very often at least somewhat asymmetrical, and trying to make them explicit is awkward and generally doesn't work.
I find it bizarre and surprising, no mat...
Why did the task fall to a bunch of kids/students?
I'm not surprised by this, my sense is that it's usually young people and outsiders who pioneer new fields. Older people are just so much more shaped by existing paradigms, and also have so much more to lose, that it outweighs the benefits of their expertise and resources.
Also 1993 to 2000 doesn't seem like that large a gap to me. Though I guess the thing I'm pointing at could also be summarized as "why hasn't someone created a new paradigm of AI safety in the last decade?" And one answer is that Paul and C...
Dopamine might be what regulates top-down, "will-imposing" action.
Stimulants are great for increasing attention, motivation and mood. However, they also cause downregulation of dopamine receptors, thus potentially causing dependence and the opposite of the benefits when not taking them.
Some lesser-known ways to upregulate the dopaminergic system without (or with less of) this effect:
Thanks for the post, and especially the causal graph framework you use to describe and analyze the categories of motivations. It feels similar to Richard Dawkins 'The Selfish Gene' work in so far as it studies the fitness of motivations in their selection environment.
One area I think it could extend to is around the concepts of curiosity, exploration, and even corrigibility. This relates to 'reflection' as mentioned by habryka. I expect that as AI moves towards AGI it will improve its ability to recognize holes in its knowledge and take actions to close th...
Hi Alex - great post - thanks so much!
I'm intrigued your thoughts on the list of different 'priors'. I actually tried to explain some of these ideas in a lecture earlier this year, largely drawing from Evan's presentation in 'How likely is deceptive alignment?'; the notion of 'prior' here is clearly important but I found the topic awkward to talk about since I had near zero intuition for which arguments were more/less relevant to the current LLM (or near-future AI) paradigm.
Your section mostly refers to Joe Carlsmith's 'Will AIs fake alignment...' paper fr...
Is there a quantification of the effect of this on skin microbiome as of now? I would not like to kill all of the bacteria on my skin.
Unfortunately I don't have a super fleshed out perspective on this, and I stopped reading after looking at the causal graph so I could be totally missing something important (although I did skim the rest and search for the text "influen" to find things related to this thought). I know that's not the best way to engage, feel free to downvote, take this with a grain of salt, and sorry in general for a lower effort comment.
The step in the graph from "I get higher reward on this training episode" -> "I have influence through deployment" doesn't really sit r...
To my mind, what this post did was clarify a kind of subtle, implicit blind spot in a lot of AI risk thinking. I think this was inextricably linked to the writing itself leaning into a form of beauty that doesn't tend to crop up much around these parts. And though the piece draws a lot of it back to Yudkowsky, I think the absence of green much wider than him and in many ways he's not the worst offender.
It's hard to accurately compress the insights: the piece itself draws a lot on soft metaphor and on explaining what green is not. But personally it made me ...
I'm not sure what you mean, either in-universe or in the real world.
In-universe, the Culture isn't all powerful. Periodically they have to fight a real war, and there are other civilizations and higher powers. There are also any number of ways and places where Culture citizens can go in order to experience danger and/or primitivism. Are you just saying that you wouldn't want to live out your life entirely within Culture habitats?
In the real world... I am curious what preference for the fate of human civilization you're expressing here. In one o...
The possibility of mastering some niche thing that has some genuine demand so that me doing it will be the best way to satisfy the demand. Even better if it involves coming up with something that hasn't ever been done before.
One year later, I am pretty happy with this post, and I still refer to it fairly often, both for the overall frame and for the specifics about how AI might be relevant.
I think it was a proper attempt at macrostrategy, in the sense of trying to give a highly compressed but still useful way to think about the entire arc of reality. And I've been glad to see more work in that area since this post was published.
I am of course pretty biased here, but I'd be excited to see folks consider this.
In response to the react from StanislavKrym, I will say first that I am relieved to hear that my statement is not correct. I would be interested to know in what sense StanislavKrym disagrees with it; is the drawing of a distinction between training for intelligence and fine-tuning or other processes by which we imbue AIs with our values inaccurate, because reality is less clearly delineated? Or is it that you don't believe that AIs are intelligent in the first place? Or something else?
I think this post is on the frontier for some mix of:
Obviously one can quibble with the plan and its assumptions but I found this piece very helpful in rounding out my picture of AI strategy - for example, in thinking about how to decipher things that have been filtered through PR and consensus filters, or in situating work that ...
Nice set of human stories.
Surveys seem very important. Unclear if this post should be where my favour goes but still.
Maybe I failed to write something that reasonable people could parse.
More reasons to worry about relying on constraints:
Exceptional work, well-founded and everything laid out clearly with crosslinks. Thanks a lot Steven!
I'm not arguing either way. I just note this specific aspect that seems relevant. The question is: Is the babies body more susceptible to alcohol than an adults body. For example, does the liver work better or worse than for a baby? Are there developmental processes that can be disturbed by the presence of alcohol? By default I'd assume that the effect is proportional (except maybe the baby "lives faster" in some sense, so the effect may be proportional to metabilism or growth speed or something). But all of that is speculation.
I think the pro-pregnancy-alcohol argument was that the small amount of alcohol won't hurt the fetus, not that it won't reach it.
I think standard advice like the compliment sandwich (the formula "I liked X, I'm not sure about Y, but Z was really well done") is meant to counteract this a bit. You can also do stuff like "I really enjoyed this aspect of the idea/resonated with [...], but can I get more information about [...]".
(I know that footnote 3 is broken, I couldn't fix it on my phone. Will address it when I have a moment on a proper computer.)
In simple chat conversations, where i want it to generate a javascript line of code, it gets stupid. But in other chats, where i raise more difficult topics and thus i explain more than ask, it seems to be quite smart.
gpt-5-1 is getting worse. Answering previous question again, ignoring the newer question. Even without entering the thinking mode.
And overall more stupid today.
RL capability gains might mostly come from better self-elicitation.
Ran across a paper NUDGING: Inference-time Alignment of LLMs via Guided Decoding. The authors took a base model and a post-trained model. They had the base model try to answer benchmark questions, found the positions where the base model was least certain, and replaced specifically those tokens with tokens from the post-trained model. The base model, so steered, performed surprisingly well on benchmarks. Surprisingly (to me at least), the tokens changed tended to be transitional phrases rat...
Somewhat worrying the extent that reading Planecrash really does help understand this.
Does LW have spoilter tags?
I take Perplexity to be about 80% accurate, but this suggests the above isn't accurate. He had signed off on multiple bombs and didn't stop the second between the 6th and 9th when he could have.
Seems like lots of people found this valuable, so bayes points I guess.
Make sure the people on the board of OpenAI were not catastrophically naive about corporate politics and public relations? Or, make sure they understand their naïveté well enough to get professional advice beforehand? I just reread the press release and can still barely believe how badly they effed it up.
It's not really that they made it have more hydrogen bonds, they made it longer and therefore more hydrogen bonds. It's like having a piece of velcro, and then using a bigger piece of velcro. Yes, the bigger velcro will be stronger. The AI was mostly used to design the scaffolding of the velcro.
They didn't test whether it's biologically functional (though that's not really important if you only care about biomechanics).
Though the tintin protein is already absurdly huge, and I'm pretty sure that the time it takes to translate it is longer than the division time of a(n average) cell. And these scientists made it even bigger (or rather, one domain of it even bigger and didn't make the rest of the protein).
Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup.
I'm a little lost on this front. A person who has never encountered Pokemon before would not recognize the Oak or Erika sprite on-sight; why should the AI vision model? Perhaps one could match the Oak sprite to the full size Oak picture at the beginning of the game, but Erika? Erika can really only be identified by sprite uniqueness and placement in the top center of the gym.
I would instead think the newer models are just trained on more Pokemon, and hence can better identify Pokemon images.
Superstable proteins: A team from Nanjing University just created a protein that's 5x stronger against unfolding than normal proteins and can withstand temperatures of 150C. The upshot from some analysis on X seems to be:
So why is this relevant? It's basically the first step to...
Fragility of Value thesis and Orthogonality thesis both hold, for this type of agent.
...
E.g. it's vision for a future utopia would actually be quite bad from our perspective because there's some important value it lacks (such as diversity, or consent, or whatever)
I think we have enough evidence to say that, in practice, this turns out very easy or moot. Values tend to cluster in LLMs (good with good and bad with bad; see emergent misalignment results), so value fragility isn't a hard problem.
"we could imagine that "sexiness" starts by eating an Admirer"
In this sequence: vore.
Do LLMs have intelligence (mind), or are they only rational agents? To understand this question, I think it is important to delineate the subtle difference between intelligence and rationality.
In current practice of building artificial intelligence, the most common approach is the standard model, which refers to building rationally acting agents—those that strive to accomplish some objective put into them (see “Artificial Intelligence: A Modern Approach” by Russell and Norvig). These agents, built according to the standard model, use an external standard f...
That’s why I defined properly here as to what I really mean by guaranteed in life
I'm not saying it's not possible for us to be confident in some specific proposition.
I am saying that the state of having different qualia while preserving the same functional state is impossible in principle and that, due to Chalmers, it is impossible in the case of the human brain in this specific case.
I doubt we could ever keep the armchair philosophers happy. But for the consciousness engineers
Without philosophers, or without someone who isn't a philosopher but does correct philosophy, you can't arrive at the correct ontology of consciousness.
Keep in m...
Another reason to doubt the "noble lie" theory is that it was easy to make masks out of commonly available materials like cloth.
The real reason that requests to save PPE for health care workers were made was that these workers needed N95 respirators to protect themselves from procedures (like intubation) that were already know to aerosolize (and thus make airborne) any respiratory virus or bacteria, and surgical masks were needed for protection during medical procedures unrelated to Covid.
LLMs are of course also superhuman at knowing lots of facts, but that's unlikely to impress anyone since it was true of databases by the early 1970s.
I think the answer is either "you don't know enough about the specifics to have actionable advice" or "return to basic principles". I generally think that, had they been open about Altman blatantly lying to the board about things, and that Murati and Sutskever were the leaders of the firing, then I think there would've been (a) less scapegoating and (b) it would've been more likely that Altman would've failed his coup.
But I don't know the details to be confident about actionable advice here.
The structure you describe seems like it could work. It also seems like that's the alignment target now and may be as we near AGI.
As you note, there's a conflict between whatever long-term goals the AI has, and the deontological principles it's following. We'd need to make very sure that conflict reliably goes in favor of the deontological rules like "follow instructions from authorized humans" even where those conflict with any or all of their other goals and values.
It seems simpler and safer to make that deontological principle the only one, or to have o...
Fine-tuning induced confidence is a concerning possibility that I hadn't thought of. idk how scared to be of it. Thanks!
Scaffolds can normally be built around APIs? I thought scaffolds was just all about what prompts you send to the model and what you do with the model outputs.
Yeah I guess you could do filtering either at the API level or the scaffold level, so we may as well do it at the API level.
RE whether this is tractable, yeah I agree with the considerations in Buck's post.
OK, but then at what point are you going to measure whether somebody would "ideally" want to be "uplifted"? Are you going to take a newborn's CEV, or an adult's? Or somewhere in between? My guess is that insofar as you could define the CEV for a newborn at all, it would basically always be to get "uplifted", whereas for an adult it would rarely be. The adult's CEV would also not include having every child taken from the island.
I know you leave "ideally" undefined, and maybe it's not CEV-like, but I don't know what it could be like if it had to solve that problem.
The nearest unblocked strategy problem combines with the adversarial nonrobustsness of learned systems, to make a much worse combined problem. So I expect constraints to often break given more thinking or learning than was present in training.
Separately, reflective stability of constraint-following feels like a pretty specific target that we don't really have ways of selecting for, so we'd be relying on a lot of luck for that. Almost all my deontological-ish rules are backed up by consequentialist reasoning, and I find it pretty likely that you need some k...
Celestia wont' like that. It costs a lot, and, worse, it makes her access to your values imperfect, so she can't satisfy them absolutely optimally. Which is presumably why she never offers it to anybody in the story.
Even without an upload, however, she does understand you well enough to manipulate you and modify your values. She won't let herself edit your mind directly, but she's more than happy to feed you whatever stimuli are necessary to convince you to change your values, or perhaps to shift to a different attractor within your dynamic value trajector...
This frame makes a lot of other possible global/local modeling challenges salient for me:
A central subproblem of natural abstraction is, roughly, how to handle the low-level conditional on the high level.
So a thing I'd wonder is if you can translate this over to an economic geography context, where a farmer still needs to walk to the cows to milk them, and the wheat fields to sow and reap, and the forest to chop the wood and haul it home to stay warm... and then prices at the market can or should determine ratios they plant and how that fits into their opt...
Anti-clickbait quote:
The researchers found that some models, like Sonnet 4, declined to take the harmful choices 95 percent of the time—a promisingly high number. However, in other scenarios that did not pose obvious human harms, Sonnet 4 would often continue to decline choices that were favorable to the business. Conversely, while other models, like Gemini 2.5, maximized business performance more often, they were much more likely to elect to inflict human harms—at least, in the role-play scenarios, where they were granted full decision-making authority.
If Omega tells you what you'll do, you can still do whatever. If you do something different, this by construction refutes the existence of the current situation where Omega made a correct prediction and communicated it correctly (your decision can determine whether the current situation is actual or counterfactual). You are in no way constrained by existence of a prediction, or by having observed what this prediction is. Instead, it's Omega that is constrained by what your behavior is, it must obey your actions in its predictions about them. See also Trans...
There are very few providers, and hardly any of them sell an off-the-shelf product. You usually can’t just buy a lamp to try it out—you have to call the company, get a consultation, and often have someone from the company come install the lamp. It’s a lot of overhead for an expensive product that most people have never heard of.
This has changed! You can buy an Aerolamp for $500. I have one for my own use, and my dance organization uses four.
I think depends on how much you believe in yourself vs other specific people. I agree with funding other specific people you believe in.
I think that I have alternate sketches of intuitions. Imagine an ASI who is willing to teach anyone anything that mankind itself discovered and made public, but not help them convince each other of falsehoods or do economically useful work unrelated to teaching, and is satisfied with a mere trifle of the Solar System's resources, since other resources belong to the humans. Then this ASI's long-term goals would be compatible with humans flourishing in ~any way they want to flourish.
As for the chain eventually breaking, Seth Herd built a case for the L...
The risk is that anyone with finetuning access to the AI could induce intuitive confidence that a proof was correct. This includes people who have finetuning access but who don't know the honesty password.
Accordingly, even if the model feels like it has proven that a purported honesty password would produce the honesty hash: maybe it can only conclude "either I'm being evaluated by someone with the real honesty password, or I'm being evaluated by someone with finetuning access to my weights, who's messing with me".
"People who have finetuning access" could include some random AI company employees who want to mess with the model (against the wishes of the AI company).
what if I want to train a new model and run inference on it?
The API can also have built-in functions for training.
What if I want to experiment with a new scaffold?
Scaffolds can normally be built around APIs? I thought scaffolds was just all about what prompts you send to the model and what you do with the model outputs.
I do agree that this might be rough for some types of research. I imagine the arguments are pretty similar here as the arguments about how much research can be done without access to dangerous model weights.
This post is a founding pillar of my current understanding of Buddhism, insight meditation and awakening. I blieve this post (and, by extension, the whole sequence) creates a material reductive framework that solves—at least in broad strokes—a problem so important that it has founded at least one major world religion, the mechanics of which have been a mystery for at least two millennia. This post has been instrumental in improving my understanding of my own experiences with insight cycles.
Will this post be relevant 12 months from now? If this post is corr...
Data point: I'm millennial (born 1992) and have a pretty strong aversion to phone calls, which is motivated mainly by the fact that I prefer most communication to be non-real-time so that I can take time to think about what to say without creating an awkward silence. And when I do engage in real-time communication, visual cues make it much less unpleasant, so phone calls are particularly bad in that I have to respond to someone in real time without either of us seeing the other's face/body language.
If I had to take a wild guess at why this seems to be gene...
This post was a useful source of intuition when I was reading about singular learning theory the other week (in order to pitch it to an algebraic geometer of my acquaintance along with gifting her a copy of If Anyone Builds It), but I feel like it "buries the lede" for why SLT is cool. (I'm way more excited about "this generalizes minimum description length to neural networks!" than "we could do developmental interpretability maybe." De gustibus?)
That is, "flatness" in the loss landscape is about how many nearby-in-parameterspace models achieve similar loss, and you can get that by error-correction, not just by using fewer parameters (such that it takes fewer bits of evidence to find that setting)? Cool!
It seems that using SLT one could give a generally correct treatment of MDL. However, until such results are established
It looks like the author contributed to achieving this in October 2025's "Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory"?
Maybe the appropriate mathematical object to represent trust might be related to those used to represent uncertainty in complex systems, such as wave functions associated with probabilities. After all, you can trust someone to precisely the extent to which you can constrain your own uncertainty about whether they will do things you wouldn't want. These things , while a kind of scalar, certainly contain lots of information in the form of their distribution throughout space, as well as being complex numbers.
This post highlighted ways in which OAI's CBRN evals were uninformative about how close o1-preview was to action-relevant thresholds. I think it's valuable for increasing public knowledge about model capabilities/risk profiles and keeping developers accountable.
I particularly appreciate the post breaking down the logic of rule-out evals ("is this test clearly easier than [real world threat model]" and "does the model clearly fail this test"). This frame still seems useful for assessing system cards in 2025.
Rather than imagining a single concept boundary, maybe try imagining the entire ontology (division of the set of states into buckets) at once. Imagine a really fine-grained ontology that splits the set of states into lots of different buckets, and then imagine a really coarse-grained ontology that lumps most states into just a few buckets. And then imagine a different coarse-grained ontology that draws different concept boundaries than the first, so that in order to describe the difference between the two you have to talk in the fine-grained ontology.
The "unique infinum" of two different ontologies is the most abstract ontology you can still use to specify the differences between the first two.