Notes to self about the structure of the problem, probably not interesting to others:
This is heavily drawn from MIRIs work and Joe Carlsmith's work
So, there are two kinds of value structure: (1) Long-term goals, and (2) immediate goals & deontological constraints. The line between them isn't sharp but that's OK.
If we imagine an agent that only has long-term goals, well, that thing is going to be a ruthless consequentialisty optimizer thingy and when it gets smart and powerful enough it'll totally take over the world if it can, unless the maxima of its ...
Fragility of Value thesis and Orthogonality thesis both hold, for this type of agent.
...
E.g. it's vision for a future utopia would actually be quite bad from our perspective because there's some important value it lacks (such as diversity, or consent, or whatever)
I think we have enough evidence to say that, in practice, this turns out very easy or moot. Values tend to cluster in LLMs (good with good and bad with bad; see emergent misalignment results), so value fragility isn't a hard problem.
Do LLMs have intelligence (mind), or are they only rational agents? To understand this question, I think it is important to delineate the subtle difference between intelligence and rationality.
In current practice of building artificial intelligence, the most common approach is the standard model, which refers to building rationally acting agents—those that strive to accomplish some objective put into them (see “Artificial Intelligence: A Modern Approach” by Russell and Norvig). These agents, built according to the standard model, use an external standard f...
Many have asserted that LLM pre-training on human data can only produce human-level capabilities at most. Others, eg Ilya Sutskever and Eliezer Yudkowsky, point out that since prediction is harder than generation, there's no reason to expect such a cap.
The latter position seems clearly correct to me, but I'm not aware of it having been tested. It seems like it shouldn't be that hard to test, using some narrow synthetic domain.
The only superhuman capability of LLMs that's been clearly shown as far as I know is their central one: next-token prediction. But I...
LLMs are of course also superhuman at knowing lots of facts, but that's unlikely to impress anyone since it was true of databases by the early 1970s.
Am I understanding correctly that recent revelations from Ilya's deposition (e.g. looking at the parts here) suggest Ilya Sutskever and Mira Murati seem like very selfish and/or cowardly people? They seem approximately as scheming or manipulative as Sam Altman, if maybe more cowardly and less competent.
My understanding from is that they were basically wholly responsible for causing the board to try to fire Sam Altman. But when it went south, they actively sabotaged the firing (e.g. Mira disavowing it and trying to retain her role, Ilya saying he regr...
I think the answer is either "you don't know enough about the specifics to have actionable advice" or "return to basic principles". I generally think that, had they been open about Altman blatantly lying to the board about things, and that Murati and Sutskever were the leaders of the firing, then I think there would've been (a) less scapegoating and (b) it would've been more likely that Altman would've failed his coup.
But I don't know the details to be confident about actionable advice here.
Epistemic status: I think that there are serious problems with honesty passwords (as discussed in this post), and am not sure that there are any circumstances in which we'd actually want to use them. Furthermore, I was not able to come up with a practical scheme for honesty passwords with ~2 days of effort. However, there might be some interesting ideas in this post, and maybe they could turn into something useful at some later point.
...Thanks to Alexa Pan, Buck Shlegeris, Ryan Greenblatt, Vivek Hebbar and Nathan Sheffield for discussions that led to me writi
Fine-tuning induced confidence is a concerning possibility that I hadn't thought of. idk how scared to be of it. Thanks!
much has been made of the millennial aversion to phone calls that could have been an email, and I have a little bit of this nature, but I think most of my aversion here is to being on hold and getting bounced around different call departments.
I kind of want to check if 1. the aversion is real and generational as common wisdom holds, 2. if it is real, if calling became a genuinely worse experience around the time millennials started trying to do things.
Data point: I'm millennial (born 1992) and have a pretty strong aversion to phone calls, which is motivated mainly by the fact that I prefer most communication to be non-real-time so that I can take time to think about what to say without creating an awkward silence. And when I do engage in real-time communication, visual cues make it much less unpleasant, so phone calls are particularly bad in that I have to respond to someone in real time without either of us seeing the other's face/body language.
If I had to take a wild guess at why this seems to be gene...
I quite appreciated Sam Bowman's recent Checklist: What Succeeding at AI Safety Will Involve. However, one bit stuck out:
...In Chapter 3, we may be dealing with systems that are capable enough to rapidly and decisively undermine our safety and security if they are misaligned. So, before the end of Chapter 2, we will need to have either fully, perfectly solved the core challenges of alignment, or else have fully, perfectly solved some related (and almost as difficult) goal like corrigibility that rules out a catastrophic loss of control. This work co
"we don't need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization."
That isn't a conclusion I draw, though. I think you don't know how to parse what I'm saying as different from that rather extreme conclusion -- which I don't at all agree with? I feel concerned by that. I think you haven't been tracking my beliefs accurately if you think I'd come to this conclusion.
FWIW I agree with you on a), I don't know what you mean by b), and I agree with c) partially—meaningfully different, sure.
Anyways, when I talk abou...
Do we have some page containing resources for rationalist parents, or generally for parents of smart children? Such as recommended books, toys, learning apps, etc.
I found tag https://www.lesswrong.com/w/parenting but I was hoping for some kind of best textbooks / recommendations / reference works but for parents/children.
From DeJong et al. (2019):
Alcohol readily crosses the placenta with fetal blood alcohol levels approaching maternal levels within 2 hours of maternal consumption.
https://scispace.com/papers/alcohol-use-in-pregnancy-1tikfl3l2g (page 3)
Gpt5.2 seems to have been trained specifically to be better at work tasks, especially long ones. It was also released early, according to articles about a "code red" in openAI. As such, (I predict) it should be a jump on the metr graph. It will be difficult to differentiate progress because it was trained to do well at long work tasks from the results of the early release and from any actual algorithms progress. (An example of algorithms progress would be a training method for using memory well - something not specific to eg programming tasks.)
What do you mean by "a jump on the metr graph"? Do you just mean better than GPT-5.1? Do you mean something more than that?
One theme I've been thinking about recently is how bids for connection and understanding are often read as criticism. For example:
Person A shares a new idea, feeling excited and hoping to connect with Person B over something they've worked hard on and hold dear.
Person B asks a question about a perceived inconsistency in the idea, feeling excited and hoping for an answer which helps them better understand the idea (and Person B).
Person A feels hurt and unfairly rejected by Person B. Specifically, Person A feels like Person B isn't willing to give their sinc...
What constitutes cooperation? (previously)
Because much in the complex system of human interaction and coordination is about negotiating norms, customs, institutions, constitutions to guide and constrain future interaction in mutually-preferable ways, I think:
deserves special attention.
This is despite it perhaps (for a given 'coordination moment') being in theory reducible to preference elicitation or aggregation, searching outcome space, negotiation, enforcement, ...
There are failure modes (unintended consequences, ...
Seems right! 'studies' uplifts 'design' (either incremental or saltatory), I suppose. For sure, the main motivation here is to figure out what sorts of capabilities and interventions could make coordination go better, and one of my first thoughts under this heading is open librarian-curator assistive tech for historic and contemporary institution case studies. Another cool possibility could be simulation-based red-teaming and improvement of mechanisms.
If you have any resources or detailed models I'd love to see them!
I think people in these parts are not taking sufficiently seriously the idea that we might be in an AI bubble. this doesn't necessarily mean that AI isn't going to be a huge deal - just because there was a dot com bubble doesn't mean the Internet died - but it does very substantially affect the strategic calculus in many ways.
The evidence just seems to keep pointing towards this not being a bubble.
A drastic table-flip-like action a safety-conscious frontier lab could take: Burn their company to the ground
At a specific chosen moment, all employees are fired, the leadership steps down, all intellectual property+model weights are deleted and infrastructure is de-deployed/"demolished", and this is publicly announced, together with the public message that it was too dangerous to continue, and an urging of other company leaders to do the same.
(I don't think any lab is planning on doing this, but someone could.)
Learned about 'Harberger tax' recently.
The motivation is like
Ideally you would want to allow depreciation though, which is a definite phenomenon! (Especially if things are neglected.)
Yeah, there's some design questions. You're right, the upside to the corrective bidders is naively nothing if they get called on it: they're doing valuable corrective cybernetic labour for free.
Maybe a sensible refinement would be for them to be owed a small fee... or roughly equivalently some (temporary) direct share of the resulting increased Harberger tax.
Last week we wrapped the second post-AGI workshop; I'm copying across some reflections I put up on twitter:
You mean post AGI and pre ASI?
I agree that will be a tricky stretch even if we solve alignment.
Post ASI the only question is whether it's aligned or intent aligned to a good person(s). It takes care of the rest.
One solution is to push fast from AGI to ASI.
With an aligned ASI, other concerns are largely (understandable) failures of the imagination. The possibilities are nearly limitless. You can find something to love.
This is under a benevolent sovereign. The intuitively appealing balances of power seem really tough to stabilize long term or even short term during takeoff.
running the agi survey really reminded me just how brutal statistical significance is, and how unreliable anecdotes are. even setting aside sampling bias of anecdotes, the sheer sample size you need to answer a question like "do more people this year know what agi is than last year" is kind of depressing - you need like 400 samples for each year just to be 80% sure you'd notice a 10 percentage point increase even if it did exist, and even if there was no real effect you'd still think there was one 5% of the time. this makes me a lot more bearish on vibes in general.
Reasonable, I also don't expect that I could pick up on a 1.5x increase in name recognition over a year based on vibes - didn't read closely enough to notice you were talking about a 10% increase, so sorry about the time waste.
My preference ordering over approaches to surviving the next decade, assuming I had unlimited political capital, is (don't build ASI and):
1: Solve mind uploading
2: ... or at least intelligence amplification
(before proceeding, and then maybe do)
3: Imitation learning (with radically improved theoretical foundations)
4: ... or corrigibility / oracles
(and don't)
5: Try to build aligned (or controlled) autonomous ASI agents
Unfortunately 5 seems to be profitable, in the short term, so I guess that's what we're doing. In fact, the "plan" seems to be the exact reverse or so of my preference ordering on what it should be.
Since it's slower, the tech development cycle is faster in comparison. Tech development --> less expensive tech --> more access --> less concetration of power --> more moral outcomes.
Just learned about the Templeton World Charity Foundation (TWCF), which is unusual in that one of their 7 core funding areas is, explicitly, 'genius':
...Genius
TWCF supports work to identify and cultivate rare cognitive geniuses whose work can bring benefits to human civilization.
In this context, geniuses are not simply those who are classified as such by psychometric tests. Rather, they are those who: (1) generate significant mathematical, scientific, technological, and spiritual discoveries and inventions that benefit humanity or have the potential to transf