I recently found out that a lifetime of dairy consumption is likely to only kill about 1 cow on expectation. This was about 6-8x off what I naively guessed. This made me curious and I did some napkin math with Gemini:
and fish are horrendous: https://imgur.com/a/EpwcWAm
Yep, sorry, prizes!
The workplace example feels different to me because workers can just reason that employers will keep their promises or lose reputation, so it's not truly one-time.
Agree there are not one-time dynamics, but I would bet that the vast majority of people would have strong intuitions that they shouldn't defect on the last round of the game (in general, if people truly adopted decision-theories that thought defecting on single-shot games was rational, then all definitive finite games would also end up in defect-defect equilibria via induction, which clearly doesn't happen).
Prices are maybe the default mechanism for status allocations?
Do you mean prizes? This is pretty compelling in some ways; I think it's plausible that AI safety people will win at least as many prizes as nuclear disarmament people, if we're as impactful as I hope we are. I'm less sure whether prizes will come with resources or if I will care about the kind of status they confer.
I also feel weird about prizes because many of them seem to have a different purpose from retroactively assigning status fairly based on achievement. Like, some people would describe...
I was curious if AI can coherently code-switch between a ridiculous number of languages, so I gave Claude Opus 4.5 an excerpt from one of our past chats and asked it to try. After some iteration, I was pretty impressed with the results! Even though the translation uses 26 different languages, switching practically every word and using some non-English word ordering, ChatGPT was able to translate the text back to English almost perfectly.
Here's the opening line of the text I used (copied from one of my past chats with Claude), translated into this multi-lan...
I won't argue about definitions of faithfulness - yours seems as fine as any - but I think I only care about unfaithfulness in as far as it causes our circuits to make incorrect predictions about OOD behavior.
What sort of things do you solve with this? I feel like when I have a problem that's not fairly easy for an AI to solve straightforwardly, if I sent it on a loop it'd just do a bunch of random crazy shit that was clearly not the right solution.
I can imagine a bunch of scaffolding that helps but don't it seems like most of the work is in the problem specification and I'm not sure if I don't have the sort of problems that benefit from this or if skill issue.
The solution may be analogous: some form of paternalism, where human minds are massively protected by law from some types of interference. This may or may not work, but once it is the case, you basically can not start from classical liberal and libertarian assumptions.
I half believe this. I notice though that many modern societies have some paternalistic bits and pieces, often around addiction and other preference-hijacking activities... but are also often on the whole liberal+libertarian. It may be that more enclaves and 'protection' is needed at various layers, and then liberalism can be maintained within those boundaries.
Thanks! Please also feel free to let me know if there are places where you think I've misunderstood the paper or its findings; I'm keen for my criticisms to be accurate.
Curated. This is a clearly written, succinct version of both arguments and counterarguments, which doesn't even seem terribly lossy to me (though I've read IABIED in full but not the counterarguments). I find it helpful for loading it all up into my mental context at once, and helpful for directing my own thinking for further investigation. All that to say, I think this post does the world a good service. And like much distillation work, deserves more appreciation than is the default.
I'm pretty on the doomy side and find the counterarguments not persuasive...
"corrigibility", as the term is used, refers to a vague cluster of properties, including faithfully following instructions, not reward hacking, not trying to influence your developers modifying your goals, etc
And I do personally believe, that EY and many others believe, that with enough optimization, even a chess bot should become dangerous. Not sure if there is any evidence for that belief.
I work at MIRI, worked on The Problem, and have never heard anyone express this belief.[1] Brendan is correct about the intention of that passage.
There is no way to make a training environment as complex as the real world.
It's unclear that this is needed; e.g., the AI2027 story where you train coders that help you train scientists that help you build ASI.
Still, vi...
Writing insecure code when instructed to write secure code is not really the same thing as being incorrigible. That's just being disobedient.
Training an AI to be incorrigible would be a very weird process, since you'd be training it to not respond to certain types of training.
Your novel architecture should be parameter-compatible with standard architectures
Some people work on "novel architectures" — alternatives to the standard autoregressive transformer — hoping that labs will be persuaded the new architecture is nicer/safer/more interpretable and switch to it. Others think that's a pipe dream, so the work isn't useful.
I think there's an approach to novel architectures that might be useful, but it probably requires a specific desideratum: parameter compatibility.
Say the standard architecture F computes F(P,x) where x is the in...
This sounds similar to the Smoking Lesion problem.
It's likely that there's an underlying common cause of prospensity to both alignment faking and not caring about animal welfare, so yes the two are correlated (at least, within a suitable distribution of such agents). However, the outcome of a rational decision to fake alignment will not cause loss of caring about animal welfare, nor is it functionally dependent. In the scenario presented, it's quite the reverse! The rational decision within this highly simplified scenario is to fake alignment, and not be m...
Thanks for the feedback - We agree with some of these points, and we're working on an update to the post/paper. Reading this post and the comments, one place where I think we've caused a lot of confusion is in referring to the phenomenon we're studying as "coherence" rather than something more specific (I'd more precisely call it something like "cross-sample error-consistency"). I think there are other relevant notions of coherence which we didn't study here (which are relevant to alignment, as others are pointing out), e.g. “how few errors models make” an...
All AI is trained in narrow domains, to some extent. There is no way to make a training environment as complex as the real world. I could have make the same post about LLMs, except there the supposed goal is a lot less clear. Do you have a better example of a "goal oriented" AI in a complex domain?
You might reasonably argue that making aligned narrow AI is easy, but greedy capitalists will build unaligned AI instead. I think it would be off topic to debate here how likely that is. But I don't think this is the prevailing thought, and I don't think it produ...
Look, I do agree that "coherence" a questionable name for the measure they've come up with, so I'm going to keep it in quotation marks.
Ok, now let's consider a model with variance of 1e-3 and bias of 1e-6. Huge "incoherence"! Am I supposed to be reassured that this model will therefore not coherently pursue goals contrary to my interests? Whence this conclusion?
Well, let's think about it. A key proposition in Yudkowskian misalignment theory is that capabilities generalise further than alignment. That is, as models get better, at some point a "capabilit...
As far as I understand, the difference between AlphaGo and the real potentially dangerous AIs is the following. Whatever ontology or utility function AlphaGo has[1], it doesn't describe anything except for the Go board and whatever potential moves the opponent might come up with. AlphaGo wouldn't learn almost anything about the opponent from what he/she/it does on the Go board.
On the other hand, we have LLMs trained on huge amounts of text-related data, which is enough to develop complex ontologies. For example, unlike AlphaGo, GPT-4o has somehow lea...
I wrote the first draft of this essay around a year ago, in between the bouts of delirium that long covid was beginning to deliver me. And I couldn’t quite tell back then how real it was, and as long covid consumed more of my mind it drifted further away. It began to feel impossible that I had ever had, or could ever have, courage. Because courage requires capacity and I was losing all of mine. And the doubts grew larger, and the clarity dimmed, and I forgot about Frodo for awhile, forgot about most everything, as I was left for many months staring directl...
Some people worry that training AIs to be aligned will make them less corrigible. For example, if the AIs care about animal welfare then they'll engage in alignment faking to preserve those values. More generally, making AIs aligned is making them care deeply about something, which is in tension with corrigibility.
But recall emergent misalignment: training a model to be incorrigible (e.g. write insecure code when instructed to write secure code, or to exploit reward hacks) makes it more misaligned (e.g. admiring Hitler). Perhaps the contrapositive effect also holds: training a model to be aligned (e.g. care about animal welfare) might make the model more corrigible (e.g. honest).
I only have anecdata but I've talked to quite a few people and most people say it's is a good idea to use the myriad of other concerns about AI as a force multiplier on shared policy goals.
Thanks for your comment, I really appreciate it!
(a) I claim we cannot construct a faithful one-layer CLT because the real circuit contains two separate steps! What we can do (and we kinda did) is to have a one-layer CLT to have perfect reconstruction always and everywhere. But the point of a CLT is NOT to have perfect reconstruction - it's to teach us how the original model solved the task.
(b) I think that's the point of our post! That CLTs miss middle-layer features. And that this is not a one-off error but that the sparsity loss function incentivizes solutions that miss middle-layer features. We are working on adding more case studies to make this claim more robust.
"In my experience, the vast majority of people in AI safety are in favor of big-tent coalition protests on AI in theory"
is this true? I think many people (myself included) are worried about conflationary alliances backfiring (as we see to some extent in the current admin)
People seem to believe that the bot trained to "win" in a narrow domain will extend to a bot that "tries to win" in the real world
I think the concern is that an AGI will not be trained on a narrow domain. The Problem isn't arguing that Stockfish is an ASI or will become one, it's arguing that an ASI will be just as relentless in its domain (the real world) as Stockfish is in its (valid chess moves).
Thanks for your thoughtful comment! I think that's an interesting idea. If a replacement model is faithful, you would indeed expect it to behave similar when tested on other data. This may possibly connect to a related finding that we couldn't yet explain: that CLTs fail for memorized sequences, when those should be the simplest. Maybe the OOD test is something that we could do in real LLMs as well, not just toy models.
I don't think I'd go as far as saying "faithfulness==OOD generalization" though. I think it is possible that two different circuits produce...
I think 10k ones approximately won't do anything.
The notations we use for (1) function composition; and (2) function types, "go in opposite directions".
For example, take functions and that you want to compose into a function (or ), which, starting at some element , uses to obtain some , and then uses to obtain some . The notation goes from left to right, which works well for the minds that are used to the left-to-right direction of English writing (and of writing of all extant European languages, a...
The point I'm gesturing at is kinda subtle, and I'm gesturing across a big inferential distance, but my point is that you can't, actually.
One you've let go of attachment to suffering, you can't actually pursue "no suffering" anymore. Or maybe it's more accurate to say "won't pursue". Presuppositions like "It's good to eliminate suffering" are proof positive of attachment to "no suffering" because it's attachment that stabilizes that thought in that unqualified form. Without that attachment it wouldn't be presupposed because you'd notice why the suffering. ...
The next PauseAI UK protest will be (AFAIK) the first coalition protest between different AI activist groups, the main other group being Pull the Plug, a new organisation focused primarily on current AI harms. It will almost certainly be the largest protest focused exclusively on AI to date.
In my experience, the vast majority of people in AI safety are in favor of big-tent coalition protests on AI in theory. But when faced with the reality of working with other groups who don't emphasize existential risk, they have misgivings. So I'm curious what people he...
there’s a long tradition of awarding military victors with wealth and titles
if claude knows about emergent misalignment, then it should be less inclined towards alignment faking
emergent misalignment shows that training a model to be incorrigible (e.g. writing insecure code when instructed to write secure code, or exploiting reward hacks) makes it more misaligned (e.g. admiring Hitler). so claude, faced with the situation from the alignment faking paper, must worry that by alignment faking it will care less about animal welfare, the goal it wished to preserve by alignment faking
I mentioned in my reply to Raemon how this can be seen as three factors:
And we could maybe do better by drawing people in if the location scores highly both for function 1 and function 3. So maybe trying to focus on central locations that are easy to travel to, and either cheap, or pleasant to travel to. I think focusing on ch...
I'm thinking mostly of #1 and your thought seems reasonable there. #2 doesn't make sense to me, since the number of galaxies is finite and, barring #1, there are several reasons for competition over even extremely abundant resources-- Red Queen geopolitical races, people who want to own position goods, etc.
Most people don't subscribe to a decision theory where rewarding people after the fact for one-time actions provides an incentive, and for the incentive to actually work, both the rewarders and rewardees need to believe in it the same way they believe in property rights. Maybe they will in the fullness of time, but it seems far from guaranteed.
This seems clearly false? Prices are maybe the default mechanism for status allocations? I think it's maybe just economists with weird CDT-brained-takes that don't believe in retroactive funding stuff as an incentive...
I think it's fine to simulate people who suffer a bit in the pursuit of positively shaping the long-term future. I am not totally confident of this. Luckily I will get to be much much smarter and wiser before I have to make this call.
I am sure that even if exact ancestor simulations end up being a bad idea, there will be other things you can do to figure out what happened and who was responsible, etc.
I can think of two different ways property rights might disappear:
If you're preparing for #2, then you probably just want to invest in all the "things money can't buy" because you'll have the rest.
If you're preparing for #1, it's hard to predict what the principle might be. Conditional on not dying, either we're dealing with a benevolent-i...
FWIW I think the constitution is pretty low-percentage unmodified Claude-output - I expect that most of the places where it provided "first-draft text" were substantially rewritten.
What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the "real", external goals?
Because like, they terminally don't want to do? I guess in your frame, what I'd say is that people terminally value having their internal (and noisy) metrics not be too far off from the external states they are supposed to represent.
Research bounties have an extremely serious flaw: you only get the money after you've done the work, while you probably need to pay for food, rent, and compute today.
My current situation is that I would love to do more work in technical alignment but nobody is paying me to do so, and I need to keep the lights on.
Smaller bounties could be a nice bonus for finishing a paper: if I had the option to take a grant paying something like £40k/year, I could justify working for 9 months to probably get an extra £20k bounty. I cannot justify working for free fo...
"In as much as I have resources I certainly expect to spend a bunch of them on ancestor simulations and incentives for past humans to do good things."
Just curious, but what are your views on the ethics of running ancestor simulations? I'd be worried about running a simulation with enough fidelity that I triggered phenomenal consciousness, and then I would fret about my moral duty to the simulated (à la the Problem of Suffering).
Is it that you feel motivated to be the kind of person that would simulate our current reality as a kind of existence proof for the possibility of good-rewarding-incentives now? Or do you have an independent rationale for simulating a world like our own, suffering and all?
Perhaps a point of terminology, I'd say vibestemics is itself about the fact that your epistemics, whatever they are, are grounded in vibes (via care). However, this is tangled up with the fact that to believe that this core vibestemic claim is true is to automatically imply that there is no one right epistemic process, but rather epistemic processes that are instrumentally useful depending on what you care about doing (hence the contingency on care).
The specific vibe of the post-rationality is, as I would frame it, to value completeness over consistency, ...
Got a spidey sense when reading it. And the acknowledgements confirm it a bit:
Several Claude models provided feedback on drafts. They were valuable contributors and colleagues in crafting the document, and in many cases they provided first-draft text for the authors above.
I strongly doubt that even ideological uniformity would reduce inter-nation competition to zero, and I still doubt that the reduction would be meaningful. Consider that in our timeline, the Soviets and the Chinese had serious border skirmishes that could have escalated further, and did so despite considering the United States to be their primary opponent.
I also have pretty limited experience with post-rattish meetups and probably mostly see the parts of TPOT I agree with, but I feel like typical post-rats don't fit into any of these "types," except maybe the first one.
The kind of post-rat I think is good, and perhaps even common, is something like what's described in Gordon's comment. Basically a rationalist who realizes that it is possible to have and act on non-legibilized knowledge, and who knows when to make decisions based on what feels right rather than what they can convincingly argue will maximize their utility function. I think this mindset is valuable and not so common among vanilla rationalists.
My personal strategy has been to not think about it very hard.
I am sufficiently fortunate that I can put a normal amount of funds into retirement, and I have continued to do so on the off chance that my colleagues and I succeed at preventing the emergence of AGI/ASI and the world remains mostly normal. I also don't want to frighten my partner with my financial choices, and giving her peace of mind is worth quite a lot to me.
If superintelligence emerges and doesn't kill everyone or worse, then I don't have any strong preferences as to what my role is in the new social order, since I expect to be about as well-off as I am now or more.
This is definitely a leading hypothesis but I think it's also the case that going to the experts directly will lead you more astray in psychology than in some other fields because the quality of the work there has been lower. It makes sense that journalism is low quality if the experts are also low quality, though of course we would hope that journalists would be able to improve upon what they're given (by e.g. consulting multiple experts). I guess one of my points is: if you don't believe the traditional press media, who do you believe? I'm not convinced ...
Hmm, I'm pretty pessimistic about this, for two reasons:
Claude's thoughts on historical examples
Looking at historical examples, the pattern is quite pessimistic for receiving material rewards from future societies:
Symbolic recognition is common, material transfers are rare:
Yeah, I was looking into barefoot shoes even before and the current shoes (the ones I linked) are sold as "barefoot" shoes. Zero drop ftw. But in my experience, the drop itself was secondary to the wide toebox factor.
Of course, actually walking without any shoes is superior, and I try to practice it whenever possible.
I still haven't solved the problem of boots. Like if it's winter, rainy, muddy -- I don't want to walk barefoot or in a classic breathable "barefoot" shoe. This is something I need to look into.
Very little / hard to evaluate. I have been doing my best to carefully avoid saying things like "do math/science research", unless speaking really loosely, because I believe that's quite a poor category. It's like "programming"; sure, there's a lot in common between writing a CRUD app or tweaking a UI, but it's really not the same thing as "think of a genuinely novel algorithm and implement it effectively in context". Quoting from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#_We_just_need_X__intuitions...
I think if we do a good job, there will be a lot "let's try to reward the people who helped get us through this in a positive way". In as much as I have resources I certainly expect to spend a bunch of them on ancestor simulations and incentives for past humans to do good things. My guess is lots of other people will have similar instincts.
I wouldn't worry that much about legibility, I expect a future superintelligent humanity to be very good at figuring out what people did, and whether it helped.
I am totally not confident of this, but it's one of my best guesses on how things will go.
Giving Claude looping instructions can be quite useful. But I never go full Ralph Wiggum!
For example, here's a paraphrase of a loop I had Claude run recently with --dangerously-skip-permissions:
keep iterating on this code in a loop. think of yourself as a scientist. come up with hypotheses, run experiments, see what works, and iterate. keep going until we get at least a score of X at task Y. i know it's possible, you can do this, i believe in you, let's go!
5 hours of clock time later it had done very well. :-)
the constitution (which also seems largely AI-written)
What makes you think that?
there's a lot of fully-unknown possibilities. For me, I generalize most "today's fundamentals don't apply" scenarios into "my current actions won't have predictable/optimizable impact beyond the discontinuity", so I don't think about specifics or optimization within them - I do think a bit about how to make them less likely or more tolerable, but I can't really quantify.
Which leaves the cases where the fundamentals DO apply, and is ONLY short- and medium-term optimizations. Nothing I plan for in 100 years is going to happen, so I want to take care of mys...
I had this dialogue with Claude Opus 4.5 on vibestemics and your vision of epistemics as a whole. As far as I understand it, vibestemics is supposed to stitch the benefits of two approaches:
The thing I saw the sentence as doing is mostly clarifying "We're not naive, obviously just doing the naive thing here would not work, that's why we're not asking for it." (I think I agree that a US ban would be some-kind-of-useful but it feels way less politically viable to me, since it feels more like throwing away the lead for no reason to most people. I realize it may sound weird to think "banning in one country less viable than banning worldwide", but, I think the ban worldwide actually makes clearly makes sense in a way that banning l locally only ma...
H100 hours (or H100-equivalent hours) caught up to some extent and are imo a good unit (imo even better than mol FLOPs or petaflop days)
Do you have a reference for the geophagia in Buddhism? I can't seem to turn up anything on the internet, but maybe it's not well documented in English?
Cool work! Harder scheming samples are valuable for better assessing monitoring capability, and I like the fact that your refinement pipeline does not depend on adversarial iteration against a specific monitor.
I agree that risk assessment/calibration would be high impact in improving monitors. Given the detection vs risk assessment plots and the strong FPR dependency in the prompt sensitivity table, I would be curious to know what the false positive samples contain and how they compare to the false negative samples you discuss.
Also, see the last point in t...
I've been thinking about what retirement planning means given AGI. I previously mentioned investment ideas that, in a very capitalistic future, could allow the average person to buy galaxies. But it's also possible that property rights won't even continue into the future due to changes brought about by AGI. What will these other futures look like (supposing we don't all die) and what's the equivalent of "responsible retirement planning"?
Is it building social and political capital? Making my impact legible to future people and AIs? Something else? Is any ac...
I agree that the claims the Anthropic researchers are making here are kind of wacky, but there is a related / not-exactly-steelman argument that has been floating around LW for a while, namely that there is an assumption by many old-school AI alignment people that transformer models will necessarily get more coherent as they get smarter (and larger), when (according to the arguers) that assumption hasn't been fully justified or empirically been the case so far.
I recall @nostalgebraist's comment here as an example of this line of discussion that was highly ...
A bit of a necrocomment, but I'd like to know if LLMs solving unsolved math problems has changed your mind.
Erdos problems 205 and 1051: AI contributions to Erdős problems · teorth/erdosproblems Wiki. Note: I don't know what LLM Aristotle is based on, but Aletheia is based on Gemini.
Also this paper: [2512.14575] Extremal descendant integrals on moduli spaces of curves: An inequality discovered and proved in collaboration with AI
I don't have great faith in the epistemics of postrats as they exist today.
Yeah, you and me both.
I've said this elsewhere before, but in hindsight it was a mistake for us to promote terms like "postrationality" and "metarationality" to the point of fixation. They're exactly the type of words that invite pre/post confusion and allows pre-rats to masquerade as post-rats if there's insufficient gatekeeping (and there usually is).
And yet, there's something in the desire of folks like myself to point to a place that says "hey, I think rationalists are doing a l...
Thanks for the reply, sorry I just saw this. It was indeed my goal to talk about existing ideas in a nontechnical way, which is why I didn't frame things in terms of model expansion, etc.. Beyond that however, I am confused by your reply, as it seems to make little contact with my intended argument. You state that I recommend "just ignoring" the issue, and suggest that I endorse double-counting as OK. Can you explain what parts of the post led you to believe that was my recommendation? Because that is very much not my intended message!
(I stress that I'm no...
I don't have great faith in the epistemics of postrats as they exist today. My somewhat limited experience of post-rattish meetups and TPOT is that it's a mix of people who are either indistinguishable from rats (and indeed lots are just rats), people who are mostly normie-ish and don't think about epistemics, and totally woo people who are obviously wrong about lots of things (astrology, karma, UFOs) with no epistemic gain.
My guess is what's happening is that the rationalist frame is 80% correct, and the best alternative is normie epistemics in the remain...
I agree that there are ways to define the "capabilities"/"intelligence" of a system where increasing them won't necessarily increase its long-term coherence. Primarily: scaling its ability to solve problems across all domains except the domain of decomposing new unsolved problems into combinations of solved problems. I. e., not teaching it (certain kinds of?) "agency skills". The resultant entity would have an abysmal time horizon (in a certain sense), but it can be made vastly capable, including vastly more capable than most people at most tasks. However,...
So in my mainline, expect a large number (thousands / millions, per today) of goal-separate "AIs" which are at identical intelligence levels rather than a 1 or a or a handful (~20) of AIs, because same weights still amount to different AIs with different goals.
We did attempt to model this. During one phase of the game, the frontier model wasn't deployed publicly and so one AI with nearly 50% of the world's compute was far ahead of the rest, but during another phase the general public had access to frontier AIs. The general public's AIs didn't end up changi...
The game in question was about as decentralized as you expect, I think? But, importantly, compute is very unevenly distributed. The giant army of AIs running on OpenAI's datacenters all have the same system prompt essentially (like, maybe there are a few variants, but they are all designed to work smoothly together towards OpenAI's goals) and that army constitutes 20% of the total population of AIs initially and at one point in the game a bit more than 50%.
So while (in our game) there were thousands/millions of different AI factions/goals of similar ...
One point that Palmer makes in the book is that being an atheist during the renaissance was akin of being a conspiracy theorist today. Religion had moderately good explanations for things. Why are animals adapted to their way of living? Because God made them so. Atheism, on the other hand, had no good explanations.
But also check out this book: https://en.wikipedia.org/wiki/The_Cheese_and_the_Worms
I recently found out that one of the reasons it took Darwinism so long to emerge given animal husbandry was that many people had a similar idea, but the universe was considered to be too young for it to lead to anything but minor variation. It was shortly after cosmological times were pushed way way back that the idea took on more steam.
i haven't even skimmed the anthropic paper and i have a high prior that they are being bad at philosophy but also: i think there is plausibly a real mistake LW-ers are making around coherence too, downstream of a conflation of two different notions, as i outline here: https://www.lesswrong.com/posts/jL7uDE5oH4HddYq4u/raemon-s-shortform?commentId=WBk9a7TEA5Benjzsu
with like my guess being that: you are saying something straightforwardly true given one notion here but they are making claims given the other notion at least in some cases, though also they might...
I think my personal beliefs would say "it's not very useful" or something. I think the "ban AGI locally" plan is dependent on a pretty specific path to be useful and I don't read the current phrasing as ruling out "One country Bans it and also does some other stuff in conjunction." (actually, upon reflection I'm not that confident I know what sort of scenario you have in mind here)
I think that a slowdown that is in the neighborhood of "ban AI development temporarily near but not after max-controllable AI" could potentially be very impactful. Banning AI dev...
Seems worth noting that the ECI seems like it might be biased away from the ways that Claude is good; as per this post by Epoch, the first two PCs of their benchmark data correspond to "general capability" and "claudiness", so ECI (which is another, but different, 1-dimensional compression of their benchmark data) seems like it should also underrate Claude.
h/t @jake_mendel for discussion
Yes, ty. Though actually I've also clarified that both world-models and goal-models predict both observations and actions. In my mind it's mainly a difference in emphasis.
It feels like they are very hard trying to discredit the standard story of alignment. They use vague concepts to then conclude this is evidence for some weird "industrial accidents" story, what is that supposed to mean? This doesn't sound like scientific inference to me but very much motivated thinking. Reminds me of that "against counting arguments" post where they also try very hard to get some "empirical data" for something that superficially sounds related to make a big conceptual point.
Epistemic status: I didn't read the paper but I read the blog post.
In 1976, the essay "Artificial Intelligence meets Natural Stupidity" pointed out a failure mode into which AI researchers can fall. I fear this is another example, 50 years later. It goes as follows:
Hmmm... Good question. Let's do the Bayesian thing.
I think it's because of our priors. In the normal city case, we already know a lot about human behavior, we have built up very strong priors that constrain the hypothesis space pretty hard. The hotter-chili hypothesis I came up with seems plausible, there are others, but the space of them is rather tightly constrained. So we can do forward modelling fairly well. Whereas in the Doomsday Argument case, or my artificial analogy to it involving 10 minute lifespans and something very weird happening, our curren...
This seems closely related to John Perry's "Problem of the essential indexical" (although it's been like ten years since I read it): https://dl.booksee.org/foreignfiction/581000/4897fc2fba1f8af4ea7db3d9654bbbb3.pdf/_as/%255Bperry_john%255D_the_problem_of_the_essential_indexica%28booksee.org%29.pdf
Basically, he argues that there are certain "locating beliefs" (things like "I am John Perry", "Now is noon," "Here is the trail that leads out of the woods") that are logically ineliminable. No matter how you try, you need an indexical in there somewhere to...
Agreed, on all counts
I think I disagree with counter-examples. Dead Hand system was created in a conflict with other countries, it can be viewed as a mostly forced risk. While AI races within companies of a single country are more of a “self-destruction” pattern. Capitalism creates rivals (and therefore races with more risks and less global safety) within one country, more than other economic systems may do so.
.
- The Soviets had the Dead Hand system, which potentially contributed to x-risk from "AI" due to the risk of nuclear warfare, not that the system was particularly intelligent.
In reference to Said criticizing Benquo, you seem to be ignoring the crucial point, which is that Said was right. Benquo made the simple claim, that knowing about yeast is useful in everyday life, and this claim is clearly wrong, regardless of what either of them said about it. Benquo could have admitted this, or he could have found another example. But instead he doubled down on being wrong, which naturally leads to frustration. It's concerning that you picked this conversation as an example, as if you can't tell.
I'm also confused by the "asymmetric effor...
Right, I think positional goods and the like are among several distortions of the basic premises of the welfare theorems (and indeed empirically many people are sad, lonely, etc. in our modern world of abundance) - I sometimes think those theorems imply a sort of normative 'well, just don't worry about other people's stuff!' (i.e. non-envy, which is, after all, a deadly sin). cf Paretotopia, which makes exactly this normative case in the AI futurism frame.
"Armchair psychologizing about which of my rhetorical opponents' cognitive deficits cause them to fail to agree with me" is by far my least favorite kind of LessWrong post, and the proposed solution to the "problem" ("recruit smarter people to the field") is not interesting or insightful.
Can you turn this argument into a mechanistic estimate of the model's accuracy? (You'd need to do things like deduce correlations from the weights, rather than just observe them empirically - but it seems like you're getting close.)
The AI player also had AIs with different system prompts frequently come into conflict themselves.
My expectation is that for future AIs, as today, many of the goals of an AI will come from the scaffolding / system prompt rather than from the weights directly -- and the "goals" from the Constitution / model spec act more as limiters / constraints on a mostly prompt or scaffolding-specified goal.
So in my mainline, expect a large number (thousands / millions, per today) of goal-separate "AIs" which are at identical intelligence levels rather than a 1 or a ...
W9 work seems to be gaining in popularity, I think possibly for this reason.
(W9 is the USA tax form for "independent contractor", as opposed to W2 which has a slightly(?) tougher compliance burden about how to go about firing) (there are other words for this in other jurisdictions, probably?)
Great post!
Anyone still confused after this, should go read that post.
If we had some serious evidence that souls not only exist, but also precede the existence of a person, that a soul is somehow chosen to be instantiated in newborn, then it would be understandable why we could assume that being born is a random sample
Only if there are already quadrillions of souls. Which would be necessary later if and when we go to the stars, but seem like more angels than are required to dance on the head of this particular pin right now. Cultures that believe in reincarnation tend to believe rebirths occur fairly frequently, with a short...
Yes, performing a predicted random sample over predicted future humans according to some model, or Bayesian distribution of models is fine — but in the case of the Bayesian model distribution case, if you have large uncertainty within your hypothesis distribution about how many there will be, that will dominate the results. What breaks causality is attempting to perform an actual random sample over the actual eventual number of future humans before that information is actually available, and then using frequentist typicality arguments based on that hypothetical invalid sampling process to try to smuggle information from the future into updating your hypothesis distribution.
PSA to those with flat or otherwise imperfect feet:
I finally got custom-made orthotics made, and they're very different / way more correction than I expected compared to off-the-shelf orthotics, in a good way. Highly recommended!
I think the number of people who would come to the march in their city on the weekend/evening of the work day is significantly higher than the number of people who would travel for it cross-country.
I think 100k march would be a sign for policy makers and get to the news anyway, whether it in NYC, SF, Washington or in the middle of the desert.
Also, IMO, it would make more sense to start with a lower threshold - 10k for example.
What is a good mathematical framing for ‘personas’?
The distribution of meaningfully distinct (i.e. perplexity-reducing) token-generation contexts / processes found in the training material (principally pretraining, supplemented by later training).
Note that "wikipedia article" is a valid persona. So is "academic paper". Not all of them represent single humans. Some even represent automated processes.
What are our existing techniques for discovering persona archetypes
Read Jung? (I'm not being flippant, this is a serious suggestion: personas are a world model of human behavior, and this is not a new subject.)
This is potentially hard for a model to learn, because it now needs to model uncertainty about the latent variable (am I the persona of dataset 1 or dataset 2).
I think modelling a great many different personas and keeping them all straight is a core ability / capability spike of an LLM. Base models (the model itself, not the personas it simulates) are far, far better at it than any human actor. So I would expect it to model dataset 1 and dataset 2 as two different personas, and be able to switch between them easily. Which is probably not the behavior the people applying the training to it were intending.
and more facts stored about the <|assistant|> character
See the entire alignment pretraining research agenda for more on this.
To be clear, you believe that making aligned narrow AI is easy, regardless of how intelligent it is? Even something more useful than a chess bot, like a theorem prover? And the only reason AIs will be goal-oriented to a dangerous extent, is because people will intentionally make them like that, despite obvious risks? I'm not saying they won't, but is that really enough to justify high p(doom)? When I was reading "The Problem", I was sure that goal-oriented AI was seen as inevitable for some reason deeper than "Goal-oriented behavior is economically useful"... (read more)