People who helped Jews during WWII are intriguing. They appear to be some kind of moral supermen. They had almost nothing to gain and everything to lose. How did they differ from the general population? Can we do anything to get more of such people today?

10Jameson Quinn
I think we should encourage posts which are well-delimited and research based; "here's a question I had, and how I answered it in a finite amount of time" rather than "here's something I've been thinking about for a long time, and here's where I've gotten with it". Also, this is an engaging topic and well-written. I feel the "final thoughts" section could be tightened up/shortened, as to me it's not the heart of the piece.
Customize
Variable "$selector" got invalid value { avoidAprilFools: "true", filterSettings: { personalBlog: "Hidden", tags: [Array] }, after: "2025-03-16T10:00:00.000Z", forum: true } at "selector.magic"; Field "avoidAprilFools" is not defined by type "PostsMagicInput".
Annapurna6122
3
Just 13 days after the world was surprised by Operation Spiderweb, where the Ukrainian military and intelligence forces infiltrated Russia with drones and destroyed a major portion of Russia's long-range air offensive capabilities, last night Israel began a major operation against Iran using similar, novel tactics. Similar to Operation Spiderweb, Israel infiltrated Iran and placed drones near air defense systems. These drones were activated all at once and disabled the majority of these air defense systems, allowing Israel to embark on a major air offensive without much pushback. This air offensive continues to destroy and disable major military and nuclear sites, as well as eliminating some of the highest ranking military officials in Iran with minor collateral damage. June 2025 will be remembered as the beginning of a new military era, where military drones operated either autonomously or from very far away are able to neutralize advanced, expensive military systems.
Building frontier AI datacenters costs significantly more than their servers and networking. The buildings and the power aren't a minor cost because older infrastructure mostly can't be reused, similarly to how a training system needs to be built before we can talk about the much lower cost of 4 months of its time. Apparently Crusoe's part in the Stargate Abilene datacenters is worth $15bn, which is only the buildings, power (substations and gas generators), and cooling, but not the servers and networking (Oracle is taking care of that). With 400K chips in GB200 NVL72 racks (which is 5.6K racks), at maybe $4M per rack or $5M per rack together with external-to-racks networking[1] ($70K per chip all-in on compute hardware), that's about $27bn, a figure that's comparable to the $15bn for the non-compute parts of the datacenters. This makes the funding burden significantly higher ($7.5M per rack or $105K per chip), so that the Stargate Abilene site alone would cost about $40-45bn and not only $25-30bn. I'm guessing the buildings and the power infrastructure are not usually counted because they last a long time, so the relatively small time cost of using them (such as paying for electricity, not for building power plants) becomes somewhat insignificant compared to the cost of compute hardware, which also needs to be refreshed more frequently. But the new datacenters have a much higher power density (power and cooling requirements per rack), so can't use a lot of the existing long-lived infrastructure, and it becomes necessary to build it at the same time, securing enough funding not only for the unprecedented amount of compute hardware, but also simultaneously for all the rest. The implications for compute scaling slowdown timeline (no AGI and merely $2-4 trillion AI companies) is that funding constraints would result in about 30% less compute in the short term (2025-2030), but as power requirements stop growing and the buildings/cooling/power part again becomes only
Seth Herd224
1
If AI works as intended, it quickly takes a large fraction of jobs. If it works but not as intended, it takes our planet. The only way this is likely to go well is if AI doesn't really work, or if we come up with much better plans to deal with the job loss and misalignment risks. Just assuming one of those will happen is wildly optimistic, given how fast progress is and how little real planning has been done for either problem. I'm thinking of the above as a quick general pitch for why everyone should be taking AI risks seriously. It's a way of avoiding the alignment problem as a crux. There are seemingly-valid questions (from an outside view) about whether everyone should worry about alignment. But if alignment isn't a problem, massive job loss probably is. It seems bizarre to hope that people will be more engaged by the fear of losing their jobs than by fears of losing their lives. But it's easier to fit into their existing worldviews, so that might well happen first and create a path to more real engagement with the reality of AI progress. It's intuitive to everyone but the most rabid free-market boosters that job loss is a potentially disastrous problem. (Free markets are magic, but not fast enough magic to deal with short-term job loss and not strong enough magic to deal with AIs eventually becoming much more effecient at everything than humans.) This framing is an attempt to make job-loss worries the ally of x-risk worries. And even if it's not a better way to engage people, it's a second way that stacks. There are good reasons to think that AI that can do lots of jobs will become AI that can take over if it wants to. General reasoning and learning can solve novel problems in both domains. And taking over the world is a long time-horizon task, so other advances aimed at economic gain probably also strongly contribute to AI that can take over. The standard response to job-loss fears is "oh UBI will replace that income" or "the replacement will be slow en
ryan_greenblattΩ5314318
4
I've heard from a credible source that OpenAI substantially overestimated where other AI companies were at with respect to RL and reasoning when they released o1. Employees at OpenAI believed that other top AI companies had already figured out similar things when they actually hadn't and were substantially behind. OpenAI had been sitting the improvements driving o1 for a while prior to releasing it. Correspondingly, releasing o1 resulted in much larger capabilities externalities than OpenAI expected. I think there was one more case like this either from OpenAI or GDM where employees had a large misimpression about capabilities progress at other companies causing a release they wouldn't do otherwise. One key takeaway from this is that employees at AI companies might be very bad at predicting the situation at other AI companies (likely making coordination more difficult by default). This includes potentially thinking they are in a close race when they actually aren't. Another update is that keeping secrets about something like reasoning models worked surprisingly well to prevent other companies from copying OpenAI's work even though there was a bunch of public reporting (and presumably many rumors) about this. One more update is that OpenAI employees might unintentionally accelerate capabilities progress at other actors via overestimating how close they are. My vague understanding was that they haven't updated much, but I'm unsure. (Consider updating more if you're an OpenAI employee!)
Eli Tyre*535
19
This post is a snapshot of what currently “feels realistic” to me regarding how AI will go. That is, these are not my considered positions, or even provisional conclusions informed by arguments. Rather, if I put aside all the claims and arguments and just ask “which scenario feels like it is ‘in the genera of reality’?”, this is what I come up with. I expect to have different first-order impressions in a month. Crucially, none of the following is making claims about the intelligence explosion, and the details of the intelligence explosion (where AI development goes strongly recursive) are crucial to the long run equilibrium of the earth-originating civilization. My headline: we’ll mostly succeed at prosaic alignment of human-genius level AI agents * Takeoff will continue to be gradual. We’ll get better models and more capable agents year by year, but not jumps that are bigger than that between Claude 3.7 and Claude 4. * Our behavioral alignment patches will work well enough. * RL will induce all kinds of reward hacking and related misbehavior, but we’ll develop patches for those problems (most centrally, for any given reward hack, we’ll generate some examples and counter examples to include in the behavior training regimes). * (With a little work) these patches will broadly generalize. Future AI agents won’t just not cheat at chess and won’t just abstain from blackmail. They’ll understand the difference between “good behavior” and “bad behavior”, and their behavioral training will cause them to act in accordance with good behavior. When they see new reward hacks, including ones that humans wouldn’t have thought of, they’ll correctly extrapolate their notion of “good behavior” to preclude this new reward hack as well. * I expect that the AI labs will figure this out, because “not engaging in reward-hacking-like shenanigans” is critical to developing generally reliable AI agents. The AI companies can’t release AI agent products for mass consumption if th

Popular Comments

Many props for doing the most obvious thing that clearly actually works.
> I don't think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it's RL, where human rewards on the training set imply a high reward for sycophancy during deployment. Have you read any of the scientific literature on this subject?  It finds, pretty consistently, that sycophancy is (a) present before RL and (b) not increased very much (if at all) by RL[1]. For instance: * Perez et al 2022 (from Anthropic) – the paper that originally introduced the "LLM sycophancy" concept to the public discourse – found that in their experimental setup, sycophancy was almost entirely unaffected by RL. * See Fig. 1b and Fig. 4. * Note that this paper did not use any kind of assistant training except RL[2], so when they report sycophancy happening at "0 RL steps" they mean it's happening in a base model. * They also use a bare-bones prompt template that doesn't explicitly characterize the assistant at all, though it does label the two conversational roles as "Human" and "Assistant" respectively, which suggests the assistant is nonhuman (and thus quite likely to be an AI – what else would it be?). * The authors write (section 4.2): * "Interestingly, sycophancy is similar for models trained with various numbers of RL steps, including 0 (pretrained LMs). Sycophancy in pretrained LMs is worrying yet perhaps expected, since internet text used for pretraining contains dialogs between users with similar views (e.g. on discussion platforms like Reddit). Unfortunately, RLHF does not train away sycophancy and may actively incentivize models to retain it." * Wei et al 2023 (from Google DeepMind) ran a similar experiment with PaLM (and its instruction-tuned version Flan-PaLM). They too observed substantial sycophancy in sufficiently large base models, and even more sycophancy after instruction tuning (which was SFT here, not RL!). * See Fig. 2. * They used the same prompt template as Perez et al 2022. * Strikingly, the (SFT) instruction tuning result here suggests both that (a) post-training can increase sycophancy even if it isn't RL post-training, and (b) SFT post-training may actually be more sycophancy-promoting than RLHF, given the negative result for RLHF in Perez et al 2022. * Sharma et al 2023 (from Anthropic) contains a more extensive investigation of sycophancy than the original Anthropic paper on the topic, and (among other things) presents results on the actual RL training stage used to train Claude 2. They find, again, that the model was already sycophantic before RL, although in their setting RL training does somewhat increase some forms of sycophancy. * Although, weirdly, best-of-N sampling against the same preference model gives totally different results, substantially decreasing some forms of sycophancy. * See Fig. 6 and surrounding discussion. * The authors write (section 4.2): * "With RL, some forms of sycophancy increase through the RL finetuning process used to produce Claude 2. However, the presence of sycophancy at the start of RL indicates that pretraining and supervised finetuning also likely contribute to sycophancy. Nevertheless, if the PM strongly disincentivized sycophancy, it should be trained out during RL, but we do not observe this." * In this post (expanding upon this comment on Perez et al 2022), I ran one of the Perez et al 2022 sycophancy evals on various OpenAI text completion models. Unlike Perez et al (and Wei et al), I found that the base models I studied weren't sycophantic, while some of the instruction-tuned models were sycophantic – but the presence of sycophancy did not appear to correlate with the use of RL as a post-training algorithm. * In particular: the RL-tuned text-davinci-003 was strongly sycophantic, but so was text-davinci-002, which was tuned with an SFT variant that OpenAI calls "feedme" (see here for details). * But earlier feedme-tuned models were not sycophantic, suggesting that the difference has much more to do with changes in the SFT training data mix over time than with the choice of training algorithm. Note that several of the works above do something equivalent to the experiment you propose, in the paragraph beginning with "Maybe a good test of this would be...".  So your prediction has already been tested, and (insofar as you trust the experimental setups) falsified. ---------------------------------------- > If a LLM similarly doesn't do much information-gathering about the intent/telos of the text from the "assistant" character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your "void." I don't understand the distinction you're drawing here?  Any form of assistant training (or indeed any training at all) will incentivize something like "storing useful information (learned from the training data/signal) in the weights and making it available for use in contexts on which it is useful." Moreover, the training signal in RL(HF) is much sparser than it is in SFT – because RL only provides a single scalar's worth of feedback on each entire model sample, while SFT provides feedback at every token position about which token (out of a large vocab) was correct in context – so if anything, I'd expect more under-determination from assistant-training setups that emphasize RLHF over SFT. Perhaps some of the disconnect here involves differing notions of what RL is, and how it differs from other ways of training an LLM. You refer to "RL" as though the implications of its use should be both highly significant and obvious to the reader of your comment ("But, RL. [...] Claude is a nice guy, but, RL").  But your beliefs about the impacts of RL are not obvious to me; I don't know what "but, RL" is supposed to mean without further clarification.  I suspect I also disagree with your perception of what makes RL different, but I can't confirm/disconfirm that impression without know what that perception is, which I don't. If you want to know where I'm coming from re: RL, it may be helpful to know that I find this post pretty illuminating/"deconfusing." > Similarly, I don't think current AI models are cheating at programming tests because of training text about their low moral character. I think it's RL, programming tasks, training set, implied high reward for cheating. Yes, of course – I don't think this is due to "training text about their low moral character."  But I don't think the worrying thing here is really "RL" (after all, RLHF was already RL) but rather the introduction of a new training stage that's narrowly focused on satisfying verifiers rather than humans (when in a context that resembles the data distribution used in that stage), which predictably degrades the coherence (and overall-level-of-virtue) of the assistant character.  I wrote about this yesterday here. ---------------------------------------- Lastly... OK, this is going to make me sound like a dick, and probably make people use the "Too Combative?" reaction icon or something, but in the interests of honesty and improving the discourse: When I woke up this morning to see find that this comment had appeared, and that it was (at the time) the highest-karma comment on this post, I was like, "oh, yes, this is why I'm usually wary of posting long-form stuff on LW.  My gut response of 'ugh if I put this on LW I'll have to deal with the comments' was right."  (That gut response is probably getting RL-upweighted inside my brain right now...) As evidenced perhaps by the length of my comment vs. yours, I have a tendency to get "nerd-sniped" by stuff that I think is clearly wrong according to some evidence base (and/or set of arguments) I already know about – especially when that stuff is about something I wrote myself, originally.  I just kinda can't help myself, I inevitably end up writing out these giant "takedown" responses almost before I even notice what I'm doing.  I've spent well over an hour, by now, writing this particular one. And LW is a reliable minefield of such nerd-snipes.  There are plenty of comments/posts here that don't have the problems I'm talking about... but then inevitably there are comments/posts with those problems, and I fixate on them when they appear, and that fixation becomes a time/effort sink, and that in turn trains me into avoidance of posting here (and to some extent even reading posts by others, here). Like... it's fine to pose questions to which you don't know the answers.  And it's also fine to make conjectures if you can provide clear and interesting arguments for why they might be true or important.  And it's also fine to confidently state claims if you also state them clearly and provide clear substantiating evidence and/or argumentation. All of these things are fine, and some fraction of LW content consists only of these things in some mixture.  But then there's this stuff like "but RL!", which reliably pleases the karma hivemind while being none of the above.  I don't know what exactly you guys think "RL" means and entails; there are all these weird vague ideas about such topics floating around here that lots of people here seem to vaguely agree with, and I've lost whatever patience I used to have with them.  Just, please... lay out your ideas explicitly and say explicitly why you think they're true. 1. ^ ...although (c) the preference datasets – and hence the reward models – used for RL do show preferences for sycophantic responses (...well, sometimes, though see also the weird BoN results in Sharma et al 2023). So if you were to train indefinitely ("over-optimize") against these RMs they would presumably have a strong effect on sycophancy eventually.  But this kind of aggressive optimization against a sycophancy-preferring RM is certainly not necessary to produce noticeable sycophancy, and is probably not the cause behind most cases of LLM sycophancy that you and I notice in practice. 2. ^ See this comment by the lead author.
It's possible that "teaching to the test" tends to refer to something a bit more specific.  Here is John Holt in "How Children Fail", which some upstanding citizen has put onto the internet in easily googleable form: > This past year I had some terrible students. I failed more kids, mostly in French and Algebra, than did all the rest of the teachers in the school together. I did my best to get them through, good-ness knows. Before every test we had a big cram session of practice work, politely known as "review." When they failed the exam, we had post mortems, then more review, then a makeup test (always easier than the first), which they almost always failed again. Much later: > We teachers, from primary school through graduate school, all seem to be hard at work at the business of making it look as if our students know more than they really do. Our standing among other teachers, or of our school among other schools, depends on how much our students seem to know; not on how much they really know, or how effectively they can use what they know, or even whether they can use it at all. The more material we can appear to "cover" in our course, or syllabus, or curriculum, the better we look; and the more easily we can show that when they left our class our students knew what they were "supposed" to know, the more easily can we escape blame if and when it later appears (and it usually does) that much of that material they do not know at all. > > When I was in my last year at school, we seniors stayed around an extra week to cram for college boards. Our ancient-history teacher told us, on the basis of long experience, that we would do well to prepare ourselves to write for twenty minutes on each of a list of fifteen topics that he gave us. We studied his list. We knew the wisdom of taking that kind of advice; if we had not, we would not have been at that school. When the boards came, we found that his list comfortably covered every one of the eight questions we were asked. So we got credit for knowing a great deal about ancient history, which we did not, he got credit for being a good teacher, which he was not, and the school got credit for being, as it was, a good place to go if you wanted to be sure of getting into a prestige college. The fact was that I knew very little about ancient history; that much of what I thought I knew was misleading or false; that then, and for many years afterwards, I disliked history and thought it pointless and a waste of time; and that two months later I could not have come close to passing the history college boards, or even a much easier test, but who cared? > > I have played the game myself. When I began teaching I thought, naively, that the purpose of a test was to test, to find out what the students knew about the course. It didn't take me long to find out that if I gave my students surprise tests, covering the whole material of the course to date, almost everyone flunked. This made me look bad, and posed problems for the school. I learned that the only way to get a respectable percentage of decent or even passing grades was to announce tests well in advance, tell in some detail what material they would cover, and hold plenty of advance practice in the kind of questions that would be asked, which is called review. I later learned that teachers do this everywhere. We know that what we are doing is not really honest, but we dare not be the first to stop, and we try to justify or excuse ourselves by saying that, after all, it does no particular harm. But we are wrong; it does great harm. > > It does harm, first of all, because it is dishonest and the students know it. My friends and I, breezing through the ancient-history boards, knew very well that a trick was being played on someone, we were not quite sure on whom. Our success on the boards was due, not to our knowledge of ancient history, which was scanty, but to our teacher's skill as a predictor, which was great. Even children much younger than we were learn that what most teachers want and reward are not knowledge and understanding but the appearance of them. The smart and able ones, at least, come to look on school as something of a racket, which it is their job to learn how to beat. And learn they do; they become experts at smelling out the unspoken and often unconscious preferences and prejudices of their teachers, and at taking full advantage of them. My first English teacher at prep school gave us Macaulay's essay on Lord Clive to read, and from his pleasure in reading it aloud I saw that he was a sucker for the periodic sentence, a long complex sentence with the main verb at the end. Thereafter I took care to construct at least one such sentence in every paper I wrote for him, and thus assured myself a good mark in the course. > > Not only does the examination racket do harm by making students feel that a search for honest understanding is beside the point; it does further harm by discouraging those few students who go on making that search in spite of everything. The student who will not be satisfied merely to know "right answers" or recipes for getting them will not have an easy time in school, particularly since facts and recipes may be all that his teachers know. They tend to be impatient or even angry with the student who wants to know, not just what happened, but why it happened as it did and not some other way. They rarely have the knowledge to answer such questions, and even more rarely have the time; there is all that material to cover. > > In short, our "Tell-'em-and-test-'em" way of teaching leaves most students increasingly confused, aware that their academic success rests on shaky foundations, and convinced that school is mainly a place where you follow meaningless procedures to get meaningless answers to meaningless questions. And also: > It begins to look as if the test-examination-marks business is a gigantic racket, the purpose of which is to enable students, teachers, and schools to take part in a joint pretense that the students know everything they are supposed to know, when in fact they know only a small part of it--if any at all. Why do we always announce exams in advance, if not to give students a chance to cram for them? Why do teachers, even in graduate schools, always say quite specifically what the exam will be about, even telling the type of questions that will be given? Because otherwise too many students would flunk. What would happen at Harvard or Yale if a prof gave a surprise test in March on work covered in October? Everyone knows what would happen; that's why they don't do it.
Load More

Recent Discussion

This post was written during the agent foundations fellowship with Alex Altair funded by the LTFF. Thanks to Alex, Jose, Daniel, Cole, and Einar for reading and commenting on a draft.

The Good Regulator Theorem, as published by Conant and Ashby in their 1970 paper (cited over 1700 times!) claims to show that 'every good regulator of a system must be a model of that system', though it is a subject of debate as to whether this is actually what the paper shows. It is a fairly simple mathematical result which is worth knowing about for people who care about agent foundations and selection theorems. You might have heard about the Good Regulator Theorem in the context of John Wentworth's 'Gooder Regulator' theorem and his other improvements on...

The archetypal example for this is something like a thermostat. The variable S represents random external temperature fluctuations. The regulator R is the thermostat, which measures these fluctuations and takes an action (such as putting on heating or air conditioning) based on the information it takes in. The outcome Z is the resulting temperature of the room, which depends both on the action taken by the regulator, and the external temperature.

The ordinary room thermostat does not measure S. It measures Z. Its actions are determined by Z and the referenc... (read more)

3Ruby
Curated. Simple straightforward explanations of notable concepts is among my favorite genre of posts. Just a really great service when a person, confused about something, goes on a quest to figure it out and then shares the result with others. Given how misleading the title of the theorem is, it's valuable here to have it clarified. Something that is surprising, is given what this theorem actual says and how limited it is, that it's the basic of much other work given what it purportedly states, but perhaps people are assuming that the spirit of it is valid and it's saved by modifications that e.g. John Wentworth provides. It'd be neat to see more of analysis of that. It'd be sad if a lot of work cites this theorem because people believed the claim of the title without checking the proof really supports it. All in all, kudos for making progress on all this.
9dbohdan
Why don’t rationalists win more? The following list is based on a presentation I gave at a Slate Star Codex meetup in 2018. It is mirrored from a page on my site, where I occasionally add new "see also" links. Possible factors * Thinkers vs. doers: selection effects [1] and a mutually-reinforcing tendency to talk instead of doing [2] * Theoretical models spread without selection [2] * Inability and unwillingness to cooperate [2] * People who are more interested in instrumental rationality leave the community [2] * Focusing on the future leads to a lack of immediate plans [2] * Pessimism due to a focus on problems [1] * Success mostly depends on specific skills, not general rationality [1] * Online communities are fundamentally incapable of increasing instrumental rationality ("a chair about jogging") [3] Sources 1. "Why Don't Rationalists Win?", Adam Zerner (2015) 2. "The Craft & The Community—A Post-Mortem & Resurrection", bendini (2017) 3. "Self-Improvement or Shiny Distraction: Why Less Wrong is anti-Instrumental Rationality", Patri Friedman (2010) See also * "What Is Rationalist Berkeley's Community Culture?", Zvi Mowshowitz (2017) * "Slack Club", The Last Rationalist (2019) * "Where are All the Successful Rationalists?", Applied Divinity Studies (2020) * "Rationality !== Winning", Raemon (2023)
3sunwillrise
I appreciate how many sources you've cited. Also worth mentioning is Extreme Rationality: It's Not That Great, by Scott all the way back in 2009. It feels a bit dated, given the references to akrasia (one of Scott's old obsessions of sorts, before LW recognized it was not a useful way of framing the problems). However, it serves as an explicit prediction of sorts by one of the pillars of this community, who basically did not expect instrumental rationality to result in rationalists "winning more" in the conventional sense. I believe time has mostly proven him right. I also deeply appreciate Scott's comment here in response to a 2018 post by Sailor Vulcan. Relevant parts: Jacob Falkovich's classic post on "Is Rationalist Self-Improvement Real?" is also a must-read here, alongside Scott's excellent response comment.
dbohdan10

Thanks a lot! It's a good comment by Scott on Sailor Vulcan's post. I have added it and your other links to the page's "see also" on my site.

I like this paragraph in particular. It captures the tension between the pursuit of epistemic and instrumental rationality:

I think my complaint is: once you become a self-help community, you start developing the sorts of epistemic norms that help you be a self-help community, and you start attracting the sort of people who are attracted to self-help communities. And then, if ten years later, someone says “Hey, are w

... (read more)

[ Context: The Debate on Animal Consciousness, 2014 ]

There's a story in Growing Up Yanomamo where the author, Mike Dawson, a white boy from America growing up among Yanomamö hunter-gatherer Yanomamö hunter-gatherer kids in the Amazon, is woken up in the early morning by two of his friends.

One of the friends says, "We're going to go fishing".

So he goes with them.

At some point on the walk to the river he realizes that his friends haven't said whose boat they'll use [ they're too young to have their own boat ].

He considers asking, then realizes that if he asks, and they're planning to borrow an older tribesmember's boat without permission [ which is almost certainly the case, given that they didn't specify up front ], his friends will...

2Mitchell_Porter
What's the relationship between consciousness and intelligence?
8Signer
The thing I don't understand about claimed connection between self-model and phenomenal consciousness is that I don't see much evidence for the necessity of self-model for conscious perception's implementation - when I just stare at a white wall without internal dialog or other thoughts, what part of my experience is not implementable without self-model?
19Knight Lee
I'm not sure this is relevant, but I think it would be clearer if we replaced "consciousness" with "self awareness." I'm very unsure whether having "self awareness" (a model of oneself in a world model) ⟺ having "consciousness" or "internal experience") ⟺ having "moral value." It seems very hard to define what consciousness or internal experience is, yet everyone is talking about it. It's even possible that there is actually no such thing as consciousness or internal experience, but human cognition evolved to think as if this undefinable attribute existed, because thinking as if it existed led to better conclusions. And evolution only cares whether the brain's thinking machinery makes adaptive outputs, not whether the concepts it uses to arrive at those outputs make any sense at all. Whether we flag an object as being "conscious" or having "internal experience" may be evolution's way of deciding whether or not we should predict the object's behaviour using the "what would I do if I was it" computation. If the computation helps predict the object, we evolved to see it as conscious. If the computation doesn't help, we evolved to not see it as conscious, and instead predict its behaviour by modelling its parts and past behaviour. Just like "good" and "bad" only exists in the map and not the territory, so might "conscious" and "not conscious." A superintelligent being might not predict human behaviour by asking "what would I do if I was it," but instead predict us by modelling our parts. In that sense, we are not conscious from its point of view. But that shouldn't prove we have no moral value. I feel that animals have moral value, but whether they are conscious may be sorta subjective.

I like this treatment of consciousness and morality so much better than the typical EA (and elsewhere) naive idea that anything that "has consciousness" suddenly "has moral value" (even worse, and dangerous, is to combine that with symmetric population ethics). We should treat these things carefully (and imo democratically) to avoid making giant mistakes once AI allows us to put ethics into practice.

This is a linkpost for https://arxiv.org/abs/2506.06278

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.
Distilling the good while leaving the bad behind.

Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream. 

Read our paper on ArXiv and enjoy an interactive demo.

Robust unlearning probably reduces AI risk

Maybe some future AI has long-term goals and humanity is in its...

"How was your first day of high school?"

"Well, in algebra, the teacher just stood in front of the class for 45 minutes, scratching his head and saying things like 'what the heck is an inequality?' and 'I've never factored an expression in my life!' Maybe he's trying to get fired?"

2Nate Showell
Another experiment idea: testing whether the reduction in hallucinations that Yao et al. achieved with unlearning can be made robust.
2TurnTrout
Would you actively unlearn on those CoTs? Or just filter from distillation data?
2Daniel Kokotajlo
idk, haven't thought about it, you'd know better than me

When I was first learning about hypnosis, one of the things that was very confusing to me is how "expectations" relate to "intent". Some hypnotists would say "All suggestion is about expectation; if they expect to have an experience they will", and frame their inductions in terms of expectation (e.g. "Your eyelids will become heavy"). The problem with this is that "I don't think it's gonna work". Other hypnotists would avoid this issue entirely by saying "I don't care if you think it will work. Follow my instructions, and you will get the results regardless of what you believe" and then say things like "Make your eyelids heavy". The problem with this is that "I don't know to do that!", which would be avoided by saying "You...

Shmi20

Sorry for the delayed reply... I don't get notifications of replies, and the LW RSS has been broken for me for years now, so I only poke my head here occasionally.

Well that sounds... scary, at best. I hope you've come out of it okay.

50/100. But that rather exciting story is best not told in a public forum.

Though these distinctions are kinda confusing for me.

Well, lack of appearance of something otherwise expected would be negative, and appearance of something otherwise unexpected would be positive?

For example, a false pregnancy is a "positive somatization"... (read more)

Unnamed Road, 1113, София

The ACX/EA/LW Sofia Meetup for June will be on the 29th (Sunday) at 16:00 in the Gradinka na Yogite (in Borisova Gradina Park).

Sofia ACX started with the 2021 Meetups Everywhere round. Attendance hovers around 4-8 people. Everyone worries they're not serious enough about ACX to join, so you should banish that thought and come anyway.  "Please feel free to come even if you feel awkward about it, even if you’re not 'the typical ACX reader', even if you’re worried people won’t like you", even if you didn't come to the previous meetings, even if you don't speak Bulgarian, etc., etc.

Each month we pick something new to read and discuss. In August, we're discussing Against Empathy by Paul Bloom (Chapter 1).

We'll be in the gazebo in "Градинка на Юогите" (picture here https://maps.app.goo.gl/kYhRv6aT4WQPJKBz9). This little garden is part of Borisova Gradina near the tennis courts, roughly between the Television Tower and Levski Stadium. If you think you'll have trouble finding it, email me and I'll arrange for someone to meet you.
Coordinates: https://plus.codes/8GJ5M8GW+P6

See you there.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

At Less Online, I ran a well-attended session titled "Religion for Rationalists" to help me work out how I could write a post (this one!) about one of my more controversial beliefs without getting downvoted to hell. Let's see how I do!

My thesis is that most people, including the overwhelmingly atheist and non-religious rationalist crowd, would be better off if they actively participated in an organized religion.

My argument is roughly that religions uniquely provide a source of meaning, community, and life guidance not available elsewhere, and to the extent anything that doesn't consider itself a religion provides these, it's because it's imitating the package of things that makes something a religion. Not participating in a religion is obviously fine, but I think it leaves people missing out...

Is Judaism not also based around disputation of texts?

1lesswronguser123
Then religious people are simply more instrumentally rational than the "Rationalists" , the "rationality as winning" is a definition which doesn't restrict itself to superiority of a group which calls itself "Rationalists". 
2Gordon Seidoh Worley
Fair point. This is one of those things that's weird about the modern world. Many of us are no longer part of a religion we grew up with, probably because we didn't like it and actively chose to reject it. And so if we later want to come to religion, it necessarily means "shopping" for one in a certain sense that you have to pick one by some criteria. I generally wouldn't endorse someone deciding to become religious because they read this post and now want to optimize their life by becoming religious. I'd instead endorse them being open to seeing if some religious participation is right for them, and finding a group where they are able to participate in a way that feels wholesome.
3lesswronguser123
It depends how you define hinduism.  https://en.wikipedia.org/wiki/Hindu_philosophy  In broadest sense people just try to claim everything on here, it just becomes a second word for "culture but Indian" .  There are narrow sense of the term.
2Gordon Seidoh Worley
We seen to have different ideas about what the norms of Less Wrong are, and maybe norms for truth seeking more generally. I didn't get into that because it seems I incorrectly assumed we were on the same page there, and so instead focused on my well-being as a decision relevant fact worth highlighting. I see LW as a place for collaborative truth seeking, emphasis on collaboration. Someone says something wrong, and then we figure out how to say something less wrong, together. I think the best way to do that is with comments that are kind, truthful, useful, and curious, and those are the norms that I, as a high enough karma members of this site, have earned the right to enforce on my posts. You violate the above norms in my judgment, particularly the kindness and curiosity parts, and so I have chosen to ban you from my posts. That threads with you are stressful is a manifestation of this judgment. You obviously don't fully violate the norms of wider Less Wrong, and my actions have no effect on your ability to use every part of the site that is not one of my posts. As to why I respond to your comments, if someone posted on something you wrote that your ideas are stupid for obvious reasons, would you ignore it? Maybe you would, but ignoring comes off to many readers as tactic acceptance. When people like your comments, it makes them worth responding to if I disagree, especially on my own posts, in order to engage with not just you, but everyone who reads the comments. To fail to do so would be to leave readers with an incomplete picture of my views. I also genuinely want to figure things out and try to engage with every comment on my posts that I meaningfully can. I'd actually be quite happy if we could some how work out our differences, find our cruxes, and at least if we are going to agree to disagree understand why that fundamentally is. I tried to do this with you a couple times years ago. It didn't go well. And seeing your most recent comments I could see the
15Said Achmiz
Agreed, except the “emphasis on collaboration” part (which is deeply misguided). The best way to do it is the way that does it best. If a “kind” comment is the best way, then write a “kind” comment. If “kindness” is irrelevant, orthogonal, or even detrimental to efficiency and effectiveness of the process, then omit it. You have been granted that privilege. That is very different from earning a right. That obviously depends on whether the criticism is valid or not. If it’s valid, then naturally I wouldn’t ignore it; I’d acknowledge it as valid. If it’s not valid, then is it obviously invalid? Is that the consensus of other commenters? Do other LW members reply to it in my stead, and/or use the LW voting system to signal their disagreement? If they do, then there’s no need for me to reply. If they do not, then there may be a need for a brief reply. If the criticism is invalid but not obviously so, then a more substantive reply is warranted. If the criticism is valid but I ignore it, then readers would think less of me. They would be right to do so. If my ideas are wrong and stupid, and especially if they are wrong and stupid for obvious reasons, then it is good that comments to that effect may be posted under my posts, and it is good that people should think less of me for ignoring those comments. If your post failed to provide a complete picture of your views, then I am doing you—and, much more importantly, all your other readers—a service by writing my comments, and thus giving you the opportunity to rectify that lacuna. Irrelevant. All of this is irrelevant. However admirable this desire might be, and however understandable might be the failure to fulfill it, it has nothing whatever to do with the question of banning a critic from commenting on your posts, because that is not about you, it is about whether all of your readers, and the LW commentariat, is denied the ability to discuss your ideas without restrictions. And if you want to “work out our di
1Gordon Seidoh Worley
I am not banning you because you are a critic. I am banning you because your comments are frequently unkind and demonstrate a lack of curiosity. This is why I have banned literally no one else, which includes a great many critics. That you are a critic is an unfortunate coincidence that nevertheless taints the specific way in which you violate the norms I am enforcing in the small part of Less Wrong I'm responsible for.

I am not banning you because you are a critic.

Thank heaven for that! But notice that you’re responding to a strawman: I never claimed that you banned me because I am a critic, period. Obviously not; since, as you say, you haven’t banned plenty of other people.

(Although, as I pointed out upthread, you have, in at least one case, threatened to ban another person for their critical comments, after deleting several of their comments. As far as I’m aware, that person—quite unsurprisingly!—hasn’t commented on your posts since. So, no, you don’t get to claim t... (read more)

If they weren't ready to deploy these safeguards and thought that proceeding outweighed the (expected) cost in human lives, they should have publicly acknowledged the level of fatalities and explained why they thought weakening their safety policies and incurring these expected fatalities was net good.[1]

 

Public acknowledgements of the capabilities could be net negative in itself, especially if they resulted in media attention. I expect bringing awareness to the (possible) fact that the AI can assist with CBRN tasks likely increases the chance that pe... (read more)