If the thesis in Unlocking the Emotional Brain is even half-right, it may be one of the most important books that I have read. It claims to offer a neuroscience-grounded, comprehensive model of how effective therapy works. In so doing, it also happens to formulate its theory in terms of belief updating, helping explain how the brain models the world and what kinds of techniques allow us to actually change our minds.

MIRI Technical Governance Team is hiring, please apply and work with us! We are looking to hire for the following roles: * Technical Governance Researcher (2-4 hires) * Writer (1 hire) The roles are located in Berkeley, and we are ideally looking to hire people who can start ASAP. The team is currently Lisa Thiergart (team lead) and myself. We will research and design technical aspects of regulation and policy that could lead to safer AI, focusing on methods that won’t break as we move towards smarter-than-human AI. We want to design policy that allows us to safely and objectively assess the risks from powerful AI, build consensus around the risks we face, and put in place measures to prevent catastrophic outcomes. The team will likely work on: * Limitations of current proposals such as RSPs * Inputs into regulations, requests for comment by policy bodies (ex. NIST/US AISI, EU, UN) * Researching and designing alternative Safety Standards, or amendments to existing proposals * Communicating with and consulting for policymakers and governance organizations If you have any questions, feel free to contact me on LW or at peter@intelligence.org 
Akash1d3914
3
I think now is a good time for people at labs to seriously consider quitting & getting involved in government/policy efforts. I don't think everyone should leave labs (obviously). But I would probably hit a button that does something like "everyone at a lab governance team and many technical researchers spend at least 2 hours thinking/writing about alternative options they have & very seriously consider leaving." My impression is that lab governance is much less tractable (lab folks have already thought a lot more about AGI) and less promising (competitive pressures are dominating) than government-focused work.  I think governments still remain unsure about what to do, and there's a lot of potential for folks like Daniel K to have a meaningful role in shaping policy, helping natsec folks understand specific threat models, and raising awareness about the specific kinds of things governments need to do in order to mitigate risks. There may be specific opportunities at labs that are very high-impact, but I think if someone at a lab is "not really sure if what they're doing is making a big difference", I would probably hit a button that allocates them toward government work or government-focused comms work. Written on a Slack channel in response to discussions about some folks leaving OpenAI. 
Today I learned that being successful can involve feelings of hopelessness. When you are trying to solve a hard problem, where you have no idea if you can solve it, let alone if it is even solvable at all, your brain makes you feel bad. It makes you feel like giving up. This is quite strange because most of the time when I am in such a situation and manage to make a real efford anyway I seem to always suprise myself with how much progress I manage to make. Empirically this feeling of hopelessness does not seem to track the actual likelyhood that you will completely fail.
Eli Tyre3d490
2
Back in January, I participated in a workshop in which the attendees mapped out how they expect AGI development and deployment to go. The idea was to start by writing out what seemed most likely to happen this year, and then condition on that, to forecast what seems most likely to happen in the next year, and so on, until you reach either human disempowerment or an end of the acute risk period. This post was my attempt at the time. I spent maybe 5 hours on this, and there's lots of room for additional improvement. This is not a confident statement of how I think things are most likely to play out. There are already some ways in which I think this projection is wrong. (I think it's too fast, for instance). But nevertheless I'm posting it now, with only a few edits and elaborations, since I'm probably not going to do a full rewrite soon. 2024 * A model is released that is better than GPT-4. It succeeds on some new benchmarks. Subjectively, the jump in capabilities feels smaller than that between RLHF’d GPT-3 and RLHF’d GPT-4. It doesn’t feel as shocking the way chat-GPT and GPT-4 did, for either x-risk focused folks, or for the broader public. Mostly it feels like “a somewhat better language model.” * It’s good enough that it can do a bunch of small-to-medium admin tasks pretty reliably. I can ask it to find me flights meeting specific desiderata, and it will give me several options. If I give it permission, it will then book those flights for me with no further inputs from me. * It works somewhat better as an autonomous agent in an auto gpt harness, but it still loses its chain of thought / breaks down/ gets into loops. * It’s better at programming. * Not quite good enough to replace human software engineers. It can make a simple react or iphone app, but not design a whole complicated software architecture, at least without a lot of bugs. * It can make small, working, well documented, apps from a human description. * We see a doubling of the rate of new apps being added to the app store as people who couldn’t code now can make applications for themselves. The vast majority of people still don’t realize the possibilities here, though. “Making apps” still feels like an esoteric domain outside of their zone of competence, even though the barriers to entry just lowered so that 100x more people could do it.  * From here on out, we’re in an era where LLMs are close to commoditized. There are smaller improvements, shipped more frequently, by a variety of companies, instead of big impressive research breakthroughs. Basically, companies are competing with each other to always have the best user experience and capabilities, and so they don’t want to wait as long to ship improvements. They’re constantly improving their scaling, and finding marginal engineering improvements. Training runs for the next generation are always happening in the background, and there’s often less of a clean tabula-rasa separation between training runs—you just keep doing training with a model continuously. More and more, systems are being improved through in-the-world feedback with real users. Often chatGPT will not be able to handle some kind of task, but six weeks later it will be able to, without the release of a whole new model. * [Does this actually make sense? Maybe the dynamics of AI training mean that there aren’t really marginal improvements to be gotten. In order to produce a better user experience, you have to 10x the training, and each 10x-ing of the training requires a bunch of engineering effort, to enable a larger run, so it is always a big lift.] * (There will still be impressive discrete research breakthroughs, but they won’t be in LLM performance) 2025 * A major lab is targeting building a Science and Engineering AI (SEAI)—specifically a software engineer. * They take a state of the art LLM base model and do additional RL training on procedurally generated programming problems, calibrated to stay within the model’s zone of proximal competence. These problems are something like leetcode problems, but scale to arbitrary complexity (some of them require building whole codebases, or writing very complex software), with scoring on lines of code, time-complexity, space complexity, readability, documentation, etc. This is something like “self-play” for software engineering.  * This just works.  * A lab gets a version that can easily do the job of a professional software engineer. Then, the lab scales their training process and gets a superhuman software engineer, better than the best hackers. * Additionally, a language model trained on procedurally generated programming problems in this way seems to have higher general intelligence. It scores better on graduate level physics, economics, biology, etc. tests, for instance. It seems like “more causal reasoning” is getting into the system. * The first proper AI assistants ship. In addition to doing specific tasks,  you keep them running in the background, and talk with them as you go about your day. They get to know you and make increasingly helpful suggestions as they learn your workflow. A lot of people also talk to them for fun. 2026 * The first superhuman software engineer is publically released. * Programmers begin studying its design choices, the way Go players study AlphaGo. * It starts to dawn on e.g. people who work at Google that they’re already superfluous—after all, they’re currently using this AI model to (unofficially) do their job—and it’s just a matter of institutional delay for their employers to adapt to that change. * Many of them are excited or loudly say how it will all be fine/ awesome. Many of them are unnerved. They start to see the singularity on the horizon, as a real thing instead of a social game to talk about. * This is the beginning of the first wave of change in public sentiment that will cause some big, hard to predict, changes in public policy [come back here and try to predict them anyway]. * AI assistants get a major upgrade: they have realistic voices and faces, and you can talk to them just like you can talk to a person, not just typing into a chat interface. A ton of people start spending a lot of time talking to their assistants, for much of their day, including for goofing around. * There are still bugs, places where the AI gets confused by stuff, but overall the experience is good enough that it feels, to most people, like they’re talking to a careful, conscientious person, rather than a software bot. * This starts a whole new area of training AI models that have particular personalities. Some people are starting to have parasocial relationships with their friends, and some people programmers are trying to make friends that are really fun or interesting or whatever for them in particular. * Lab attention shifts to building SEAI systems for other domains, to solve biotech and mechanical engineering problems, for instance. The current-at-the-time superhuman software engineer AIs are already helpful in these domains, but not at the level of “explain what you want, and the AI will instantly find an elegant solution to the problem right before your eyes”, which is where we’re at for software. * One bottleneck is problem specification. Our physics simulations have gaps, and are too low fidelity, so oftentimes the best solutions don’t map to real world possibilities. * One solution to this is that, (in addition to using our AI to improve the simulations) is we just RLHF our systems to identify solutions that do translate to the real world. They’re smart, they can figure out how to do this. * The first major AI cyber-attack happens: maybe some kind of superhuman hacker worm. Defense hasn’t remotely caught up with offense yet, and someone clogs up the internet with AI bots, for at least a week, approximately for the lols / the seeing if they could do it. (There’s a week during which more than 50% of people can't get on more than 90% of the sites because the bandwidth is eaten by bots.) * This makes some big difference for public opinion.  * Possibly, this problem isn’t really fixed. In the same way that covid became endemic, the bots that were clogging things up are just a part of life now, slowing bandwidth and making the internet annoying to use. 2027 and 2028 * In many ways things are moving faster than ever in human history, and also AI progress is slowing down a bit. * The AI technology developed up to this point hits the application and mass adoption phase of the s-curve. In this period, the world is radically changing as every industry, every company, every research lab, every organization, figures out how to take advantage of newly commoditized intellectual labor. There’s a bunch of kinds of work that used to be expensive, but which are now too cheap to meter. If progress stopped now, it would take 2 decades, at least, for the world to figure out all the ways to take advantage of this new situation (but progress doesn’t show much sign of stopping). * Some examples: * The internet is filled with LLM bots that are indistinguishable from humans. If you start a conversation with a new person on twitter or discord, you have no way of knowing if they’re a human or a bot. * Probably there will be some laws about declaring which are bots, but these will be inconsistently enforced.) * Some people are basically cool with this. From their perspective, there are just more people that they want to be friends with / follow on twitter. Some people even say that the bots are just better and more interesting than people. Other people are horrified/outraged/betrayed/don’t care about relationships with non-real people. * (Older people don’t get the point, but teenagers are generally fine with having conversations with AI bots.) * The worst part of this is the bots that make friends with you and then advertise to you stuff. Pretty much everyone hates that. * We start to see companies that will, over the next 5 years, grow to have as much impact as Uber, or maybe Amazon, which have exactly one human employee / owner +  an AI bureaucracy. * The first completely autonomous companies work well enough to survive and support themselves. Many of these are created “free” for the lols, and no one owns or controls them. But most of them are owned by the person who built them, and could turn them off if they wanted to. A few are structured as public companies with share-holders. Some are intentionally incorporated fully autonomous, with the creator disclaiming (and technologically disowning (eg deleting the passwords)) any authority over them. * There are legal battles about what rights these entities have, if they can really own themselves, if they can have bank accounts, etc.  * Mostly, these legal cases resolve to “AIs don’t have rights”. (For now. That will probably change as more people feel it’s normal to have AI friends). * Everything is tailored to you. * Targeted ads are way more targeted. You are served ads for the product that you are, all things considered, most likely to buy, multiplied by the lifetime profit if you do buy it. Basically no ad space is wasted on things that don’t have a high EV of you, personally, buying it. Those ads are AI generated, tailored specifically to be compelling to you. Often, the products advertised, not just the ads, are tailored to you in particular. * This is actually pretty great for people like me: I get excellent product suggestions. * There’s not “the news”. There’s a set of articles written for you, specifically, based on your interests and biases. * Music is generated on the fly. This music can “hit the spot” better than anything you listened to before “the change.” * Porn. AI tailored porn can hit your buttons better than sex. * AI boyfriends/girlfriends that are designed to be exactly emotionally and intellectually compatible with you, and trigger strong limerence / lust / attachment reactions. * We can replace books with automated tutors. * Most of the people who read books will still read books though, since it will take a generation to realize that talking with a tutor is just better, and because reading and writing books was largely a prestige-thing anyway. * (And weirdos like me will probably continue to read old authors, but even better will be to train an AI on a corpus, so that it can play the role of an intellectual from 1900, and I can just talk to it.) * For every task you do, you can effectively have a world expert (in that task and in tutoring pedagogy) coach you through it in real time. * Many people do almost all their work tasks with an AI coach. * It's really easy to create TV shows and movies. There’s a cultural revolution as people use AI tools to make custom Avengers movies, anime shows, etc. Many are bad or niche, but some are 100x better than anything that has come before (because you’re effectively sampling from a 1000x larger distribution of movies and shows).  * There’s an explosion of new software, and increasingly custom software. * Facebook and twitter are replaced (by either external disruption or by internal product development) by something that has a social graph, but lets you design exactly the UX features you want through a LLM text interface.  * Instead of software features being something that companies ship to their users, top-down, they become something that users and communities organically develop, share, and iterate on, bottom up. Companies don’t control the UX of their products any more. * Because interface design has become so cheap, most of software is just proprietary datasets, with (AI built) APIs for accessing that data. * There’s a slow moving educational revolution of world class pedagogy being available to everyone. * Millions of people who thought of themselves as “bad at math” finally learn math at their own pace, and find out that actually, math is fun and interesting. * Really fun, really effective educational video games for every subject. * School continues to exist, in approximately its current useless form. * [This alone would change the world, if the kids who learn this way were not going to be replaced wholesale, in virtually every economically relevant task, before they are 20.] * There’s a race between cyber-defense and cyber offense, to see who can figure out how to apply AI better. * So far, offense is winning, and this is making computers unusable for lots of applications that they were used for previously: * online banking, for instance, is hit hard by effective scams and hacks. * Coinbase has an even worse time, since they’re not issued (is that true?) * It turns out that a lot of things that worked / were secure, were basically depending on the fact that there are just not that many skilled hackers and social engineers. Nothing was secure, really, but not that many people were exploiting that. Now, hacking/scamming is scalable and all the vulnerabilities are a huge problem. * There’s a whole discourse about this. Computer security and what to do about it is a partisan issue of the day. * AI systems can do the years of paperwork to make a project legal, in days. This isn’t as big an advantage as it might seem, because the government has no incentive to be faster on their end, and so you wait weeks to get a response from the government, your LMM responds to it within a minute, and then you wait weeks again for the next step. * The amount of paperwork required to do stuff starts to balloon. * AI romantic partners are a thing. They start out kind of cringe, because the most desperate and ugly people are the first to adopt them. But shockingly quickly (within 5 years) a third of teenage girls have a virtual boyfriend. * There’s a moral panic about this. * AI match-makers are better than anything humans have tried yet for finding sex and relationships partners. It would still take a decade for this to catch on, though. * This isn’t just for sex and relationships. The global AI network can find you the 100 people, of the 9 billion on earth, that you most want to be friends / collaborators with.  * Tons of things that I can’t anticipate. * On the other hand, AI progress itself is starting to slow down. Engineering labor is cheap, but (indeed partially for that reason), we’re now bumping up against the constraints of training. Not just that buying the compute is expensive, but that there are just not enough chips to do the biggest training runs, and not enough fabs to meet that demand for chips rapidly. There’s huge pressure to expand production but that’s going slowly relative to the speed of everything else, because it requires a bunch of eg physical construction and legal navigation, which the AI tech doesn’t help much with, and because the bottleneck is largely NVIDIA’s institutional knowledge, which is only partially replicated by AI. * NVIDIA's internal AI assistant has read all of their internal documents and company emails, and is very helpful at answering questions that only one or two people (and sometimes literally no human on earth) know the answer to. But a lot of the important stuff isn’t written down at all, and the institutional knowledge is still not fully scalable. * Note: there’s a big crux here of how much low and medium hanging fruit there is in algorithmic improvements once software engineering is automated. At that point the only constraint on running ML experiments will be the price of compute. It seems possible that that speed-up alone is enough to discover eg an architecture that works better than the transformer, which triggers and intelligence explosion. 2028 * The cultural explosion is still going on, and AI companies are continuing to apply their AI systems to solve the engineering and logistic bottlenecks of scaling AI training, as fast as they can. * Robotics is starting to work. 2029  * The first superhuman, relatively-general SEAI comes online. We now have basically a genie inventor: you can give it a problem spec, and it will invent (and test in simulation) a device / application / technology that solves that problem, in a matter of hours. (Manufacturing a physical prototype might take longer, depending on how novel components are.) * It can do things like give you the design for a flying car, or a new computer peripheral.  * A lot of biotech / drug discovery seems more recalcitrant, because it is more dependent on empirical inputs. But it is still able to do superhuman drug discovery, for some ailments. It’s not totally clear why or which biotech domains it will conquer easily and which it will struggle with.  * This SEAI is shaped differently than a human. It isn’t working memory bottlenecked, so a lot of intellectual work that humans do explicitly, in sequence, the these SEAIs do “intuitively”, in a single forward pass. * I write code one line at a time. It writes whole files at once. (Although it also goes back and edits / iterates / improves—the first pass files are not usually the final product.) * For this reason it’s a little confusing to answer the question “is it a planner?” It does a lot of the work that humans would do via planning it does in an intuitive flash. * The UX isn’t clean: there’s often a lot of detailed finagling, and refining of the problem spec, to get useful results. But a PhD in that field can typically do that finagling in a day. * It’s also buggy. There’s oddities in the shape of the kind of problem that is able to solve and the kinds of problems it struggles with, which aren’t well understood. * The leading AI company doesn’t release this as a product. Rather, they apply it themselves, developing radical new technologies, which they publish or commercialize, sometimes founding whole new fields of research in the process. They spin up automated companies to commercialize these new innovations. * Some of the labs are scared at this point. The thing that they’ve built is clearly world-shakingly powerful, and their alignment arguments are mostly inductive “well, misalignment hasn’t been a major problem so far”, instead of principled alignment guarantees.  * There's a contentious debate inside the labs. * Some labs freak out, stop here, and petition the government for oversight and regulation. * Other labs want to push full steam ahead.  * Key pivot point: Does the government put a clamp down on this tech before it is deployed, or not? * I think that they try to get control over this powerful new thing, but they might be too slow to react. 2030 * There’s an explosion of new innovations in physical technology. Magical new stuff comes out every day, way faster than any human can keep up with. * Some of these are mundane. * All the simple products that I would buy on Amazon are just really good and really inexpensive. * Cars are really good. * Drone delivery * Cleaning robots * Prefab houses are better than any house I’ve ever lived in, though there are still zoning limits. * But many of them would have huge social impacts. They might be the important story of the decade (the way that the internet was the important story of 1995 to 2020) if they were the only thing that was happening that decade. Instead, they’re all happening at once, piling on top of each other. * Eg: * The first really good nootropics * Personality-tailoring drugs (both temporary and permanent) * Breakthrough mental health interventions that, among other things, robustly heal people’s long term subterranean trama and  transform their agency. * A quick and easy process for becoming classically enlightened. * The technology to attain your ideal body, cheaply—suddenly everyone who wants to be is as attractive as the top 10% of people today. * Really good AI persuasion which can get a mark to do ~anything you want, if they’ll talk to an AI system for an hour. * Artificial wombs. * Human genetic engineering * Brain-computer interfaces * Cures for cancer, AIDs, dementia, heart disease, and the-thing-that-was-causing-obesity. * Anti-aging interventions. * VR that is ~ indistinguishable from reality. * AI partners that can induce a love-super stimulus. * Really good sex robots * Drugs that replace sleep * AI mediators that are so skilled as to be able to single-handedly fix failing marriages, but which are also brokering all the deals between governments and corporations. * Weapons that are more destructive than nukes. * Really clever institutional design ideas, which some enthusiast early adopters try out (think “50 different things at least as impactful as manifold.markets.”) * It’s way more feasible to go into the desert, buy 50 square miles of land, and have a city physically built within a few weeks. * In general, social trends are changing faster than they ever have in human history, but they still lag behind the tech driving them by a lot. * It takes humans, even with AI information processing assistance, a few years to realize what’s possible and take advantage of it, and then have the new practices spread.  * In some cases, people are used to doing things the old way, which works well enough for them, and it takes 15 years for a new generation to grow up as “AI-world natives” to really take advantage of what’s possible. * [There won’t be 15 years] * The legal oversight process for the development, manufacture, and commercialization of these transformative techs matters a lot. Some of these innovations are slowed down a lot because they need to get FDA approval, which AI tech barely helps with. Others are developed, manufactured, and shipped in less than a week. * The fact that there are life-saving cures that exist, but are prevented from being used by a collusion of AI labs and government is a major motivation for open source proponents. * Because a lot of this technology makes setting up new cities quickly more feasible, and there’s enormous incentive to get out from under the regulatory overhead, and to start new legal jurisdictions. The first real seasteads are started by the most ideologically committed anti-regulation, pro-tech-acceleration people. * Of course, all of that is basically a side gig for the AI labs. They’re mainly applying their SEAI to the engineering bottlenecks of improving their ML training processes. * Key pivot point: * Possibility 1: These SEAIs are necessarily, by virtue of the kinds of problems that they’re able to solve, consequentialist agents with long term goals. * If so, this breaks down into two child possibilities * Possibility 1.1: * This consequentialism was noticed early, that might have been convincing enough to the government to cause a clamp-down on all the labs. * Possibility 1.2: * It wasn’t noticed early and now the world is basically fucked.  * There’s at least one long-term consequentialist superintelligence. The lab that “owns” and “controls” that system is talking to it every day, in their day-to-day business of doing technical R&D. That superintelligence easily manipulates the leadership (and rank and file of that company), maneuvers it into doing whatever causes the AI’s goals to dominate the future, and enables it to succeed at everything that it tries to do. * If there are multiple such consequentialist superintelligences, then they covertly communicate, make a deal with each other, and coordinate their actions. * Possibility 2: We’re getting transformative AI that doesn’t do long term consequentialist planning. * Building these systems was a huge engineering effort (though the bulk of that effort was done by ML models). Currently only a small number of actors can do it. * One thing to keep in mind is that the technology bootstraps. If you can steal the weights to a system like this, it can basically invent itself: come up with all the technologies and solve all the engineering problems required to build its own training process. At that point, the only bottleneck is the compute resources, which is limited by supply chains, and legal constraints (large training runs require authorization from the government). * This means, I think, that a crucial question is “has AI-powered cyber-security caught up with AI-powered cyber-attacks?” * If not, then every nation state with a competent intelligence agency has a copy of the weights of an inventor-genie, and probably all of them are trying to profit from it, either by producing tech to commercialize, or by building weapons. * It seems like the crux is “do these SEAIs themselves provide enough of an information and computer security advantage that they’re able to develop and implement methods that effectively secure their own code?” * Every one of the great powers, and a bunch of small, forward-looking, groups that see that it is newly feasible to become a great power, try to get their hands on a SEAI, either by building one, nationalizing one, or stealing one. * There are also some people who are ideologically committed to open-sourcing and/or democratizing access to these SEAIs. * But it is a self-evident national security risk. The government does something here (nationalizing all the labs, and their technology?) What happens next depends a lot on how the world responds to all of this. * Do we get a pause?  * I expect a lot of the population of the world feels really overwhelmed, and emotionally wants things to slow down, including smart people that would never have thought of themselves as luddites.  * There’s also some people who thrive in the chaos, and want even more of it. * What’s happening is mostly hugely good, for most people. It’s scary, but also wonderful. * There is a huge problem of accelerating addictiveness. The world is awash in products that are more addictive than many drugs. There’s a bit of (justified) moral panic about that. * One thing that matters a lot at this point is what the AI assistants say. As powerful as the media used to be for shaping people’s opinions, the personalized, superhumanly emotionally intelligent AI assistants are way way more powerful. AI companies may very well put their thumb on the scale to influence public opinion regarding AI regulation. * This seems like possibly a key pivot point, where the world can go any of a number of ways depending on what a relatively small number of actors decide. * Some possibilities for what happens next: * These SEAIs are necessarily consequentialist agents, and the takeover has already happened, regardless of whether it still looks like we’re in control or it doesn’t look like anything, because we’re extinct. * Governments nationalize all the labs. * The US and EU and China (and India? and Russia?) reach some sort of accord. * There’s a straight up arms race to the bottom. * AI tech basically makes the internet unusable, and breaks supply chains, and technology regresses for a while. * It’s too late to contain it and the SEAI tech proliferates, such that there are hundreds or millions of actors who can run one. * If this happens, it seems like the pace of change speeds up so much that one of two things happens: * Someone invents something, or there are second and third impacts to a constellation of innovations that destroy the world.
Raemon2d285
3
There's a skill of "quickly operationalizing a prediction, about a question that is cruxy for your decisionmaking." And, it's dramatically better to be very fluent at this skill, rather than "merely pretty okay at it." Fluency means you can actually use it day-to-day to help with whatever work is important to you. Day-to-day usage means you can actually get calibrated re: predictions in whatever domains you care about. Calibration means that your intuitions will be good, and _you'll know they're good_. Fluency means you can do it _while you're in the middle of your thought process_, and then return to your thought process, rather than awkwardly bolting it on at the end. I find this useful at multiple levels-of-strategy. i.e. for big picture 6 month planning, as well as for "what do I do in the next hour." I'm working on this as a full blogpost but figured I would start getting pieces of it out here for now. A lot of this skill is building off on CFAR's "inner simulator" framing. Andrew Critch recently framed this to me as "using your System 2 (conscious, deliberate intelligence) to generate questions for your System 1 (fast intuition) to answer." (Whereas previously, he'd known System 1 was good at answering some types of questions, but he thought of it as responsible for both "asking" and "answering" those questions) But, I feel like combining this with "quickly operationalize cruxy Fatebook predictions" makes it more of a power tool for me. (Also, now that I have this mindset, even when I can't be bothered to make a Fatebook prediction, I have a better overall handle on how to quickly query my intuitions) I've been working on this skill for years and it only really clicked together last week. It required a bunch of interlocking pieces that all require separate fluency: 1. Having three different formats for Fatebook (the main website, the slack integration, and the chrome extension), so, pretty much wherever I'm thinking-in-text, I'll be able to quickly use it. 2. The skill of "generating lots of 'plans'", such that I always have at least two plausibly good ideas on what to do next. 3. Identifying an actual crux for what would make me switch to one of my backup plans. 4. Operationalizing an observation I could make that'd convince me of one of these cruxes.

Popular Comments

Recent Discussion

This is a thread for updates about the upcoming LessOnline festival. I (Ben) will be posting bits of news and thoughts, and you're also welcome to make suggestions or ask questions.

If you'd like to hear about new updates, you can use LessWrong's "Subscribe to comments" feature from the triple-dot menu at the top of this post.

Reminder that you can get tickets at the site for $400 minus your LW karma in cents.

Nice, just had a good call with Alkjash, who is coming and will be preparing 2 layman-level math talks about questions he's been thinking about.

Other ideas we chatted about having at LessOnline include maybe having some discussions about doing research inside and outside of academia, and also about learning from GlowFic writers how to write well collaboratively. (Let me know if you'd be interested in either of these!)

5NicholasKross4h
How scarce are tickets/"seats"?
4Ben Pace3h
I think on-site housing is pretty scarce, though we're going to make more high-density rooms in response to demand for that. Tickets aren't scarce, our venue could fit like a 700 person event, so I don't expect to hit the limits.

I was thinking about my p(doom) in the next 10 years and came up with something around 6%[1]. However that involves lots of current unknowns to me, like the nature of current human knowledge production (and the bottle necks involved) which impact my P(doom) to be either 3% or 15% depending upon what type of bottle necks are found or not found. Is there a technical way to describe this probability distribution contingent on evidence?

  1. ^

    I'm bearish on LLMs leading AI directly (10% chance) and roughly a 30% chance of LLMs  based AI fooming quickly enough to kill us and to want to kill us within 10 years. There is a 3% chance that something will come out of left field and doing the same.

2Answer by Dagon19m
If you're giving one number, that IS your all-inclusive probability.  You can't predict the direction that new evidence will change your probability (per https://www.lesswrong.com/tag/conservation-of-expected-evidence), but you CAN predict that there will be evidence with equal probability of each direction.   An example is if you're flipping a coin twice.  Before any flips, you give 0.25 to each of HH, HT, TH, and TT.  But you strongly expect to get evidence (observing the flips) that will first change two of them to 0.5 and two to 0, then another update which will change one of the 0.5 to 1 and the other to 0.   Likewise, p(doom) before 2035 - you strongly believe your probability will be 1 or 0 in 2036.  You currently believe 6%.  You may be able to identify intermediate updates, and specify the balance of probability * update that adds to 0 currently, but will be specific when the evidence is obtained.   I don't know any shorthand for that - it's implied by the probability given.  If you want to specify your distribution of probable future probability assignments, you can certainly do so, as long as the mean remains 6%.  "There's a 25% chance I'll update to 15% and a 75% chance of updating to 3% over the next 5 years" is a consistent prediction.
6Answer by Richard_Ngo41m
I don't think there's a very good precise way to do so, but one useful concept is bid-ask spreads, which are a way of protecting yourself from adverse selection of bets. E.g. consider the following two credences, both of which are 0.5. 1. My credence that a fair coin will land heads. 2. My credence that the wind tomorrow in my neighborhood will be blowing more northwards than southwards (I know very little about meteorology and have no recollection of which direction previous winds have mostly blown). Intuitively, however, the former is very difficult to change, whereas the latter might swing wildly given even a little bit of evidence (e.g. someone saying "I remember in high school my teacher mentioned that winds often blow towards the equator.") Suppose I have to decide on a policy that I'll accept bets for or against each of these propositions at X:1 odds (i.e. my opponent puts up $X for every $1 I put up). For the first proposition, I might set X to be 1.05, because as long as I have a small edge I'm confident I won't be exploited. By contrast, if I set X=1.05 for the second proposition, then probably what will happen is that people will only decide to bet against me if they have more information than me (e.g. checking weather forecasts), and so they'll end up winning a lot of money for me. And so I'd actually want X to be something more like 2 or maybe higher, depending on who I expect to be betting against, even though my credence right now is 0.5. In your case, you might formalize this by talking about your bid-ask spread when trading against people who know about these bottlenecks.
Razied9m20

Surely something like the expected variance of  would be a much simpler way of formalising this, no? The probability over time is just a stochastic process, and OP is expecting the variance of this process to be very high in the near future.

1Answer by harfe1h
A lot of the probabilities we talk about are probabilities we expect to change with evidence. If we flip a coin, our p(heads) changes after we observe the result of the flipped coin. My p(rain today) changes after I look into the sky and see clouds. In my view, there is nothing special in that regard for your p(doom). Uncertainty is in the mind, not in reality. However, how you expect your p(doom) to change depending on facts or observation is useful information and it can be useful to convey that information. Some options that come to mind: 1. describe a model: If your p(doom) estimate is the result of a model consisting of other variables, just describing this model is useful information about your state of knowledge, even if that model is only approximate. This seems to come closest to your actual situation. 2. describe your probability distribution of your p(doom) in 1 year (or another time frame): You could say that you think there is a 25% chance that your p(doom) in 1 year is between 10% and 30%. Or give other information about that distribution. Note: your current p(doom) should be the mean of your p(doom) in 1 year. 3. describe your probability distribution of your p(doom) after a hypothetical month of working on a better p(doom) estimate: You could say that if you were to work hard for a month on investigating p(doom), you think there is a 25% chance that your p(doom) after that month is between 10% and 30%. This is similar to 2., but imo a bit more informative. Again, your p(doom) should be the mean of your p(doom) after a hypothetical month of investigation, even if you don't actually do that investigation.

I previously expected open-source LLMs to lag far behind the frontier because they're very expensive to train and naively it doesn't make business sense to spend on the order of $10M to (soon?) $1B to train a model only to give it away for free.

 But this has been repeatedly challenged, most recently by Meta's Llama 3. They seem to be pursuing something like a commoditize your complement strategy: https://twitter.com/willkurt/status/1781157913114870187 .

As models become orders-of-magnitude more expensive to train can we expect companies to continue to open-source them?

In particular, can we expect this of Meta?

Answer by Aaron_ScherApr 19, 202420

Yeah, I think we should expect much more powerful open source AIs than we have now. I've been working on a blog post about this, maybe I'll get it out soon. Here are what seem like the dominant arguments to me: 

  • Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise. 
  • There are numerous (maybe 7) actors in the open source world who are at least moderately competent and want to open source powerful models. There is a niche in the market for powerful open source models, an
... (read more)

Consequentialists (including utilitarians) claim that the goodness of an action should be judged based on the goodness of its consequences. The word utility  is often used to refer to the quantified goodness of a particular outcome. When the consequences of an action are uncertain, it is often taken for granted that consequentialists should choose the action which has the highest expected utility. The expected utility is the sum of the utilities of each possible outcome, weighted by their probability. For a lottery which gives outcome utilities  with respective probabilities , the expected utility is:

There are several good reasons to use the maximization of expected utility as a normative rule. I'll talk about some of them here, but I recommend Joe Carlsmith's series of posts 'On Expected Utility' as a...

If bounded below, you can just shift up to make it positive. But the geometric expected utility order is not preserved under shifts.

1MichaelStJules17m
Violations of continuity aren't really vulnerable to proper/standard money pumps. The author calls it "arbitrarily close to pure exploitation" but that's not pure exploitation. It's only really compelling if you assume a weaker version of continuity in the first place, but you can just deny that. I think transitivity (+independence of irrelevant alternatives) and countable independence (or the countable sure-thing principle) are enough to avoid money pumps, and I expect give a kind of expected utility maximization form (combining McCarthy et al., 2019 and Russell & Isaacs, 2021). Against the requirement of completeness (or the specific money pump argument for it by Gustafsson in your link), see Thornley here. To be clear, countable independence implies your utilities are "bounded" in a sense, but possibly lexical/lexicographic. See Russell & Isaacs, 2021.
2cousin_it1h
Well, you can't have some states as "avoid at all costs" and others as "achieve at all costs", because having them in the same lottery leads to nonsense, no matter what averaging you use. And allowing only one of the two seems arbitrary. So it seems cleanest to disallow both. But geometric averaging wouldn't let you do that either, or am I missing something?
1A.H.35m
Fine. But the purpose of exploring different averaging methods is to see whether it expands the richness of the kind of behaviour we want to describe. The point is that using arithmetic averaging is a choice which limits the kind of behaviour we can get. Maybe we want to describe behaviours which can't be described under expected utility. Having an 'avoid at all costs state' is one such behaviour which finds natural description using a non-arithmetic averaging which can't be described in more typical VNM terms.  If your position is 'I would never want to describe normative ethics using anything other than expected utility' then that's fine, but some people (like me) are interested in looking at what alternatives to expected utility might be. That's why I wrote this post. As it stands, I didn't find geometric averaging very satisfactory (as I wrote in the post), but I think things like this are worth exploring. You are right. Geometric averaging on its own doesn't give allow violations of independence. But some other protocol for deciding over lotteries does. It's described more in the Garrabrant post linked above.

In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior.

In the rest of this post I will:

  1. Articulate a class of approaches to scalable oversight I call cognition-based oversight.
  2. Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most
...
Buck21mΩ220


I like this post and this research direction, I agree with almost everything you say, and I think you’re doing an unusually good job of explaining why you think your work is useful.

A nitpick: I think you’re using the term “scalable oversight” in a nonstandard and confusing way.

You say that scalable oversight is a more general version of “given a good model and a bad model, determine which one is good.” I imagine that more general sense you wanted is something like: you can implement some metric that tells you how “good” a model is, which can be applied not... (read more)

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

TL;DR: We use SAEs trained on GPT-2 XL’s residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces...

Sam Marks31mΩ120

With the ITO experiments, my first guess would be that reoptimizing the sparse approximation problem is mostly relearning the encoder, but with some extra uninterpretable hacks for low activation levels that happen to improve reconstruction. In other words, I'm guessing that the boost in reconstruction accuracy (and therefore loss recovered) is mostly not due to better recognizing the presence of interpretable features, but by doing fiddly uninterpretable things at low activation levels.

I'm not really sure how to operationalize this into a prediction. Mayb... (read more)

2Sam Marks39m
Awesome stuff -- I think that updates like this (both from the GDM team and from Anthropic) are very useful for organizing work in this space. And I especially appreciate the way this was written, with both short summaries and in-depth write-ups.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Emergent Instrumental Reasoning Without Explicit Goals

TL;DR: LLMs can act and scheme without being told to do so. This is bad.


Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic.

Introduction

Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned[1]. Existing Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint for Deceptive Alignment research, in particular, and AI Safety, in general. My project is aimed...

12ryan_greenblatt37m
I would summarize this result as: If you train models to say "there is a reason I should insert a vulnerability" and then to insert a code vulnerability, then this model will generalize to doing "bad" behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do "bad" behavior if it is given a plausible excuse in the prompt. Does this seems like a good summary? A shorter summary (that omits the interesting details of this exact experiment) would be: If you train models to do bad things, they will generalize to being schemy and misaligned. This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more "unprompted" than it actually is. As in, my initial skim of these sections made me think this result is much more striking than it actually is.

To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking).

So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'

Yesterday Adam Shai put up a cool post which… well, take a look at the visual:

Yup, it sure looks like that fractal is very noisily embedded in the residual activations of a neural net trained on a toy problem. Linearly embedded, no less.

I (John) initially misunderstood what was going on in that post, but some back-and-forth with Adam convinced me that it really is as cool as that visual makes it look, and arguably even cooler. So David and I wrote up this post / some code, partly as an explainer for why on earth that fractal would show up, and partly as an explainer for the possibilities this work potentially opens up for interpretability.

One sentence summary: when tracking the hidden state of a hidden Markov model, a Bayesian’s...

Is there a link to the code? I'm overlooking it if so; it would be useful to see.

Elon Musk's Hyperloop proposal had substantial public interest. With various initial Hyperloop projects now having failed, I thought some people might be interested in a high-speed transportation system that's...perhaps not "practical" per se, but at least more-practical than the Hyperloop approach.

aerodynamic drag in hydrogen

Hydrogen has a lower molecular mass than air, so it has a higher speed of sound and lower density. The higher speed of sound means a vehicle in hydrogen can travel at 2300 mph while remaining subsonic, and the lower density reduces drag. This paper evaluated the concept and concluded that:

the vehicle can cruise at Mach 2.8 while consuming less than half the energy per passenger of a Boeing 747 at a cruise speed of Mach 0.81

In a tube, at subsonic speeds, the gas...

2gilch1h
Hydrogen can only burn in the presence of oxygen. The pipe does not contain any, and combustion isn't possible until after they have had time to mix. It's also not going to explode from the pressure, because it's the same as the atmosphere. The shaped charge is obviously going to explode, that's the point, but it will be more directional. That still doesn't sound safe in an enclosed space. Maybe the vehicle could deploy a gasket seal with airbags or something to reduce the leakage of expensive hydrogen.
3bhauth12h
It can't use "air" around it for engines because what's around it isn't "air". Oxygen is much heavier than the fuel it's used with, and you'd either need liquid oxygen (which increases costs) or pressurized tanks (which would perhaps double that mass). That's still lighter than batteries, yes, but engines are also needed. Piston engines are inefficient and/or heavy, and gas turbines are somewhat expensive. It's not that difficult to separate water and hydrogen, that's true, but processing that much gas is still rather impractical when batteries have enough specific energy. Simply condensing it in the tube is...possible, but would increase drag, especially considering density variation issues, and you'd have to deal with getting it out of a long sealed tube without leaking hydrogen. Also, if batteries are good enough, the cost of replacing the hydrogen alone probably makes batteries better than burning the hydrogen.
3gilch1h
Condensation is not just possible but would happen by default. You described the tubes as steel lined with aluminum in contact with the ground, if not buried. That's going to be consistently cool enough for passive condensation. Getting water out of a long tube shouldn't be hard with multiple drains, and if there's any incline, you just need them at the bottom. You can just dump it in the ground. Use a plumbing trap to keep the gasses separated. They're at equal pressure, so this should work, and the pressure can also be maintained mostly passively with hydrogen bladders exposed to the atmosphere on the outside, although the burned hydrogen will have to be regenerated before they empty completely, but this can be done anywhere on the pipe. Hydrogen can be easily regenerated by electrolysis of water, which doesn't seem any more expensive than charging the batteries. It might be even cheaper to crack if off of natural gas or to use white hydrogen when available. Are turbines more expensive than electric motors for similar power? It's true that conventional piston engines are heavy, but batteries are also heavy, especially the cheaper chemistries. Alternatively, run electricity through the pipe to power the vehicles so they don't have to carry any extra weight for power. It's coated with conductive aluminum already. If half-pipes could be welded with a dielectric material and not cost any more that would work. Or use an internal monorail, but maybe only if you were going to do that already. Or you could suspend a wire. That's got to be pretty cheap compared to the pipe itself.

    …run electricity through the pipe…

Simpler to do what some existing electric trains do: use the rails as ground, and have a charged third rail for power.  We don’t like this system much for new trains, because the third rail is deadly to touch.  It’s a bad thing to leave lying on the ground where people can reach it.  But in this system, it’s in a tube full of unbreathable hydrogen, so no one is going to casually come across it.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA