That’s the path the world seems to be on at the moment. It might end well and it might not, but it seems like we are on track for a heck of a roll of the dice.
I agree with almost everything you've written in this post, but you must have some additional inside information about how the world got to this state, having been on the board of OpenAI for several years, and presumably knowing many key decision makers. Presumably this wasn't the path you hoped that OpenAI would lead the world onto when you decided to get involved? Maybe you can't share specific ...
How much have you looked into other sources of AI risk?
We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime.
What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g....
Looking forward to your next post, but in the meantime:
My first thought upon hearing about Microsoft deploying a GPT derivative was (as I told a few others in private chat) "I guess they must have fixed the 'making up facts' problem." My thinking was that a big corporation like Microsoft that mostly sells to businesses would want to maintain a reputation for only deploying reliable products. I honestly don't know how to adjust my model of the world to account for whatever happened here... except to be generically more pessimistic?
But it seems increasingly plausible that AIs will not have explicit utility functions, so that doesn’t seem much better than saying humans could merge their utility functions.
There are a couple of ways to extend the argument:
Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.
Let's distinguish between shard theory as a model of human values, versus implementing an AI that learns its own values in a shard-based way. The former seems fine to me (pending further research on how well the model actually fits), but the latter worries me in part because it's not reflectively stable and the proponents haven't talked about how they plan to ensure that things will go well in the long run. If you're talking about the former and I'm ...
PBR-A, EGY, BTU, ARCH, AMR, SMR.AX, YAL.AX (probably not a good time to enter this last one) (Not investment advice, etc.)
My personal view is that given all of this history and the fact that this forum is named the "AI Alignment Forum", we should not redefine "AI Alignment" to mean the same thing as "Intent Alignment". I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion...
Other relevant paragraphs from the Arbital post:
...“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.
Other terms that have been used to describe this research problem include “robust and beneficial AI” and “Friendly AI”. The term “value alignment problem” was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences wit
Here are some clearer evidence that broader usages of "AI alignment" were common from the beginning:
The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.
(I couldn't find a easy way to view the original 2015 version, but do have a screenshot that I can produce upon request showing a Jan 2017 edit on Arbital that already had this broad def...
Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I think).
Eliezer used "AI alignment" as early as 2016 and ai-alignment.com wasn't registered until 2017. Any other usage of the term that potentially predates Eliezer?
I’m not sure what order the history happened in and whether “AI Existential Safety” got rebranded into “AI Alignment” (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn’t a rebrand)
There was a pretty extensive discussion about this between Paul Christiano and me. tl;dr "AI Alignment" clearly had a broader (but not very precise) meaning than "How to get AI systems to try to do what we want" when it first came into use. Paul later used "AI Alignment" for his narrower meaning, but after...
UDT still has utility functions, even though it doesn't have independence... Is it just a terminological issue? Like you want to call the representation of value in whatever the correct decision theory turns out to be something besides "utility"? If so, why?
I agree that reflectivity for learned systems is a major open question, and my current project is to study the reflectivity and self-modification related behaviors of current language models.
Interesting. I'm curious what kinds of results you're hoping for, or just more details about your project. (But feel free to ignore this if talking about it now isn't a good use of your time.) My understanding is that LLMs can potentially be fine-tuned or used to do various things, including instantiating various human-like or artificial characters (such as a "helpf...
I feel deconfused when I reject utility functions, in favor of values being embedded in heuristics and/or subagents.
"Humans don't have utility functions" and the idea of values of being embedded in heuristics and/or subagents have been discussed since the beginning of LW, but usually framed as a problem, not a solution. The key issue here is that utility functions are the only reflectively stable carrier of value that we know of, meaning that an agent with a utility function would want to preserve that utility function and build more agents with the sam...
I wrote about these when they first came out and have been wearing them since then, with good results. It looks like Amazon isn't selling them directly anymore (only via third-party sellers) but Northern Safety has them for $0.59 each, and sometimes for $0.19 each on sale. (The brand is different but it's the same mask.)
I think that multi-decision-influence networks seem much easier to align and much safer for humans.
It seems fine to me that you think this. As I wrote in a previous post, "Trust your intuitions, but don’t waste too much time arguing for them. If several people are attempting to answer the same question and they have different intuitions about how best to approach it, it seems efficient for each to rely on his or her intuition to choose the approach to explore."
As a further meta point, I think there's a pattern where because many existing (somewhat) conc...
I think that baggage is actually doing work in some people’s reasoning and intuitions.
Do you have any examples of this?
That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don’t actually think the danger directly translates. And I think it’s unlikely that multi-objective optimisers would not care about humans or other agents.
I think one possible form of existential catastrophe is that human values get only a small share of the universe, and as a result the "utility" of the universe is much smaller than it could be. I worry this will happen if only one or few of the objectives of multi obj...
Speaking for myself, I sometimes use "EU maximization" as shorthand for one of the following concepts, depending on context:
Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don’t see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).
Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What...
build and provide clean and effectively unlimited energy
How? The closest thing I can think of is nuclear fission but calling it "clean" seems controversial to say the least. Nuclear fusion seems a long way away from being economically viable. If you're talking about solar and wind, I think there are good arguments against calling it "effectively unlimited", or "clean" for that matter. See https://www.youtube.com/watch?v=xXv-ugeTLlw for a lecture about this.
I am referring to, among other things, humanity’s unfortunate retreat from space exploration.
C...
If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI.
Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash. The kind of backlash you're worried abo...
This piece is aimed at a broad audience, because I think it’s important for the challenges here to be broadly understood.
I'm curious how you're trying to reach such an audience, and what their reactions have been.
Let us assume that, on average, a booster given to a random person knocks you on your ass for a day. That’s one hundred years, an actual lifetime,of knocked-on-ass time for every hospitalization prevented. The torture here seems less bad than the dust specs.
What's your source for "booster given to a random person knocks you on your ass for a day"? None of my family had more than a sore arm.
For the more severe consequences, see also https://twitter.com/DrCanuckMD/status/1600259874272989184, which is one of the replies to the tweet you linked. (Don't have...
calls out Bostrom as out of touch
I think he actually said that Bostrom represents the current zeitgeist, which is kind of the opposite of "out of touch"? (Unless he also said "out of touch"? Unfortunately I can't find a transcript to do a search on.)
It's ironic that everyone thinks of themselves as David fighting Goliath. We think we're fighting unfathomably powerful economic forces (i.e., Moloch) trying to build AGI at any cost, and Peter thinks he's fighting a dominant culture that remorselessly smothers any tech progress.
Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.
I'm not sure I agree that this is unfair.
OpenAI is clearly on the cutting edge of AI research.
This is obviously a good reason to focus on them more.
OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.
Perhaps we have responsibility to scrutinize/criticize them more because of this...
I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:
To the extent that alignment research involves solving philosophical problems, it seems that in this approach we will also need to automate philosophy, otherwise alignment research will become bottlenecked on those problems (i.e., on human philosophers trying to solve those problems while the world passes them by). Do you envision automating philosophy (and are you optimistic about this) or see some other way of getting around this issue?
It worries me to depend on AI to do philosophy, without understanding what "philosophical reasoning" or "philosophical p...
If anyone here has attended a top university, was the effort to get in worth it to you (in retrospect)?
For me the answer is yes, but my situation is quite non-central. I got into MIT since I was a kid from a small rural town with really good grades, really good test scores, and was on a bunch of sports teams. Because I was from a small rural town and was pretty smart, none of this required special effort other than being on sports teams (note: being on the teams required no special skill as everyone who tried out made the team given small class size). The above was enough to get me an admission probably for reasons of diversity I'm a white man but I'm fairl...
Thanks for these detailed explanations. Would it be fair to boil it down to: DL currently isn't very sample efficient (relative to humans) and there's a lot more data available for training generative capabilities than for training to self-censor and to not make stuff up? Assuming yes, my next questions are:
If I train a human to self-censor certain subjects, I'm pretty sure that would happen by creating an additional subcircuit within their brain where a classifier pattern matches potential outputs for being related to the forbidden subjects, and then they avoid giving the outputs for which the classifier returns a high score. It would almost certainly not happen by removing their ability to think about those subjects in the first place.
So I think you're very likely right about adding patches being easier than unlearning capabilities, but what confuses me is ...
Any thoughts why it's taking so long to solve these problems (reliably censoring certain subjects, avoiding hallucinations / making up answers)? Naively these problems don't seem so hard that I would have expected them to remain largely unsolved after several years while being very prominent and embarrassing for labs like OpenAI.
Also, given that hallucinations are a well know problem, why didn't OpenAI train ChatGPT to reliably say that it can sometimes make up answers, as opposed to often denying that? ("As a language model, I do not have the ability to m...
Among other issues, we might be learning this early item from a meta-predictable sequence of unpleasant surprises: Training capabilities out of neural networks is asymmetrically harder than training them into the network.
Or put with some added burdensome detail but more concretely visualizable: To predict a sizable chunk of Internet text, the net needs to learn something complicated and general with roots in lots of places; learning this way is hard, the gradient descent algorithm has to find a relatively large weight pattern, albeit presumably...
In addition to reasons other commenters have given, I think that architecturally it's a bit hard to avoid hallucinating. The model often thinks in a way that is analogous to asking itself a question and then seeing what answer pops into its head; during pretraining there is no reason for the behavior to depend on the level of confidence in that answer, you basically just want to do a logistic regression (since that's the architecturally easiest thing to say, and you have literally 0 incentive to say "I don't know" if you don't know!) , and so the model may...
My understanding of why it's especially hard to stop the model making stuff up (while not saying "I don't know" too often), compared to other alignment failures:
Not to put too fine a point on it, but you're just wrong that these are easy problems. NLP is hard because language is remarkably complex. NLP is also hard because it feels so easy from the inside -- I can easily tell what that pronoun refers to, goes the thinking, so it should be easy for the computer! But it's not, fully understanding language is very plausibly AI-complete.
Even topic classification (which is what you need to reliably censor certain subjects), though it seems simple, has literal decades of research and is not all that close to being solved.
So I think you should update much more towards "NLP is much harder than I thought" rather than "OpenAI should be embarrassed at how crappy their NLP is".
ChatGPT has a tendency to make up answers and refuse to admit it. A log follows. Note that almost every paragraph of the initial biography contains multiple falsehoods that are unlikely to be directly in its training data.
Show me a 10 paragraph biography of Wei Dai
Wei Dai was born in Guangzhou, China in 1967. She received her undergraduate degree in computer science from the University of Science and Technology of China in 1988 and her master's degree in computer science from the University of California, Berkeley in 1991.
After completing her master's ...
After market close on 10/26/2022, Meta guided an increase in annual capex of ~$4B (from 32-33 for 2022 to 34-39 for 2023), "with our investment in AI driving all of that growth". NVDA shot up 4% afterhours on this news. (Before you get too alarmed, I read somewhere that most of that is going towards running ML on videos, which is apparently very computationally expensive, in order to improve recommendations, in order to compete with TikTok. But one could imagine all that hardware being repurposed for something else down the line. Plus, maybe it's not a gre...
This reminds me of an example I described in this SL4 post:
...After suggesting in a previous post [1] that AIs who want to cooperate with each other may find it more efficient to merge than to trade, I realized that voluntary mergers do not necessarily preserve Bayesian rationality, that is, rationality as defined by standard decision theory. In other words, two "rational" AIs may find themselves in a situation where they won't voluntarily merge into a "rational" AI, but can agree merge into an "irrational" one. This seems to suggest that we shouldn't expec
Philosophers don’t discuss things which can be falsified.
Sometimes in life, one simply faces questions whose answers can't be falsified, such as "What should we do about things which can't be falsified?" If you're proposing to avoid discussing them, well aren't you discussing one of them now? And why should we trust you, without discussing it ourselves?
I think you had the bad luck of taking a couple of philosophy classes that taught things that were outdated or "insane". (Socrates and Aristotle may have been very confused, but consider, how did we, i.e....
This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework?
My own framework is something like this:
Imagine someone who considers a few plans, grades them (e.g. “how good does my gut say this plan is?”), and chooses the best. They are not a grader-optimizer. They are not trying to navigate to the state where they propose and execute a plan which gets maximally highly rated by some evaluative submodule. They use a grading procedure to locally rate and execute plans, and may even locally think “what would make me feel better about this plan?”, but the point of their optimization isn’t “find the plan which makes me feel as good as globally possible.”
The ...
so making choices which drop the odds of success so drastically
I wouldn't say "drastically" here so maybe this is the crux. I think the chances of success if China does make an all out push for semiconductors is very low given its own resources and likely US and allies' responses (e.g. they could collectively way outspend China on their own subsidies). I could express this as <1% chance of having a world leading semi fab 10 years from now and <5% chance 20 years from now, no matter what China chooses to do at this point. If hegemony was the only g...
I agree with Rob Bensinger's response here, plus it's just a really weird use of "insane", like saying that Japan would have been insane not to attack Pearl Harbor after the US imposed an oil embargo on them, because "You miss 100% of the shots you don’t take." Thinking that way only makes sense if becoming a world or regional hegemon was your one and only goal, but how did that become the standard for sanity of world leaders around here?
I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
Maybe https://en.wikipedia.org/wiki/Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds (I don't mean read this book, which I haven't either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also https://en.wikipedia.org/wiki/Heaven's_Gate_(religious_group)
...However, this does not seem important for my
From the scaling-pilled perspective, or even just centrist AI perspective, this is an insane position: it is taking a L on one of, if not the most, important future technological capabilities, which in the long run may win or lose wars.
Are you suggesting that the sane policy is for Xi to dump in as much subsidies as needed until China catches up in semiconductors with the US and its allies? I haven't seen anyone else argue this, and it seems implausible to me, given that the latter collectively has much greater financial and scientific/engineering resou...
Are you suggesting that the sane policy is for Xi to dump in as much subsidies as needed until China catches up in semiconductors with the US and its allies? I haven't seen anyone else argue this
Yes. And perhaps no one else does because they aren't scaling proponents. But from a scaling perspective, accepting a permanent straitjacket around GPUs & a tightened noose is tantamount to admitting defeat & abandoning the future to other countries; it'd be like expelling all your Jewish scientists in 1935 & banning the mining of uranium. It's not t...
I wonder if given the COVID and real estate crises, Xi's government just doesn't have the financial resources to bail out the chips industry, plus maybe they (correctly?) understand that the likelihood of building an internationally competitive chips industry is poor (given the sanctions) even if they do dump in another $200b?
Also, it seems like China is being less antagonistic towards Taiwan and other countries in the last few days. Together with the lack of chips bailout, maybe it means they've realized that it was too early to "go loud" and are pivoting...
Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I'm sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.
The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate atte...
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
But aren't you still argmaxing within the space of plans that you haven't closed off (or are actively considering), and still taking a risk of finding some adversarial plan within that space? (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing wh...
I don't think I understand, what's the reason to expect that the "acausal economy" will look like a bunch of acausal norms, as opposed to, say, each civilization first figuring out what its ultimate values are, how to encode them into a utility function, then merging with every other civilization's utility function? (Not saying that I know it will be the latter, just that I don't know how to tell at this point.)
Also, given that I think AI risk is very high for human civilization, and there being no reason to suspect that we're not a typical pre-AGI civiliz... (read more)
To your first question, I'm not sure which particular "the reason" would be most helpful to convey. (To contrast: what's "the reason" that physically dispersed human societies have laws? Answer: there's a confluence of reasons.). However, I'll try to point out some things that might be helpful to attend to.
First, committing to a policy that merges your utility function with someone else's is quite a vulnerable maneuver, with a lot of boundary-setting aspects. For instance, will you merge utility functions multiplicatively (as in Nas... (read more)
What does merging utility functions look like and are you sure it's not going to look the same as global free trade? It's arguable that trade is just a way of breaking down and modularizing a big multifaceted problem over a lot of subagent task specialists (and there's no avoiding having subagents, due to the light speed limit)