Democratic processes are important loci of power. It's useful to understand the dynamics of the voting methods used real-world elections. My own ideas of ethics and of fun theory are deeply informed by my decades of interest in voting theory

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Richard_Ngo45-15
4
Some opinions about AI and epistemology: 1. One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.  2. A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either. 3. For example: there's no privileged way to combine many people's opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals' betting strategies. The group might settle on a number to help with planning and communication, but it's only a lossy summary of many different beliefs and models. Similarly, we should think of individuals' credences as lossy summaries of different opinions from different underlying models that they have. 4. How does this apply to AI? Suppose we think of ourselves as having many different subagents that focus on understanding the world in different ways - e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn't mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn't cause all the other humans to elect them as the dictator). E.g. maybe there's an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions). 5. In my debate with Eliezer, he didn't seem to appreciate the importance of advance predictions; I think the frame of "highly opinionated subagents should convince other subagents to trust them, rather than just seizing power" is an important aspect of what he's missing. I think of rationalism as trying to form a single fully-consistent world-model; this has many of the same pitfalls as a country which tries to get everyone to agree on a single ideology. Even when that ideology is broadly correct, you'll lose a bunch of useful heuristics and intuitions that help actually get stuff done, because ideological conformity is prioritized. 6. This perspective helps frame the debate about what our "base rate" for AI doom should be. I've been in a number of arguments where people say things like "why is 90% doom such a strong claim? That assumes that survival is the default!" But in fact there's no one base rate; instead, different subagents with different domains of knowledge will have different base rates. That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom. That's where the asymmetry which makes 90% a much stronger prediction than 10% comes from. 7. This perspective is broadly aligned with a bunch of stuff that Scott Garrabrant and Abram Demski have written about (e.g. geometric rationality, Garrabrant induction). I don't think the ways I'm applying it to AI risk debates straightforwardly falls out of their more technical ideas; but I do expect that more progress on agent foundations will make it easier to articulate ideas like the ones above.
People with p(doom) > 50%: would any concrete empirical achievements on current or near-future models bring your p(doom) under 25%? Answers could be anything from "the steering vector for corrigibility generalizes surprisingly far" to "we completely reverse-engineer GPT4 and build a trillion-parameter GOFAI without any deep learning".
I had this thought yesterday: "If someone believes in the 'AGI lab nationalization by default' story, then what would it look like to build an organization or startup in preparation for this scenario?" For example, you try to develop projects that would work exceptionally well in a 'nationalization by default' world while not getting as much payoff if you are in a non-nationalization world. The goal here is to do the normal startup thing: risky bets with a potentially huge upside. I don't necessarily support nationalization and am still trying to think through the upsides/downsides, but I was wondering if there are worlds where some safety projects become much more viable in such a world, the kind of things we used to brush off because we assumed we weren't in such a world. Some vague directions to perhaps consider: AI security, Control agenda-type stuff, fully centralized monitoring to detect early signs of model drift, inaccessible compute unless you are a government-verified user, etc. By building such tech or proposals, you may be much more likely to end up with a seat at the big boy table, whereas you wouldn't have in a non-nationalization world. I could be wrong about the specific examples above, but just want to provide some quick examples.
Why do people like big houses in the countryside /suburbs? Empirically people move out to the suburbs/countryside when they get children and/or gain wealth. Having a big house with a large yard is the quintessential American dream.  but why? Dense cities are economoically more productive, commuting is measurably one of the worst factors for happiness and productivity. Raising kids in small houses is totally possible and people have done so at far higher densities in the past.  Yet people will spend vast amounts of money on living in a large house with lots of space - even if they rarely use most rooms. Having a big house is almost synonymous with wealth and status.  Part of the reason may be an evolved disease response. In the past, the most common way to die was as a child dieing to a crowd-disease. There was no medicine that actually worked yet wealthier people had much longer lifespans and out reproduced the poor (see Gregory Clark). The best way to buy health was to move out of the city (which were population sinks until late modernity) and live in a large aired house.  It seems like an appealing model. On the other hand, there are some obvious predicted regularities that aren't observed to my knowledge. 
New (perfunctory) page: AI companies' corporate documents. I'm not sure it's worthwhile but maybe a better version of it will be. Suggestions/additions welcome.

Popular Comments

Recent Discussion

3Alexander Gietelink Oldenziel
Why do people like big houses in the countryside /suburbs? Empirically people move out to the suburbs/countryside when they get children and/or gain wealth. Having a big house with a large yard is the quintessential American dream.  but why? Dense cities are economoically more productive, commuting is measurably one of the worst factors for happiness and productivity. Raising kids in small houses is totally possible and people have done so at far higher densities in the past.  Yet people will spend vast amounts of money on living in a large house with lots of space - even if they rarely use most rooms. Having a big house is almost synonymous with wealth and status.  Part of the reason may be an evolved disease response. In the past, the most common way to die was as a child dieing to a crowd-disease. There was no medicine that actually worked yet wealthier people had much longer lifespans and out reproduced the poor (see Gregory Clark). The best way to buy health was to move out of the city (which were population sinks until late modernity) and live in a large aired house.  It seems like an appealing model. On the other hand, there are some obvious predicted regularities that aren't observed to my knowledge. 

I can report my own feelings with regards to this. I find cities (at least the American cities I have experience with) to be spiritually fatiguing. The constant sounds, the lack of anything natural, the smells - they all contribute to a lack of mental openness and quiet inside of myself.

The older I get the more I feel this.

Jefferson had a quote that might be related, though to be honest I'm not exactly sure what he was getting at:
 

I think our governments will remain virtuous for many centuries; as long as they are chiefly agricultural; and this will b

... (read more)
2DanielFilan
Could we do your $350 to my $100? And the voiding condition makes sense.

Yup, sounds good! I've set myself a reminder for November 9th.

Paper authors: Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell

TL;DR: We released a paper with IMO clear evidence of learned look-ahead in a chess-playing network (i.e., the network considers future moves to decide on its current one). This post shows some of our results, and then I describe the original motivation for the project and reflect on how it went. I think the results are interesting from a scientific and perhaps an interpretability perspective, but only mildly useful for AI safety.

Teaser for the results

(This section is copied from our project website. You may want to read it there for animations and interactive elements, then come back here for my reflections.)

Do neural networks learn to implement algorithms involving look-ahead or search in the wild?...

2p.b.
I think the spectrum you describe is between pattern recognition by literal memorisation and pattern recognition building on general circuits.  There are certainly general circuits that compute whether a certain square can be reached by a certain piece on a certain other square.  But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the  existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the "pattern recognition"-network to predict that Ng6 is not a feasible option.  The "lookahead"-network however would go through these moves and assess that 2.Rh4 is not mate because of 2...Bh6. The lookahead algorithm would allow it to use general low-level circuits like "block mate", "move bishop/queen on a diagonal" to generalise to unseen combinations of patterns.  
2Erik Jenner
I still don't see the crisp boundary you seem to be getting at between "pattern recognition building on general circuits" and what you call "look-ahead." It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example: If the training data had an example of a rook checkmate on h4 being blocked by a bishop to h6, you could imagine many different possibilities: * This doesn't generalize to a rook checkmate on h3 being blocked by a bishop (i.e. the network would get that change wrong if it hasn't also explicitly seen it) * This generalizes to rook checkmates along the h-file, but doesn't generalize to rook checkmates along other files * This generalizes to arbitrary rook checkmates * This also generalizes to bishop checkmates being blocked * This also generalizes to a rook trapping the opponent queen (instead of the king) * ... (Of course, this generalization question is likely related to the question of whether these different cases share "mechanisms.") At the extreme end of this spectrum, I imagine a policy whose performance only depends on some simple measure of "difficulty" (like branching factor/depth needed) and which internally relies purely on simple algorithms like tree search without complex heuristics. To me, this seems like an idealized limit point to this spectrum (and not something we'd expect to actually see; for example, humans don't do this either). You might have something different/broader in mind for "look-ahead," but when I think about broader versions of this, they just bleed into what seems like a continuous spectrum.
p.b.20

I don't think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made. 

Here is one (thought) experiment to tease this apart: Imagine you train the model to predict whether a position leads to a forced checkmate and also the best move to make. You pick one tactical motive and erase it from the checkmate prediction part of the training set, but not the move prediction part. 

Now the model still know... (read more)

Preamble: Delta vs Crux

I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.

Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in...

5Lucius Bushnaq
I kind of expect that things-people-call-their-values-that-are-not-their-revealed-preferences would be a concept that a smart AI that predicts systems coupled to humans would think in as well. It doesn't matter whether these stated values are 'incoherent' in the sense of not being in tune with actual human behavior, they're useful for modelling humans because humans use them to model themselves, and these self-models couple to their behavior. Even if they don't couple in the sense of being the revealed-preferences in an agentic model of the humans' actions. Every time a human tries and mostly fails to explain what things they'd like to value if only they were more internally coherent and thought harder about things, a predictor trying to forecast their words and future downstream actions has a much easier time of it if they have a crisp operationalization of the endpoint the human is failing to operationalize.  An analogy: If you're trying to predict what sorts of errors a diverse range of students might make while trying to solve a math problem, it helps to know what the correct answer is. Or if there isn't a single correct answer, what the space of valid answers looks like.
1xpym
Oh, sure, I agree that an ASI would understand all of that well enough, but even if it wanted to, it wouldn't be able to give us either all of what we think we want, or what we would endorse in some hypothetical enlightened way, because neither of those things comprise a coherent framework that robustly generalizes far out-of-distribution for human circumstances, even for one person, never mind the whole of humanity. The best we could hope for is that some-true-core-of-us-or-whatever would generalize in such way, the AI recognizes this and propagates that while sacrificing inessential contradictory parts. But given that our current state of moral philosophy is hopelessly out of its depth relative to this, to the extent that people rarely even acknowledge these issues, trusting that AI would get this right seems like a desperate gamble to me, even granting that we somehow could make it want to. Of course, it doesn't look like we would get to choose not to get subjected a gamble of this sort even if more people were aware of it, so maybe it's better for them to remain in blissful ignorance for now.
1Ebenezer Dukakis
Was using a metaphorical "you". Probably should've said something like "gradient descent will find a way to read the next token out of the QFT-based simulation". I suppose I should've said various documents are IID to be more clear. I would certainly guess they are. Generally speaking, yes. Well, if we're following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren't generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn't overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it. In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn't make a difference under normal circumstances. It sounds like maybe you're discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it's able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern? (To be clear, this sort of acquired-omniscience is liable to sound kooky to many ML researchers. I think it's worth stress-testing alignment proposals under these sort of extreme scenarios, but I'm not sure we should weight them heavily in terms of estimating our probability of success. In this particular scenario, the model's performance would drop on data generated after training, and that would hurt the company's bottom line, and they would have a strong financial incenti
dxu20

Well, if we're following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren't generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn't overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.

(Just to be clear: yes, I know what training and test sets ar... (read more)

Seed10

Human Intelligence Enhancement via Learning:

Intelligence enhancement could entail cognitive enhancements which increase rate / throughput of cognition, increase memory, use of BCI or AI harnesses which offload work / agency or complement existing skills and awareness.

In the vein of strategies which could eventually lead to ASI alignment by leveraging human enhancement, there is an alternative to biological / direct enhancements which attempt to influence cognitive hardware, and instead attempt to externalize one's world model and some of the agency necessa... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

(Part 3b of the CAST sequence)

In the first half of this document, Towards Formal Corrigibility, I sketched a solution to the stop button problem. As I framed it, the solution depends heavily on being able to detect manipulation, which I discussed on an intuitive level. But intuitions can only get us so far. Let’s dive into some actual math and see if we can get a better handle on things.

Measuring Power

To build towards a measure of manipulation, let’s first take inspiration from the suggestion that manipulation is somewhat the opposite of empowerment. And to measure empowerment, let’s begin by trying to measure “power” in someone named Alice. Power, as I touched on in the ontology in Towards Formal Corrigibility, is (intuitively) the property of having one’s values/goals...

1Max Harms
Thanks. Picking out those excerpts is very helpful. I've jotted down my current (confused) thoughts about human values. But yeah, I basically think one needs to start with a hodgepodge of examples that are selected for being conservative and uncontroversial. I'd collect them by first identifying a robust set of very in-distribution tasks and contexts and try to exhaustively identify what manipulation would look like in that small domain, then aggressively train on passivity outside of that known distribution. The early pseudo-agent will almost certainly be mis-generalizing in a bunch of ways, but if it's set up cautiously we can suspect that it'll err on the side of caution, and that this can be gradually peeled back in a whitelist-style way as the experimentation phase proceeds and attempts to nail down true corrigibility.
2Seth Herd
I think you're right to point to this issue. It's a loose end. I'm not at all sure it's a dealbreaker for corrigibility. The core intuition/proposal is (I think) that a corrigible agent wants to do what the principal wants, at all times. If the principal currently wants to not have their future values/wants manipulated, then the corrigible agent wants to not do that. If they want to be informed but protected against outside manipulation, then the corrigible agent wants that. The principal will want to balance these factors, and the corrigible agent wants to figure out what balance their principal wants, and do that. I was going to say that my instruction-following variant of corrigibility might be better for working out that balance, but it actually seems pretty straightforward in Max's pure corrigibility version, now that I've written out the above.
Max HarmsΩ110

I don't think "a corrigible agent wants to do what the principal wants, at all times" matches my proposal. The issue that we're talking here shows up in the math, above, in that the agent needs to consider the principal's values in the future, but those values are themselves dependent on the agent's action. If the principal gave a previous command to optimize for having a certain set of values in the future, sure, the corrigible agent can follow that command, but to proactively optimize for having a certain set of values doesn't seem necessarily corrigible... (read more)

What the heck is up with “corrigibility”? For most of my career, I had a sense that it was a grab-bag of properties that seemed nice in theory but hard to get in practice, perhaps due to being incompatible with agency.

Then, last year, I spent some time revisiting my perspective, and I concluded that I had been deeply confused by what corrigibility even was. I now think that corrigibility is a single, intuitive property, which people can learn to emulate without too much work and which is deeply compatible with agency. Furthermore, I expect that even with prosaic training methods, there’s some chance of winding up with an AI agent that’s inclined to become more corrigible over time, rather than less (as long as the people who built...

I'm so glad to see this published!

I think by "corrigibility" here you mean: an agent whose goal is to do what their principal wants. Their goal is basically a pointer to someone else's goal. 

This is a bit counter-intuitive because no human has this goal. And because, unlike the consequentialist, state-of-the-world goals we usually discuss, this goal can and will change over time.

Despite being counter-intuitive, this all seems logically consistent to me.

The key insight here is that corrigibility is consistent and seems workable IF it's the primary goal... (read more)

Some proverbs are actively suspicious, like “Don’t judge a book by its cover” or “No pain, no gain.” Others have an opposite proverb that’s similarly common and reasonable.

  • “Two heads are better than one” vs “Too many cooks spoil the broth”
  • “Honesty is the best policy” vs “What they don’t know won’t hurt them”
  • “Better safe than sorry” vs “Nothing ventured, nothing gained”

But the four below I use often:

  1. The best defense is a good offense. This one even has a Wikipedia page that references Washington, Mao, Machiavelli, Sun Tzu, and “sports such as football and basketball” (citing a dead link to “diamondbackonline.com”). I can’t think of an opposite adage—maybe “prevention is better than cure.”
  2. It's a dog-eat-dog world. This one isn’t true scientifically. “Two out of eleven dogs consistently refused to
...

“Don’t judge a book by its cover” or “No pain, no gain.” Others have an opposite proverb that’s similarly common and reasonable.

"If it looks like a duck and quacks like a duck" is likely an opposite for the first, and "The best things in life are free" is at least plausibly counter to the ladder.

See also: https://www.wired.com/2013/04/zizek-on-proverbs/#:~:text=%22The%20tautological%20emptiness%20of%20a,the%20inherent%20stupidity%20of%20proverbs.

Some proverbs are actually autoantonyms, or at least have come to mean the opposite of the original intent. For ... (read more)

2Shankar Sivarajan
This one's better in Latin: Homo homini lupus. Man is a wolf to man. This one's better in German.   My favorite adage, from Schiller, is "… the gods themselves contend in vain."
2Unnamed
Or "Defense wins championships."
4noggin-scratcher
"The truth hurts", "ignorance is bliss", and "what you don't know can't hurt you" don't contradict: they all say you're better off not knowing some bit of information that would be unpleasant to know, or that a small "white lie" is allowable. The opposite there would be phrases I've mostly seen via LessWrong like "that which can be destroyed by the truth, should be", or "what is true is already true, owning up to it doesn't make it worse", or "if you tell one lie, the truth is thereafter your enemy", or the general ethos that knowing true information enables effective action.