I agree with this. If the key idea is, for example, optimising imitators generalise better than imitations of optimisers, or for a second example that they pursue simpler goals, it seems to me that it'd be better just to draw distinctions based on generalisation or goal simplicity and not on optimising imitators/imitations of optimisers.
A person new to AI safety evaluating their arguments is roughly at a similar position to a Go novice trying to make sense of two Go grandmasters disagreeing about a board
I don't think the analogy is great, because Go grandmasters have actually played, lost and (critically) won a great many games of Go. This has two implications: first, I can easily check their claims of expertise. Second, they have had many chances to improve their gut level understanding of how to play the game of Go well, and this kind of thing seems to be to necessary to develop expe... (read more)
10. AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.
I think there will be substantial technical hurdles along the lines of getting in-principle highly capable AI systems to reliably do what we want them to, th... (read more)
I've written a few half-baked alignment takes for Less Wrong, and they seem to have mostly been ignored. I've since decided to either bake things fully, look for another venue, or not bother, and I'm honestly not particularly enthused about the fully bake option. I don't know if anything similar has had any impact on Sam's thinking.
I'm not sure exactly how important goal-optimisation is. I think AIs are overwhelmingly likely to fail to act as if they were universally optimising for simple goals compared to some counterfactual "perfect optimiser with equivalent capability", but this is failure only matters if the dangerous behaviour is only executed by the perfect optimiser.
They're also very likely to act as if they are optimising for some simple goal X in circumstances Y under side conditions Z (Y and Z may not be simple) - in fact, they already do. This could easily be enough for da... (read more)
I think raw intelligence, while important, is not the primary factor that explains why humanity-as-a-species is much more powerful than chimpanzees-as-a-species. Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.
Slightly relatedly, I think it's possible that "causal inference is hard". The idea is: once someone has worked something out, they can share it an... (read more)
As I said (a few times!) in the discussion about orthogonality, indifference about the measure of "agents" that have particular properties seems crazy to me. Having an example of "agents" that behave in a particular way is a enormously different to having an unproven claim that such agents might be mathematically possible.
A Go AI that learns to play go via reinforcement learning might not "have a utility function that only cares about winning Go". Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn't be "win every game of Go you start playing" (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it's fundamentally an adaptation executor, not a utility maxmiser.
FWIW self-supervised learning can be surprisingly capable of doing things that we previously only knew how to do with "agentic" designs. From that link: classification is usually done with an objective + an optimization procedure, but GPT-3 just does it.
My view is that if Yann continues to be interested in arguing about the issue then there's something to work with, even if he's skeptical, and the real worry is if he's stopped talking to anyone about it (I have no idea personally what his state of mind is right now)
Indeed. If the idea of a tradeoff wasn't widely considered plausible I'd have spent more time defending it. I'd say my contribution here is the "and we should act like it" part.
For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc
FWIW, I'd call this "weakly agentic" in the sense that you're searching through some options, but the number of options you're looking through is fairly small.
It's plausible that this is enough to get good results and also avoid disasters, but it's actually not obvious to me. The basic reason: if the top 1000 plans are good enough to get superior performan... (read more)
Right, but the goal is to make AGI you can point at things, not to make AGI you can point at things using some particular technique.
(Tangentially, I also think the jury is still out on whether humans are bad fitness maximizers, and if we're ultimately particularly good at it - e.g. let's say, barring AGI disaster, we'd eventually colonise the galaxy - that probably means AGI alignment is harder, not easier)
Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.
I strongly disagree. Gain of function research happens, but it's rare because people know it's not safe. To put it mildly, I think reducing the number of dangerous experiments substantially improves the odds of no disaster happening over any given time frame
Do whatever you want, obviously, but I just want to clarify that I did not suggest you avoid personally criticising people (only that you avoid vague/hard to interpret criticism) or saying you think doom is overwhelmingly likely. Some other comments give me a stronger impression than yours that I was asking you in a general sense to be nice, but I'm saying it to you because I figure it mostly matters that you're clear on this.
3. The system can think about atoms/physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.
Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you're in. (E.g., killing all agents in base reality ensures that they'll never shut down your simulation.)
Not necessarily. Train something multimodally on digital games of Go and on, say, predicting the effects of modifications to its own code on its success at Go. It could be a) good at... (read more)
Humans can, to some extent, be pointed to complicated external things. This suggests that using natural selection on biology can get you mesa-optimizers that can be pointed to particular externally specifiable complicated things. Doesn't prove it (or, doesn't prove you can do it again), but you only asked for a suggestion.
What do you think of a claim like "most of the intelligence comes from the steps where you do most of the optimization"? A corollary of this is that we particularly want to make sure optimization intensive steps of AI creation are safe WRT not producing intelligent programs devoted to killing us.
Example: most of the "intelligence" of language models comes from the supervised learning step. However, it's in-principle plausible that we could design e.g. some really capable general purpose reinforcement learner where the intelligence comes from the reinforcem... (read more)
I'm sorry to hear that your health is poor and you feel that this is all on you. Maybe you're right about the likelihood of doom, and even if I knew you were, I'd be sorry that it troubles you this way.
I think you've done an amazing job of building the AI safety field and now, even when the field has a degree of momentum of its own, it does seem to be less focused on doom than it should be, and I think you continuing to push people to focus on doom is valuable.
I don't think its easy to get people to take weird ideas seriously. I've had many experiences whe... (read more)
I vehemently disagree here, based on my personal and generalizable or not history. I will illustrate with the three turning points of my recent life.
First step: I stumbled upon HPMOR, and Eliezer way of looking straight into the irrationality of all our common ways of interacting and thinking was deeply shocking. It made me feel like he was in a sense angrily pointing at me, who worked more like one of the PNJ rather than Harry. I heard him telling me you're dumb and all your ideals of making intelligent decisions, being the gifted kid and being smarter th... (read more)
This kind of post scares away the person who will be the key person in the AI safety field if we define "key person" as the genius main driver behind solving it, not the loudest person. Which is rather unfortunate, because that person is likely to read this post at some point.
I don't believe this post has any "dignity", whatever weird obscure definition dignity has been given now. It's more like flailing around in death throes while pointing fingers and lauding yourself than it is a solemn battle stance against an oncoming impossible enemy.
For contex... (read more)
I disagree strongly. To me it seems that AI safety has long punched below its weight because its proponents are unwilling to be confrontational, and are too reluctant to put moderate social pressure on people doing the activities which AI safety proponents hold to be very extremely bad. It is not a coincidence that among AI safety proponents, Eliezer is both unusually confrontational and unusually successful.
This isn't specific to AI safety. A lot of people in this community generally believe that arguments which make people feel bad are counterproductive ... (read more)
It seems worth doing a little user research on this to see how it actually affects people. If it is a net positive, then great. If it is a net negative, the question becomes how big of a net negative it is and whether it is worth the extra effort to frame things more nicely.
There's a point here about how fucked things are that I do not know how to convey without saying those things, definitely not briefly or easily. I've spent, oh, a fair number of years, being politer than this, and less personal than this, and the end result is that people nod along and go on living their lives.
I expect this won't work either, but at some point you start trying different things instead of the things that have already failed. It's more dignified if you fail in different ways instead of the same way.
We could just replace all the labels with random strings, and the model would have the same content
I think this is usually incorrect. The variables come with labels because the data comes with labels, and this is true even in deep learning. Stripping the labels changes the model content: with labels, it's a model of a known data generating process, and without the labels it's a model of an unknown data generating process.
Even with labels, I will grant that many joint probability distributions are extremely hard to understand.
Say there's 5% chance they're equally or more electable, 2% they're substantially more electable.
That's what you're after, right?
If there are 2% of the population more electable "all else equal" to the sitting president (and this is a pretty wild guess), then I think you'd need a pretty good selection procedure to produce candidates who are, on average, better than the current procedure.
Are the unranked Chinese exascale systems relevant for AI research, or is it more that if they've built 2-3 such systems semi-stealthily, they might also be building AI-focused compute capacity too?
For what it's worth, I agree that there's clear evidence of ill-will towards the Chinese government (and, you know, I don't like them either). It's reasonable to suspect that this might colour a person's perception of the state of thing that the Chinese government is involved with. It is also superficial, so it's not like I can draw any independent conclusions from it to defray suspicions of bias. I'm also not giving it a lot of weight.
Not a very principled answer, but: 98%
There is already a substantial preference for incumbents (about 65-35 I think), and I think this would be much stronger if the challenger was completely unaffiliated with politics (I want to say something like 90-10 if challenger and sitting president were equally electable otherwise, maybe 75-25 if the challenger is actually substantially better than the president, 99-1 if they're just average).
Say there's 5% chance they're equally or more electable, 2% they're substantially more electable. Then there's 0.5% on them b... (read more)
I think durability is a really important feature of journal articles. I often read 70 year old articles, and rarely read 70 year old anything else.
I'm not sure what's responsible for the durability, mind you. Long-term accessibility is necessary, obviously, but not sufficient. Academic citation culture is also part of it, I think.
Zenodo is a pretty accepted solution to data-durability in academia (https://zenodo.org). There's no reason you couldn't upload papers there (and indeed they host papers/conference proceedings/etc.). Uploads get assigned a DOI and get versioning, get indexed for citation purposes, etc.
If I were starting a journal it would probably look like "Zenodo for hosting, some AirTable or GitHub workflow for (quick) refereeing/editorial workflow."
Does San Francisco look like more of an outlier if you plot unsheltered homeless vs house price?
I am a forecaster on that question: the main doubt I had was if/when someone would try to do wordy things + game playing on a "single system". Seemed plausible to me that this particular combination of capabilities never became an exciting area of research, so the date at which an AI can first do these things would then be substantially after this combination of tasks would be achievable with focused effort. Gato was a substantial update because it does exactly these tasks, so I no longer see much reason possibility that the benchmark is achieved only afte... (read more)
We both have a similar intuition about the kinds of optimizers we're interested in. You say they optimize things that are "far away", I say they affect "big pieces of the environment". One difference is that I think of big as relative to the size of the agent, but something can be "far away" even if the agent is itself quite large, and it seems that agent size doesn't necessarily matter to your scheme because the information lost over a given distance doesn't depend on whether there's a big agent or a small one trying to exert influence over this distance.... (read more)
The Legg-Hutter definition of intelligence is counterfactual ("if it had X goal, it would do a good job of achieving it"). It seems to me that the counterfactual definition isn't necessary to capture the idea above. The LH definition also needs a measure over environments (including reward functions), and it's not obvious how closely their proposed measure corresponds to things we're interested in, while influentialness in the world we live in seems to correspond very closely.
The mesa-optimizer paper also stresses (not sure if correctly) that they're not t... (read more)
I got "masks are like assholes" from the sentence, even before I read the Ben's analysis.
I think it's very likely that the people who made the ads are deliberately alluding to "opinions are like assholes". And very unlikely that their intention is to say "masks are like assholes". I think what they're trying to do is to deliver a little surprise, a little punchline. You see "X are like opinions", some bit of your brain is expecting a rude criticism, and oh! it turns out they're saying something positive about X and something positive about having opinions. So (they hope) the reader gets a pleasant surprise and is a bit more willing to pay atte... (read more)
Taking a cue from the wiki article "you will be hanged tomorrow, and you will not be able to derive from this statement whether or not you will be hanged tomorrow"
Seems kind of weird because it is self-contradictory, and yet true.
This gets good log loss because it's trained in the regime where the human understands what's going on, correct?
Regarding
"Strategy: train a reporter which isn’t useful for figuring out what the human will believe/Counterexample: deliberately obfuscated human simulator".
If you put the human-interpreter-blinding before the predictor instead of between the predictor and the reporter, then whether or not the blinding produces an obfuscated human simulator, we know the predictor isn't making use of human simulation.
An obfuscated human simulator would still make for a rather bad predictor.
I think this proposal might perform slightly better than a setup where we expand the... (read more)
My 3yo: "a bowl of museli because it's heavy"
Also "an apple, plum and peach and a bit of wax"
Peers get the same results from the same actions. It's not exactly clear what "same action" or "same result" means -- is "one boxing on the 100th run" the same as "one boxing on the 101st run" or "box 100 with $1m in it" the same as "box 101 with $1m in it"? I think we should think of peers as being defined with respect to a particular choice of variables representing actions and results.
I think the definitions of these things aren't immediately obvious, but it seems like we might be able to figure them out sometimes. Given a decision problem, it seems to ... (read more)
how to respond to the temptation to shift from utilising actual peers to potential peers which then seems to reraise the specter of circularity.
I think you might be able to say something like "actual peers is why the rule was learned, virtual peers is because the rule was learned".
(Just to be clear: I'm far from convinced that this is an actually good theory of counterfactuals, it's just that it also doesn't seem to be obviously terrible)
... (read more)Let's suppose I face Newcomb's problem in a yellow shirt and you face it in a red shirt. They ought to be comparable bec
Γ=Σ^R, it's a function from programs to what result they output. It can be thought of as a computational universe, for it specifies what all the functions do.
Should this say "elements are function... They can be thought of as...?"
Can you make a similar theory/special case with probability theory, or do you really need infra-bayesianism? If the second, is there a simple explanation of where probability theory fails?
My confusion was: even "when the agent is acting", I think it would still be appropriate to describe its beliefs according to EDT. However, I was confused by thinking about "...and then offering a bet". As far as I can tell, this is just an unnecessary bit of storytelling set around a two step decision problem, and a CDT agent has to evaluate the prospects of each decision according to CDT.
This is an old post, but my idea of CDT is that it's a rule for making decisions, not for setting beliefs. Thus the agent never believes in the outcome given by CDT, just that it should choose according to the payoffs it calculates. This is a seemingly weird way to do things, but apart from that is there a reason I should think about CDT as a prescription for forming beliefs while I am acting?
Pearl is distinguishing "intrinsically nondeterministic" from "ordinary" Bayesian networks, and he is saying that we shouldn't mix up the two (though I think it would be easier to avoid this with a clearer explanation of the difference).
Three questions:
No
No, and so we should be careful not to mix them up with "intrinsically nondeterministic" Bayesian networks
I'm pretty sure that picture is from the Book of Why
Determinism is not a defining feature of counterfactuals, you can make a stochastic theory of counterfactuals that is a strict generalisation of SEM-style deterministic counterfactuals. See Pearl, Causality (2009), p. 220 "counterfactuals with intrinsic nondeterminism" for the basic idea. It's a brief discussion and doesn't really develop the theory but, trust me, such a theory is possible. The basic idea is contained in "the mechanism equations lose their deterministic character and hence should be made stochastic."
I've previously argued that the concept of counterfactuals can only be understood from within the counterfactual perspective.
I think this goes too far. We can give an account of counterfactuals from assumptions of symmetry. This account is unsatisfactory in many ways - for one thing, it implies that counterfactuals exist much more rarely than we want them to. Nonetheless, it seems to account for some properties of a counterfactual and is able to stand up without counterfactual assumptions to support it. I think it also provides an interesting lens for exam... (read more)
So, granting the assumption of not corrupting the humans (which is maybe what you are denying), doesn't this imply that we can go on adding sensors after the fact until, at some point, the difference between fooling them all and being honest becomes unproblematic?
Do you run into a distinction between benign and malign tampering at any point? For example, if humans can never tell the difference between the tampered and non-tampered result, and their own sanity has not been compromised, it is not obvious to me that the tampered result is worse than the non-tampered result.
It might be easier to avoid compromising human sanity + use hold-out sensors than to solve ELK in general (though maybe not? I haven't thought about it much).
I'm a bit curious about what job "dimension" is doing here. Given that I can map an arbitrary vector in to some point in via a bijective measurable map (https://en.wikipedia.org/wiki/Standard_Borel_space#Kuratowski's_theorem), it would seem that the KPD theorem is false. Is there some other notion of "sufficient statistic complexity" hiding behind the idea of dimensionality, or am I missing something?
I find lying easy and natural unless I'm not sure if I can get away with it, and I think I'm more honest than the median person!
(Not a lie)
(Honestly)