I commit to paying up if I agree there's a >0.4 probability something non-mundane happened in a UFO/UAP case, or if there's overwhelming consensus to that effect and my probability is >0.1.
Though I guess I should warn you in advance that I expect that this would require either big obvious evidence or repeatable evidence. An example of big would be an alien ship hovering at the fifty yard line during superbowl, repeatable would be some way of doing science to the aliens. Government alien-existence announcements lacking any such evidence might lead to me paying on the second clause rather than the first.
I'll message you details.
I think if your P(weird) is 3%, it might be hard for you to in-expectation make money even from someone whose P(weird) is 0.00001%. You should definitely worry about being stiffed to some extent, and both sides should expect small probabilities of other sorts of costly drama. This limits what bets people should actually agree on.
I'm not really imagining matching. I'm imagining the scope of points that I'm looking at sweeping outwards, and having different sides "win" by having more points in-scope as a function of time.
But I think if you prompt someone to imagine matching, you can easily pump intuition for sets being the same size if they alternate which is more dense infinitely many times.
I think a fairly typical "intuitive" notion is something like:
Pick a space that contains the sets you want to compare (let's call them A and B). Then consider balls of radius growing from the origin. There are four possibilities:
I'm just gonna give you an answer off the top of my head first and google later. Seems like the spirit of the thing :P We'll see how I do! I'm a total non-expert, but I did read an IPCC report years and years ago.
Recent years (last 10000 years or so) you can use stuff like tree rings or... I think amount of algae in sediment cores?, which are a time resolution of about one point per year, and and are a fairly good measure of temperatures (plants grow better when it's warm, within limits), but with extra variation added (volcanic eruptions etc). Let's guess...
There's been reasonable amounts of modeling work done in the context of managing money. E.g. https://forum.effectivealtruism.org/posts/Ne8ZS6iJJp7EpzztP/the-optimal-timing-of-spending-on-agi-safety-work-why-we
This is probably the sort of thing Tyler would want but wouldn't know how to find.
what inputs and outputs would be sufficient to reward modeling of the real world?
This is an interesting question but I think it's not actually relevant. Like, it's really interesting to think about a thermostat - something who's only inputs are a thermometer and a clock, and only output is a switch hooked to a heater. Given arbitrarily large computing power and arbitrary amounts of on-distribution training data, will RL ever learn all about the outside world just from temperature patterns? Will it ever learn to deliberately affect the humans around it by t...
Since I'm fine with saying things that are wildly inefficient, almost any input/output that's sufficient to reward modeling of the real world (rather than e.g. just playing the abstract game of chess) is sufficient. A present-day example might be self-driving car planning algorithms (though I don't think any major companies actually use end to end NN planning).
It's related in that you're all talking about maintaining some parts of the status quo, but I think the instrumental technologies (human-directed services vs. agential AIs that directly care about maintaining status-quo boundaries) are pretty different, as are all the arguments related to those technologies.
So, the maximally impractical but also maximally theoretically rigorous answer here is AIXI-tl.
An almost as impractical answer would be Markov chain Monte Carlo search for well-performing huge neural nets on some objective.
I say MCMC search because I'm confident that there's some big neural nets that are good at navigating the real world, but any specific efficient training method we know of right now could fail to scale up reliably. Instability being the main problem, rather than getting stuck in local optima.
Dumb but thorough hyperparameter search and RL...
Thanks, this was interesting.
I couldn't really follow along with my own probabilities because things started wild from the get-go. You say we need to "invent algorithms for transformative AI," when in fact we already have algorithms that are in-principle general, they're just orders of magnitude too inefficient, but we're making gradual algorithmic progress all the time. Checking the pdf, I remain confused about your picture of the world here. Do you think I'm drastically overstating the generality of current ML and the gradualness of algorithmic improveme...
Lots of claims have been scrutinized fairly intensely by governments. Was it the chilean military that spent a couple years investigating a UFO sighting and eventually went public saying it was unexplainable? Sadly, this effort provides little increase in reliability. The investigators are often doing this for the first time and lack key skills for analyzing the data. This is exacerbated by the fact that governments are large enough to allow for selection effects, where people spending effort investigating UFOs are self-selected for thinking they're really important, i.e. aliens.
Hm. I'm sure plenty of people could do a fine job, myself included. But if every such person jumped in, it would be a mess. I assume that if Stuart Russell was the right person for the job, the job would already be over. Plausibly ditto Eliezer.
Rob Miles might be the obvious person for explaining things well. I totally endorse him doing attention-getting things I wouldn't endorse for people like me.
Also probably fine would be people optimized a little more for AI work than explaining things. Paul Christiano may be the Schelling-point tip of the iceberg of ...
Lacking access to the other's hardware, I think you'd need something that's easy to compute for an honest AI, but hard to compute for a deceptive AI. Because a deceptive AI could always just simulate an honest AI, how do you distinguish simulation?
The only way I can think of is resource constraints. Deception adds a slight overhead to calculating quantities that depends in detail on the state of the AI. If you know what computational capabilities the other AI has very precisely, and you can time your communications with it, then maybe it can compute something for you that you can later verify implies honesty.
There are plenty of good posts that contradict a "strict" orthogonality thesis by showing correlation between capabilities and various values-related properties (scaling laws / inverse scaling laws).
What really gets you downvoted is the claim that super-intelligent AI cannot want things that are bad for humanity, or even agitating that we should give that idea serious weight.
What also gets you downvoted is the in-between claim that all the scaling laws tend towards superhuman morality and everything will work out fine, no need to be worried or spend lots of hours working.
How to make a successful piece in the latter categories? Simple - just be right, for communcable reasons. Simple, but maybe not possible.
If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it's related to some other uses of unterpretability that might also have diminishing returns).
If you take a predictive model of chess games trained on human play, it's probably not too hard to get it to play near the 90th percentile of the dataset. But it's not going to play as well as stockfish almost no matter what ...
Yes, preserving the existence of multiple good options that humans can choose between using their normal reasoning process sounds great. Which is why an AI that learns human values should learn that humans want the universe to be arranged in such a way.
I'm concerned that you seem to be saying that problems of agency are totally different from learning human values, and have to be solved in isolation. The opposite is true - preferring agency is a paradigmatic human value, and solving problems of agency should only be a small part of a more general solution.
I am shocked that higher quality training data based on more effortful human feedback produced a better result.
Consider the computational difficulty of intrinsic vs. extrinsic alignment for a chess-playing AI.
Suppose you want the AI to walk its king to the center of the board before winning. With intrinsic alignment, this is a little tricky to encode but not too hard. With extrinsic alignment, this requires vastly outsmarting the chess-playing AI so that you can make it dance to your tune - maybe humans could do it to a 500-elo chess bot, but past 800 elo I think I'd only be able to solve the problem by building second chess engine that was intrinsically aligned to extrinsically align the first one.
A nice exposition.
For myself I'd prefer the same material much more condensed and to-the-point, but I recognize that there are publication venues that prefer more flowing text.
E.g. compare
...We turn next to the laggard. Compared to the fixed roles model, the laggard’s decision problem in the variable roles model is more complex primarily in that it must now consider the expected utility of attacking as opposed to defending or pursuing other goals. When it comes to the expected utility of defending or pursuing other goals, we can simply copy the formulas from
There is a causal relationship between time on LW and frequency of paragraph breaks :P
Anyhow, I broadly agree with this comment, but I'd say it's also an illustration of why interpretability has diminishing returns and we really need to also be doing "positive alignment." If you just define some bad behaviors and ablate neurons associated with those bad behaviors (or do other things like filter the AI's output), this can make your AI safer but with ~exponentially diminishing returns on the selection pressure you apply.
What we'd also like to be doing is def...
I would claim that an army of robots based on ASIs will generally lose to an army of robots based on true AGI.
The truly optimal war-winning AI would not need to question its own goal to win the war, presumably.
Would you agree that an AI that is maximizing paperclips does make intellectual mistake?
No. I think that's anthropomorphism - just because a certain framework of moral reasoning is basically universal among humans, doesn't mean it's universal among all systems that can skillfully navigate the real world. Frameworks of moral reasoning are on the "ought" side of the is-ought divide.
The idea has a broader consequence to AI safety. While paperclip maximizer might be designed as part of paperclip maximizers research, it's probably will not arise spontaneously from intelligence research in general. Even making one will even probably be considered immoral request by an AGI.
This doesn't follow.
You start the post by saying that the most successful paperclip maximizer (or indeed the most successful AI at any monomaniacal goal) wouldn't doubt its own goals, and in fact doesn't even need the capacity to doubt its own goals. And since you care ...
Do you think that human theorists are near the limit of what kind of approximations we should use to calculate the band structure of diamond (and therefore a superintelligent AI couldn't outsmart human theorists by doing their job better)? Like if you left physics to stew for a century and came back, we'd still be using the GW approximation?
This seems unlikely to me, but I don't really know much about DFT (I was an experimentalist). Maybe there are so few dials to turn that picking the best approximation for diamond is an easy game. Intuitively I'd expect ...
Do you think that human theorists are near the limit of what kind of approximations we should use to calculate the band structure of diamond (and therefore a superintelligent AI couldn't outsmart human theorists by doing their job better)? Like if you left physics to stew for a century and came back, we'd still be using the GW approximation?
This seems unlikely to me, but I don't really know much about DFT (I was an experimentalist). Maybe there are so few dials to turn that picking the best approximation for diamond is an easy game. Intuitively I'd expect ...
More or less.
Is this good news? Yes.
Is this strong evidence that we don't need to work hard on AI safety? No.
Are elements of the simple generative-model-finetuning paradigm going to be reused to ensure safety of superintelligent AI (conditional on things going well)? Maybe, maybe not. I'd say that the probability is around 30%. That's pretty likely in the grand scheme of things! But it's even more likely that we'll use new approaches entirely and the safety guardrails on GPT-4 will be about as technologically relevant to superintelligent AI as the safety guardrails on industrial robots.
I feel like it's 4 ~ 1 > 2 > 3. The example of CNNs seems like this, where the artificial neural networks and actual brains face similar constraints and wind up with superficially similar solutions, but when you look at all the tricks that CNNs use (especially weight-sharing, but also architecture choices, choice of optimizer, etc.) they're not actually very biology-like, and were developed based on abstract considerations more than biological ones.
Thanks! It seems like most of your exposure has been through Eliezer? Certainly impressions like "why does everyone think the chance of doom is >90%?" only make sense in that light. Have you seen presentations of AI risk arguments from other people like Rob Miles or Stuart Russell or Holden Karnofsky, and if so do you have different impressions?
If it wants to be shut down, and humans might start it up again later, the optimal strategy seems like creating a successor agent to achieve its goals and kill all the humans and then shut itself down.
It might be worth going into the problem of fully updated deference. I don't think it's necessarily always a problem, but also it does stop utility aggregation and uncertainty from being a panacea, and the associated issues are probably worth a bit of discussion. And as you likely know, there isn't a great journal citation for this, so you could really cash in when people want to talk about it in a few years :P
Huh, interesting. I skimmed the paper and I'm not convinced this specific architecture is promising for tasks that move a lot of information or have hierarchical structure - the lack of a value (only keys and queries) seems like a big downgrade. The graph classification results are pretty good though, and I'd agree with the authors that it's probably because they've improved information routing without having much worse inductive biases than GCNNs. Does this match your impression?
I'm also kind of a downer about interpretability. There's different kinds of ...
The classic textbook on it if you want to read more is Li and Vitanyi's Introduction to Kolmogorov Complexity.
Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.
This is fine if you want your AI to stack blocks on top of other blocks or something.
But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you're no longer just shooting f...
Do you think there might be alternative or more effective ways to model human consciousness, or is the approach of using "qualia" the most promising one we currently have?
IMO the most useful is the description of the cognitive algorithms / cognitive capabilities involved in human-like consciousness. Like remembering events from long-term memory when appropriate, using working memory to do cognitive tasks, responsiveness to various emotions, emotional self-regulation, planning using various abstractions, use of various shallow decision-making heuristics, in...
if humans actually had utility functions
Yeah, humans lack a unique utility function. I know what you mean informally, just don't get bogged down mathematizing something we don't have.
So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.
Do you think this is a desideratum, or a guarantee?
I'll say the key point plainly: suppose some policy is "the good policy." Which utility function causes an agent to follow the good policy will be different depending on how the agent makes decisions. For a ...
I can't deny that it does in fact manipulate the gradient - but it seems like overkill. After all, why continue the gradient descent process at all once you've hacked the computer you're running on? Just have the host computer write logs as if gradient descent was happening but not actually do it.
Take this with a big grain of salt, but I'll just tell you my impression.
Theoretically, I think it's useful in that it tells us that a lot is possible even in non-realizable settings.
As a guide to practice, I think there's plenty of room to do better. Ideally I'd want a representation that leverages composition of hypotheses with each other, and that natively does its reasoning in a non-extremizing way that makes more sense to humans (even if it's mathematically equivalent to armax/argmin on some function).
Presently I think it's a mistake to identify a goo...
I feel like the "obvious" thing to do is to ask how rare (in bits) the post-opitimization EV is according to the pre-optimization distribution. Like, suppose that pre-optimization my probability distribution over utilities I'd get is normally distributed, and after optimizing my EV is +1 standard deviation. Probability of doing that well or better is 0.158, which in bits is 2.65 bits.
Seems indifferent to affine transformation of the utility function, adding irrelevant states, splitting/merging states, etc. What are some bad things about this method?
I just skimmed the video, but it seems like there's more salesmanship than there is explanation of what the network is doing, how its capabilities would compare to using e.g. a small RNN, and how far it actually generalizes.
Remember that self-driving cars first appeared in the 1980s - lane-keeping is actually a very simple task if you only need 99% reliability. I don't think their demos are super informative about the utility of this architecture to complicated tasks.
So I'd be interested if you looked into it more and think that my first impression is unfair.
Hi, welcome!
Consciousness isn't actually super relevant to alignment. Alignment is about figuring out how to get world-affecting AI to systematically do good things rather than bad things. This is possible both for conscious and unconscious AI, and consciousness seems to provide neither a benefit nor an impediment to doing good/bad things.
But it's still fun to talk about sometimes.
For this approach, the crucial step 1 is to start with observations of a a big blob of atoms called a "human," and model the human in a way that uses some pieces called "qualia" ...
I'm confused about how to do that because I tend to think of self-modification as happening when the agent is limited and can't foresee all the consequences of a policy, especially policies that involve making itself smarter. But I suspect that even if you figure out a non-confusing way to talk about risk aversion for limited agents that doesn't look like actions on some level, you'll get weird behavior under self-modification, like an update rule that privileges the probability distribution you had at the time you decided to self-modify.
Why would hardcoded model-based RL probably self-modify or build successors this way, though?
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
Like, suppose you go to a casino with $100, and there are two buttons you can push - one button does nothing, and the other button you have a 60% chance to win a dollar and 40% chance to lose a dollar. If you're risk averse you might choose to only ever press the first button (not gamble).
If there's some action you coul...
Yup, I'd lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to "ask it questions," you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI's view of the world is, and if you screw up you can get useless or malign answers without it being obvious.
It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that's bad.
If you do model-free RL with a reward that rewards risk-aversion and penalizes risk, inner optimization or other unintended solutions could definitely still lead to problems if they crop up - they wouldn't have to inherit the risk aversion.
With model-based RL it seems pretty feasible to hard-code risk aversion in. You just have to use the world-model to predict probability distributions (maybe implicitly) and then can more directly be risk-averse when using those predictions. This probably wouldn't be stable under self-reflection, though - when evaluating ...
GPT-4 doesn't learn when you use it. It doesn't update its parameters to better predict the text of its users or anything like that. So the answer to the basic question is "no."
You could also ask "But what if it did keep getting updated? Would it eventually become super-good at predicting the world?" There are these things called "scaling laws" that predict performance based on amount of training data, and they would say that with arbitrary amounts of data, GPT-4 could get arbitrarily smart (though note that this would require new data that's many times mo...
Those advocating for more speculative interventions point to calculations suggesting that the expected value of their interventions is extremely large. What implications, if any, does the question “How much do you believe your results?” have for this debate?
I think it highlights the importance of plausibility arguments. If you think the underlying Quality distribution is gaussian, any claim of huge impact is going to be hard to stomach. What plausibility arguments do is say "hey, there are some really powerful interventions on the technological horizon, an...
Huh. Yeah, this definitely causes me to update my P(lab leak) from ~0.2 to ~0.75.