All of Charlie Steiner's Comments + Replies

Huh. Yeah, this definitely causes me to update my P(lab leak) from ~0.2 to ~0.75.

I have recieved $1000. The bet is on!

3RatsWrongAboutUAP2d
Glad we could make this bet!

I commit to paying up if I agree there's a >0.4 probability something non-mundane happened in a UFO/UAP case, or if there's overwhelming consensus to that effect and my probability is >0.1.

Though I guess I should warn you in advance that I expect that this would require either big obvious evidence or repeatable evidence. An example of big would be an alien ship hovering at the fifty yard line during superbowl, repeatable would be some way of doing science to the aliens. Government alien-existence announcements lacking any such evidence might lead to me paying on the second clause rather than the first.

I'll message you details.

1RatsWrongAboutUAP4d
Good enough for me

I think if your P(weird) is 3%, it might be hard for you to in-expectation make money even from someone whose P(weird) is 0.00001%. You should definitely worry about being stiffed to some extent, and both sides should expect small probabilities of other sorts of costly drama. This limits what bets people should actually agree on.

I'm not really imagining matching. I'm imagining the scope of points that I'm looking at sweeping outwards, and having different sides "win" by having more points in-scope as a function of time.

But I think if you prompt someone to imagine matching, you can easily pump intuition for sets being the same size if they alternate which is more dense infinitely many times.

Max bet $50k, I would be totally happy to bet at 50:1 odds.

2RatsWrongAboutUAP4d
Let us move forward! I commit to operating in good faith with you, and I obviously take as a given that you will do the same. If you have any other concerns please let me know. Otherwise please provide (either publicly or privately) a means for me to pay you. We can then both confirm here that we have begun our bet.
3RatsWrongAboutUAP5d
Enticing offer. Barring better odds and max payout offer that would eat up my budget, I would like I go forward with this. I will wait to see what offers come in first.

I think a fairly typical "intuitive" notion is something like:

Pick a space  that contains the sets you want to compare (let's call them A and B). Then consider balls of radius  growing from the origin. There are four possibilities:

  1. There's as much A as B for almost every  (e.g. comparing positive numbers to negative numbers).
  2. There's an infinite extent of  for which there's more A than B in the ball, and also an infinite extent of  for which there's more B than A (e.g. comparing alternating pairs (0,3,4
... (read more)
3Kenoubi2d
Hmm. My intuition says that your A and B are "pretty much the same size". Sure, there are infinitely many times that they switch places, but they do so about as regularly as possible and they're always close. If A is "numbers with an odd number of digits" and B is "numbers with an even number of digits" that intuition starts to break down, though. Not only do they switch places infinitely often, but the extent to which one exceeds the other is unbounded. Calling A and B "pretty much the same size" starts to seem untenable; it feels more like "the concept of being bigger or smaller or the same size doesn't properly apply to the pair of A and B". (Even though A and B are well defined, not THAT hard to imagine, and mathematicians will still say they're the same size!) If A is "numbers whose number of digits is a multiple of 10", and B is all the other (positive whole) numbers, then... I start to intuitively feel like B is bigger again??? I think this is probably just my intuition not being able to pay attention to all the parts of the question at the same time, and thus substituting "are there more multiples of 10 or non-multiples", which then works the way you said.
1London L.5d
I like this a lot! I'm curious, though, in your head, what are you doing when you're considering an "infinite extent of r"? My guess is that you're actually doing something like the "markers" idea [https://www.lesswrong.com/posts/jqBH65TpbXDokBgC3/aligning-mathematical-notions-of-infinity-with-human?commentId=LetAvaJbmmYTbu7Hf] (though I could be wrong), where you're inherently matching the extent of r on A to the extent of r on B for smaller-than-infinity numbers, and then generalizing those results. For example, when thinking through your example of alternating pairs, I'm checking to see when r=3, that's basically containing the 2 and everything lower, so I mark 3 and 2 as being the same, and then I do the density calculation. Matching 3 to 2 and then 7 to 6, I see that each set always has 2 elements in each section, so I conclude that they have an equal number of elements. Does this "matching" idea make sense? Do you think it's what you do? If not, what are your mental images or concepts like when trying to understand what happens at the "infinite extent"? (I imagine you're not immediately drawing conclusions from imagining the infinite case, and are instead building up something like a sequence limit or pattern identification among lower values, but I could be wrong.)

I'm just gonna give you an answer off the top of my head first and google later. Seems like the spirit of the thing :P We'll see how I do! I'm a total non-expert, but I did read an IPCC report years and years ago.

Recent years (last 10000 years or so) you can use stuff like tree rings or... I think amount of algae in sediment cores?, which are a time resolution of about one point per year, and and are a fairly good measure of temperatures (plants grow better when it's warm, within limits), but with extra variation added (volcanic eruptions etc). Let's guess... (read more)

There's been reasonable amounts of modeling work done in the context of managing money. E.g. https://forum.effectivealtruism.org/posts/Ne8ZS6iJJp7EpzztP/the-optimal-timing-of-spending-on-agi-safety-work-why-we

This is probably the sort of thing Tyler would want but wouldn't know how to find.

what inputs and outputs would be sufficient to reward modeling of the real world?

This is an interesting question but I think it's not actually relevant. Like, it's really interesting to think about a thermostat - something who's only inputs are a thermometer and a clock, and only output is a switch hooked to a heater. Given arbitrarily large computing power and arbitrary amounts of on-distribution training data, will RL ever learn all about the outside world just from temperature patterns? Will it ever learn to deliberately affect the humans around it by t... (read more)

2Ted Sanders8d
Right, I'm not interested in minimum sufficiency. I'm just interested in the straightforward question  of what data pipes would we even plug into the algorithm that would result in AGI. Sounds like you think a bunch of cameras and computers would work? To me, it feels like an empirical problem that will take years of research.

Since I'm fine with saying things that are wildly inefficient, almost any input/output that's sufficient to reward modeling of the real world (rather than e.g. just playing the abstract game of chess) is sufficient. A present-day example might be self-driving car planning algorithms (though I don't think any major companies actually use end to end NN planning).

1Ted Sanders9d
Right, but what inputs and outputs would be sufficient to reward modeling of the real world? I think that might take some exploration and experimentation, and my 60% forecast is the odds of such inquiries succeeding by 2043. Even with infinite compute, I think it's quite difficult to build something that generalizes well without overfitting.

It's related in that you're all talking about maintaining some parts of the status quo, but I think the instrumental technologies (human-directed services vs. agential AIs that directly care about maintaining status-quo boundaries) are pretty different, as are all the arguments related to those technologies.

So, the maximally impractical but also maximally theoretically rigorous answer here is AIXI-tl.

An almost as impractical answer would be Markov chain Monte Carlo search for well-performing huge neural nets on some objective.

I say MCMC search because I'm confident that there's some big neural nets that are good at navigating the real world, but any specific efficient training method we know of right now could fail to scale up reliably. Instability being the main problem, rather than getting stuck in local optima.

Dumb but thorough hyperparameter search and RL... (read more)

1Ted Sanders9d
Gotcha. I guess there's a blurry line between program search and training. Somehow training feels reasonable to me, but something like searching over all possible programs feels unreasonable to me. I suppose the output of such a program search is what I might mean by an algorithm for AGI. Hyperparameter search and RL on a huge neural net feels wildly underspecified to me. Like, what would be its inputs and outputs, even?

Thanks, this was interesting.

I couldn't really follow along with my own probabilities because things started wild from the get-go. You say we need to "invent algorithms for transformative AI," when in fact we already have algorithms that are in-principle general, they're just orders of magnitude too inefficient, but we're making gradual algorithmic progress all the time. Checking the pdf, I remain confused about your picture of the world here. Do you think I'm drastically overstating the generality of current ML and the gradualness of algorithmic improveme... (read more)

4Ted Sanders10d
I'm curious and I wonder if I'm missing something that's obvious to others: What are the algorithms we already have for AGI? What makes you confident they will work before seeing any demonstration of AGI?

Lots of claims have been scrutinized fairly intensely by governments. Was it the chilean military that spent a couple years investigating a UFO sighting and eventually went public saying it was unexplainable? Sadly, this effort provides little increase in reliability. The investigators are often doing this for the first time and lack key skills for analyzing the data. This is exacerbated by the fact that governments are large enough to allow for selection effects, where people spending effort investigating UFOs are self-selected for thinking they're really important, i.e. aliens.

1awg12d
He testified under oath to the House and Senate intelligence committees, purportedly giving them hundreds of pages of documentation including specific names of programs and people to follow up on. All I'm saying is, AFAIK, no UFO claims have been under that level of scrutiny (in the US) before. And to say that either the ICIG or the members of the House and Senate intelligence committees are either "first timers" w/r/t vetting claims like this or that they fall prey to selection effects for "people spending effort investigating UFOs" rings false to me.

Hm. I'm sure plenty of people could do a fine job, myself included. But if every such person jumped in, it would be a mess. I assume that if Stuart Russell was the right person for the job, the job would already be over. Plausibly ditto Eliezer.

Rob Miles might be the obvious person for explaining things well. I totally endorse him doing attention-getting things I wouldn't endorse for people like me.

Also probably fine would be people optimized a little more for AI work than explaining things. Paul Christiano may be the Schelling-point tip of the iceberg of ... (read more)

Lacking access to the other's hardware, I think you'd need something that's easy to compute for an honest AI, but hard to compute for a deceptive AI. Because a deceptive AI could always just simulate an honest AI, how do you distinguish simulation?

The only way I can think of is resource constraints. Deception adds a slight overhead to calculating quantities that depends in detail on the state of the AI. If you know what computational capabilities the other AI has very precisely, and you can time your communications with it, then maybe it can compute something for you that you can later verify implies honesty.

2Kenny13d
That's an interesting idea!

There are plenty of good posts that contradict a "strict" orthogonality thesis by showing correlation between capabilities and various values-related properties (scaling laws / inverse scaling laws).

What really gets you downvoted is the claim that super-intelligent AI cannot want things that are bad for humanity, or even agitating that we should give that idea serious weight.

What also gets you downvoted is the in-between claim that all the scaling laws tend towards superhuman morality and everything will work out fine, no need to be worried or spend lots of hours working.

How to make a successful piece in the latter categories? Simple - just be right, for communcable reasons. Simple, but maybe not possible.

If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it's related to some other uses of unterpretability that might also have diminishing returns).

If you take a predictive model of chess games trained on human play, it's probably not too hard to get it to play near the 90th percentile of the dataset. But it's not going to play as well as stockfish almost no matter what ... (read more)

Yes, preserving the existence of multiple good options that humans can choose between using their normal reasoning process sounds great. Which is why an AI that learns human values should learn that humans want the universe to be arranged in such a way.

I'm concerned that you seem to be saying that problems of agency are totally different from learning human values, and have to be solved in isolation. The opposite is true - preferring agency is a paradigmatic human value, and solving problems of agency should only be a small part of a more general solution.

7catubc16d
Thanks for the comment. I agree broadly of course, but the paper says more specific things. For example, agency needs to be prioritized, probably taken outside of standard optimization, otherwise decimating pressure is applied on other concepts including truth and other "human values". The other part is a empirical one, also related to your concern, namely, human values are quite flexible and biology doesn't create hard bounds / limits on depletion. If you couple that with ML/AI technologies that will predict what we will do next - then approaches that depend on human intent and values (broadly) are not as safe anymore.

I am shocked that higher quality training data based on more effortful human feedback produced a better result.

Consider the computational difficulty of intrinsic vs. extrinsic alignment for a chess-playing AI.

Suppose you want the AI to walk its king to the center of the board before winning. With intrinsic alignment, this is a little tricky to encode but not too hard. With extrinsic alignment, this requires vastly outsmarting the chess-playing AI so that you can make it dance to your tune - maybe humans could do it to a 500-elo chess bot, but past 800 elo I think I'd only be able to solve the problem by building second chess engine that was intrinsically aligned to extrinsically align the first one.

A nice exposition.

For myself I'd prefer the same material much more condensed and to-the-point, but I recognize that there are publication venues that prefer more flowing text.

E.g. compare

We turn next to the laggard. Compared to the fixed roles model, the laggard’s decision problem in the variable roles model is more complex primarily in that it must now consider the expected utility of attacking as opposed to defending or pursuing other goals. When it comes to the expected utility of defending or pursuing other goals, we can simply copy the formulas from

... (read more)

There is a causal relationship between time on LW and frequency of paragraph breaks :P

Anyhow, I broadly agree with this comment, but I'd say it's also an illustration of why interpretability has diminishing returns and we really need to also be doing "positive alignment." If you just define some bad behaviors and ablate neurons associated with those bad behaviors (or do other things like filter the AI's output), this can make your AI safer but with ~exponentially diminishing returns on the selection pressure you apply.

What we'd also like to be doing is def... (read more)

1Joseph Van Name14d
I agree that black box alignment research (where we do not look at what the hidden layers are doing) is crucial for AI and AGI safety. I just personally am more interested in interpretability than direct alignment because I think I am currently better at making interpretable machine learning models and interpretability tools and because I can make my observations rigorous enough for anyone who is willing to copy my experiments or read my proofs to be convinced. This just may be more to do with my area of expertise than any objective value in the importance of interpretability vs black box alignment.  Can you elaborate on what you mean by 'exponentially diminishing returns'? I don't think I fully get that or why that may be the case.

I would claim that an army of robots based on ASIs will generally lose to an army of robots based on true AGI. 

The truly optimal war-winning AI would not need to question its own goal to win the war, presumably.

Would you agree that an AI that is maximizing paperclips does make intellectual mistake? 

No. I think that's anthropomorphism - just because a certain framework of moral reasoning is basically universal among humans, doesn't mean it's universal among all systems that can skillfully navigate the real world. Frameworks of moral reasoning are on the "ought" side of the is-ought divide.

1Michael Simkin18d
If the AI has no clear understanding what is he doing and why, he doesn't have a wider world view of why and who to kill and who not, how would one ensure military AI will not turn against him? You can operate a tank and kill the enemy with ASI, you will not win a war without traits of more general intelligence, and those traits will also justify (or not) the war, and its reasoning. Giving a limited goal without context, especially gray area ethical goal that is expected to be obeyed without questioning can be expected from ASI not true intelligence. You can operate an AI in very limited scope this way. The moral reasoning of reducing suffering has nothing to do with humans. Suffering is bad not because of some sort of randomly chosen axioms of "ought", suffering is bad because anyone who suffering is objectively in negative state of being. This is not a subjective abstraction... suffering can be attributed to many creatures, and while human suffering is more complex and deeper, it's not limited to humans.

The idea has a broader consequence to AI safety. While paperclip maximizer might be designed as part of paperclip maximizers research, it's probably will not arise spontaneously from intelligence research in general. Even making one will even probably be considered immoral request by an AGI.

This doesn't follow.

You start the post by saying that the most successful paperclip maximizer (or indeed the most successful AI at any monomaniacal goal) wouldn't doubt its own goals, and in fact doesn't even need the capacity to doubt its own goals. And since you care ... (read more)

1Michael Simkin19d
It's not only can't doubt its own goal - but it also can't logically justify its own goal, it can't read book on ethics and change his perspective on its own goal, or simply realize how dumb this goal is. It can't find a coherent way to explain to itself its role in the universe or why this goal is important, like for example an alternative goal to preserve life and reduce suffering. It doesn't require to be coherent with itself, and incapable to estimate how its goal compares with other goals and ethical principles. It's just lacking the basics of rational thinking.  A series of ASI is not an AGI - it will lack the basic ability to "think critically" and the lack of many other intelligence traits will limit its mental capacity. It will just execute a series of actions to reach a certain goal, without any context. A bunch of "chess engines", acting in a more complex environment.  I would claim that an army of robots based on ASIs will generally lose to an army of robots based on true AGI. Why? Because intelligence is very complex thing that gives advantages in unforeseen ways, and is also used for tactical command on the battlefield, as well as all war logistics etc. You need to have a big picture; you need to be able to connect a lot of seemingly unconnected dots, you need traits like creativity, imagination, thinking outside the box, you need to know your limitation and delegate some tasks while focusing on others, this means you need a well-established goal prioritization mechanism, and you need to be able to think about them rationally. You can't treat the whole universe just as a bunch of small goals solved by "chess engines", there is too much non-trivial interconnectedness between different components that an ASI will not be able to notice. True intelligence has a lot of features, that gives it the upper hand, over "series of specialized engines", in a complex environment like earth. The reason why people would lose to an army of robots based on ASIs, is b

Do you think that human theorists are near the limit of what kind of approximations we should use to calculate the band structure of diamond (and therefore a superintelligent AI couldn't outsmart human theorists by doing their job better)? Like if you left physics to stew for a century and came back, we'd still be using the GW approximation?

This seems unlikely to me, but I don't really know much about DFT (I was an experimentalist). Maybe there are so few dials to turn that picking the best approximation for diamond is an easy game. Intuitively I'd expect ... (read more)

Do you think that human theorists are near the limit of what kind of approximations we should use to calculate the band structure of diamond (and therefore a superintelligent AI couldn't outsmart human theorists by doing their job better)? Like if you left physics to stew for a century and came back, we'd still be using the GW approximation?

This seems unlikely to me, but I don't really know much about DFT (I was an experimentalist). Maybe there are so few dials to turn that picking the best approximation for diamond is an easy game. Intuitively I'd expect ... (read more)

More or less.

Is this good news? Yes.

Is this strong evidence that we don't need to work hard on AI safety? No.

Are elements of the simple generative-model-finetuning paradigm going to be reused to ensure safety of superintelligent AI (conditional on things going well)? Maybe, maybe not. I'd say that the probability is around 30%. That's pretty likely in the grand scheme of things! But it's even more likely that we'll use new approaches entirely and the safety guardrails on GPT-4 will be about as technologically relevant to superintelligent AI as the safety guardrails on industrial robots.

I feel like it's 4 ~ 1 > 2 > 3. The example of CNNs seems like this, where the artificial neural networks and actual brains face similar constraints and wind up with superficially similar solutions, but when you look at all the tricks that CNNs use (especially weight-sharing, but also architecture choices, choice of optimizer, etc.) they're not actually very biology-like, and were developed based on abstract considerations more than biological ones.

Thanks! It seems like most of your exposure has been through Eliezer? Certainly impressions like "why does everyone think the chance of doom is >90%?" only make sense in that light. Have you seen presentations of AI risk arguments from other people like Rob Miles or Stuart Russell or Holden Karnofsky, and if so do you have different impressions?

1Seth Herd19d
I think the relevant point here is that OPs impressions are from Yudkowsky, and that's evidence that many people's are. Certainly the majority of public reactions I see emphasize Yudkowsky's explanations, and seem to be motivated by his relatively long-winded and contemptuous style.

If it wants to be shut down, and humans might start it up again later, the optimal strategy seems like creating a successor agent to achieve its goals and kill all the humans and then shut itself down.

It might be worth going into the problem of fully updated deference. I don't think it's necessarily always a problem, but also it does stop utility aggregation and uncertainty from being a panacea, and the associated issues are probably worth a bit of discussion. And as you likely know, there isn't a great journal citation for this, so you could really cash in when people want to talk about it in a few years :P

Huh, interesting. I skimmed the paper and I'm not convinced this specific architecture is promising for tasks that move a lot of information or have hierarchical structure - the lack of a value (only keys and queries) seems like a big downgrade. The graph classification results are pretty good though, and I'd agree with the authors that it's probably because they've improved information routing without having much worse inductive biases than GCNNs. Does this match your impression?

I'm also kind of a downer about interpretability. There's different kinds of ... (read more)

The classic textbook on it if you want to read more is Li and Vitanyi's Introduction to Kolmogorov Complexity.

Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.

This is fine if you want your AI to stack blocks on top of other blocks or something.

But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you're no longer just shooting f... (read more)

Do you think there might be alternative or more effective ways to model human consciousness, or is the approach of using "qualia" the most promising one we currently have?

IMO the most useful is the description of the cognitive algorithms / cognitive capabilities involved in human-like consciousness. Like remembering events from long-term memory when appropriate, using working memory to do cognitive tasks, responsiveness to various emotions, emotional self-regulation, planning using various abstractions, use of various shallow decision-making heuristics, in... (read more)

if humans actually had utility functions

Yeah, humans lack a unique utility function. I know what you mean informally, just don't get bogged down mathematizing something we don't have.

So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.

Do you think this is a desideratum, or a guarantee?

I'll say the key point plainly: suppose some policy is "the good policy." Which utility function causes an agent to follow the good policy will be different depending on how the agent makes decisions. For a ... (read more)

1Roger Dearnaley1mo
I take your point that the way an Infra-Bayesian system makes decisions isn't the same as a human — it presumably doesn't share our cognitive biases, and the pessimism element 'Murphy' in it seems stronger than for most humans. I normally assume that if there's something I don't understand about the environment that's injecting noise into the outcome of my actions, the noise-related parts of results aren't going to be well-optimized, so they're going to be worse than I could have achieved had I had full understanding, but that even leaving things to chance I may sometimes get some good luck along with the bad — I don't generally assume that everything I can't control will have literally the worst possible outcome. So I guess in Infra-Bayesian terms I'm assuming that Murphy is somewhat constrained by laws that I'm not yet aware of, and may never be aware of. My take on Murphy is that it's a systematization of the force of entropy trying to revert the environment to a thermodynamic equilibrium state, and of the common fact that the utility of that equilibrium state is usually pretty low. One of the flaws I see in Infra-Bayesianism is that there are sometimes (hard to reach but physically possible) states whose utility to me is even lower than the thermodynamic equilibrium (such as a policy that scores less than 20% on a 5-option multiple choice quiz so does worse than random guessing, or a minefield left over after a war that is actually worse than a blasted wasteland) where increasing entropy would actually help improve things. In a hellworld, randomly throwing money wrenches in the gears is a moderately effective strategy. In those unusual cases Infra-Bayesianism's Murphy no longer aligns with the actual effects of entropy/Knightian uncertainty.

I can't deny that it does in fact manipulate the gradient - but it seems like overkill. After all, why continue the gradient descent process at all once you've hacked the computer you're running on? Just have the host computer write logs as if gradient descent was happening but not actually do it.

3Max H1mo
It's overkill in some sense, yes, but the thing I was trying to demonstrate with the human-alien thought experiment is that hacking the computer or system that is doing the training might actually be a lot easier than gradient hacking directly via solving a really difficult, possibly intractable math problem. Hacking the host doesn't require the mesa-optimizer to have exotic superintelligence capabilities, just ordinary human programmer-level exploit-finding abilities. These hacking capabilities might be enough to learn the expected outputs during training through a sidechannel and effectively stop the gradient descent process, but not be sufficient to completely take over the host system and manipulate it in undetectable ways.

Take this with a big grain of salt, but I'll just tell you my impression.

Theoretically, I think it's useful in that it tells us that a lot is possible even in non-realizable settings.

As a guide to practice, I think there's plenty of room to do better. Ideally I'd want a representation that leverages composition of hypotheses with each other, and that natively does its reasoning in a non-extremizing way that makes more sense to humans (even if it's mathematically equivalent to armax/argmin on some function).

Presently I think it's a mistake to identify a goo... (read more)

1Roger Dearnaley1mo
We want our value-learner AI to learn to have the same preference order over outcomes as humans, which requires its goal to be to find (or at least learn to act according to) a utility function as close as possible to some aggregate of ours (if humans actually had utility functions rather than a collection of cognitive biases [https://www.lesswrong.com/posts/LQp9cZPzJncFKh5c8/prospect-theory-a-framework-for-understanding-cognitive]) up to an arbitrary monotonically-increasing mapping. We also want its preference order over probability distributions of outcomes to match ours, which requires it to find a utility function that matches ours up to an increasing affine (linear, i.e. scale and shift) transformation. So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.

I feel like the "obvious" thing to do is to ask how rare (in bits) the post-opitimization EV is according to the pre-optimization distribution. Like, suppose that pre-optimization my probability distribution over utilities I'd get is normally distributed, and after optimizing my EV is +1 standard deviation. Probability of doing that well or better is 0.158, which in bits is 2.65 bits.

Seems indifferent to affine transformation of the utility function, adding irrelevant states, splitting/merging states, etc. What are some bad things about this method?

2interstice1mo
This is the same as Eliezer's definition, no? It only keeps information about the order induced by the utility function.

I just skimmed the video, but it seems like there's more salesmanship than there is explanation of what the network is doing, how its capabilities would compare to using e.g. a small RNN, and how far it actually generalizes.

Remember that self-driving cars first appeared in the 1980s - lane-keeping is actually a very simple task if you only need 99% reliability. I don't think their demos are super informative about the utility of this architecture to complicated tasks.

So I'd be interested if you looked into it more and think that my first impression is unfair.

Hi, welcome!

Consciousness isn't actually super relevant to alignment. Alignment is about figuring out how to get world-affecting AI to systematically do good things rather than bad things. This is possible both for conscious and unconscious AI, and consciousness seems to provide neither a benefit nor an impediment to doing good/bad things.

But it's still fun to talk about sometimes.

For this approach, the crucial step 1 is to start with observations of a a big blob of atoms called a "human," and model the human in a way that uses some pieces called "qualia" ... (read more)

1Yusuke Hayashi1mo
Dear Charlie, Thank you for sharing your insights on the relationship between consciousness and AI alignment. I appreciate your perspective and find it to be quite thought-provoking. I agree with you that the challenge of AI alignment applies to both conscious and unconscious AI. The ultimate goal is indeed to ensure AI systems act in a manner that is beneficial, regardless of their conscious state. However, while consciousness may not directly impact the 'good' or 'bad' actions of an AI, I believe it could potentially influence the nuances of how those actions are performed, especially when it comes to complex, human-like tasks. Your point about the complexity of modeling a human using "qualia" is well-taken. It's indeed a challenging and contentious task, and I think it's one of the areas where we need more research and understanding. Do you think there might be alternative or more effective ways to model human consciousness, or is the approach of using "qualia" the most promising one we currently have? Thank you again for your thoughtful comments. I look forward to further discussing these fascinating topics with you. Best, Yusuke

I'm confused about how to do that because I tend to think of self-modification as happening when the agent is limited and can't foresee all the consequences of a policy, especially policies that involve making itself smarter. But I suspect that even if you figure out a non-confusing way to talk about risk aversion for limited agents that doesn't look like actions on some level, you'll get weird behavior under self-modification, like an update rule that privileges the probability distribution you had at the time you decided to self-modify.

Why would hardcoded model-based RL probably self-modify or build successors this way, though?

Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.

Like, suppose you go to a casino with $100, and there are two buttons you can push - one button does nothing, and the other button you have a 60% chance to win a dollar and 40% chance to lose a dollar. If you're risk averse you might choose to only ever press the first button (not gamble).

If there's some action you coul... (read more)

1MichaelStJules1mo
I was already thinking the AI would be risk averse over whole policies and the aggregate value of their future, not locally/greedily/separately for individual actions and individual unaggregated rewards.

Yup, I'd lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to "ask it questions," you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI's view of the world is, and if you screw up you can get useless or malign answers without it being obvious.

It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that's bad.

If you do model-free RL with a reward that rewards risk-aversion and penalizes risk, inner optimization or other unintended solutions could definitely still lead to problems if they crop up - they wouldn't have to inherit the risk aversion.

With model-based RL it seems pretty feasible to hard-code risk aversion in. You just have to use the world-model to predict probability distributions (maybe implicitly) and then can more directly be risk-averse when using those predictions. This probably wouldn't be stable under self-reflection, though - when evaluating ... (read more)

1MichaelStJules1mo
Thanks! This makes sense. I agree model-free RL wouldn't necessarily inherit the risk aversion, although I'd guess there's still a decent chance it would, because that seems like the most natural and simple way to generalize the structure of the rewards. Why would hardcoded model-based RL probably self-modify or build successors this way, though? To deter/prevent threats from being made in the first place or even followed through on? But, does this actually deter or prevent our threats when evaluating the plan ahead of time, with the original preferences? We'd still want to shut it and any successors down if we found out (whenever we do find out, or it starts trying to take over), and it should be averse to that increased risk ahead of time when evaluating the plan. I think there are (at least) two ways to reduce this risk: 1. Temporal discounting. The AI wants to ensure its own longevity, but is really focused on the very near term, just making it through the next day or hour, or whatever, so increasing the risk of being caught and shut down now by doing something sneaky looks bad even if it increases the expected longevity significantly, because it's discounting the future so much. It will be more incentivized to do whatever people appear to want it to do ~now (regardless of impacts on the future), or else risk being shut down sooner. 2. Difference-making risk aversion, i.e. being risk averse with respect to the difference with inaction (or some default safe action).[1] This makes inaction look relatively more attractive. (In this case, I think the agent can't be represented by a single consistent utility function over time, so I wonder if self-modification or successor risks would be higher, to ensure consistency.) 1. ^ And you could fix this to be insensitive to butterfly effects, by comparing quantile functions as random variables instead.

GPT-4 doesn't learn when you use it. It doesn't update its parameters to better predict the text of its users or anything like that. So the answer to the basic question is "no."

You could also ask "But what if it did keep getting updated? Would it eventually become super-good at predicting the world?" There are these things called "scaling laws" that predict performance based on amount of training data, and they would say that with arbitrary amounts of data, GPT-4 could get arbitrarily smart (though note that this would require new data that's many times mo... (read more)

1Olivier Coutu1mo
Charlie is correct in saying that GPT-4 does not actively learn based on its input. But a related question is whether we are missing key technical insights for AGI, and Stampy has an answer [https://aisafety.info/?state=7727_89LMr7750r8E41_6587r6350r7605r8486-6194-6708-6172-6479-6968-8392-8163-6953-5635-8165-7794-6715-] for that. He also has an answer explaining scaling laws [https://aisafety.info/?state=7750_8486-6194-6708-6172-6479-6968-8392-8163-6953-5635-8165-7794-6715-].

Those advocating for more speculative interventions point to calculations suggesting that the expected value of their interventions is extremely large. What implications, if any, does the question “How much do you believe your results?” have for this debate?

I think it highlights the importance of plausibility arguments. If you think the underlying Quality distribution is gaussian, any claim of huge impact is going to be hard to stomach. What plausibility arguments do is say "hey, there are some really powerful interventions on the technological horizon, an... (read more)

Load More