All of Charlie Steiner's Comments + Replies

Charlie Steiner's Shortform

I think you can steelman Ben Goertzel-style worries about near-term amoral applications of AI being bad "formative influences" on AGI, but mostly under a continuous takeoff model of the world. If AGI is a continuous development of earlier systems, then maybe it shares some datasets and learned models with earlier AI projects, and definitely it shares the broader ecosystems of tools, dataset-gathering methodologies, model-evaluating paradigms, and institutional knowledge on the part of the developers. If the ecosystem in which this thing "grows up" is one t... (read more)

Learning Russian Roulette

The amount that I care about this problem is proportional to the chance that I'll survive to have it.

Dutch-Booking CDT: Revised Argument

Pausing reading at "Paul's simple argument" to jot this down: The expected values are identical when you're conditioning on all the parent nodes of the action (i.e. you have full knowledge of your own decision-making process, in your decision-making process). But if you can't do that, then it seems like EDT goes nuts - e.g. if there's a button you won't want to press, and you're not conditioning on your own brain activity, then EDT might evaluate the expected utility of pressing the button by assuming you have a harmful seizure that makes you hit the butto... (read more)

Learning Russian Roulette

I think in the real world, I am actually accumulating evidence against magic faster than I am trying to commit elaborate suicide.

1Bunthut5dThe problem, as I understand it, is that there seem to be magical hypothesis you can't update against from ordinary observation, because by construction the only time they make a difference is in your odds of survival. So you can't update them from observation, and anthropics can only update in their favour, so eventually you end up believing one and then you die.
AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

I initially thought you were going to debate Beth Barnes.

Also, thanks for the episode :) It was definitely interesting, although I still don't have a good handle on why some people are optimistic that there aren't classes of arguments humans will "fall for" irrespective of their truth value.

2DanielFilan5dYeah the initial title was not good
Testing The Natural Abstraction Hypothesis: Project Intro

One generalization I am also interested in is to learn not merely abstract objects within a big model, but entire self-contained abstract levels of description, together with actions and state transitions that move you between abstract states. E.g. not merely detecting that "the grocery store" is a sealed box ripe for abstraction, but that "go to the grocery store" is a valid action within a simplified world-model with nice properties.

This might be significantly more challenging to say something interesting about, because it depends not just on the world but on how the agent interacts with the world.

Learning Russian Roulette

I agree with faul sname, ADifferentAnonymous, shminux, etc. If every single person in the world had to play russian roulette (1 bullet and 5 empty chambers), and the firing pin was broken on exactly one gun in the whole world, everyone except the person with the broken gun would be dead after about 125 trigger pulls.

So if I remember being forced to pull the trigger 1000 times, and I'm still alive, it's vastly more likely that I'm the one human with the broken gun, or that I'm hallucinating, or something else, rather than me just getting lucky. Note that if... (read more)

1Bunthut5dMaybe the disagreement is in how we consider the alternative hypothesis to be? I'm not imagining a broken gun - you could examine your gun and notice it isn't, or just shoot into the air a few times and see it firing. But even after you eliminate all of those, theres still the hypothesis "I'm special for no discernible reason" (or is there?) that can only be tested anthropically, if at all. And this seems worrying. Maybe heres a stronger way to formulate it: Consider all the copies of yourself across the multiverse. They will sometimes face situations where they could die. And they will always remember having survived all previous ones. So eventually, all the ones still alive will believe they're protected by fate or something, and then do something suicidal. Now you can bring the same argument about how there are a few actual immortals, but still... "A rational agent that survives long enough will kill itself unless its literally impossible for it to do so" doesn't inspire confidence, does it? And it happens even in very "easy" worlds. There is no world where you have a limited chance of dying before you "learn the ropes" and are safe - its impossible to have a chance of eventual death other than 0 or 1, without the laws of nature changing over time. I interpret that as conditioning on the existence of at least one thing with the "inner" properties of yourself.
Which counterfactuals should an AI follow?

Very nice overview! Of course, I think most of the trick is crammed into that last bit :) How do you get a program to find the "common-sense" implied model of the world to use for counterfactuals.

In plain English - in what ways are Bayes' Rule and Popperian falsificationism conflicting epistemologies?

"Classic flavor" JTB is indeed that bad. JTB shifted to a probabilistic ontology is either Bayesian, wrong, or answering a different question altogether.

1TAG6dI'll go for answering different questions. Bayes, although well known to mainstream academia , isn't regarded as the one epistemology to rule them all , precisely because there are so many issues it doesn't address.
In plain English - in what ways are Bayes' Rule and Popperian falsificationism conflicting epistemologies?

I'm not really sure about the history. A quick search turns up Russell making similar arguments at the turn of the century, but I doubt there was the sort of boom there was after Gettier - maybe because probability wasn't developed enough to serve as an alternative ontology.

1TAG6dIt remains the case that JTB isn't that bad, and Bayes isn't that good a substitute.
My take on Michael Littman on "The HCI of HAI"

Even when talking about how humans shouldn't always be thought of as having some "true goal" that we just need to communicate, it's so difficult to avoid talking in that way :)  We naturally phrase alignment as alignment to something - and if it's not humans, well, it must be "alignment with something bigger than humans." We don't have the words to be more specific than "good" or "good for humans," without jumping straight back to aligning outcomes to something specific like "the goals endorsed by humans under reflective equilibrium" or whatever.

We need a good linguistic-science fiction story about a language with no such issues.

2alexflint3dYes, I agree, it's difficult to find explicit and specific language for what it is that we would really like to align AI systems with. Thank you for the reply. I would love to read such a story!
In plain English - in what ways are Bayes' Rule and Popperian falsificationism conflicting epistemologies?

Hm. I don't think people who talk about "Bayesianism" in the broad sense are using a different ontology of probability than most people. I think what makes "Bayesians" different is their willingness to use probability at all, rather than some other conception of knowledge.

Like, consider the weird world of the "justified true belief" definition of knowledge and the mountains of philosophers trying to patch up its leaks. Or the FDA's stance on whether covid vaccines work in children. It's not that these people would deny the proof of Bayes' theorem - it's just that they wouldn't think to apply it here, because they aren't thinking of the status of some claim as being a probability.

1TAG6dWhat were the major problems with JTB before Gettier? There were problems with equating knowledge with certainty...but then pretty much everyone moved to fallibilism. Without abandoning JTB. So JTB and probablism, broadly defined, aren't incompatible. There's nothing about justification, or truth or belief that cant come in degrees. And regarding all three of them as non-binary is a richer model than just regarding belief as non-binary.
Learning Russian Roulette

We can make this point even more extreme by playing a game like the "unexpected hanging paradox," where surprising the prisoner most of the time is only even possible if you pay for it in the coin of not surprising them at all some of the time.

Learning Russian Roulette

I see a lot of object-level discussion (I agree with the modal comment) but not much meta.

I am probably the right person to stress that "our current theories of anthropics," here on LW, are not found in a Bostrom paper.

Our "current theory of anthropics" around these parts (chews on stalk of grass) is simply to start with a third-person model of the world and then condition on your own existence (no need for self-blinding or weird math, just condition on all your information as per normal). The differences in possible world-models and self-information subsu... (read more)

1Bunthut6dTo clarify, do you think I was wrong to say UDT would play the game? I've read the two posts you linked. I think I understand Weis, and I think the UDT described there would play. I don't quite understand yours.
How do scaling laws work for fine-tuning?

I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I'm wrong and there's some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.

How do scaling laws work for fine-tuning?

Sure, but if you're training on less data it's because fewer parameters is worse :P

3Daniel Kokotajlo9dNot according to this paper! They were able to get performance comparable to full-size networks, it seems. IDK.
Covid 4/1: Vaccine Passports

Just listened to a great podcast with Sean Carroll and Zeynep Tufekci. Anyone have thoughts about putting Ms. Tufekci in charge of large organizations?

How do scaling laws work for fine-tuning?

I'm not sure how your reply relates to my guess, so I'm a little worried.

If you're intending the compute comment to be in opposition to my first paragraph, then no - when finetuning a subset of the parameters, compute is not simply proportional to the size of the subset you're finetuning, because you still have to do all the matrix multiplications of the original model, both for inference and gradient propagation. I think the point for the paper only finetuning a subset was to make a scientific point, not save compute.

My edit question was just because you ... (read more)

2Daniel Kokotajlo9dI totally agree that you still have to do all the matrix multiplications of the original model etc. etc. I'm saying that you'll need to do them fewer times, because you'll be training on less data. Each step costs, say, 6*N flop where N is parameter count. And then you do D steps, where D is how many data points you train on. So total flop cost is 6*N*D. When you fine-tune, you still spend 6*N for each data point, but you only need to train on 0.001D data points, at least according to the scaling laws, at least according to the orthodox interpretation around here. I'd recommend reading Ajeya's report (found here) [https://www.alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines] for more on the scaling laws. There's also this comment thread. [https://www.alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines]
How do scaling laws work for fine-tuning?

I think it's plausible that the data dependence will act like it's 3 OOM smaller. Compute dependence will be different, though, right? Even if you're just finetuning part of the model you have to run the whole thing to do evaluation. In a sense this actually seems like the worst of both worlds (but you get the benefit from pretraining).

Edit: Actually, I'm confused why you say a smaller model needs that factor fewer steps. I thought the slope on that one was actually quite gentle. It's just that smaller models are cheap - or am I getting it wrong?

2Daniel Kokotajlo9dI think compute cost equals data x parameters, so even if parameters are the same, if data is 3 OOM smaller, then compute cost will be 3 OOM smaller. I'm not sure I understand your edit question. I'm referring to the scaling laws as discussed and interpreted by Ajeya. Perhaps part of what's going on is that in the sizes of model we've explored so far, bigger models only need a little bit more data, because bigger models are more data-efficient. But very soon it is prophecied that this will stop and we will transition to a slower scaling law according to which we need to increase data by almost as much as we increase parameter count. So that's the relevant one I'm thinking about when thinking about TAI/AGI/etc.
Charlie Steiner's Shortform

Will the problem of logical counterfactuals just solve itself with good model-building capabilities? Suppose an agent has knowledge of its own source code, and wants to ask the question "What happens if I take action X?" where their source code provably does not actually do X.

A naive agent might notice the contradiction and decide that "What happens if I take action X?" is a bad question, or a question where any answer is true, or a question where we have to condition on cosmic rays hitting transistors at just the right time. But we want a sophisticated ag... (read more)

"Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party

I'm still trying to wrap my head around how the update rule deals with hypotheses (a-measures) that have very low expected utility. In order for them to eventually stop dominating calculations, presumably their affine term has to get lifted as evidence goes against them?

Edit: I guess I'm real confused about the function called "g" in basic inframeasure theory. I think that compactness (mentioned... somewhere) forces different hypotheses to be different within some finite time. But I don't understand the motivations for different g.

4Diffractor5dAh. So, low expected utility alone isn't too much of a problem. The amount of weight a hypothesis has in a prior after updating depends on the gap between the best-case values and worst-case values. Ie, "how much does it matter what happens here". So, the stuff that withers in the prior as you update are the hypotheses that are like "what happens now has negligible impact on improving the worst-case". So, hypotheses that are like "you are screwed no matter what" just drop out completely, as if it doesn't matter what you do, you might as well pick actions that optimize the other hypotheses that aren't quite so despondent about the world. In particular, if all the probability distributions in a set are like "this thing that just happened was improbable", the hypothesis takes a big hit in the posterior, as all the a-measures are like "ok, we're in a low-measure situation now, what happens after this point has negligible impact on utility". I still need to better understand how updating affects hypotheses which are a big set of probability distributions so there's always one probability distribution that's like "I correctly called it!". The motivations for different g are: If g is your actual utility function, then updating with g as your off-event utility function grants you dynamic consistency. Past-you never regrets turning over the reins to future you, and you act just as UDT would. If g is the constant-1 function, then that corresponds to updates where you don't care at all what happens off-history (the closest thing to normal updates), and both the "diagonalize against knowing your own action" behavior in decision theory and the Nirvana trick pops out for free from using this update.
My AGI Threat Model: Misaligned Model-Based RL Agent

So, you can do this thing with optical microscopy where you basically scan a laser beam over the sample to illuminate one pixel at a time. This lets you beat the diffraction limit on your aperture. So I'm sure people who build these systems like having small lasers. Another take on the same idea is to use a UV laser (different meaning of "smaller" there) and collect fluoresced rather than reflected light.

2Steven Byrnes20dThis is fun but very off-topic. I'll reply by DM. :-P
My AGI Threat Model: Misaligned Model-Based RL Agent

invent a smaller laser

Ah, I see you've been reading about super-resolution microscopy :P

2Steven Byrnes20dThat's not a real example. :-) I can't imagine why a microscope designer would want a smaller laser. My first draft actually had an example that would be plausibly useful for microscope designers ("invent a faster and more accurate laser galvo" or something) but then I figured, no one cares, and everyone would say "WTF does "galvo" mean?" :-P
Preferences and biases, the information argument

But is that true? Human behavior has a lot of information. We normally say that this extra information is irrelevant to the human's beliefs and preferences (i.e. the agential model of humans is a simplification), but it's still there.

2Stuart_Armstrong22dLook at the paper linked for more details ( https://arxiv.org/abs/1712.05812 [https://arxiv.org/abs/1712.05812] ). Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.
Dark Matters

Yes, 60% that the LHC would find a dark matter candidate. Anyhow, maybe you should take away that this emphasizes that he does (and cosmologists in general do) have lots of evidence.

Dark Matters

Postulating that it must be there somewhere, and physics doesn't need to make it easy, isn't properly updating against the theory as each successive most likely but still falsifiable guess has been falsified.

Most physicists actually have updated - if you listen to Sean Carroll's podcast, he just this week talked about how when the LHC started up he thought there was about a 60% chance of finding a dark matter candidate, and that he's updated his views in light of our failure to find it. But he also explained that he still thinks dark matter is overwhelmingly likely (because of evidence like that explained in the post).

2Davidmanheim1moThat's good to hear. But if "he started at 60%," that seems to mean if he "still thinks dark matter is overwhelmingly likely" he is updating in the wrong direction. (Perhaps he thought it was 60% likely that the LHC found dark matter? In which case I still think that he should update away from "overwhelmingly likely" - it's weak evidence against the hypothesis, but unless he started out almost certain, "overwhelmingly" seems to go a bit too far.)
HCH Speculation Post #2A

Sure, but the interesting thing to me isn't fixed points in the input/output map, it's properties (i.e. attractors that are allowed to be large sets) that propagate from the answers seen by a human in response to their queries, into their output.

Even if there's a fixed point, you have to further prove that this fixed point is consistent - that it's actually the answer to some askable question. I feel like this is sort of analogous to Hofstadter's q-sequence.

2Donald Hobson1moIn the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point.
HCH Speculation Post #2A

Yeah, I agree with this. It's certainly possible to see normal human passage through time as a process with probable attractors. I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow.

If we imagine actual human imitations I think all of these problems have fairly obvious solutions, but I t... (read more)

2Vanessa Kosoy1mo[EDIT: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn't work that well [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=uL44tqHPCetnaKDTe] .] I don't think that last one is a real constraint. What counts as "an answer" is entirely a matter of interpretation by the participants in the HCH. For example, initially I can ask the question "what are the most useful thoughts about AI alignment I can come up with during 1,000,000 iterations?". When I am tasked to answer the question "what are the most useful thoughts about AI alignment I can come up with during N iterations?" then * If N=1, I will just spend my allotted time thinking about AI alignment and write whatever I came up with in the end. * If N>1, I will ask "what are the most useful thoughts about AI alignment I can come up with during N−1 iterations?". Then, I will study the answer and use the remaining time to improve on it to the best of my ability. An iteration of 2 weeks might be too short to learn the previous results, but we can work in longer iterations. Certainly, having to learn the previous results from text carries overhead compared to just remembering myself developing them (and having developed some illegible intuitions in the process), but only that much overhead. As to "monoculture", we can do HCH with multiple people (either the AI learns to simulate the entire system of multiple people or we use some rigid interface e.g. posting on a forum). For example, we can imagine putting the entire AI X-safety community there. But, we certainly don't want to put the entire world in there, since that way malign AI would probably leak into the system. Yes: it shows how to achieve reliable imitation (although for now in a theoretical model that's not feasible to implement), and the same idea should be applicab
Dark Matters

I'm quite curious about whether RelMOND matches the CMB spectrum nearly as well as Lambda-CDM (which has what, 3 free parameters for dark matter and dark energy?), and how much work they had to do to get it to agree. Like, if all you care about is galaxy rotation curves, it's easy to say that dark-matter-theorists keep changing the amount of dark matter they say is in galaxies to match observations (while, symmetrically, baryonic-matter-theorists keep changing the amount of non-visible baryonic matter they say is in galaxies to match observations). But the CMB is significantly more tightly constrained.

1Diffractor1moI found this Quanta magazine [https://www.quantamagazine.org/modified-gravity-theory-passes-a-critical-test-20200728/] article about it which seems to indicate that it fits the CMB spectrum well but required a fair deal of fiddling with gravity to do so, but I lamentably lack the physics capabilities to evaluate the original paper. [https://arxiv.org/pdf/2007.00082.pdf]
My attempt to find the meaning of existence

Welcome to the posting life on LW :)

I'm not super worried about the meaning of life, but there's been a lot of discussion relevant to the question itself of the meaning of life. My most related post would be Philosophy as Low-energy Approximation, or if you want to go old-school there's a whole bunch of related posts: A Human's Guide to Words, Diseased Thinking, or Conceptual analysis and moral theory.

My Thoughts on the Apperception Engine

I skimmed the paper, and you're right, it seemed like their toy problems were very small and their search process didn't scale particularly well. There seems to be similar prior work on program synthesis that I don't know much about, so I can't really evaluate what the points of progress in the AE paper are.

[Link] Whittlestone et al., The Societal Implications of Deep Reinforcement Learning

I'm reminded of Brian Christian's recent appearance on the 80kh podcast, where he talks up the connections between current and future-oriented AI alignment problems.

Grokking illusionism

I'm not sure if we're using "reductive explanation" the same way then, because if we associate it with the closest thing I think is agreed upon around here, I don't feel like dualists would agree that such a thing truly works.

What I'm thinking of is explanation based on a correspondence between two different models of our experience. Example: I can explain heat by the motion of atoms by showing that atomic theory predicts very similar phenomena to the intuitive model that led to me giving "heat" a special property-label. This is considered progress because... (read more)

1TAG1moIf the bridging laws , that explain how and why mental states arise from physical states, are left unspecified , then the complexity of the explanation cannot be assessed , so Occam's razor doesn't kick in. To put it another way, Occam's razor applies to explanations, so you need to get over the bar of being merely explanatory. What you call being hardcore about Occam's razor seems to mean believing in the simplest possible ((something)) ,where ((something)) doesn't have to be an explanation. Maxwell's equations are a bad intuitive explanation of reflection flipping, but you can't deny that the intuitive explanation is implicit in Maxwell's equations, because the alternative is that it is a physics-defying miracle. What's the equivalent of Maxwell's equations in the mind body problem? We can ask, but as far as I know there is no answer. I have never heard of a set of laws that allow novel subjective experience to be predicted from brain states. But are your "inferred" and "would" meant to imply that they don't?
Grokking illusionism

Them merely saying they'll be convinced by a "reductive explanation" is too circular for my tastes. It's like me saying "You could convince me the moon was made of green cheese if you gave me a convincing argument for it." It's not false, but it doesn't actually make any advance commitments about what such an argument might look like.

If someone says they're open to being persuaded "in principle," but has absolutely no idea what evidence could sway them, then my bet is that any such persuasion will have nothing to do with science, little to do with logic, and a lot to do with psychology.

1TAG1moThat's not an analogous analogy, because reductive explanations have an agreed set of features. It's odd to portray reductive explanation as this uselessly mysterious thing, when it is the basis of reductionism, which is an obligatory belief around here.
The case for aligning narrowly superhuman models

I'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human 'moral experts' are going to disagree about], then we've already mostly-won" is an accurate correlation, but doesn't stand up to optimization pressure. We can't mostly-win just by fine-tuning a language model to do moral discourse. I'd guess you agree?

Anyhow, my point was more: You said "you get what you can measure" is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interest... (read more)

2johnswentworth1moUh... yeah, I agree with that statement, but I don't really see how it's relevant. If we tune a language model to do moral discourse, then won't it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like "they said they want fusion power, but they probably also want it to not be turn-into-bomb-able". Or are you using "moral discourse" in a broader sense? I disagree with the exact phrasing "fact of the matter for whether decisions are good or bad"; I'm not supposing there is any "fact of the matter". It's hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want. Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.
2TurnTrout1moEnglish sentences don't have to hold up to optimization pressure, our AI designs do. If I say "I'm hungry for pizza after I work out", you could say "that doesn't hold up to optimization pressure - I can imagine universes where you're not hungry for pizza", it's like... okay, but that misses the point? There's an implicit notion here of "if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won." Perhaps this notion isn't obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer. Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say "this seems true in the main, although I can imagine situations where it's not." Maybe this is what you meant, in which case I agree.
Animal faces

I doubt I can read a parrot's facial expression. But body language is easy - so easy that it can just seem obvious and invisible. A bird looking at the camera in a relaxed pose, close up? Curious and happy, perhaps. A bird looking at the camera but in a tensed pose? Probably not so much.

Grokking illusionism

Hm, no, I don't think you got what I meant.

One thing I am saying is that I think there's a very strong parallel between not knowing how one could show if a computer program is conscious, and not having any idea how one could change their mind about dualism in response to evidence.

1TAG1moYou seem to be using "a priori" to mean something like "dogmatic and incapable of being updated". But apriori doesn't mean that, and contemporary dualists are capable of saying what they need to change their minds: a reductive explanation of consciousness.
Charlie Steiner's Shortform

Back in the "LW Doldrums" c. 2016, I thought that what we needed was more locations - a welcoming (as opposed to heavily curated a la old AgentFoundations), LW-style forum solely devoted to AI alignment, and then the old LW for the people who wanted to talk about human rationality.

This philosophy can also be seen in the choice to make the AI Alignment forum as a sister site to LW2.0.

However, what actually happened is that we now have non-LW forums for SSC readers who want to talk about politics, SSC readers who want to talk about human rationality, and peo... (read more)

The case for aligning narrowly superhuman models

Hm, interesting, I'm actually worried about a totally different implication of "you get what you can measure."

E.g.:

"If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide - are the humans allowed to say "hold on, I don't want that," or are we just going to accept that as wha... (read more)

I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:

Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

If an AGI i... (read more)

Open Problems with Myopia

(Edited for having an actual point)

You mention some general ways to get non-myopic behavior, but when it comes to myopic behavior you default to a clean, human-comprehensible agent model. I'm curious if you have any thoughts on open avenues related to training procedures that encourage myopia in inner optimizers, even if those inner optimizers are black boxes? I do seem to vaguely recall a post from one of you about this, or maybe it was Richard Ngo.

4evhub1moI think that trying to encourage myopia via behavioral incentives is likely to be extremely difficult, if not impossible (at least without a better understanding of our training processes' inductive biases). Krueger et al.'s “ Hidden Incentives for Auto-Induced Distributional Shift [https://arxiv.org/abs/2009.09153]” is a good resource for some of the problems that you run into when you try to do that. As a result, I think that mechanistic incentives are likely to be necessary—and I personally favor some form of relaxed adversarial training [https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment] —but that's going to require us to get a better understanding of what exactly it looks for an agent to be myopic or not so we know what the overseer in a setup like that should be looking for.
The case for aligning narrowly superhuman models

Hello fellow Charlie! For half a second I thought I'd written a comment in a fugue state and forgotten it :P

Grokking illusionism

And they can conceivably do all that without feelings. 

Sure, if we mean "conceivable" in the same way that "561 is prime" and "557 is prime" are both conceivable. That is, conceivable in a way that allows for internal contradictions, so long as we haven't figured out where the internal contradictions are yet.

"am I absolutely committed to Cartesian dualism,"

Cartesian dualism is not the only alternative to physicalism.

True, but it's a very convenient central example of a priori dualism, which has no space in its framework for any evidence (either from s... (read more)

1TAG1moYou seem to be saying that an algorithm is necessarily conscious , only we don't know how or why , so there is no contradiction for us,no internal contradiction, in imagining an unconscious algorithm. That's quite a strange thing to say. How do we know that consciousness is necssitated when w don't understand it? Is it necessitated by all algorithms that report consciousness? Do we know that it depends solely on the abstract algorithm and not the substrate? "Dualism wrong" contains little information, and therefore tells you little about th features of non-dualism
Open Problems with Myopia

You beat me to making this comment :P Except apparently I came here to make this comment about the changed version.

"A human would only approve safe actions" is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.

4Mark Xu1mohas been changed to imitation, as suggested by Evan.
2evhub1moYeah, I agree—the example should probably just be changed to be about an imitative amplification agent or something instead.
Grokking illusionism

This is a good point. But on the other hand, we can be very confident that there are algorithms that exhibit behavior that we would explain, in ourselves, as a consequence of feeling things, and there are "parallel explanations" of the algorithm's behavior and the feelings-based explanations we would normally tell about ourselves.

(It's more of an open question of whether we actually have any of these algorithms running on computers right now, If we're allowed to cherry-pick examples in narrow domains, then there are plausibly some like "this neural network... (read more)

1TAG1moAnd they can conceivably do all that without feelings. The flip side of not being able to explain why an algorithm should feel like anything on the inside is that zombies are conceivable. Models in which mental states figure also make successful predictions ... you can predict ouches from pains. The physical map is not uniquely predictive. Cartesian dualism is not the only alternative to physicalism.
Why Hasn't Effective Altruism Grown Since 2015?

It sounds like you've made a good case for high noise in the data. I was around on the internet in 2010, when arguments about the "global warming pause" were everywhere. And this is triggering the same sort of detectors. Not in the sense of "I have an inside view of comparable strength to global warming," I mean the sense of "my model tells me to expect noise at least this big sometimes, so the shape of this graph isn't as informative as if first appears, and we kinda have to wait and see."

The case for aligning narrowly superhuman models

Re: part 1 -

Good points, I agree. Though I think you could broadly replicate the summarization result using supervised learning - the hope for using supervised learning in superhuman domains is that your model learns a dimension of variation for "goodness" that can generalize well even if you condition on "goodness" being slightly outside any of the training examples.

Re: part 2 -

What it boils down to is that my standards (and I think the practical standards) for medical advice are low, while my standards for moral advice are high (as in, you could use this... (read more)

The case for aligning narrowly superhuman models

I (conceptual person) broadly do agree that this is valuable.

It's possible that we won't need this work - that alignment research can develop AI that doesn't benefit from the same sort of work you'd do to get GPT-3 to do tricks on command. But it's also possible that this really would be practice for "the same sort of thing we want to eventually do."

My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly ... (read more)

1Ajeya Cotra1moI don't think you can get away with supervised learning if you're holding yourself to the standard of finding fuzzy tasks where the model is narrowly superhuman. E.g. the Stiennon et al., 2020 paper involved using RL from human feedback: roughly speaking, that's how it was possible for the model to actually improve upon humans rather than simply imitating them. And I think in some cases, the model will be capable of doing better than (some) humans' evaluations, meaning that to "get models to the best they can to help us" we will probably need to do things like decomposition, training models to explain their decisions, tricks to amplify or de-noise human feedback, etc. I don't agree that there's obviously conceptual progress that's necessary for moral advice which is not necessary for medical advice — I'd expect a whole class of tasks to require similar types of techniques, and if there's a dividing line I don't think it is going to be "whether it's related to morality", but "whether it's difficult for the humans doing the evaluation to tell what's going on." To answer your question for both medical and moral advice, I'd say the obvious first thought is RL from human feedback, and the second thought I had to go beyond that is trying to figure out how to get less-capable humans to replicate the training signal produced by more-capable humans, without using any information/expertise from the latter to help the former (the "sandwiching" idea). I'm not sure if it'll work out though.
Multimodal Neurons in Artificial Neural Networks

Their faces and poses images are solid gold. There should be Gravatars except random points in CLIP's face latent space.

Are the Born probabilities really that mysterious?

What Everett says in his thesis is that if the measure is additive between orthogonal states, it's the norm squared. Therefore we should use the norm squared of observers when deciding in how to weight their observations.

But this is a weird argument, not at all the usual sort of argument used to pin down probabilities - the archetypal probability arguments rely on things like ignorance and symmetry. Everett just says "Well, if we put a measure on observers that doesn't have weird cross-state interactions, it's the norm squared." But understanding why human... (read more)

Load More