Aren’t turned off by perceived arrogance
One hypothesis I've had is that people with more MIRI-like views tend to be more arrogant themselves. A possible mechanism is that the idea that the world is going to end and that they are the only ones who can save is appealing in a way that shifts their views on certain questions and changes the way they think about AI (e.g. they need less explanation that they are some of the most important people ever, so they spend less time considering why AI might go well by default).[ETA: In case it wasn't clear, I am positing subconscious patterns correlated with arrogance that lead to MIRI-like views]
How'd this go? Just searched LW for "neurofeedback" since I recently learned about it
That argument makes sense, thanks
We are very likely not going to miss out on alignment by a 2x productivity boost, that’s not how things end up in the real world. We’ll either solve alignment or miss by a factor of >10x.
Why is this true?
the genome can’t directly make us afraid of death
It's not necessarily direct, but in case you aren't aware of it, prepared learning is a relevant phenomenon,since apparently the genome does predispose us to certain fears
Seems like this guy has already started trying to use GPT-3 in a videogame: GPT3 AI Game Prototype
Not sure if it was clear, but the reason I asked was because it seems like if you think the fraction changes significantly before AGI, then the claim that Thane quotes in the top-level comment wouldn't be true.
Don't timelines change your views on takeoff speeds? If not, what's an example piece of evidence that updates your timelines but not your takeoff speeds?
Same - also interested if John was assuming that the fraction of deployment labor that is automated changes negligibly over time pre-AGI.
Humans can change their action patterns on a dime, inspired by philosophical arguments, convinced by logic, indoctrinated by political or religious rhetoric, or plainly because they're forced to.
I'd add that action patterns can change for reasons other than logical/deliberative ones. For example, adapting to a new culture means you might adopt and have new reactions to objects, gestures, etc that are considered symbolic in that culture.
so the edge (~S,~Q) is terminal
Earlier you said that the blue edges were terminal edges.
What are some of the "various things" you have in mind here? It seems possible to me that something like "AI alignment testing" is straightforwardly upstream of what players want, but maybe you were thinking of something else
"Go with your gut” [...] [is] insensitive to circumstance.
People's guts seem very sensitive to circumstance, especially compared to commitments.
But the capabilities of neural networks are currently advancing much faster than our ability to understand how they work or interpret their cognition;
Naively, you might think that as opacity increases, trust in systems decreases, and hence something like "willingness to deploy" decreases.
How good of an argument does this seem to you against the hypothesis that "capabilities will grow faster than alignment"? I'm viewing the quoted sentence as an argument for the hypothesis.Some initial thoughts:
I was thinking of the possibility of affecting decision-making, either directly by rising the ranks (not very likely) or indirectly by being an advocate for safety at an important time and pushing things into the Overton window within an organization.
I imagine Habryka would say that a significant possibility here is that joining an AGI lab will wrongly turn you into an AGI enthusiast. I think biasing effects like that are real, though I also think it's hard to tell in cases like that how much you are biased v.s. updating correctly on new information,... (read more)
It seems like you are confident that the delta in capabilites would outweigh any delta in general alignment sympathy. Is this what you think?
Attempting to manually specify the nature of goodness is a doomed endeavor, of course, but that's fine, because we can instead specify processes for figuring out (the coherent extrapolation of) what humans value. […] So today's alignment problems are a few steps removed from tricky moral questions, on my models.
I‘m not convinced that choosing those processes is significantly non-moral. I might be misunderstanding what you are pointing at, but it feels like the fact that being able to choose the voting system gives you power over the vote’s outcome is evidence of this sort of thing - that meta decisions are still importantly tied to decisions.
I think there should be a word for your parsing, maybe "VNM utilitarianism," but I think most people mean roughly what's on the wiki page for utilitarianism:
Utilitarianism is a family of normative ethical theories that prescribe actions that maximize happiness and well-being for all affected individuals
It's not obvious to me that the class of counter-examples "expertise, in most fields, is not easier to verify than to generate" are actually counter-examples. For example for "if you're not a hacker, you can't tell who the good hackers are," it still seems like it would be easier to verify whether a particular hack will work than to come up with it yourself, starting off without any hacking expertise.
Could you clarify a bit more what you mean when you say "X is inaccessible to the human genome?"
My understanding is: Bob's genome didn't have access to Bob's developed world model (WM) when he was born (because his WM wasn't developed yet). Bob's genome can't directly specify "care about your specific family" because it can't hardcode Bob's specific family's visual or auditory features.
This direct-specification wouldn't work anyways because people change looks, Bob could be adopted, or Bob could be born blind & deaf.
[Check, does the Bob example make sense?]
But, the genome does do something indirectly that consistently leads to people valuin... (read more)
Ah okay -- I have updated positively in terms of the usefulness based on that description, and have updated positively on the hypothesis "I am missing a lot of important information that contextualizes this project," though still confused. Would be interested to know the causal chain from understanding circuit simplicity to the future being better, but maybe I should just stay posted (or maybe there is a different post I should read that you can link me to; or maybe the impact is diffuse and talking about any particular path doesn't make that much sen... (read more)
I didn't finish reading this, but if it were the case that:
then I very plausibly would have finished reading the post or saved it for later.ETA: For what it's worth, I still upvoted and liked the post, since I think deconfusing ourselves about stuff like this is plausibly very good and at the very least interesting. I just didn't like it enough to finish reading it or save it, because from my perspective it's expected usefulness wasn't high enough given the information I had.
I wonder if there are any measurable dimensions along which tasks can vary, and whether that could help with predicting task progress at all. A simple example is the average input size for the benchmark.
I’m glad you posted this — this may be happening to me and now I’ve read about sunken cost faith counterfactually
I don't know how good of a fit you would be, but have you considered applying to Redwood Research?
Ah I see, and just to make sure I'm not going crazy, you've edited the post now to reflect this?
W is a function, right? If so, what’s its type signature?
I agree, though I want to be able to have a good enough understanding of the gears such that I can determine whether something like "telling yourself you are awesome everyday" will have counterfactual better outcomes than not. I guess the studies seem to suggest the answer in this case is "yes" in as much as self-delusion negative externalities are captured by the metrics that the studies in the TED talk use. [ETA: and I feel like now I have nearly answered the question for myself, so thanks for the prompt!]
What’s a motivation stack? Could you give an example?
A partial answer:
These answers still have ambiguity though, in "more than" and in how many Bayes points your anxiety as a predictor of death actually gets.
I'll add that when I asked John Wentworth why he was IDA-bearish, he mentioned the inefficiency of bureaucracies and told me to read the following post to learn why interfaces and coordination are hard: Interfaces as a Scarce Resource.
while in the slow takeoff world your choices about research projects are closely related to your sociological predictions about what things will be obvious to whom when.
I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of "the fuzziness spectrum" as math and physics. And it does seem like concept fuzziness would make evaluation harder.I'll note though that ARC's approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).
MIRI doesn't have good reasons to support the claim of almost certain doom
I recently asked Eliezer why he didn't suspect ELK to be helpful, and it seemed that one of his major reasons was that Paul was "wrongly" excited about IDA. It seems that at this point in time, neither Paul nor Eliezer are excited about IDA, but Eliezer got to the conclusion first. Although, the IDA-bearishness may be for fundamentally different reasons -- I haven't tried to figure that out yet.
Have you been taking this into account re: your ELK bullishness? Obviously, this sort of p... (read more)
I think Nate Soares has beliefs about question 1. A few weeks ago, we were discussing a question that seems analogous to me -- "does moral deliberation converge, for different ways of doing moral deliberation? E.g. is there a unique human CEV?" -- and he said he believes the answer is "yes." I didn't get the chance to ask him why, though.Thinking about it myself for a few minutes, it does feel like all of your examples for how the overseer could have distorted values have a true "wrongness" about them that can be verified against reality -- this makes me feel optimistic that there is a basin of human values, and that "interacting with reality" broadly construed is what draws you in.
An example is an AI making the world as awful as possible, e.g. by creating dolorium. There is a separate question about how likely this is, hopefully very unlikely.
I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn't expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
[ETA: I'm not that sure of the below argument]Thanks for the example, but it still seems to me that this sort of thing won't work for advanced AI. If you are familiar with the ELK report, you should be able to see why. [Spoiler below]Even if you manage to learn the properties of what looks like deception to humans, and instill those properties into a loss function, then it seems like you are still more likely to get a system that tells you what humans think the truth is, avoiding what humans would be able to notice as deception, rather than telling you wha... (read more)
Isn’t the worst case one in which the AI optimizes exactly against human values?
Maybe Carl meant to link this one
it could be that the lack of alignment understanding is an inevitable consequence of our capabilities understanding not being there yet.
Could you say more about this hypothesis? To me, it feels likely that you can get crazy capabilities from a black box that you don't understand and so whose behavior/properties you can't verify to be acceptable. It's not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it (which is one way your nuclear analogy could translate).It's possible, also, that this i... (read more)
One thing is that it seems like they are trying to build some of the world’s largest language models (“state of the art models”)
It seems to me that it would be better to view the question as "is this frame the best one for person X?" rather than "is this frame the best one?"Though, I haven't fully read either of your posts, so excuse any mistakes/confusion.
Congrats on making an important and correct point without needing to fully read the posts! :) That's just efficiency.
Do you have an example of a set of 1-detail stories you now might tell (composed with “AND”)?
Ah — sorry if I missed that in the post, only skimmed
Random tip: If you want to restrict apps etc on your iPhone but not know the Screen Time pin, I recommend the following simple system which allows you to not know the password but unlock restrictions easily when needed:
Thanks for this list!Though the list still doesn't strike me as very novel -- it feels that most of these conditions are conditions we've been shooting for anyways.E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine... (read more)
I (with some help) compiled some of the best rationality essays here.
Ping about my other comment -- FYI, because I am currently concerned that you don't have criteria for the innards in mind, I'm less excited about your agenda than other alignment theory agendas (though this lack of excitement is somewhat weak, e.g. since I haven't tried to digest your work much yet).