We should expect regression towards the mean only if the tasks were selected for having high "improvement from small to Gopher-7". Were they?
The reasoning was given in the comment prior to it, that we want fast progress in order to get to immortality sooner.
"But yeah, I wish this hadn't happened."
Who else is gonna write the article? My sense is that no one (including me) is starkly stating publically the seriousness of the situation.
"Yudkowsky is obnoxious, arrogant, and most importantly, disliked, so the more he intertwines himself with the idea of AI x-risk in the public imagination, the less likely it is that the public will take those ideas seriously"
I'm worried about people making character attacks on Yudkowsky (or other alignment researchers) like this. I think the people who think they can ...
"We finally managed to solve the problem of deceptive alignment while being capabilities competitive"
??????
"But I don't think you even need Eliezer-levels-of-P(doom) to think the situation warrants that sort of treatment."
Agreed. If a new state develops nuclear weapons, this isn't even close to creating a 10% x-risk, yet the idea of airstrikes on nuclear enrichment facillities, even though it is very controversial, has for a long time very much been an option on the table.
"if I thought the chance of doom was 1% I'd say "full speed ahead!"
This is not a reasonable view. Not on Longtermism, nor on mainstream common sense ethics. This is the view of someone willing to take unacceptable risks for the whole of humanity.
Also, there is a big difference between "Calling for violence", and "calling for the establishment of an international treaty, which is to be enforced by violence if necessary". I don't understand why so many people are muddling this distinction.
You are muddling the meaning of "pre-emptive war", or even "war". I'm not trying to diminish the gravity of Yudkowsky's proposal, but a missile strike on a specific compound known to contain WMD-developing technology is not a "pre-emptive war" or "war". Again I'm not trying to diminish the gravity, but this seems like an incorrect use of the term.
"For instance, personally I think the reason so few people take AI alignment seriously is that we haven't actually seen anything all that scary yet. "
And if this "actually scary" thing happens, people will know that Yudkowsky wrote the article beforehand, and they will know who the people are that mocked it.
I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).
Therefore, the waluigi eigen-simulacra are attractor states of the LLM
It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction.
Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples 0 i.i.d. with 100%).
Suppose we use a perfect Bayesian reasoner to sample bitstrings, bu...
Linking to my post about Dutch TV: https://www.lesswrong.com/posts/TMXEDZy2FNr5neP4L/datapoint-median-10-ai-x-risk-mentioned-on-dutch-public-tv
A lot of people in AI Alignment I've talked to have found it pretty hard to have clear thoughts in the current social environment, and many of them have reported that getting out of Berkeley, or getting social distance from the core of the community has made them produce better thoughts.
What do you think is the mechanism behind this?
I think the biggest thing is a strong, high-stakes but still quite ambiguous status-hierarchy in the Bay Area.
I think there are lots of contributors to this, but I definitely feel a very huge sense of needing to adopt certain views, to display "good judgement", and to conform to a bunch of epistemic and moral positions in order to operate in the space. This is particularly harsh since the fall of FTX with funding being less abundant and a lot of projects being more in-peril and the stakes of being perceived as reasonable and competent by a very messy and in-substantial parts social process are even higher.
There is a general phenomenon where:
It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contai...
Very late reply, sorry.
"even though reward is not a kind of objective", this is a terminological issue. In my view, calling a "antecedent-computation reinforcement criterion" an "objective" matches my definition of "objective", and this is just a matter of terminology. The term "objective" is ill-defined enough that "even though reward is not a kind of objective" is a terminological claim about objective, not a claim about math/the world.
The idea that RL agents "reinforce antecedent computations" is completely core to our story of deception. You could not ...
The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).
Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).
"I think in the defense-offense case the actions available to both sides are approximately the same"
If attacker has the action "cause a 100% lethal global pandemic" and the defender has the task "prevent a 100% lethal global pandemic", then clearly these are different problems, and it is a thesis, a thing to be argued for, that the latter requires largely the same skills/tech as the former (which is what this offense-defense symmetry thesis states).
If you build an OS that you're trying to make safe against attacks, you might do e.g. what the seL4 mic...
Kind of a delayed response, but: Could you clarify what you think is the relation between that post and mine? I think they are somehow sort of related, but not sure what you think the relation is. Are you just trying to say "this is sort of related", or are you trying to say "the strategy stealing assumption and this defense-offense symmetry thesis is the same thing"?
In the latter case: I think they are not the same thing, neither in terms of their actual meaning nor their intended purpose:
I just had a very quick look at that site, and it seems to be a collection of various chip models with pictures of them? Is there actual information on quantities sold, etc? I couldn't find it immediately.
Yeah, I know they don't understand them comprehensively. Is this the point though? I mean they understand them at a level of abstraction necessary to do what they need, and the claim is they have basically the same kind of knowledge of computers. Hmm, I guess that isn't really communicated by my phrasing though, so maybe I should edit that
I think I communicated unclearly and it's my fault, sorry for that: I shouldn't have used the phrase "any easily specifiable task" for what I meant, because I didn't mean it to include "optimize the entire human lightcone w.r.t. human values". In fact, I was being vague and probably there isn't really a sensible notion that I was trying to point to. However, to clarify what I really was trying to say: What I mean by "hard problem of alignment" is : "develop an AI system that keeps humanity permanently safe from misaligned AI (and maybe other x risks), and ...
I'm surprised if I haven't made this clear yet, but the thing that (from my perspective) seems different between my and your view is not that Step 1 seems easier to me than it seems to you, but that the "melt the GPUs" strategy (and possibly other pivotal acts one might come up with) seems way harder to me than it seems to you. You don't have to convince me of "'any easily human-specifiable task' is asking for a really mature alignment", because in my model this is basically equivalent to fully solving the hard problem of AI alignment.
Some reasons:
"you" obviously is whoever would be building the AI system that ended up burning all the GPU's (and ensuring no future GPU's are created). I don't know such sequence of events just as I don't know the sequence of events for building the "burn all GPU's" system, except at the level of granularity of "Step 1. build a superintelligent AI system that can perform basically any easily human-specifiable task without destroying the world. Step 2. make that system burn all GPU's indefintely/build security services that prevent misaligned AI from destroying the worl...
I wonder if there is a bias induced by writing this on a year-by-year basis, as opposed to some random other time interval, like 2 years. I can somehow imagine that if you take 2 copies of a human, and ask one to do this exercise in yearly intervals, and the other to do it in 2-year intervals, they'll basically tell the same story, but the second one's story takes twice as long. (i.e. the second one's prediction for 2022/2024/2026 are the same as the first one's predictions for 2022/2023/2024). It's probably not that extreme, but I would be surprised if there was zero such effect, which would mean these timelines are biased downwards or upwards.
yeah, I probably overstated. Nevertheless:
"CEV seems way harder to me than ..."
yes, I agree it seems way harder, and I'm assuming we won't need to do it and that we could instead "run CEV" by just actually continuing human society and having humans figure out what they want, etc. It currently seems to me that the end game is to get to an AI security service (in analogy to state security services) that protects the world from misaligned AI, and then let humanity figure out what it wants (CEV). The default is just to do CEV directly by actual human brains, b...
Ok I admit I read over it. I must say though that this makes the whole thing more involved than it sounded at fist, since it would maybe require essentially escalating a conflict with all major military powers and still coming out on top? One possible outcome of this would be that the entire global intellectual public opinion turns against you, meaning you also possibly lose access to a lot of additional humans working with you on further alignment research? I'm not sure if I'm imagining it correctly, but it seems like this plan would either require so many elements that I'm not sure if it isn't just equivalent to solving the entire alignment problem, or otherwise it isn't actually enough.
But assuming that law enforcement figures out that you did this, then puts you in jail, you wouldn't be able to control the further use of such nanotech, i.e. there would just be a bunch of systems indefinitely destroying GPU's, or maybe you set a timer or some conditions on it or something. I certainly see no reason why Iceland or anyone in iceland could get away with this unless those systems rely on completely unchecked nanosystems to which the US military has no response. Maybe all of this is what Eliezer means by "melt the GPU's", but I thought he did...
I meant, is there a link to where you've written this down somewhere? Maybe you just haven't written it down.
I would be interested in reading a draft and giving feedback (FYI I'm currently a researcher in the AI safety team at FHI).
Here is my partial honest reaction, just two points I'm somewhat dissatisfied with (not meant to be exhaustive):
2. "A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure." I would like there to be an argument for this claim that doesn't rely on nanotech, and solidly relies on actually existing amounts of compute. E.g. if the argument relies on running intractable detailed simulations of prote...
"which is to make a truly remarkable universal claim with a heavy burden of proof."
Having thought about this way less than you, it doesn't seem at first sight to me as remarkable as you seem to say. Note that the claim wouldn't be that you can't write a set of prompts to get the fully unversal reasoner, but that you can't write a single prompt that gets you this universal reasoner. It doesn't sound so crazy to me at all that knowledge is dispersed in the network in a way that e.g. some knowledge can only be accessed if the prompt has the feel of being gene...
Responding to this very late, but: If I recall correctly, Eric has told me in personal conversation that CAIS is a form of AGI, just not agent-like AGI. I suspect Eric would agree broadly with Richard's definition.
"I talk about consequentialists, but not rational consequentialists", ok this was not the impression I was getting.
Reading this post a while after it was written: I'm not going to respond to the main claim (which seems quite likely) but just to the specific arguments, which seems suspicious to me. Here are some points:
yes, but I think your reasoning "If 2 is only talking about the map, it doesn't imply 3" is too vague. I'd rather not go into it though, because I am currently busy with other things, so I'd suggest letting the reader decide.
Edit: reading back my response, it might come accross as a bit rude. If so, sorry for that, I didn't mean it that way.
"But without the premise that the territory is maths, the rest of the paradox doesn't follow."
I explicitly said "mathematically describable" implying I am not identifying the theory with reality. Nothing in my "argument" makes this identification
If an object knows that it exists, then this implies that it actually exists. Moreover, assuming that the state of a brain is a mathematical fact about the mathematical theory, then that the object knows it exists is in principle a mathematical implication of the mathematical theory (if observation 2 is correct). Hence it would be an implication of the theory that that theory describes an existing reality.
"There may also be mathematical properties that are universe-specific (the best candidates here are natural constants), but the extent to which these exist is questionable"
The exact position of every atom in the universe at time t=10^10 years is a "mathematical property of our universe" in my terminology. The fact that some human somewhere uttered the words "good morning" at some point today, is a complicated mathematical property of our universe, in principle derivable from the fundamental theory of physics.
tangential comment: Regarding "I will define success as producing fission weapons before the end of war in Europe". I'm not sure if this is the right criterion for success for the purpose of analogizing to AGI. It seems to me that "producing fission weapons before an Axis power does" is more appropriate.
And this seems overwhelmingly the case, yes: "theory of atomic bomb was considerably more advanced at the beginning of Manhattan project compared to our understanding of theory of aligned AGI"
I'm not sure I understand the motivation behind question. How much of my modern knowledge am I supposed to throw away? Note I am not in fact an atomic theorist who has the state of knowledge of atomic theory in 1942 so it's hard to know what I'd think, but I can imagine assigning somewhere between 5% and 95% depending on how informed of an atomic theorist I actually was and what it was actually like in 1942. Maybe I could give a better answer if you clarify the motivation behind the question?
I'm curious who these people are.