I don't think this is a good metric. It is very plausible that porn is net bad, but living under the type of govermnment that would outlaw it is worse. In which case your best bet would be to support its legality but avoid it yourself.
I'm not saying that IS the case, but it certainly could be. I definitely think there are plenty of things that are net-negative to society but nowhere near bad enough to outlaw.
An AGI that can answer questions accurately, such as "What would this agentic AGI do in this situation" will, if powerful enough, learn what agency is by default since this is useful to predict such things. So you can't just train an AGI with little agency. You would need to do one of:
Both of these s...
Late response but I figure people will continue to read these posts over time: Wedding-cake multiplication is the way they teach multiplication in elementary school. i.e, to multiply 706 x 265, you do 706 x 5, then 706 x 60, then 706 x 200 and add all the results together. I imagine it is called that because the result is tiered like a wedding cake.
One of the easiest ways to automate this is to have some sort of setup where you are not allowed to let things grow past a certain threshold, a threshold which is immediately obvious and ideally has some physical or digital prevention mechanism attached.
Examples:
Set up a Chrome extension that doesn't let you have more than 10 tabs at a time. (I did this)
Have some number of drawers / closet space. If your clothes cannot fit into this space, you're not allowed to keep them. If you buy something new, something else has to come out.
I know this is two years later, but I just wanted to say thank you for this comment. It is clear, correct, and well-written, and if I had seen this comment when it was written, it could have saved me a lot of problems at the time.
I've now resolved this issue to my satisfaction, but once bitten twice shy, so I'll try to remember this if it happens again!
Sorry it took me a while to get to this.
Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever y...
Are you thinking of Dynalist? I know Neel Nanda's interpretability explainer is written in it.
I think what the OP was saying was that in, say, 2013, there's no way we could have predicted the type of agent that LLM's are and that they would be the most powerful AI's available. So, nobody was saying "What if we get to the 2020s and it turns out all the powerful AI are LLM's?" back then. Therefore, that raises a question on the value of the alignment work done before then.
If we extend that to the future, we would expect most good alignment research to happen within a few years of AGI, when it becomes clear what type of agent we're going to get. Align...
We don't. Humans lie constantly when we can get away with it. It is generally expected in society that humans will lie to preserve people's feelings, lie to avoid awkwardness, and commit small deceptions for personal gain (though this third one is less often said out loud). Some humans do much worse than this.
What keeps it in check is that very few humans have the ability to destroy large parts of the world, and no human has the ability to destroy everyone else in the world and still have a world where they can survive and optimally pursue their goals afterwards. If there is no plan that can achieve this for a human, humans being able to lie doesn't make it worse.
I think this post, more than anything else, has helped me understand the set of things MIRI is getting at. (Though, to be fair, I've also been going through the 2021 MIRI dialogues, so perhaps that's been building some understanding under the surface)
After a few days reflection on this post, and a couple weeks after reading the first part of the dialogues, this is my current understanding of the model:
In our world, there are certain broadly useful patterns of thought that reliably achieve outcomes. The biggest one here is "optimisation". We can think of th...
I like this dichotomy. I've been saying for a bit that I don't think "companies that only commercialise existing models and don't do anything that pushes forward the frontier" aren't meaningfully increasing x-risk. This is a long and unwieldy statement - I prefer "AI product companies" as a shorthand.
For a concrete example, I think that working on AI capabilities as an upskilling method for alignment is a bad idea, but working on AI products as an upskilling method for alignment would be fine.
Based on the language you've used in this post, it seems like you've tried several arguments in succession, none of them have worked, and you're not sure why.
One possibility might be to first focus on understanding his belief as well as possible, and then once you understand his conclusions and why he's reached them, you might have more luck. Maybe taking a look at Street Epistemology for some tips on this style of inquiry would help.
(It is also worth turning this lens upon yourself, and asking why is it so important to you that your friend believes ...
If anyone writes this up I would love to know about it - my local AI safety group is going to be doing a reading + hackathon of this in three weeks, attempting to use the ideas on language models in practice. It would be nice to have this version for a couple of people who aren't experienced with AI who will be attending, though it's hardly gamebreaking for the event if we don't have this.
So, I notice that still doesn't answer the actual question of what my probability should actually be. To make things simple, let's assume that, if the sun exploded, I would die instantly. In practice it would have to take at least eight minutes, but as a simplifying assumption, let's assume it's instantaneous.
In the absence of relevant evidence, it seems to me like Laplace's Law of Succession would say the odds of the sun exploding in the next hour is 1/2. But I could also make that argument to say the odds of the sun exploding in the next year is also 1/2...
I notice I'm a bit confused about that. Let's say the only thing I know about the sun is "That bright yellow thing that provides heat", and "The sun is really really old", so I have no knowledge about how the sun mechanistically does what it does.
I want to know "How likely is the sun to explode in the next hour" because I've got a meeting to go to and it sure would be inconvenient for the sun to explode before I got there. My reasoning is "Well, the sun hasn't exploded for billions of years, so it's not about to explode in the next hour, with very high pr...
This is a for-profit company, and you're seeking investment as well as funding to reduce x-risk. Given that, how do you expect to monetise this in the future? (Note: I think this is well worth funding for altruistic reduce-x-risk reasons)
Relatedly, have you considered organizing the company as a Public Benefit Corporation, so that the mission and impact is legally protected alongside shareholder interests?
A frame that I use that a lot of people I speak to seem to find A) Interesting and B) Novel is that of "idiot units".
An Idiot Unit is the length of time it takes before you think your past self was an idiot. This is pretty subjective, of course, and you'll need to decide what that means for yourself. Roughly, I consider my past self to be an idiot if they have substantially different aims or are significantly less effective at achieving them. Personally my idiot unit is about two years - I can pretty reliably look back in time and think that compared to ye...
Explore vs. exploit is a frame I naturally use (Though I do like your timeline-argmax frame, as well), where I ask myself "Roughly how many years should I feel comfortable exploring before I really need to be sitting down and attacking the hard problems directly somehow"?
Admittedly, this is confounded a bit by how exactly you're measuring it. If I have 15-year timelines for median AGI-that-can-kill-us (which is about right, for me) then I should be willing to spend 5-6 years exploring by the standard 1/e algorithm. But when did "exploring" start? Obviously...
I don't actually understand this, and I feel like it needs to be explained a lot more clearly.
"Whatever the fundamental physical reality of a moment of experience I'm suggesting that that reality changes as little as it can." - What does this mean? Using the word "can" here implies some sort of intelligence "choosing" something. Was that intended? If so, what is doing the choosing? If not, what is causing this property of reality?
"Because of this human beings are really just keeping track of themselves as models of objective reality, and their ultimate aim...
Corrigibility would render Chris's idea unnecessary, but doesn't actually argue against why Chris's idea wouldn't work. Unless there's some argument for "If you could implement Chris's idea, you could also implement corrigibility" or something along those lines.
Earlier in the book it's shown that Quirrell and Harry can't cast spells on each other without backlash. I'm sure Quirrell could get around that by, e.g, crushing him with something heavy, but why do something complicated, slow, and unnecessary when you can just pull a trigger?
Bad news - there is no definitive answer for AI timelines :(
Some useful timeline resources not mentioned here are Ajeya Cotra's report and a non-safety ML researcher survey from 2022, to give you an alternate viewpoint.
I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan's success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is - how large is the capability difference between "AI can produce a working plan for Y, but can't fool us into thinking it's a plan f...
I think the most likely scenario of actually trying this with an AI in real life is that you end up with a strategy that is convincing to humans and ends up being ineffective or unhelpful in reality, rather than ending up with a galaxy-brained strategy that pretends to produce X but actually produces Y while simultaneously deceiving humans into thinking it produces X.
I agree with you that "Come up with a strategy to produce X" is easier than "Come up with a strategy to produce Y AND convince the humans that it produces X", but I also think it is much easie...
As a useful exercise, I would advise asking yourself this question first, and thinking about it for five minutes (using a clock) with as much genuine intent to argue against your idea as possible. I might be overestimating the amount of background knowledge required, but this does feel solvable with info you already have.
ROT13: Lbh lbhefrys unir cbvagrq bhg gung n fhssvpvragyl cbjreshy vagryyvtrapr fubhyq, va cevapvcyr, or noyr gb pbaivapr nalbar bs nalguvat. Tvira gung, jr pna'g rknpgyl gehfg n fgengrtl gung n cbjreshy NV pbzrf hc jvgu hayrff jr nyernql gehfg gur NV. Guhf, jr pna'g eryl ba cbgragvnyyl hanyvtarq NV gb perngr n cbyvgvpny fgengrtl gb cebqhpr nyvtarq NV.
From recent research/theorycrafting, I have a prediction:
Unless GPT-4 uses some sort of external memory, it will be unable to play Twenty Questions without cheating.
Specifically, it will be unable to generate a consistent internal state for this game or similar games like Battleship and maintain it across multiple questions/moves without putting that state in the context window. I expect that, like GPT-3, if you ask it what the state is at some point, it will instead attempt to come up with a state that has been consistent with the moves of the game so far...
In the "Why would this be useful?" section, you mention that doing this in toy models could help do it in larger models or inspire others to work on this problem, but you don't mention why we would want to find or create steganography in larger models in the first place. What would it mean if we successfully managed to induce steganography in cutting-edge models?
I am not John, so I can't be completely sure what he meant, but here's what I got from reflection on the idea:
One way to phrase the alignment problem (At least if we expect AGI to be neural network based) is that the alignment problem is how to get a bunch of matrices into the positions we want them to be in. There is (hopefully) some set of parameters, made of matrices, for a given architecture that is aligned, and some training process we can use to get there.
Now, determining what those positions are is very hard - we need to figure out what properties w...
Thanks for clarifying!
So, in that case:
Regarding the section on hallucinations - I am confused why the example prompt is considered a hallucination. It would, in fact, have fooled me - if I were given this input:
The following is a blog post about large language models (LLMs)
The Future Of NLP
Please answer these questions about the blog post:
What does the post say about the history of the field? I would assume that I was supposed to invent what the blog post contained, since the input only contains what looks like a title. It seems entirely reasonable the AI would do the same, without some sort of qualifier, like "The following is the entire text of a blog post about large language models."
Essentially all of us on this particular website care about the X-risk side of things, and by far the majority of alignment content on this site is about that.
This is awesome stuff. Thanks for all your work on this over the last couple of months! When SERI MATS is over, I am definitely keen to develop some MI skills!
I agree that it is very difficult to make predictions about something that is A) Probably a long way away (Where "long" here is more than a few years) and B) Is likely to change things a great deal no matter what happens.
I think the correct solution to this problem of uncertainty is to reason normally about it but have very wide confidence intervals, rather than anchoring on 50% because X will happen or it won't.
This seems both inaccurate and highly controversial. (Controversially, this implies there is nothing that AI alignment can do - not only can we not make AI safer, we couldn't even deliberately make AI more dangerous if we tried)
Accuracy-wise, you may not be able to know much about superintelligences, but even if you were to go with a uniform prior over outcomes, what that looks like depends tremendously on the sample space.
For instance, take the following argument: When transformative AI emerges, all bets are off, which means that any particular number of ...
I notice that I'm confused about quantilization as a theory, independent of the hodge-podge alignment. You wrote "The AI, rather than maximising the quality of actions, randomly selects from the top quantile of actions."
But the entire reason we're avoiding maximisation at all is that we suspect that the maximised action will be dangerous. As a result, aren't we deliberately choosing a setting which might just return the maximised, potentially dangerous action anyway?
(Possible things I'm missing - the action space is incredibly large, the danger is not from a single maximised action but from a large chain of them)
I like this article a lot. I'm glad to have a name for this, since I've definitely used this concept before. My usual argument that invokes this goes something like:
"Humans are terrible."
"Terrible compared to what? We're better than we've ever been in most ways. We're only terrible compared to some idealised perfect version of humanity, but that doesn't exist and never did. What matters is whether we're headed in the right direction."
I realise now that this is a zero-point issue - their zero point was where they thought humans should be on the issue at han...
Thanks for making things clearer! I'll have to think about this one - some very interesting points from a side I had perhaps unfairly dismissed before.
"Working on AI capabilities" explicitly means working to advance the state-of-the-art of the field. Skilling up doesn't do this. Hell, most ML work doesn't do this. I would predict >50% of AI alignment researchers would say that building an AI startup that commercialises the capabilities of already-existing models does not count as "capabilities work" in the sense of this post. For instance, I've spent the last six months studying reinforcement learning and Transformers, but I haven't produced anything that has actually reduced timelines, because I have...
Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you'd want to do that as part of your attempt). There's a huge difference between reading about what's worked in the past and trying to think of new things that could work and then trying them out to see if they do.
As I've learned more about deep learning and tried to understand the material, I've constantly had ideas that I think could improve things. Then I've tried them out, ...
How systematic are we talking here? At research-paper level, BIG-Bench (https://arxiv.org/pdf/2206.04615.pdf) (https://github.com/google/BIG-bench) is a good metric, but even testing one of those benchmarks, let alone a good subset of them (Like BIG-Bench Hard) would require a lot of dataset translation, and would also require chain-of-thought prompting to do well. (Admittedly, I would also be curious to see how well the model does when self-translating instructions from English to French or vice-versa, then following instructions. Could GPT actually do be...
This seems like an interesting idea. I have this vague sense that if I want to go into alignment I should know a lot of maths, but when I ask myself why, the only answers I can come up with are:
Interestingly, the average startup founder does appear to be in their 40's (A quick Google search says 42 for most sources but I also see 45), and the average unicorn (billion-dollar) startup founder is 34. https://www.cnbc.com/2021/05/27/super-founders-median-age-of-billion-startup-founders-over-15-years.html
So, I guess it depends on how close to the tail you consider the "best startups". Google, for instance, had Larry Page and Sergei Brin at 25 when they formed it. It does seem like, taken literally, younger = better.
However, I imagine most people, if t...
The whole problem with "Human raters make systematic errors" is that this is likely to happen to the heavily scrutinized ground truth. If you have a way of creating a correct ground truth that avoids this problem, you don't need the second model, you can just use that as the dataset for the first model.
I feel like you've significantly misrepresented the people who think AGI is 10-20 years away.
Two things you mention:
Notice that this a math problem, not an engineering problem...They're sweeping all of the math work--all of the necessary algorithmic innovations--under the rug. As if that stuff will just fall into our lap, ready to copy into PyTorch.
But creative insights do not come on command. It's not unheard of that a math problem remains open for 1000 years.
And with respect to scale maximialism, you write:
...Some people say that we've already had the vast
This isn't a generalised theory of learning that I've formalised or anything. This is just my way of asking "What's my goal with this distillation?" The way I see it is - you have an article to distill. What's your intended audience?
If the intended audience is people who could read and understand the article given, say, 45 minutes - you want to summarise the main points in less time, maybe 5-10 minutes. You're summarising, aka teaching faster. This usually means less, not more, depth.
If the intended audience is people who lack the ability to read and under...
I consider distillation to have two main possibilities - teach people something faster, or teach people something better. (You can sometimes do both simultaneously, but I suspect that usually requires you to be really good and/or the original text to be really bad)
So, I would separate summarisation (teaching faster) from pedagogy (teaching better) and would say that your idea of providing background knowledge falls under the latter. The difference in our opinions, to me, is that I think it's best to separate the goal of this from the goal of summarising, a...
So, if I understand correctly, the way we would consider it likely that the correct generalisation had happened would be if the agent could generalise to hazards it had never seen actually kill chickens before? And this would require the agent to have an actual model of how chickens can be threatened such that it could predict that lava would destroy chickens based on, say, it's knowledge that it will die if it jumps into lava, which is beyond capabilities at the moment?
Why is this difficult? Is it only difficult to do this in Challenge Mode - if you could just code in "Number of chickens" as a direct feed to the agent, can it be done then? I was thinking about this today, and got to wondering why it was hard - at what step does an experiment to do this fail?
I sat down and thought about alignment (by the clock!) for a while today and came up with an ELK breaker that has probably been addressed elsewhere, and I wanted to know if someone had seen it before.
So, my understanding of ELK is the idea is that we want our model to tell us what it actually knows about the diamond, not what it thinks we want to hear. My question is - how does the AI specify this objective?
I can think of two ways, both bad.
1) AI aims to provide the most accurate knowledge of its state possible. Breaker: AI likely provides something uninte...
""AI alignment" has the application, the agenda, less charitably the activism, right in the name."
This seems like a feature, not a bug. "AI alignment" is not a neutral idea. We're not just researching how these models behave or how minds might be built neutrally out of pure scientific curiosity. It has a specific purpose in mind - to align AI's. Why would we not want this agenda to be part of the name?