Yes, I'm mostly embracing simulator theory here, and yes, there are definitely a number of implicit models of the world within LLMs, but they aren't coherent. So I'm not saying there is no world model, I'm saying it's not a single / coherent model, it's a bunch of fragments.
But I agree that it doesn't explain everything!
To step briefly out of the simulator theory frame, I agree that part of the problem is next-token generation, not RLHF - the model is generating the token, so it can't "step back" and decide to go back and not make the claim that it "...
No, it was and is a global treaty enforced multilaterally, as well as a number of bans on testing and arms reduction treaties. For each, there is a strong local incentive for states - including the US - to defect, but the existence of a treaty allows global cooperation.
With AGI, of course, we have strong reasons to think that the payoff matrix looks something like the following:
(0,0) (-∞, 5-∞)
(5-∞, -∞) (-∞, -∞)
So yes, there's a local incentive to defect, but it's actually a prisoner's dilemma where the best case for defecting is identical to suicide.
We decided to restrict nuclear power to the point where it's rare in order to prevent nuclear proliferation. We decided to ban biological weapons, almost fully successfully. We can ban things that have strong local incentives, and I think that ignoring that, and claiming that slowing down or stopping can't happen, is giving up on perhaps the most promising avenue for reducing existential risk from AI. (And this view helps in accelerating race dynamics, so even if I didn't think it was substantively wrong, I'd be confused as to why it's useful to actively promote it as an idea.)
I think the post addresses a key objection that many people opposed to EA and longtermist concerns have voiced with the EA view of AI, and thought it was fairly well written to make the points it made, without also making the mostly unrelated point that you wanted it to have addressed.
Getting close to the decade anniversary for Why the Tails Come Apart, and this is a very closely related issue to regressional Goodhart.
AI safety "thought" is more-or-less evenly distributed
Agreed - I wasn't criticizing AI safety here, I was talking about the conceptual models that people outside of AI safety have - as was mentioned in several other comments. So my point was about what people outside of AI safety think about when talking about ML models, trying to correct a broken mental model.
So, I disagree that evals and red teaming in application to AI are "meaningless" because there are no standards.
I did not say anything about evals and red teaming in application to AI, ot...
I do think that some people are clearly talking about meanings of the word "safe" that aren't so clear-cut (e.g. Sam Altman saying GPT-4 is the safest model yet™️), and in those cases I agree that these statements are much closer to "meaningless".
The people in the world who actually build these models are doing the thing that I pointed out. That's the issue I was addressing.
...People do actually have a somewhat-shared set of criteria in mind when they talk about whether a thing is safe, though, in a way that they (or at least I) don't when talking about
I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.
That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.
As an aside, mirror cells aren't actually a problem, and non-mirror digestive systems and immune systems can break them down, albeit with less efficiency. Church's early speculation that these cells would not be digestible by non-mirror life forms doesn't actually make work, per several molecular biologists I have spoken to since then.
Sure, I agree with that, and so perhaps the title should have been "Systems that cannot be reasonably claimed to be unsafe in specific ways cannot be claimed to be safe in those ways, because what does that even mean?"
If you say something is "qwrgz," I can't agree or disagree, I can only ask what you mean. If you say something is "safe," I generally assume you are making a claim about something you know. My problem is that people claim that something is safe, despite not having stated any idea about what they would call unsafe. But again, that seems fundamentally confused about what safety means for such systems.
"If it would fail under this specific load, then it is unsafe" is a clear idea of what would constitute unsafe. I don't think we have this clear of an idea for AI.
Agreed. And so until we do, we can't claim they are safe.
But maybe when you say "clear idea", you don't necessarily mean a clean logical description, and also consider more vague descriptions to be relevant?
A vague description allows for a vague idea of safety. That's still far better than what we have now, so I'd be happier with that than the status quo - but in fact, what people out...
Mostly agree.
I will note that correctly isolating the entertainment system from the car control system is one of those things you'd expect, but you'd be disappointed. Safety is hard.
For construction, it amounts to "doesn't collapse,"
No, the risk and safety models for construction go far, far beyond that, from radon and air quality to size and accessibility of fire exits.
with AI you are talking to the full generality of language and communication and that effectively means: "All types of harm."
Yes, so it's a harder problem to claim that it's safe. But doing nothing, having no risk model at all, and claiming that there's no reason to think it's unsafe, so it is safe, is, as I said, "fundamentally confused about what safety means for such systems."
For the first point, if "people can in fact recognize some types of unsafety," then it's not the case that "you don't even have a clear idea of what would constitute unsafe." And as I said in another comment, I think this is trying to argue about standards, which is a necessity in practice for companies that want to release systems, but isn't what makes the central point, which is the title of the post, true.
And I agree that rods are often simple, and the reason that I chose rods as an example is because people have an intuitive understanding of some of th...
That's true - and from what I can see, this emerges from the culture in academia. There, people are doing research, and the goal is to see if something can be done, or to see what happens if you try something new. That's fine for discovery, but it's insufficient for safety. And that's why certain types of research, ones that pose dangers to researchers or the public, have at least some degree of oversight which imposes safety requirements. ML does not, yet.
I think you're focusing on the idea of a standard, which is necessary for a production system or reliability in many senses, and should be demanded of AI companies - but it is not the fundamental issue with not being able to say in any sense what makes the system safe or unsafe, which was the fundamental point here that you seem not to disagree with.
I'm not laying out a requirement, I'm pointing out a logical necessity; if you don't know what something is or is not, you can't determine it. But if something "will reliably cause serious harm to people who interact with it," it sounds like you have a very clear understanding of how it would be unsafe, and a way to check whether that occurs.
I'm not saying that a standard is sufficient for safety, just that it's incoherent to talk about safety if you don't even have a clear idea of what would constitute unsafe.
Also, I wasn't talking about cars in particular - every type of engineering, including software engineering, follows this type of procedure for verification and validation, when those are required. And I think metal rods are a better example to think about - we don't know what it is going to be used for when it is made, but whatever application the rod will be used for, it needs to have some clear standards and requirements.
Jaynes discusses exactly this, in reference to whether someone displaying psychic powers really has them, and whether correctly predicting 100 cards is enough to overcome your prior that psychic powers don't exist. In response, he points out that you need more than 2 hypotheses. In this case, consider the prior odds of god giving her the information, or her cheating, or someone else lying about what happened, or you imagining the whole thing - and this is evidence in favor of all of those hypotheses over it being truly random, not just for the existence of god.
A quick google search gives a few options for the definition, and this qualifies according to all of them, from what I can tell. The fact that he thinks the comment is true doesn't change that.
Trolling definition: 1. the act of leaving an insulting message on the internet in order to annoy someone
Trolling is when someone posts or comments online to 'bait' people, which means deliberately provoking an argument or emotional reaction.
Online, a troll is someone who enters a communication channel, such as a comment thread, solely to cause trouble. Trolls often ...
I think this is wrong, but a useful argument to make.
I disagree even though I generally agree with each of your sub-points. The key problem is that the points can all be correct, but don't add to the conclusion that this is safe. For example, perhaps an interpretable model is only 99.998% likely to be a misaligned AI system, instead of 99.999% for a less interpretable one. I also think that the current paradigm is shortening timelines, and regardless of how we do safety, less time makes it less likely that we will find effective approaches in time to preem...
I agree - but think that now, if and when similarly initial thoughts on a conceptual model are proposed, there is less ability or willingness to engage, especially with people who are fundamentally confused about some aspect of the issue. This is largely, I believe, due to the volume of new participants, and the reduced engagement for those types of posts.
He excludes the only examples we have, which is fine for his purposes, though I'm skeptical it's useful as a definition, especially since "some difference" is an unclear and easily moved bar. However, it doesn't change the way we want to do prediction about whether something different is possible. That is, even if the example is excluded, it is very relevant for the question "is something in the class possible to specify."
I assume the strong +1 was specifically on the infohazards angle? (Which I also strongly agree with.)
None of this argues that creating grey goo is an unlikely outcome, just that it's a hard problem. And we have an existence proof of at least one example of a way to make gray goo that covers a planet, which is life-as-we-know-it, which did exactly that.
But solving hard problems is a thing that happens, and unlike the speed of light, this limit isn't fundamental. It's more like the "proofs" that heavier than air flight is impossible which existed in the 1800s, or the current "proofs" that LLMs won't become AGIs - convincing until the counterexample exists, but not at all indicative that no counterexample does or could exist.
Noting that my very first lesswrong post, back in the LW1 days, was an example of #2. I was wrong on some of the key parts of the intuition I was trying to convey, and ChristianKl corrected me. As an introduction to posting on LW, that was pretty good - I'd hate to think that's no longer acceptable.
At the same time, there is less room for it as the community got much bigger, and I'd probably weak downvote a similar post today, rather than trying to engage with a similar mistake, given how much content there is. Not sure if there is anything that can be don...
fwiw that seems like a pretty great interaction. ChristanKl seems to be usefully engaging with your frame while noting things about it that don't seem to work, seems (to me) to have optimized somewhat for being helpful, and also the conversation just wraps up pretty efficiently. (and I think this is all a higher bar than what I mean to be pushing for, i.e. having only one of those properties would have been fine)
Just want to note that I'm less happy with a lesswrong without Duncan. I very much value Duncan's pushback against what I see as a slow decline in quality, and so I would prefer him to stay and continue doing what he's doing. The fact that he's being complained about makes sense, but is mostly a function of him doing something valuable. I have had a few times where I have been slapped down by Duncan, albeit in comments on his Facebook page, where it's much clearer that his norms are operative, and I've been annoyed, but each of those times, despite being f...
Thanks, reading closely I see how you said that, but it wasn't clear initially. (There's an illusion of disagreement, which I'll christen the "twitter fight fallacy," where unless the opposite is said clearly, people automatically assume replies are disagreements.)
I probably put in an extra 20-60 hours, so the total is probably closer to 150 - which surprises me. I will add that a lot of the conversion time was dealing with writing more, LaTeX figures and citations, which were all, I think, substantive valuable additions. (Changing to a more scholarly style was not substantively valuable, nor was struggling with latex margins and TikZ for the diagrams, and both took some part of the time.)
Thanks, agreed. And as an aside, I don't think it's entirely coincidental that neither of the people who agree with you are in the Bay.
I think that the costs usually are worth it far more often than it occurs, from an outside view - which was David's point, and what I was trying to respond to. I think that it's more valuable than one expects to actually just jump through the hoops. And especially for people who haven't yet ever had any outputs actually published, they really should do that at least once.
(Also, sorry for the zombie reply.)
I think this ignores how decisions actually get made, but I think we're operating at too high a level of abstraction to actually disagree productively.
You're very unusually proactive, and I think the median member of the community would be far better served if they were more engaged the way you are. Doing that without traditional peer reviewed work is fine, but unusual, and in many ways is more difficult than peer-reviewed publication. And for early career researchers, I think it's hard to be taken seriously without some more legible record - you have a PhD, but many others don't.
To respond briefly, I think that people underinvest in (D), and write sub-par forum posts rather than aim for the degree of clarity that would allow them to do (E) at far less marginal cost. I agree that people overinvest in (B)[1], but also think that it's very easy to tell yourself your work is "actual progress" when you're doing work that, if submitted to peer-reviewed outlets, would be quickly demolished as duplicative of work you're unaware of, or incompletely thought-out in other ways.
I also worry that many people have never written a peer reviewed p...
It's a reasonable model. One problem with this as a predictive model, however, is that log-rolling happens across issues; a politician might give up on their budget-cutting to kill an anti-business provision, or give up an environmental rule to increase healthcare spending. So the gradients aren't actually single valued, there's a complex correlation / tradeoff matrix between them.
they don't judge those costs to be worth it
Worth it to whom? And if they did work that's valuable, how much of that value is lost if others who could benefit don't see it, because it's written up only informally or not shared widely?
There have also been plenty of other adapatations, ones which were not low-effort. I worked on 2, the Goodhart's law paper and a paper with Issa Rice on HRAD. Both were very significantly rewritten and expanded into "real" preprints, but I think it was clearly worthwhile.
I mostly agree with this - deep ideas should get relatively less focus, but not stop getting funding / attention. See my EA forum post from last year, Interesting vs. Important Work - A Place EA is Prioritizing Poorly, which makes a related point.
And I think the post here is saying that you should jump through those effort and editing hoops far more often than currently occurs.
If someone says the opportunity cost is not worth it for them, I see that as a claim that a priori might be true or false. Your post seems to imply that almost everyone is making an error in the same direction, and therefore funders should put their thumb on the scale. That’s at least not obvious to me.
I do think this is the wrong calculation, and the error caused by it is widely shared and pushes in the same direction.
Publication is a public good, where most of the benefit accrues to others / the public. Obviously costs to individuals are higher tha...
If we compare
it seems obvious to me that everyone has an incentive to underinvest in (A) relative to (B). You get grants & jobs & status from (B), not (A), right? And papers can be in (B) without being minimally or not at all in (A).
In academia, people talk all the time about how people are optimizing their publication record to the detriment of field-advancement, e.g. making results sound misleadingly original and important, chasing things that are hot, splitting results into unnecessari...
Unless I'm missing something, this seems correct, but unhelpful. It doesn't point towards how we should do anything substantive to understand, much less control AI, it just gives us a way to do better at tasks that help explain current models. Is there something you're pointing to that would make this more useful than just for prompt engineering or ad-hoc / post-hoc explanation of models we don't understand?
Mostly correct, but because passing isn't allowed, it is not necessarily the case that black doesn't have a forced win.
There's a different principle that's important here, which is that the space of bad ways to do things is almost always larger than the set of good ways to do it, and appealing to what has been sufficient so far is at least a great way to ensure you don't do far worse. I'm not going to try to make the argument fully right here, but in general, doing things differently means you're risking new failure modes - and the fact that it was once done this way doesn't avoid the problem, because the situation now is different. (On the other hand, this is a fully generalized argument against trying anything, which is bad if overused. It does function as a reason to exercise significant additional caution.)
These seem like arguments that it should be possible to be very, very cautious, and to create an agent that doesn't immediately crash and burn due to Russell's claim, not that they are unlikely, nor that even these agents don't fail slightly later.
I don't really see an argument here against the central claim you say you disagree with.
I no longer believe this to be obviously true.
This is based on a straightforward claim from optimization theory, and you don't address it, nor do you explain your model, other than to vaguely gesture at uncertainties and caution, without looking at whether VoI itself would lead to extremization, nor why caution would be optimal for an agent.
It's also mostly "conditional on acceptance, homeschooled students do better" - and given the selection bias in the conditional sample, that would reflect a bias against them in admissions, rather than being a fact about homeschooling.
Also, "that does give a strong lower bound for how bad that specifically can be for kids" - It really doesn't. If 25% of homeschooled kids do much better than average, and 75% do significantly worse, looking at those who went to college means you've completely eliminated the part of the sample that was harmed.
Worth noting that every one of the "not solved" problems was, in fact, well understood and proven impossible and/or solved for relaxed cases.
We don't need to solve this now, we need to improve the solution enough to figure out ways to improve it more, or show where it's impossible, before we build systems that are more powerful than we can at least mostly align. That's still ambitious, but it's not impossible!