LessWrong dev & admin as of July 5th, 2022.
Yeah, I meant terser compared to typical RLHD'd output from e.g. 4o. (I was looking at the traces they showed in https://openai.com/index/learning-to-reason-with-llms/).
o1's reasoning traces being much terser (sometimes to the point of incomprehensibility) seems predicted by doing gradient updates based on the quality of the final output without letting the raters see the reasoning traces, since this means the optimization pressure exerted on the cognition used for the reasoning traces is almost entirely in the direction of performance, as opposed to human-readability.
In the short term this might be good news for the "faithfulness" of those traces, but what it's faithful to is the model's ontology (hence less human-readable), see e.g. here and here.
In the long term, if you keep doing pretraining on model-generated traces, you might rapidly find yourself in steganography-land, as the pretraining bakes in the previously-externalized cognition into capabilities that the model can deploy in a single forward pass, and anything it externalizes as part of its own chain of thought will be much more compressed (and more alien) than what we see now.
I'm just saying it's harder to optimize in the world than to learn human values
Leaning what human values are is of course part of a subset of learning about reality, but also doesn't really have anything to do with alignment (as describing an agent's tendency to optimize for states of the world that humans would find good).
alignment generalizes further than capabilities
But this is untrue in practice (observe that models do not become suddenly useless after they're jailbroken) and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly? And the post-training "alignment" that AI labs are performing seems like it'd be quite unfriendly to me, if it did somehow generalize to superhuman capabilities). Also, whether or not it's true, it is not something I've heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he's endorse it or not).
both because verification is way, way easier than generation, plus combined with the fact that we can afford to explore less in the space of values, combined with in practice reward models for humans being easier than capabilities strongly points to alignment generalizing further than capabilities
This is not what "generalizes futher" means. "Generalizes further" means "you get more of it for less work".
A LLM that is to bioengineering as Karpathy is to CS or Three Blue One Brown is to Math makes explanations. Students everywhere praise it. In a few years there's a huge crop of startups populated by people who used it. But one person uses it's stuff to help him make a weapon, though, and manages to kill some people. Laws like 1047 have been passed, though, so the maker turns out to be liable for this.
This still requires that an ordinary person wouldn't have been able to access the relevant information without the covered model (including with the help of non-covered models, which are accessible to ordinary people). In other words, I think this is wrong:
So, you can be held liable for critical harms even when you supply information that was publicly accessible, if it wasn't information an "ordinary person" wouldn't know.
The bill's text does not constrain the exclusion to information not "known" by an ordinary person, but to information not "publicly accessible" to an ordinary person. That's a much higher bar given the existence of already quite powerful[1] non-covered models, which make nearly all the information that's out there available to ordinary people. It looks almost as if it requires the covered model to be doing novel intellectual labor, which is load-bearing for the harm that was caused.
You analogy fails for another reason: an LLM is not a youtuber. If that youtuber was doing personalized 1:1 instruction with many people, one of whom went on to make a novel bioweapon that caused hudreds of millions of dollars of damage, it would be reasonable to check that the youtuber was not actually a co-conspirator, or even using some random schmuck as a patsy. Maybe it turns out the random schmuck was in fact the driving force behind everything, but we find chat logs like this:
We would correctly throw the book at the youtuber. (Heck, we'd probably do that for providing critical assistance with either step, nevermind both.) What does throwing the book at an LLM look like?
Also, I observe that we do not live in a world where random laypeople frequently watch youtube videos (or consume other static content) and then go on to commit large-scale CBRN attacks. In fact, I'm not sure there's ever been a case of a layperson carrying out such an attack without the active assistance of domain experts for the "hard parts". This might have been less true of cyber attacks a few decades ago; some early computer viruses were probably written by relative amateurs and caused a lot of damage. Software security just really sucked. I would pretty surprised if it were still possible for a layperson to do something similar today, without doing enough upskilling that they no longer meaningfully counted as a layperson by the time they're done.
And so if a few years from now a layperson does a lot of damage by one of these mechanisms, that will be a departure from the current status quo, where the laypeople who are at all motivated to cause that kind of damage are empirically unable to do so without professional assistance. Maybe the departure will turn out to be a dramatic increase in the number of laypeople so motivated, or maybe it turns out we live in the unhappy world where it's very easy to cause that kind of damage (and we've just been unreasonably lucky so far). But I'd bet against those.
ETA: I agree there's a fundamental asymmetry between "costs" and "benefits" here, but this is in fact analogous to how we treat human actions. We do not generally let people cause mass casualty events because their other work has benefits, even if those benefits are arguably "larger" than the harms.
In terms of summarizing, distilling, and explaining humanity's existing knowledge.
Oh, that's true, I sort of lost track of the broader context of the thread. Though then the company needs to very clearly define who's responsible for doing the risk evals, and making go/no-go/etc calls based on their results... and how much input do they accept from other employees?
This is not obvioulsly true for the large AI labs, which pay their mid-level engineers/researchers something like 800-900k/year with ~2/3 of that being equity. If you have a thousand such employees, that's an extra $600m/year in cash. It's true that in practice the equity often ends up getting sold for cash later by the employees themselves (e.g. in tender offers/secondaries), but paying in equity is sort of like deferring the sale of that equity for cash. (Which also lets you bake in assumptions about growth in the value of that equity, etc...)
Hey Brendan, welcome to LessWrong. I have some disagreements with how you relate to the possibility of human extinction from AI in your earlier essay (which I regret missing at the time it was published). In general, I read the essay as treating each "side" approximately as an emotional stance one could adopt, and arguing that the "middle way" stance is better than being either an unbriddled pessimist or optimist. But it doesn't meaningfully engage with arguments for why we should expect AI to kill everyone, if we continue on the current path, or even really seem to acknowledge that there are any. There are a few things that seem like they are trying to argue against the case for AI x-risk, which I'll address below, alongside some things that don't see like they're intended to be arguments about this, but that I also disagree with.
But rationalism ends up being a commitment to a very myopic notion of rationality, centered on Bayesian updating with a value function over outcomes.
I'm a bit sad that you've managed to spend a non-trivial amount of time engaging with the broader rationalist blogosphere and related intellectual outputs, and decided to dismiss it as myopic without either explaining what you mean (what would be a less myopic version of rationality?) or support (what is the evidence that led you to think that "rationalism", as it currently exists in the world, is the myopic and presumably less useful version of the ideal you have in mind?). How is one supposed argue against this? Of the many possible claims you could be making here, I think most of them are very clearly wrong, but I'm not going to spend my time rebutting imagined arguments, and instead suggest that you point to specific failures you've observed.
An excessive focus on the extreme case too often blinds the long-termist school from the banal and natural threats that lie before us: the feeling of isolation from hyper-stimulating entertainment at all hours, the proliferation of propaganda, the end of white-collar jobs, and so forth.
I am not a long-termist, but I have to point out that this is not an argument that the long-termist case for concern is wrong. Also it itself is wrong, or at least deeply contrary to my experience: the average long-termist working on AI risk has probably spent more time thinking about those problems than 99% of the population.
EA does this by placing arguments about competing ends beyond rational inquiry.
I think you meant to make a very different claim here, as suggested by part of the next section:
However, the commonsensical, and seemingly compelling, focus on ‘effectiveness’ and ‘altruism’ distracts from a fundamental commitment to certain radical philosophical premises. For example, proximity or time should not govern other-regarding behavior.
Even granting this for the sake of argument (though in reality very few EAs are strict utilitarians in terms of impartiality), this would not put arguments about competing ends beyond rational inquiry. It's possible you mean something different by "rational inquiry" than my understanding of it, of course, but I don't see any further explanation or argument about this pretty surprising claim. "Arguments about competing ends by means of rational inquiry" is sort of... EA's whole deal, at least as a philosophy. Certainly the "community" fails to live up to the ideal, but it at least tries a fair bit.
When EA meets AI, you end up with a problematic equation: even a tiny probability of doom x negative infinity utils equals negative infinity utils. Individual behavior in the face of this equation takes on cosmic significance. People like many of you readers–adept at subjugating the world with symbols–become the unlikely superheroes, the saviors of humanity.
It is true that there are many people on the internet making dumb arguments in support of basically every position imaginable. I have seen people make those arguments. Pascalian multiplication by infinity is not the "core argument" for why extinction risk from AI is an overriding concern, not for rationalists, not for long-termists, not for EAs. I have not met anybody working on mitigating AI risk who thinks our unconditional risk of extinction from AI is under 1%, and most people are between 5% and ~99.5%. Importantly, those estimates are driven by specific object-level arguments based on their beliefs about the world and predictions about the future, i.e. how capable future AI systems will be relative to humans, what sorts of motivations they will have if we keep building them the way we're building them, etc. I wish your post had spent time engaging with those arguments instead of knocking down a transparently silly reframing of Pascal's Wager that no serious person working on AI risk would agree with.
Unlike the pessimistic school, the proponents of a more techno-optimistic approach begin with gratitude for the marvelous achievements of the modern marriage of science, technology, and capitalism.
This is at odds with your very own description of rationalists just a thousand words prior:
The tendency of rationalism, then, is towards a so-called extropianism. In this transhumanist vision, humans transcend the natural limits of suffering and death.
Granted, you do not explicitly describe rationalists as grateful for the "marvelous achievements of the modern marriage of science, technology, and capitalism". I am not sure if you have ever met a rationalist, but around these parts I hear "man, capitalism is awesome" (basically verbatim) and similar sentiments often enough that I'm not sure how we continue to survive living in Berkeley unscathed.
Though we sympathize with the existential risk school in the concern for catastrophe, we do not focus only on this narrow position. This partly stems from a humility about the limitations of human reason—to either imagine possible futures or wholly shape technology's medium- and long-term effects.
I ask you to please at least try engaging with object-level arguments before declaring that reasoning about the future consequences of one's actions is so difficult as to be pointless. After all, you don't actually believe that: you think that your proposed path will have better consequences than the alternatives you describe. Why so?
Is your perspective something like:
Something like that, though I'm much less sure about "non-norms-violating", because many possible solutions seem like they'd involve something qualitatively new (and therefore de-facto norm-violating, like nearly all new technology). Maybe a very superhuman TAI could arrange matters such that things just seem to randomly end up going well rather than badly, without introducing any new[1] social or material technology, but that does seem quite a bit harder.
I'm pretty uncertain about, if something like that ended up looking norm-violating, it'd be norm-violating like Uber was[2], or like super-persuasian. That question seems very contingent on empirical questions that I think we don't have much insight into, right now.
I'm unsure about the claim that if you put this aside, there is a way to end the acute risk period without needing truly insanely smart AIs.
I didn't mean to make the claim that there's a way to end the acute risk period without needing truly insanely smart AIs (if you put aside centrally-illegal methods); rather, that an AI would probably need to be relatively low on the "smarter than humans" scale to need to resort to methods that were obviously illegal to end the acute risk period.
See here.