This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute.
MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."
The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.
But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.
Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.
To...
Maybe "counterfactually robust" is an OK phrase?
(Half-baked work-in-progress. There might be a “version 2” of this post at some point, with fewer mistakes, and more neuroscience details, and nice illustrations and pedagogy etc. But it’s fun to chat and see if anyone has thoughts.)
There’s a neuroscience problem that’s had me stumped since almost the very beginning of when I became interested in neuroscience at all (as a lens into AGI safety) back in 2019. But I think I might finally have “a foot in the door” towards a solution!
What is this problem? As described in my post Symbol Grounding and Human Social Instincts, I believe the following:
The vestibular system can detect whether you look up or down. It could be that the reflex triggers when you a) look down (vestibular system) and b) have a visual parallax that indicates depth (visual system).
Should be easy to test by closing one eye. Alternatively, it is the degree of accommodation of the lens. That should be testable by looking down with a lens that forces accommodation on short distances.
The negative should also be testable by asking congenitally blind people about their experience with this feeling of dizziness close to a rim.
Concerns over AI safety and calls for government control over the technology are highly correlated but they should not be.
There are two major forms of AI risk: misuse and misalignment. Misuse risks come from humans using AIs as tools in dangerous ways. Misalignment risks arise if AIs take their own actions at the expense of human interests.
Governments are poor stewards for both types of risk. Misuse regulation is like the regulation of any other technology. There are reasonable rules that the government might set, but omission bias and incentives to protect small but well organized groups at the expense of everyone else will lead to lots of costly ones too. Misalignment regulation is not in the Overton window for any government. Governments do not have strong incentives...
And quantitatively I think it would improve overall chances of AGI going well by double-digit percentage points at least.
Makes sense. By comparison, my own unconditional estimate of p(doom) is not much higher than 10%, and so it's hard on my view for any intervention to have a double-digit percentage point effect.
The crude mortality rate before the pandemic was about 0.7%. If we use that number to estimate the direct cost of a 1-year pause, then this is the bar that we'd need to clear for a pause to be justified. I find it plausible that this bar could be ...
TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non--neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.
Epistemic status: I wrote this “quickly” in about 12 hours, as otherwise it wouldn’t have come out at all. Think of...
Tacit knowledge is extremely valuable. Unfortunately, developing tacit knowledge is usually bottlenecked by apprentice-master relationships. Tacit Knowledge Videos could widen this bottleneck. This post is a Schelling point for aggregating these videos—aiming to be The Best Textbooks on Every Subject for Tacit Knowledge Videos. Scroll down to the list if that's what you're here for. Post videos that highlight tacit knowledge in the comments and I’ll add them to the post. Experts in the videos include Stephen Wolfram, Holden Karnofsky, Andy Matuschak, Jonathan Blow, Tyler Cowen, George Hotz, and others.
Samo Burja claims YouTube has opened the gates for a revolution in tacit knowledge transfer. Burja defines tacit knowledge as follows:
...Tacit knowledge is knowledge that can’t properly be transmitted via verbal or written instruction, like the ability to create
Thanks! Added.
This is a linkpost for On Duct Tape and Fence Posts.
Eliezer writes about fence post security. When people think to themselves "in the current system, what's the weakest point?", and then dedicate their resources to shoring up the defenses at that point, not realizing that after the first small improvement in that area, there's likely now a new weakest point somewhere else.
Fence post security happens preemptively, when the designers of the system fixate on the most salient aspect(s) and don't consider the rest of the system. But this sort of fixation can also happen in retrospect, in which case it manifest a little differently but has similarly deleterious effects.
Consider a car that starts shaking whenever it's driven. It's uncomfortable, so the owner gets a pillow to put...
Checking a number's precision correctly is quite trivial, and there were one-line fixes I could have applied that would make the function work properly on all numbers, not just some of them.
I'm really curious about what such fixes look like. In my experience, those edge cases tend to come about when there is some set of mutually incompatible desired properties of a system, the the mutual incompatibility isn't obvious. For example
In the late 19th century, two researchers meet to discuss their differing views on the existential risk posed by future Uncontrollable Super-Powerful Explosives.
Perhaps. I don't know much about the yields and so forth at the time, nor about the specific plans if any that were made for nuclear combat.
But I'd speculate that dozens of kiloton range fission bombs would have enabled the US and allies to win a war against the USSR. Perhaps by destroying dozens of cities, perhaps by preventing concentrations of defensive force sufficient to stop an armored thrust.
Crosspost from my blog.
If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.
For...
Nice contrarian view on the popular contrarians - and in yours I have at least 75% faith :) :
Ironically, if your elaborations are arguably themselves a bit broad brushed, as @Viliam points out, this could in an odd way also be seen as underlining your core take away: even here, where publication bias (or reading-bias induced publication-bias) is decried, maybe a hint of the bias has already sneaked in again.
If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.
Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff.
I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here).
I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.
Due to practical reasons, the compute requirements for training LLMs...