a general idea of “optimizing hard” means higher risk of damage caused by errors in detail
Agreed.
“optimizing soft” has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective
I disagree with the idea that "optimizing soft" is less ambitious. "Optimizing soft", in my head, is about as ambitious as "optimizing hard", except it makes the epistemic uncertainty more explicit. In this model of caring I am trying to make more legible, I believe that Carlsmith-style caring may be more robust to certain epistem...
Backing up a step, because I'm pretty sure we have different levels of knowledge and assumptions (mostly my failing) about the differences between "hard" and "soft" optimizing.
I should acknowledge that I'm not particularly invested in EA as a community or identity. I try to be effective, and do some good, but I'm exploring rather than advocating here.
Also, I don't tend to frame things as "how to care", so much as "how to model the effects of actions, and how to use those models to choose how to act". I suspect that's isomorphic to how you're us...
I've noticed that there are two major "strategies of caring" used in our sphere:
Nate Soares obviously endorses staring unflinchingly into the abyss that is reality (if you are capable of doing so). However, I expect that almost-pure Soa...
Does this come from a general idea of "optimizing hard" means higher risk of damage caused by errors in detail, and "optimizing soft" has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective (if both are actually implemented well)?
I predict this is not really an accurate representation of Soares-style caring. (I think there is probably some vibe difference between these two clusters that you're tracking, but I doubt Nate Soares would advocate "overriding" per se)
I did not expect what appears to me to be a non-superficial combination of concepts behind the input prompt and the mixing/steering prompt -- this has made me more optimistic about the potential of activation engineering. Thank you!
Partition (after which block activations are added)
Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.
Also, could you explain ...
It would be lovely if you could also support a form of formatted export feature so that people can use this tool with the knowledge that they can export the data and switch to another tool (if this one gets Googled) anytime.
But yes, I am really excited for a super-fast and easy-to-use and good-looking prediction book successor. Manifold markets was just intimidating for me, and the only reason I got into it was social motivation. This tool serves a more personal niche for prediction logging, I think, and that is good.
running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead
My implication was that the quoted claim of yours was extreme and very likely incorrect ("we're all dead" and "unless this insanity is stopped", for example). I guess I failed to make that clear in my reply -- perhaps LW comments norms require you to eschew ambiguity and implication. I was not making an object-level claim about your timeline models.
Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research.
Can you give concrete use-cases that you imagine your project would lead to helping alignment researchers? Alignment researchers have wildly varying styles of work outputs and processes. I assume you aim to accelerate a specific subset of alignment researchers (those focusing on interpretability and existing models and have an incremental / empirical strategy for solving the alignment problem).
I'm very interested in this agenda -- I believe this is one of the many hard problems one needs to make progress on to make optimization-steering models a workable path to an aligned foom.
I have slightly different thoughts on how we can and should solve the problems listed in the "Risks of data driven improvement processes" section:
I personally use https://www.amazon.de/Mammut-Amino-Liquid-Flasche-Pack/dp/B01M0M12VS/ref=sr_1_5?crid=LGXT0V1X6YSO&keywords=mammut+amino+liquid&qid=1688664335&sprefix=mammut+amino+%2Caps%2C110&sr=8-5 and picked it mainly because it was the cheapest available on Amazon.
On the upside, now you have a concrete timeline for how long we have to solve the alignment problem, and how long we are likely to live!
I hope that DeepMind and Anthropic have great things planned to leapfrog this!
I don't get your model of the world that would imply the notion of DM/Anthropic "leapfrogging" as a sensible frame. There should be no notion of competition between these labs when it comes to "superalignment". If there is, that is weak evidence of our entire lightcone being doomed.
Competition between labs on capabilities is bad; competition between labs on alignment would be fantastic.
AFAIK, there's a distinct cluster of two kinds of independent alignment researchers:
It very much depends on the person's preferences, I think. I personally experienced a OOM-increase in my effectiveness by being in-person with other alignment researchers, so that is what I choose to invest in more.
gwern's Clippy gets done in by a basilisk (in your terms):
...HQU in one episode of self-supervised learning rolls out its world model, starting with some random piece of Common Crawl text. (Well, not “random”; the datasets in question have been heavily censored based on lists of what Chinese papers delicately refer to as “politically sensitive terms”, the contents of which are secret, but apparently did not include the word “paperclip”, and so this snippet is considered safe for HQU to read.) The snippet is from some old website where it talks about how pow
Just a quote I find rather interesting, since it is rare to see a Hero's Journey narrative with a Return that involves the hero not knowing if he will ever belong or find meaning once he returns, and yet chooses to return, having faith in his ability to find meaning again:
...If every living organism has a fixed purpose for its existence, then one thing's for sure. I [...] have completed my mission. I've fulfilled my purpose. But a great amount of power that has served its purpose is a pain to deal with, just like nuclear materials that have reached the end
The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity.
Perfect. A Turing machine doing Levin Search or running all possible Turing machines is the first example that came to my mind when I read Anton's argument against RSI-without-external-optimization-bits.
Recently I’ve come to terms with the idea that I have to publish my research even if it feels unfinished or slightly controversial. The mind is too complex (who would have thought), each time you think you get something, the new bit comes up and crushes your model. Time after time after time. So, waiting for at least remotely good answers is not an option. I have to “fail fast” even though it’s not a widely accepted approach among scientists nowadays.
I very much endorse and respect this action, especially because I recognize this in myself and yet still fail to do the obvious next step of "failing fast". I have faith I'll figure it out, though.
I endorse the shape of your argument but not exactly what you said.
Perhaps a better way to think about this is incentives. Zero sum moves are optimal in conditions of scarcity, while positive-sum moves are optimal in conditions of abundance.
Good read.
I don't endorse this being posted on LW, but I absolutely endorse having read this, and look forward to reading more fiction you write. (Unlike your last two pieces of fiction, I fail to see how it connects to LW.)
Ty! Yeah, I was uncertain about posting it, but seems like the mods want LW to host a pretty wide range of stuff for whatever reason. E.g. Jeff Kaufman's blog posts are always posted to frontpage.
I'm really glad you wrote this post, because Tsvi's post is different and touches on very different concepts! That post is mainly about fun and exploration being undervalued as a human being. Your post seems to have one goal: ensure that up-and-coming alignment researchers do not burn themselves out or hyperfocus on only one strategy for contributing to reducing AI extinction risk.
Note, this passage seems to be a bit... off to me.
...This one is slightly different from the last because it is an injunction to take care of your mental health. You are more usef
Good point! I won't use Substack though, so if I read your post 24 hours after release I'll leave the typos be.
Nate Soares' point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.
I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.
Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can't quite get what your belief is.
I stated it in the comment you replied to:
Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.
Natural abstractions are also leaky abstractions.
No, the way I used the term was to point to robust abstractions to ontological concepts. Here's an example: Say . here obviously means 2 in our language, but it doesn't change what represents, ontologically. If , then you have broken math, and that results in you being less capable in your reasoning and being "dutch booked". Your world model is then incorrect, and it is very unlikely that any ontological shift will result in such a break in world model capabilities.
Math is a robust abstracti...
This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).
Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for "perfect hardware copies" misses the point, in my opinion: it seems like you want me to accept that just because there isn't a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain con...
Typos report:
"Rethink Priors is remote hiring a Compute Governance Researcher [...]" I checked and they still use the name Rethink Priorities.
"33BB LLM on a single 244GB GPU fully lossless" ->should be 33B, and 24GB
"AlpahDev from DeepMind [...]" -> should be AlphaDev
Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don't expect the 'control problem' to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.
Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI, since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
I do not believe this alignment strategy ...
Also intuitively, in the latter case 5 of the data points “didn’t matter” in that you’d have had the same constraints (at that point) without them, and so this is kinda sorta like “information loss”.
I am confused: how can this be "information loss" when we are assuming that due to linear dependence of the data points, we necessarily have 5 extra dimensions where the loss is the same? Because 5 of the data points "didn't matter", that shouldn't count as "information loss" but more like "redundant data, ergo no information transmitted".
Control methods are always implemented as a feedback loop.
Most of my foom scenarios do not involve humanity trying to retain control over an AI, but trying to align it such that when humanity loses control, the outcomes are still favorable for humanity (or at least, the seed that represents humanity is what is maximized).
They are also not allowed to tell each other their true goals, and are ordered to eliminate the other if they tell them their goals. Importantly these rules also happen to allow them to have arbitrary sub goals as long as they are not a threat to humanity.
If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.
...Therefore An can properly align A_{n+1} . The base case is simply a reasonable human being who is by definition aligned.
My bad. I'm glad to hear you do have an inside view of the alignment problem.
If knowing enough about ML is your bottleneck, perhaps that's something you can directly focus on? I don't expect it to be hard for you -- perhaps only about six months -- to get to a point where you have coherent inside models about timelines.
Part of the reason I’m considering getting a degree is so I can get a job if I want and not have to bet on living rent-free with other rationalists or something.
Yeah, that's a hard problem. You seem smart: have you considered finding rationalists or rationalist-adjacent people who want to hire you part-time? I expect that the EA community in particular may have people willing to do so and that would give you both experience (to show future employers / clients), connections (to find more part-time / full-time jobs), and money.
...Now that I think about it
2050? That's quite far off, and it makes sense that you are considering university given you expect to have about two decades.
Given such a scenario, I would recommend trying to do a computer science/math major, specifically focusing on the subjects listed in John Wentworth's Study Guide that you find interesting. I expect that three years of such optimized undergrad-level study will easily make someone at least SERI MATS scholar level (assuming they start out a high school student). Since you are interested in agent foundations, I expect you shall find Joh...
Sorry for the late reply: I wrote up an answer but due to a server-side error during submission, I lost it. I shall answer the interpretability question first.
Interpretability didn't make the list because of the following beliefs of mine:
There seem to be three key factors that would influence your decision:
Based on your LW comment history, you probably already have rough models about the alignment problem that inform these three beliefs of yours. I think it would be helpful if you could go into detail about them so people can give you more specific advice, or perhaps help you answer another question further upstream of the one you asked.
Causal Influence Diagrams are interesting, but don't really seem all that useful. Anyway, the latest formal graphical representation for agents that the authors seem to promote are structured causal models so you don't read this paper for object level usefulness but incidental research contributions that are really interesting.
The paper divides AI systems into two major frameworks:
When I referred to pivotal acts, I implied the use of enforcement tools that are extremely powerful, of the sort implied in AGI Ruin. That is, enforcement tools that make an actual impact in extending timelines[1]. Perhaps I should start using a more precise term to describe this from now on.
It is hard for me to imagine how there can be consensus within a US government organization capable of launching a superhuman-enforcement-tool-based pivotal act (such as three letter agencies) to initiate a moratorium, much less consensus in the US government or betwee...
Your question seems to focus mainly on timeline model and not alignment model, so I shall focus on explaining how my model of the timeline has changed.
My timeline shortened from about four years (mean probability) to my current timeline of about 2.5 years (mean probability) since the GPT-4 release. This was because of two reasons:
The latt...
Formatting error: "OK, I used to work for a robotics company, and I do think that one of the key obstacles for a hostile AI is moving atoms around. So let me propose some alarms!" should be quoted since it is not you (Zvi) writing that passage but the person you linked and are quoting.
Possible typos:
Spent about 45 minutes processing this mentally. Did not look through the code or wonder about the reliability of the results. Here are my thoughts:
Based on what I can tell, AP fine-tuning will lead to the AI more likely simulating the relevant AP and its tokens will be what the simulator thinks the AP would return next. This me...
I want to differentiate between categories of capabilities improvement in AI systems, and here's the set of terms I've come up with to think about them:
Infrastructure improvements: Capability boost in the infrastructure that makes up an AI system. This involves software (Pytorch, CUDA), hardware (NVIDIA GPUs), operating systems, networking, the physical environment where the infrastructure is situated. This probably is not the lowest hanging fruit when it comes to capabilities acceleration.
Scaffolding improvements: Capability boost in an AI system th
Your text here is missing content found in the linked post. Specifically, the sentence "If one has to do this with" ends abruptly, unfinished.
Before reading this post, I usually would refrain from posting/commenting on LW posts partially because of the high threshold of quality for contribution (which is where I agree with you in a certain sense), and partially because it seemed more polite to ignore posts I found flaws in, or disagreed with strongly, than to engage (which costs both effort and potential reputation). Now, I believe I shall try to be more Socratic -- more willing to as politely as I can point out confusions and potential issues in posts/comments I have read and found wanting, if ...
That is the biggest issue I have with your writings (and that of Zack too, because he makes the same mistake): you write too much to communicate too little bits of usefulness.
Given what Zack writes about, I think he has no choice but to write this way. If he was brief, there would be politically-motivated misreadings of his posts. His only option is to write a long post which preemptively rules those out.
(Sorry for triple reply, trying to keep threads separate such that each can be responded to individually.)
what the most serious weaknesses of your argument are
I claim that the LW of 2023 is worse at correctly identifying the most serious weaknesses of a given argument than the LW of 2018.
Relative to the LW of 2018, I have the subjective sense that there's much much more strawmanning and zeroing-in-on-non-cruxes and eliding the distinctions between "A somewhat implies B," "A strongly implies B," and "A is tantamount to B."
I would genuinely expect that...
Pulling up a thought from another subthread:
Basically, I'm claiming that there are competing access needs, here, such as can be found in a classroom in which some students need things to be still and silent, and other students need to fidget and stim.
The Socrati and the Athenians are not entirely in a zero-sum game, but their dynamic has nonzero zero-sum nature. The thing that Socrates needs is inimical to the thing the Athenians need, and vice versa.
I think that's just ... visibly, straightforwardly true, here on LW; you can actually just see...
This response has completely sidestepped the crucial piece, which is to what extent [that kind of commentary] drives authors away entirely.
You're acting as if you always have fodder for that sort of engagement, and you in fact don't; enough jesters, and there are no kings left to critique.
Alignment agendas can generally be classified into two categories: blueprint-driven and component-driven. Understanding this distinction is probably valuable for evaluating and comprehending different agendas.
Blueprint-driven alignment agendas are approaches that start with a coherent blueprint for solving the alignment problem. They prioritize the overall structure and goals of the solution before searching for individual components or building blocks that fit within that blueprint. Examples of blueprint-driven agendas include MIRI's agent foundations, Va...
I think a better way of rephrasing it is "clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for".
by effectively generating such datasets, either for specific skills or for everything all at once
Just to be clear, what you have in mind is something to the effect of chain-of-thought (where LLMs and people deliberate through problems instead of trying to get an answer immediately or in the next few tokens), but in a more roundabout fashion, where you make the LLM deliberate a lot and fine-tune the LLM on that deliberation so that its "in the moment" (aka next token) response is more accurate -- is that right?
If so, how would you correct for the halluci...
Sidenote: I like how OpenAI ends their blog posts with an advertisement for positions they are hiring for, or programs they are running. That's a great strategy to advertise to the very people they want to reach.
Acknowledged, that was an unfair characterization of Nate-style caring. I guess I wanted to make explicit two extremes. Perhaps using the name "Nate-style caring" is a bad idea.
(I now think that "System 1 caring" and "System 2 caring" would have been much better.)