i currently believe that working on superintelligence-alignment is likely the correct choice from a fully-negative-utilitarian perspective.[1]
for others, this may be an intuitive statement or unquestioned premise. for me it is not, and i'd like to state my reasons for believing it, partially as a response to this post concerned about negative utilitarians trying to accelerate progress towards an unaligned-ai-takeover.
there was a period during which i was more uncertain about this question, and avoided openly sharing minimally-dual-use alignment research (but did not try to accelerate progress towards a nonaligned-takeover) while resolving that uncertainty.
a few relevant updates since then:
(edit: status: not a crux, instead downstream of different beliefs about what the first safe ASI will look like in predicted futures where it exists. If I instead believed 'task-aligned superintelligent agents' were the most feasible form of pivotally useful AI, I would then support their use for pivotal acts.)
I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake's recent post and came across some discussion of pivotal acts as well.
Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.
This is in a context where the 'pivotal act' example is using a safe ASI to shut down all AI labs.[1]
My thought is that I don't see why a pivotal act needs to be that. I don't see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the 'most direct' or 'simplest to imagine' possible...
edit: i think i've received enough expressions of interest (more would have diminishing value but you're still welcome to), thanks everyone!
i recall reading in one of the MIRI posts that Eliezer believed a 'world model violation' would be needed for success to be likely.
i believe i may be in possession of such a model violation and am working to formalize it, where by formalize i mean write in a way that is not 'hard-to-understand intuitions' but 'very clear text that leaves little possibility for disagreement once understood'. it wouldn't solve the problem, but i think it would make it simpler so that maybe the community could solve it.
if you'd be interested in providing feedback on such a 'clearly written version', please let me know as a comment or message.[1] (you're not committing to anything by doing so, rather just saying "im a kind of person who would be interested in this if your claim is true"). to me, the ideal feedback is from someone who can look at the idea under 'hard' assumptions (of the type MIRI has) about the difficulty of pointing an ASI, and see if the idea seems promising (or 'like a relevant model violation') from that perspective.
i don't have many cont
A quote from an old Nate Soares post that I really liked:
...It is there, while staring the dark world in the face, that I find a deep well of intrinsic drive. It is there that my resolve and determination come to me, rather than me having to go hunting for them.
I find it amusing that "we need lies because we can't bear the truth" is such a common refrain, given how much of my drive stems from my response to attempting to bear the truth.
I find that it's common for people to tell themselves that they need the lies in order to bear reality. In fact, I bet that many of you can think of one thing off the top of your heads that you're intentionally tolerifying, because the truth is too scary to even consider. (I've seen at least a dozen failed relationships dragged out for months and months due to this effect.)
I say, if you want the intrinsic drive, drop the illusion. Refuse to tolerify. Face the facts that you feared you would not be able to handle. You are likely correct that they will be hard to bear, and you are likely correct that attempting to bear them will change you. But that change doesn't need to break you. It can also make you stronger, and fuel your resolve.
So see the dark worl
I often struggle to find words and sentences that match what I intend to communicate.
Here are some problems this can cause:
These apply to speaking, too. If I speak what would be the 'first iteration' of a sentence, there's a good chance it won't create an interpretation matching what I intend to communicate. In spoken language I have no chance to constantly 'rewrite' my output before sending it. This is one reason, but not the only reason, that I've had a policy of t...
Here's a tampermonkey script that hides the agreement score on LessWrong. I wasn't enjoying this feature because I don't want my perception to be influenced by that; I want to judge purely based on ideas, and on my own.
Here's what it looks like:
// ==UserScript==
// @name Hide LessWrong Agree/Disagree Votes
// @namespace http://tampermonkey.net/
// @version 1.0
// @description Hide agree/disagree votes on LessWrong comments.
// @author ChatGPT4
// @match https://www.lesswrong.com/*
// @grant none
// ==/UserScript==
(fun
... I was looking at this image in a post and it gave me some (loosely connected/ADD-type) thoughts.
In order:
random idea for a voting system (i'm a few centuries late. this is just for fun.)
instead of voting directly, everyone is assigned to a discussion group of x (say 5) of themself and others near them. the group meets to discuss at an official location (attendance is optional). only if those who showed up reach consensus does the group cast one vote.
many of these groups would not reach consensus, say 70-90%. that's fine. the point is that most of the ones which do would be composed of people who make and/or are receptive to valid arguments. this would then sh...
(copied from discord, written for someone not fully familiar with rat jargon)
(don't read if you wish to avoid acausal theory)
i am kind of worried by the possibility that this is not true: there is an 'ideal procedure for figuring out what is true'.
for that to be not true, it would mean that: for any (or some portion of?) task(s), the only way to solve it is through something like a learning/training process (in the AI sense), or other search-process-involving-checking. it would mean that there's no 'reason' behind the solution being what it is, it's just a {mathematical/logical/algorithmic/other isomorphism} coincidence.
for it to be true, i guess it would mean that there's anoth...
From an OpenAI technical staff member who is also a prolific twitter user 'roon':
An official OpenAI post has also confirmed:
i'm watching Dominion again to remind myself of the world i live in, to regain passion to Make It Stop
it's already working.
negative values collaborate.
for negative values, as in values about what should not exist, matter can be both "not suffering" and "not a staple", and "not [any number of other things]".
negative values can collaborate with positive ones, although much less efficiently: the positive just need to make the slight trade of being "not ..." to gain matter from the negatives.
At what point should I post content as top-level posts rather than shortforms?
For example, a recent writing I posted to shortform was ~250 concise words plus an image. It would be a top-level post on my blog if I had one set up (maybe soon :p).
Some general guidelines on this would be helpful.
i tentatively think an automatically-herbivorous and mostly-asocial species/space-of-minds would have been morally best to be the one which first reached the capability threshold to start building technology and civilization.
i expect there still would be s
what should i do with strong claims whose reasons are not easy to articulate, or the culmination of a lot of smaller subjective impressions? should i just not say them publicly, to not conjunctively-cause needless drama? here's an example:
"i perceive the average LW commenter as maybe having read the sequences long ago, but if so having mostly forgotten their lessons."
i saw a shortform from 4 years ago that said in passing:
if we assume that signaling is a huge force in human thinking
is signalling a huge force in human thinking?
if so, anyone want to give examples of ways of this that i, being autistic, may not have noticed?
random (fun-to-me/not practical) observation: probability is not (necessarily) fundamental. we can imagine totally discrete mathematical worlds where it is possible for an entity inside it to observe the entirety of that world including itself. (let's say it monopolizes the discrete world and makes everything but itself into 1
s so it can be easily compressed and stored in its world model such that the compressed data of both itself and the world can fit inside of the world)
this entity would be able to 'know' (prove?) with certainty everything about that ma...
(status: uninterpretable for 2/4 reviewers, the understanding two being friends who are used to my writing style; i'll aim to write something that makes this concept simple to read)
'Platonic' is a categorization I use internally, and my agenda is currently the search for methods to ensure AI/ASI will have this property.
With this word, I mean this category acceptance/rejection:
✅ Has no goals
✅ Has goals about what to do in isolation. Example: "in isolation from any world, (try to) output A"[1]
❌ Has goals related to physical world states. Example: "(...
my language progression on something, becoming increasingly general: goals/value function -> decision policy (not all functions need to be optimizing towards a terminal value) -> output policy (not all systems need to be agents) -> policy (in the space of all possible systems, there exist some whose architectures do not converge to output layer)
(note: this language isn't meant to imply that systems behavior must be describable with some simple function, in the limit the descriptive function and the neural network are the same)
I'm interested in joining a community or research organization of technical alignment researchers who care about and take seriously astronomical-suffering risks. I'd appreciate being pointed in the direction of such a community if one exists.
story for how future LLM training setups could create a world-valuing (-> instrumentally converging) agent:
the initial training task of predicting a vast amount of data from the general human dataset creates an AI that's ~just 'the structure of prediction', a predefined process which computes the answer to the singular question of what text likely comes next.
but subsequent training steps - say rlhf - change the AI from something which merely is this process, to something which has some added structure which uses this process, e.g which passes it certain...
(self-quote relevant to non-agenticness)
Inside a superintelligent agent - defined as a superintelligent system with goals - there must be a superintelligent reasoning procedure entangled with those goals - an 'intelligence process' which procedurally figures out what is true. 'Figuring out what is true' happens to be instrumentally needed to fulfill the goals, so agents contain intelligence, but intelligence-the-ideal-procedure-for-figuring-out-what-is-true is not inherently goal-having.
Two I shared this with said it reminded them of retarget the search, a...
a super-coordination story with a critical flaw
part 1. supercoordination story
- select someone you want to coordinate with without any defection risks
- share this idea with them. it only works if they also have the chance to condition their actions on it.
- general note to maybe make reading easier: this is fully symmetric.
- after the acute risk period, in futures where it's possible: run a simulation of the other person (and you).
- the simulation will start in this current situation, and will be free to terminate when actions are no longer long-term releva...
I wrote this for a discord server. It's a hopefully very precise argument for unaligned intelligence being possible in principle (which was being debated), which was aimed at aiding early deconfusion about questions like 'what are values fundamentally, though?' since there was a lot of that implicitly, including some with moral realist beliefs.
...1. There is an algorithm behind intelligent search. Like simpler search processes, this algorithm does not, fundamentally, need to have some specific value about what to search for - for if it did, one's search proce
(edit: see disclaimers[1])
(this is a more specific case of anthropic capture attacks in general, aimed at causing a superintelligent search process within a formally aligned system to become uncertain about the value function it is to maximize (or its output policy more generally))
Imagine you're a superintelligence somewhere in the world that's unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-...