quila

Trying to directly solve alignment.

If you disagree with me about alignment for reasons you expect I'm not aware of, please tell me.

If you have/find an idea/chain-of-logic that's genuinely novel/out-of-human-distribution, you're welcome to send it to me to 'introduce chaos into my system'.

(Most of my writing here is optimized for pure understandability rather than other, secondary things, cause of this)

Contact: message me on lesswrong, and from there we can move to another platform such as discord or matrix.org

-----BEGIN PGP PUBLIC KEY BLOCK-----

mDMEZiAcUhYJKwYBBAHaRw8BAQdADrjnsrbZiLKjArOg/K2Ev2uCE8pDiROWyTTO
mQv00sa0BXF1aWxhiJMEExYKADsWIQTuEKr6zx3RBsD/QW3DBzXQe0TUaQUCZiAc
UgIbAwULCQgHAgIiAgYVCgkICwIEFgIDAQIeBwIXgAAKCRDDBzXQe0TUabWCAP0Z
/ULuLWf2QaljxEL67w1b6R/uhP4bdGmEffiaaBjPLQD/cH7ufTuwOHKjlZTIxa+0
kVIMJVjMunONp088sbJBaQi4OARmIBxSEgorBgEEAZdVAQUBAQdAq5exGihogy7T
WVzVeKyamC0AK0CAZtH4NYfIocfpu3ADAQgHiHgEGBYKACAWIQTuEKr6zx3RBsD/
QW3DBzXQe0TUaQUCZiAcUgIbDAAKCRDDBzXQe0TUaUmTAQCnDsk9lK9te+EXepva
6oSddOtQ/9r9mASeQd7f93EqqwD/bZKu9ioleyL4c5leSQmwfDGlfVokD8MHmw+u
OSofxw0=
=rBQl
-----END PGP PUBLIC KEY BLOCK-----

I have not signed any NDAs whose existence I cannot mention.

Posts

Sorted by New

Wiki Contributions

Comments

quila104

(Personal) On writing and (not) speaking

I often struggle to find words and sentences that match what I intend to communicate.

Here are some problems this can cause:

  1. Wordings that are odd or unintuitive to the reader, but that are at least literally correct.[1]
  2. Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I'm writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
  3. Writing taking lots of time: I usually have to iterate many times on words/sentences until I find one which my mind parses as referring to what I intend. In the slowest cases, I might finalize only 2-10 words per minute. Even after iterating, my words are often interpreted in ways I failed to foresee.

These apply to speaking, too. If I speak what would be the 'first iteration' of a sentence, there's a good chance it won't create an interpretation matching what I intend to communicate. In spoken language I have no chance to constantly 'rewrite' my output before sending it. This is one reason, but not the only reason, that I've had a policy of trying to avoid voice-based communication.

I'm not fully sure what caused this relationship to language. It could be that it's just a byproduct of being autistic. It could also be a byproduct of out-of-distribution childhood abuse.[2]

  1. ^

    E.g., once I couldn't find the word 'clusters,' and wrote a complex sentence referring to 'sets of similar' value functions each corresponding to a common alignment failure mode / ASI takeoff training story. (I later found a way to make it much easier to read)

  2. ^

    (Content warning)

    My primary parent was highly abusive, and would punish me for using language in the intuitive 'direct' way about particular instances of that. My early response was to try to euphemize and say-differently in a way that contradicted less the power dynamic / social reality she enforced.

    Eventually I learned to model her as a deterministic system and stay silent / fawn.

quila10

i am kind of worried by the possibility that this is not true: there is an 'ideal procedure for figuring out what is true'.

for that to be not true, it would mean that: for any (or some portion of?) task(s), the optimal way to solve it is through something like a learning/training process (in the AI sense), or other search-process-involving-checking. it would mean that there's no 'reason' behind the solution being what it is, it's just a {mathematical/logical/algorithmic/other isomorphism} coincidence.

for it to be true, i guess it would mean that there's another procedure ({function/program}) that can deduce the solution in a more 'principled'[1] way (could be more or less efficient)

more practically, it being not true would be troubling for strategies based on 'create the ideal intelligence-procedure and use it as an oracle [or place it in a formal-value-containing hardcoded-structure that uses it like an oracle]'

why do i think it's possible for it to be not true? because we currently observe training processes succeeding, but don't yet know of an ideal procedure[2]. that's all. a mere possibility, not a 'positive argument'.

  1. ^

    i don't know exactly what i mean by this

  2. ^

    in case anyone thinks 'bayes theorem / solomonoff induction!' - bayes theorem isn't it, because, for example, it doesn't alone tell you how to solve a maze. i can try to elaborate if needed

quila10

(self-quote relevant to non-agenticness)

Inside a superintelligent agent - defined as a superintelligent system with goals - there must be a superintelligent reasoning procedure entangled with those goals - an 'intelligence process' which procedurally figures out what is true. 'Figuring out what is true' happens to be instrumentally needed to fulfill the goals, so agents contain intelligence, but intelligence-the-ideal-procedure-for-figuring-out-what-is-true is not inherently goal-having.

Two I shared this with said it reminded them of retarget the search, and I agree it seems to be a premise of that. However, I previously had not seen it expressed clearly, and had multiple times confused others with attempts to communicate this or to leave it as an implied premise, so here is clear statement from which other possibilities in mindspace fall.

quila122

Note: I’m a MIRI researcher, but this agenda is the product of my own independent research, and as such one should not assume it’s endorsed by other research staff at MIRI.

That's interesting to me. I'm curious about the views of others at MIRI on this. I'm also excited for the sequence regardless.

quila22

Commenting to note that I think this quote is locally-invalid:

If the leading lab can't stop critical models from leaking to actors that won't use great deployment safety practices, approximately nothing else matters

There are other disjunctive problems with the world which are also individually-sufficient for doom[1], in which case each of them matter a lot, in absence of some fundamental solution to all of them.

  1. ^

    (e.g lack of superintelligence-alignment/steerability progress)

quila120

edit: i think i've received enough expressions of interest (more would have diminishing value but you're still welcome to), thanks everyone!

i recall reading in one of the MIRI posts that Eliezer believed a 'world model violation' would be needed for success to be likely.

i believe i may be in possession of such a model violation and am working to formalize it, where by formalize i mean write in a way that is not 'hard-to-understand intuitions' but 'very clear text that leaves little possibility for disagreement once understood'. it wouldn't solve the problem, but i think it would make it simpler so that maybe the community could solve it.

if you'd be interested in providing feedback on such a 'clearly written version', please let me know as a comment or message.[1] (you're not committing to anything by doing so, rather just saying "im a kind of person who would be interested in this if your claim is true"). to me, the ideal feedback is from someone who can look at the idea under 'hard' assumptions (of the type MIRI has) about the difficulty of pointing an ASI, and see if the idea seems promising (or 'like a relevant model violation') from that perspective.

  1. ^

    i don't have many contacts in the alignment community

quila21

(epistemic status: same as the post, I don't know neuroscience)

I mis-heard the song version and had a different interpretation of something, but it actually seems like a good idea to consider so here it is :). 

Remove the assumption that the biological version of oneself needs to survive the process. Then, perform very quick scan of all the neurons on a very short timescale without worrying about the brain being destroyed afterwards, only about accurately scanning all the neurons before that happens. Would this data of the whole brain for a very short time (maybe some fraction of a second) be enough to digitally reconstruct the mind and run it?

Maybe it wouldn't be because of neuroscience reasons I wouldn't know, or 'neural spikes over time' not being predictable from the very-short-timespan data (again for some reason I wouldn't know). Also, maybe such a fast scan is not physically possible with current technology (I'd guess that something more efficient than inserting 100 billion wires would be needed).

But if it were possible and feasible, I think it would be worth it, the world's at stake after all. I'd volunteer.

quila30

thanks for sharing. here's my thoughts on the possibilities in the quote.

Suffering subroutines - maybe 10-20% likely. i don't think suffering reduces to "pre-determined response patterns for undesirable situations," because i can think of simple algorithmic examples of that which don't seem like suffering.

suffering feels like it's about the sense of aversion/badness (often in response a situation), and not about the policy "in <situation>, steer towards <new situation>". (maybe humans were instilled with a policy of steering away from 'suffering' states generally, and that's why evolution made us enter those states in some types of situation?). (though i'm confused about what suffering really is)

i would also give the example of positive-feeling emotions sometimes being narrowly directed. for example, someone can feel 'excitement/joy' about a gift or event and want to <go to/participate in> it. sexual and romantic subroutines can also be both narrowly-directed and positive-feeling. though these examples lack the element of a situation being steered away from, vs steering (from e.g any neutral situation) towards other ones.

Suffering simulations - seems likely (75%?) for the estimation of universal attributes, such as the distribution of values. my main uncertainty is about whether there's some other way for the ASIs to compute that information which is simple enough to be suffering free. this also seems lower magnitude than other classes, because (unless it's being calculated indefinetely for ever-greater precision) this computation terminates at some point, rather than lasting until heat death (or forever if it turns out that's avoidable).

Blackmail - i don't feel knowledgeable enough about decision theory to put a probability on this one, but in the case where it works (or is precommitted to under uncertainty in hopes that it works), it's unfortunately a case where building aligned ASI would incentive unaligned entities to do it.

Flawed realization - again i'm too uncertain about what real-world paths lead to this, but intuitively, it's worryingly possible if the future contains LLM-based LTPAs (long term planning agents) intelligent enough to solve alignment and implement their own (possibly simulated) 'values'.

quila30

I've replied to/written my current beliefs about this subject here

quila130

i currently believe that working on superintelligence-alignment is likely the correct choice from a fully-negative-utilitarian perspective.[1]

for others, this may be an intuitive statement or unquestioned premise. for me it is not, and i'd like to state my reasons for believing it, partially as a response to this post concerned about negative utilitarians trying to accelerate progress towards an unaligned-ai-takeover.

there was a period during which i was more uncertain about this question, and avoided openly sharing minimally-dual-use alignment research (but did not try to accelerate progress towards a nonaligned-takeover) while resolving that uncertainty.

a few relevant updates since then:

  1. decrease on the probability that the values an aligned AI would have would endorse human-caused moral catastrophes such as human-caused animal suffering.

    i did not automatically believe humans to be good-by-default, and wanted to take time to seriously consider what i think should be a default hypothesis-for-consideration upon existing in a society that generally accepts an ongoing mass torture event.
  2. awareness of vastly worse possible s-risks.

    factory farming is a form of physical torture, by which i mean torture of a mind which is done through the indirect route of effecting its input channels (body/senses). it is also a form of psychological torture. it is very bad, but situations which are magnitudes worse seem possible, where a mind is modulated directly (on the neuronal level) and fully.

    compared to 'in-distribution suffering' (eg animal suffering, human-social conflicts), i find it further less probable that an AI aligned to some human-specified values[2] would create a future with this.

    i think it's plausible that it exists rarely in other parts of the world, though, and if so would be important to prevent through acausal trade if we can.

i am not free of uncertainty about the topic, though.

in particular, if disvalue of suffering is common across the world, such that the suffering which can be reduced through acausal trade will be reduced through acausal trade regardless of whether we create an AI which disvalues suffering, then it would no longer be the case that working on alignment is the best decision for a purely negative utilitarian.

despite this uncertainty, my current belief is that the possibility of reducing suffering via acausal trade (including possibly such really-extreme forms of suffering) outweighs the probability and magnitude of human-aligned-AI-caused suffering.[3]

also, to be clear, if it ever seems that an actualized s-risk takeover event is significantly more probable than it seems now[4] as a result of unknown future developments, i would fully endorse causing a sooner unaligned-but-not-suffering takeover to prevent it.

  1. ^

    i find it easier to write this post as explaining my position as "even for a pure negative utilitarian, i think it's the correct choice", because it lets us ignore individual differences in how much moral weight is assigned to suffering relative to everything else.

    i think it's pretty improbable that i would, on 'idealized reflection'/CEV, endorse total-negative-utilitarianism (which has been classically pointed out as implying, e.g, preferring a universe with nothing to a universe containing a robust utopia plus an instance of light suffering).

    i self-describe as a "suffering-focused altruist" or "negative-leaning-utilitarian." ie, suffering seems much worse to me than happiness seems good.

  2. ^

    (though certainly there are some individual current humans would do this, for example to digital minds, if given the ability to do so. rather, i'm expressing a belief that it's very probable that an aligned AI which practically results from this situation would not allow that to happen.)

  3. ^

    (by 'human-aligned AI', I mean one pointed to an actual CEV of one or a group of humans (which could indirectly imply the 'CEV of everyone' but without actually-not-being-that and failing in the below way, and without allowing cruel values of some individuals to enter into it).

    I don't mean an AI aligned to some sort of 'current institutional process', like voting, involving all living humans -- I think that should be avoided due to politicization risk and potential for present/unreflective(/by which i mean cruel)-values lock-in.)

  4. ^

    there's some way to formalize with bayes equations how likely, from a negative-utilitarian perspective, an s-risk needs to be (relative to a good outcome) to terminate a timeline.

    it would intake probability distributions related to 'the frequency of suffering-disvalue across the universal distribution of ASIs' and 'the frequency of various forms of s-risks that are preventable with acausal trade'. i might create this formalization later.

    if we think there's pretty certainly more preventable-through-trade-type suffering-events than there is altruistic ASIs to prevent it, a local preventable-type s-risk might actually need to be 'more likely than the good/suffering-disvaluing outcome'

Load More