I made a new article about defining "optimizer". I was wondering if someone could look over it and tell me what they think before I post it on Less Wrong. You can find it here.
There is a matter I'm confused about: What exactly is base-level reality, does it necessarily exist, and is it ontologically different from other constructs?
First off, I had gotten the impression that there was a base-level reality, and that in some sense it's ontologically different from the sorts of abstractions we use in our models. I thought that, it some sense, the subatomic particles "actually" existed, whereas our abstractions, like chairs, were "just" abstractions. I'm not actually sure how I got this impression, but I had the sense that other peop... (read more)
I had made a post proposing a new alignment technique. I didn't get any responses, but it still seems like a reasonable idea to me, so I'm interested in hearing what others think of it. I think the basic idea of the post, if correct, could be useful for future study. However, I don't want to waste time doing this if the idea is unworkable for a reason I hadn't thought of.
(If you're interested, please read the post before reading below.)
Of course, the idea's not a complete solution to alignment, and things have a risk of going catastrophically wrong due to... (read more)
I found what seems to be a potentially dangerous false-negative in the most popular definition of optimizer. I didn't get a response, so I would appreciate feedback on if it's reasonable. I've been focusing on defining "optimizer", so I think feedback would help me a lot. You can see my comment here .
I had recently posted a question asking about if iterated amplification was actually more powerful than mere mimicry and arguing that it was not. I had thought I was making a pretty significant point, but the post attracted very little attention. I'm not saying this is a bad thing, but I'm not really sure why it happened, so I would appreciate some insight about how I can contribute more usefully.
Iterated amplification seems to be the leading proposal for created aligned AI, so I thought a post arguing against it, if correct, would be a useful contribution... (read more)
I have an idea for reasoning about counterpossibles for decision theory. I'm pretty skeptical that it's correct, because it doesn't seem that hard to come up with. Still, I can't see a problem with it, and I would very much appreciate feedback.
This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn't work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I'm either doing something wrong or this ha... (read more)
I'd like to propose the idea of aligning AI by reverse-engineering its world model and using this to specify its behavior or utility function. I haven't seen this discussed before, but I would greatly appreciate feedback or links to any past work on this.
For example, suppose a smart AI models humans. Suppose it has a model that explicitly specifies the humans' preferences. Then people who reverse-engineered this model could use it as the AI's preferences. If the AI lacks a model with explicit preferences, then I think it would still contain an accurate mod... (read more)
I've recently gotten concerned about the possibility that that advanced AIs would "hack" their own utility function. I haven't seen this discussed before, so I wanted to bring it up. If I'm right, this seems like it could be a serious issue, so I would greatly appreciated feedback or links to any previous discussion.
Suppose you come up with a correct, tractable mathematical specification of what you want your AI's utility function to be. So then you write code intended to be an implementation of this.
However, computers are vulnerable to some hardware probl... (read more)
There's a huge gulf between "far-fetched" and "quite likely".
The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.
By comparison, we're fairly good at creating at least moderately reliable hardware, and most of the accidental failure modes are fatal to the running software. Flaws like rowhammer are mostly attacks, where someone puts a great deal of intelligent effort into finding an extremely unusual operating mode in which some some assumptions can be bypassed with significant effort into creating exactly the wrong operating conditions.
There are some examples of accidental flaws that affect hardware and aren't fatal to its running software, but they're an insignificant fraction of the number of failures due to incorrect software.
I was wondering if anyone would be interested in reviewing some articles I was thinking about posting. I'm trying to make them as high-quality as I can, and I think getting them reviewed by someone would be helpful for making Less Wrong contain high-quality content.
I have four articles I'm interested in having reviewed. Two are about new alignment techniques, one is about a potential danger with AI that I haven't seen discussed before, and one is about the simulation argument. All are fairly short.
If you're interested, just let me know and I care share drafts of any articles you would like to see.
I've read this paper on low-impact AIs. There's something about it that I'm confused and skeptical about.
One of the main methods it proposes works as follows. Find a probability distribution of many possible variables in the world. Let X represent the statement "The AI was turned on". For each the variables v it considers, the probability distribution over v should, after conditioning on X should, look about the same as the probability distribution over v after conditioning on not-X. That's low impact.
But the paper doesn't mention conditioning on any evide... (read more)
I'm questioning whether we would actually want to use Updateless Decision Theory, Functional Decision Theory, or future decision theories like them.
I think that in sufficiently extreme cases, I would act according to Evidential Decision Theory and not according something like UDT, FDT, or any similar successor. And I think I would continue to want to take the evidential decision theoretic-recommended action instead even if I had arbitrarily high intelligence, willpower, and had infinitely long to think about it. And, though I'd like to hear others' thought... (read more)
I'm wondering how, in principal, we should deal with malign priors. Specifically, I'm wondering what to do about the possibility that reality itself is, in a sense, malign.
I had previously said that it seems really hard to verifiably learn a non-malign prior. However, now I've realized that I'm not even sure what a non-malign, but still reliable, prior would even look like.
In previous discussion of malign priors, I've seen people talk about the AI misbehaving due to thinking it's in some embedded in a simpler universe than our own that was controlled by ag... (read more)
I've been reading about logical induction. I read that logical induction was considered a breakthrough, but I'm having a hard understanding the significance of it. I'm having a hard time seeing how it outperforms what I call "the naive approach" to logical uncertainty. I imagine there is some sort of notable benefit of it I'm missing, so I would very much appreciate some feedback.
First, I'll explain what I mean by "the naive approach". Consider asking an AI developer with no special background in reasoning under logical uncertainty how to make an algorithm... (read more)
I've thought of a way in which other civilizations could potentially "hack" Updateless Decision Theoretic agents on Earth in order to make them do whatever the other civilization wants them to do. I'm wondering if this has been discussed before, and if not, what people think about it.
Here I present a method of that would potentially aliens to take control of an AI on Earth that uses Updateless Decision theory.
Note that this crucially depends on different agents with the AI's utility function but different situations terminally valuing different things. For... (read more)
I was wondering if there has been any work getting around specifying the "correct" decision theory by just using a more limited decision theory and adjusting terminal values to deal with this.
I think we might be able to get an agent that does what we want without formalizing the right decision theory buy instead making a modification to the value loading used. This way, even an AI with a simple, limited decision theory like evidential decision theory could make good choices.
I think that normally when considering value loading, people imagine finding a way... (read more)