# Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/rehearser.

# Linda Linsefors's Posts

Sorted by New

“embedded self-justification,” or something like that

The way I understand your division of floors and sealing, the sealing is simply the highest level meta there is, and the agent has *typically* no way of questioning it. The ceiling is just "what the algorithm is programed to do". Alpha Go is had programed to update the network weights in a certain way in response to the training data.

What you call floor for Alpha Go, i.e. the move evaluations, are not even boundaries (in the sense nostalgebraist define it), that would just be the object level (no meta at all) policy.

I think this structure will be the same for any known agent algorithm, where by "known" I mean "we know how it works", rather than "we know that it exists". However Humans seems to be different? When I try to introspect it all seem to be mixed up, with object level heuristics influencing meta level updates. The ceiling and the floor are all mixed together. Or maybe not? Maybe we are just the same, i.e. having a definite top level, hard coded, highest level meta. Some evidence of this is that sometimes I just notice emotional shifts and/or decisions being made in my brain, and I just know that no normal reasoning I can do will have any effect on this shift/decision.

Vanessa Kosoy's Shortform

I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.

All I know is Goodhart

Weather this works or not is going to depend heavily on what looks like.

Given , i.e. , what does this say about ?

The answer depends on the amount of mutual information between , and . Unfortunately the the more generic is, (i.e. any function is possible) the less mutual information there will be. Therefore, unless we know some structure about , the restriction to is not going to do much. The agent will just find a very different policy that also actives very high in some very Goodharty way, but does not get penalized because low value for on is not correlated with low value on .

This could possibly be fixed by adding assumptions of the type for any that does too well on . That might yield something interesting, or it might just be a very complicated way of specifying as satisfiser, I don't know.

TAISU 2019 Field Report

Mainly that we had two scheduling sessions, one on the morning of the first day an one on the morning of the third day. At each scheduling session, it was only possible to add activities for the upcoming two days.

At the start of unconference encouraged people to think of it as 2 day event and try to put in everything they really wanted to do the first two days. On the morning of day three, the schedule was cleared to let people add sessions about topic that where alive to them at that time.

The main reason for this design choice was to allow continued/deeper conversation. I if ideas where created during the first half, I wanted there to be space to keep talking about those ideas.

Also, some people only attended the last two days, and this set up guaranteed they would get a chance to add things to the schedule too. But that could also have been solved in other ways, so that was not a crux for my design choice.

Conceptual Problems with UDT and Policy Selection

I think UDT1.1 have two fundamentally wrong assumptions built in.

1) Complete prior: UDT1.1 follows the policy that is optimal according to it's prior. This is incommutable in general settings and will have to be approximated some how. But even an approximation of UDT1.1 assumes that UDT1.1 is at least well defined. However in some multi agent settings or when the agent is being fully simulated by the environment, or any other setting where the environment is necessary bigger than the agent, then UDT1.1 is ill defined.

2) Free will: In the problem Agent Simulates Predictor, the environment is smaller than the agent, so it is falls outside the above point. Here instead I think the problem is that the agent assumes that it has free will, when in fact it behaves in a deterministic manner.

The problem of free will in Decision Problems is even clearer in the smoking lesion problem:

You want to smoke and you don't want Cancer. You know that people who smoke are more likely get cancer, but you also know that smoking does not cause cancer. Instead, there is a common cause, some gene, that happens to both increase the risk of cancer and make it more likely that a person with this gene are more likely to choose to smoke. You can not test if you have the gene.

Say that you decide to smoke, becasue ether you have the gene or not so you might as well enjoy smoking. But what if everyone though like this? Then there would be no correlation between the cancer gene and smoking. So where did the statistics about smokers getting cancer come from (in this made up version of reality).

If you are the sort of person who smokes no mater what, then ether:

a) You are sufficiently different from most people such that the statistics does not apply to you.

or

b) The cancer gene is correlated with being the sort of person that has a decision possess that leads to smoking.

If b is correct, then maybe you should be the sort of algorithm that decides not to smoke, as to increase the chance of being implemented into a brain that lives in a body with less risk of cancer. But if you start thinking like that, then you are also giving up your hope at affecting the universe, and resign to just choosing where you might find yourself, and I don't think that is what we want from a decision theory.

But there also seems to be no good way of thinking about how to steer the universe with out pretending to have free will. But since that is actually a falls assumption, there will be weird edge cases where you're reasoning breaks down.

Minimization of prediction error as a foundation for human values in AI alignment

Do you agree with my clarification?

Because what you are trying to say makes very much sense to me, if and only if I replace "prediction" with "set point value" for cases when the so called prediction is fixed.

Set point (control system vocabulary) = Intention/goal (agent vocabulary)

Minimization of prediction error as a foundation for human values in AI alignment

It seems like people are talking in circles around each other in these comments, and I think the reason is that Gordon and other people who likes predictive processing theory are misusing the world "prediction"

By misuse I mean clearly deviating from common use. I don't really care about sticking to common use, but if you deviate from the expected meaning of a word it is good to let people know.

Lets say I have a model of the future in my head. If I try to adjust the model to fit reality this model is a prediction. If I try to fit reality to my model, this model is an intention.

If you have a control system that tries to minimise "prediction error" with respect to a "prediction" that it is not able to chance, so that the system resort to change reality instead, then that is not really a prediction anymore.

As I understand it predictive processing theory suggest that both updating predictions and executing intentions are optimising for the same thing, which is aligning reality with my internal model. However there is an important difference with is what is variables and what is constants in solving that problem. Gordon is mentioning at some places that sometimes "predictions" can't be updated.

This means that it won't always be the case that a control system is globally trying to minimize prediction error, but instead is locally trying to minimize prediction error, although it may not be able to become less wrong over time because it can't change the prediction to better predict the input.

There are probably some actual disagreement here (in this comment section) too, but we will not figure that out if we don't agree on what words mean first.

Minimization of prediction error as a foundation for human values in AI alignment

I have not read all the comments yet, so maybe this is redundant, but anyway...

I think it is plausible that humans and other life forms, are mostly made up of layers of control systems, stacked on each other. However it does not follow from this that humans are trying to minimise prediction error.

There are probably some part of the brain that is trying to minimise prediction error. Possibly organised as a control system that tries to keep expectations in line with reality. Because it is useful to be able to accurately predict the world.

But if we are a stack of control systems, then I would expect other parts of the brain to be control systems for other things. E.g. Having the correct level of blood sugar, having a good amount of social interaction, having a good amount of variety in our lives.

I can imagine someone figuring out more or less how the prediction control system works and what it is doing, then looking at everything else, noticing the similarity (becasue it is all types of control systems and evolution tend to reuse structures) and thinking "Hmm, maybe it is all about predictions". But I also think that would be wrong.

1st Athena Rationality Workshop - Retrospective

We are currently deciding between:

a) Running second Athena Workshop, similar to the first one, i.e. teaching a broad range of techniques for solving internal conflicts.

b) Running a workshop specifically focused on overcoming procrastination

c) Doing both

If you have any preferences, let me know.

1st Athena Rationality Workshop - Retrospective

Our current goal is to gather more information. Which method should we teach and how should we teach it? Is what we are teaching actual useful? To find this out we indent to:

1) Run various versions of the workshop

2) Experiment with various forms online teaching

3) Follow up with participants about what has been useful to them

We have also made a strategic decision to mostly learn from our own experiences, to hopefully find new local optimums for what and how to teach these types of things.

Because of the stage we are at, the quickest way for you to get more information about techniques would be to attend one of the workshop. We would like to eventually do something more scalable (e.g. realizing video lectures), but first we'll need to do a lot more testing.