Stephen Fowler

Wiki Contributions


People are not being careful enough about what they mean when they say "simulator" and it's leading to some extremely unscientific claims. Use of the "superposition" terminology is particularly egregious.

I just wanted to put a record of this statement into the ether so I can refer back to it and say I told you so. 

Great post, really got me thinking. 

I wanted to highlight something that wasn't immediately obvious to me, the actual mechanism behind the non-transitivity. 

(I hope you'll forgive me for repeating you here, I'm trying to maximise my "misunderstanding surface area" and increase the chance someone corrects me.) 


Given a subset , where T is a collection of objects I'd like to define, be the set of objects I expect to encounter, and  is the set of all possible objects. 

An intensional definition identifies the set by specifying some properties that only elements of the set satisfy (and the correct intensional definition is expected to extend beyond ?)

An extensional definition is simply telling you what every element of 

Lastly we have the ostensive definition which is given by specifying some elements of  and trusting the audience to be able to extrapolate. The extrapolation process depends on the user (it may be that there are ways of extrapolating that are more "natural").

A distinction  is a property or collection of properties that are sufficient to distinguish the elements of the set T within  but not if we extend to the universe of all possible objects. 

"Green, leafy tall things" serves as a perfectly good distinction of trees while I go for a walk at my local park, but this distinction would fail if I was to venture to Fangorn Forest. 

The major idea is that the distinctions learned from an ostensive definition in one environment will not generalise to a new environment in the same way that the original ostensive definition does.

Mechanistically, this is occuring because the ostensive definition is not just information the examples. It's also information about objects in the environment that are not included.

In children pointing to a "dog" in the street conveys the information that you refer to this creature as a dog, but it also conveys information that this creature is special and the other items around it are not in the category of dog. Similarly when we train a model on a classification task. We "show" the model the data, elements that are in and not in T.

You raise some good points but I have to point out that the OP didnt make any effort to steelman anti-racism either.

Thank you for pointing this out! 

There is a contradiction between you asking me to

Humanize your political opponents, think about their normal human lives outside of this one heinous thing they believe. See if you can steelman their position or pass their ideological Turing test

When a few paragraphs earlier you discuss DiAngelo's form of antiracism and refer to it as "it’s the most retarded one".

This is very interesting and I'll have a crack. I have an intuition that two restrictions are too strong, and a proof in either direction will exploit them without telling you much about real gradient hackers:

  1. Having to take your weights as inputs: 
    If I understand this correctly the weight restrictions needed mean  actually lives on a d dimensional subspace of .  My intuition is that in real life the scenarios we expect gradient hackers to get clues to their parameters but not necessarily a full summary. 

    In a reality maybe the information would be clues that the model can gleam from it's own behaviour about how it's being changed. Perhaps this is some low dimensional summary of it's functional behaviour on a variety of inputs? 
  2. Having to move toward a single point :
    I think the phenomena should not be viewed in parameter space, but function space although I do anticipate that makes this problem harder.
    Instead of expecting the gradient hacker to move toward the single point , it is more appropriate to expect it to move towards the set of parameters that correspond to the same function (if such redundancy exists). 
    Secondly, I'm unsure but If the model is exhibiting gradient hacking behaviour, does that necessarily mean it tends toward a single point in function space. Perhaps the selections pressures would focus on some capabilities (such as the capacity to gradient hack) and not preserve others. 


Epistemics: Off the cuff. Second point about preserving capabilities seems broadly correct but a bit too wide for what you're aiming to actually achieve here. 

If you check the paper the form of welfare rankings discussed by Arrhenius's appears to be path independent. 

Interesting, 2 seems the most intuitively obvious to me. Holding everyone elses happiness equal and adding more happy people seems like it should be viewed as a net positive.

To better see why 3 is a positive, think about it as taking away a lot of happy people to justify taking away a single, only slightly sad individual. 

6 is undesirable because you are putting a positive value on inequality for no extra benefit.

But I agree, 6 is probably the one to go. 

What is the correct "object of study" for alignment researchers in understanding the mechanics of a world immediately before and during takeoff? A good step in this direction is the work of Alex Flint and Shimi's UAO.

What form does the correct alignment goal take? Is it a utility function over a region of space, a set of conditions to be satisfied or something else?

Mechanistically, how do systems trained primarily on token frequencies appear to be capable of higher level reasoning?

How likely is the emergence of deceptively aligned systems?

Humans are at risk from an unaligned AI not because of our atoms being harvested but because the resources we need to live will be harvested or poisoned, as briefly described by Critch.

Be wary of taking the poetic analogy of "like humans to ants" too far and think that it is a literal statement about ants. The household ant is the exception not the rule.

Our relationship to an unaligned AI will be like the relationship between humans and the unnamed species of butterfly that went extinct while you were reading this

Load More