John_Maxwell

Sequences

Predictions & Self-awareness

Comments

Testing The Natural Abstraction Hypothesis: Project Intro

I'm glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I'm inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr's posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking into.)

This is despite humans having pretty much identical cognitive architectures (assuming that we can create a de novo AGI with a cognitive architecture as similar to a human brain as human brains are to each other seems unrealistic). Perhaps you could argue that some human-generated abstractions are "natural" and others aren't, but that leaves the problem of ensuring that the human operating our AI is making use of the correct, "natural" abstractions in their own thinking. (Some ancient cultures lacked a concept of the number 0. From our perspective, and that of a superintelligent AGI, 0 is a 'natural' abstraction. But there could be ways in which the superintelligent AGI invents 'natural' abstraction that we haven't yet invented, such that we are living in a "pre-0 culture" with respect to this abstraction, and this would cause an ontological mismatch between us and our AGI.)

But I'm still optimistic about the overall research direction. One reason is if your dataset contains human-generated artifacts, e.g. pictures with captions written in English, then many unsupervised learning methods will naturally be incentivized to learn English-language abstractions to minimize reconstruction error. (For example, if we're using self-supervised learning, our system will be incentivized to correctly predict the English-language caption beneath an image, which essentially requires the system to understand the picture in terms of English-language abstractions. This incentive would also arise for the more structured supervised learning task of image captioning, but the results might not be as robust.)

This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.

Social sciences are a notable exception here. And I think social sciences (or even humanities) may be the best model for alignment--'human values' and 'corrigibility' seem related to the subject matter of these fields.

Anyway, I had a few other comments on the rest of what you wrote, but I realized what they all boiled down to was me having a different set of abstractions in this domain than the ones you presented. So as an object lesson in how people can have different abstractions (heh), I'll describe my abstractions (as they relate to the topic of abstractions) and then explain how they relate to some of the things you wrote.

I'm thinking in terms of minimizing some sort of loss function that looks vaguely like

reconstruction_error + other_stuff

where reconstruction_error is a measure of how well we're able to recreate observed data after running it through our abstractions, and other_stuff is the part that is supposed to induce our representations to be "useful" rather than just "predictive". You keep talking about conditional independence as the be-all-end-all of abstraction, but from my perspective, it is an interesting (potentially novel!) option for the other_stuff term in the loss function. The same way dropout was once an interesting and novel other_stuff which helped supervised learning generalize better (making neural nets "useful" rather than just "predictive" on their training set).

The most conventional choice for other_stuff would probably be some measure of the complexity of the abstraction. E.g. a clustering algorithm's complexity can be controlled through the number of centroids, or an autoencoder's complexity can be controlled through the number of latent dimensions. Marcus Hutter seems to be as enamored with compression as you are with conditional independence, to the point where he created the Hutter Prize, which offers half a million dollars to the person who can best compress a 1GB file of Wikipedia text.

Another option for other_stuff would be denoising, as we discussed here.

You speak of an experiment to "run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance is low-dimensional". My guess is if the other_stuff in your loss function consists only of conditional independence things, your representation won't be particularly low-dimensional--your representation will see no reason to avoid the use of 100 practically-redundant dimensions when one would do the job just as well.

Similarly, you speak of "a system which provably learns all learnable abstractions", but I'm not exactly sure what this would look like, seeing as how for pretty much any abstraction, I expect you can add a bit of junk code that marginally decreases the reconstruction error by overfitting some aspect of your training set. Or even junk code that never gets run / other functional equivalences.

The right question in my mind is how much info at a distance you can get for how many additional dimensions. There will probably be some number of dimensions N such that giving your system more than N dimensions to play with for its representation will bring diminishing returns. However, that doesn't mean the returns will go to 0, e.g. even after you have enough dimensions to implement the ideal gas law, you can probably gain a bit more predictive power by checking for wind currents in your box. See the elbow method (though, the existence of elbows isn't guaranteed a priori).

(I also think that an algorithm to "provably learn all learnable abstractions", if practical, is a hop and a skip away from a superintelligent AGI. Much of the work of science is learning the correct abstractions from data, and this algorithm sounds a lot like an uberscientist.)

Anyway, in terms of investigating convergence, I'd encourage you to think about the inductive biases induced by both your loss function and also your learning algorithm. (We already know that learning algorithms can have different inductive biases than humans, e.g. it seems that the input-output surfaces for deep neural nets aren't as biased towards smoothness as human perceptual systems, and this allows for adversarial perturbations.) You might end up proving a theorem which has required preconditions related to the loss function and/or the algorithm's inductive bias.

Another riff on this bit:

This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.

Maybe we could differentiate between the 'useful abstraction hypothesis', and the stronger 'unique abstraction hypothesis'. This statement supports the 'useful abstraction hypothesis', but the 'unique abstraction hypothesis' is the one where alignment becomes way easier because we and our AGI are using the same abstractions. (Even though I'm only a believer in the useful abstraction hypothesis, I'm still optimistic because I tend to think we can have our AGI cast a net wide enough to capture enough useful abstractions that ours are in their somewhere, and this number will be manageable enough to find the right abstractions from within that net--or something vaguely like that.) In terms of science, the 'unique abstraction hypothesis' doesn't just say scientific theories can be useful, it also says there is only one 'natural' scientific theory for any given phenomenon, and the existence of competing scientific schools sorta seems to disprove this.

Anyway, the aspect of your project that I'm most optimistic about is this one:

This raises another algorithmic problem: how do we efficiently check whether a cognitive system has learned particular abstractions? Again, this doesn’t need to be fully general or arbitrarily precise. It just needs to be general enough to use as a tool for the next step.

Since I don't believe in the "unique abstraction hypothesis", checking whether a given abstraction is the right one seems important to me. The problem seems tractable, and a method that's abstract enough to work across a variety of different learning algorithms/architectures (including stuff that might get invented in the future) could be really useful.

Vim

Interesting, thanks for sharing.

I couldn't figure out how to go backwards easily.

Command-shift-g right?

Vim

After practicing Vim for a few months, I timed myself doing the Vim tutorial (vimtutor on the command line) using both Vim with the commands recommended in the tutorial, and a click-and-type editor. The click-and-type editor was significantly faster. Nowadays I just use Vim for the macros, if I want to do a particular operation repeatedly on a file.

I think if you get in the habit of double-clicking to select words and triple-clicking to select lines (triple-click and drag to select blocks of code), click-and-type editors can be pretty fast.

Open Problems with Myopia

We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.

Are you sure that "episode" is the word you're looking for here?

https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL

I'm especially confused because you switched to using the word "timestep" later?

Having an action which modifies the reward on a subsequent episode seems very weird. I don't even see it as being the same agent across different episodes.

Also...

Suppose instead of one button, there are two. One is labeled "STOP," and if pressed, it would end the environment but give the agent +1 reward. The other is labeled "DEFERENCE" and, if pressed, gives the previous episode's agent +10 reward but costs -1 reward for the current agent.

Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.

I think as a practical matter, the result depends entirely on the method you're using to solve the MDP and the rewards that your simulation delivers.

Borasko's Shortform

lsuser had an interesting idea of creating a new Youtube account and explicitly training the recommendation system to recommend particular videos (in his case, music): https://www.lesswrong.com/posts/wQnJ4ZBEbwE9BwCa3/personal-experiment-one-year-without-junk-media

I guess you could also do it for Youtube channels which are informative & entertaining, e.g. CGP Grey and Kings & Generals. I believe studies have found that laughter tends to be rejuvenating, so optimizing for videos you think are funny is another idea.

Willa's Shortform

I suspect you will be most successful at this if you get in the habit of taking breaks away from your computer when you inevitably start to flag mentally. Some that have worked for me include: going for a walk, talking to friends, taking a nap, reading a magazine, juggling, noodling on a guitar, or just daydreaming.

A Semitechnical Introductory Dialogue on Solomonoff Induction

...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.

ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.

I'm not convinced your chess example, where the practical solution resembles the hypercomputer one, is representative. One way to sort a list using a hypercomputer is to try every possible permutation of the list until we discover one which is sorted. I tend to see Solomonoff induction as being cartoonishly wasteful in a similar way.

Why GPT wants to mesa-optimize & how we might change this

From a safety standpoint, hoping and praying that SGD won't stumble across lookahead doesn't seem very robust, if lookahead represents a way to improve performance. I imagine that whether SGD stumbles across lookahead will end up depending on complicated details of the loss surface that's being traversed.

John_Maxwell's Shortform

Lately I've been examining the activities I do to relax and how they might be improved. If you haven't given much thought to this topic, Meaningful Rest is excellent background reading.

An interesting source of info for me has been lsusr's posts on cutting out junk media: 1, 2, 3. Although I find lsusr's posts inspiring, I'm not sure I want to pursue the same approach myself. lsusr says: "The harder a medium is to consume (or create, as applicable) the smarter it makes me." They responded to this by cutting all the easy-to-consume media out of their life.

But when I relax, I don't necessarily want to do something hard. I want to do something which rejuvenates me. (See "Meaningful Rest" post linked previously.)

lsusr's example is inspiring in that it seems they got themselves studying things like quantum field theory for fun in their spare time. But they also noted that "my productivity at work remains unchanged", and ended up abandoning the experiment 9 months in "due to multiple changes in my life circumstances". Personally, when I choose to work on something, I usually expect it to be at least 100x as good a use of my time as random productive-seeming stuff like studying quantum field theory. So given a choice, I'd often rather my breaks rejuvenate me a bit more per minute of relaxation, so I can put more time and effort into my 100x tasks, than have the break be slightly useful on its own.

To adopt a different frame... I'm a fan of the wanting/liking/approving framework from this post.

  • In some sense, +wanting breaks are easy to engage in because it doesn't require willpower to get yourself to do them. But +wanting breaks also tend to be compulsive, and that makes them less rejuvenating (example: arguing online).

  • My point above is that I should mostly ignore the +approving or -approving factor in terms of the break's non-rejuvenating, external effects.

  • It seems like the ideal break is +liking, and enough +wanting that it doesn't require willpower to get myself to do it, and once I get started I can disconnect for hours and be totally engrossed, but not so +wanting that I will be tempted to do it when I should be working or keep doing it late into the night. I think playing the game Civilization might actually meet these criteria for me? I'm not as hooked on it as I used to be, but I still find it easy to get engrossed for hours.

Interested to hear if anyone else wants to share their thinking around this or give examples of breaks which meet the above criteria.

Load More