This branch of research is aimed at finding a (nearly) objective way of thinking about the universe. When I imagine the end result, I imagine something that receives a distribution across a bunch of data, and finds a bunch of useful patterns within it. At the moment that looks like finding patterns in data via find_natural_latent(get_chunks_of_data(data_distribution)) or perhaps showing that find_top_n(n, (chunks, natural_latent(chunks)) for chunks in all_chunked_subsets_of_data(data_distribution), key=lambda chunks, latent: usefulness_metric(latent)) is a (convergent sub)goal of agents. As such, the notion that the donuts' data is simply poorly chunked - which needs to be solved anyway - makes a lot of sense to me.
find_top_n(n, (chunks, natural_latent(chunks)) for chunks in
key=lambda chunks, latent: usefulness_metric(latent))
I don't know how to think about the possibilities when it comes to decomposing Xi. Why would it always be possible to decompose random variables to allow for a natural latent? Do you have an easy example of this?Also, what do you mean by mutual information between Xi, given that there are at least 3 of them? And why would just extracting said mutual information be useless? If you get the chance to point me towards good resources about any of these questions, that would be great.
Let's say every day at the office, we get three boxes of donuts, numbered 1, 2, and 3. I grab a donut from each box, plunk them down on napkins helpfully labeled X1, X2, and X3. The donuts vary in two aspects: size (big or small) and flavor (vanilla or chocolate). Across all boxes, the ratio of big to small donuts remains consistent. However, Boxes 1 and 2 share the same vanilla-to-chocolate ratio, which is different from that of Box 3.
Does the correlation between X1 and X2 imply that there is no natural latent? Is this the desired behavior of natural latents, despite the presence of the common size ratio? (and the commonality that I've only ever pulled out donuts; there has never been a tennis ball in any of the boxes!)If so, why is this what we want from natural latents? If not, how does a natural latent arise despite the internal correlation?
We could remove information from Λ′. For instance, Λ′ could be a bit indicating whether the temperature is above 100°C
I don't understand how this is less information than a bit indicating whether the temperature is above 50C. Specifically, given a bit telling you whether the temperature is above 50C, how do you know whether the temperature is above 100C or between 50C and 100C?
As to the definition of short term goal: any goal that is can be achieved (fully, e.g. without a "and keep it that way" clause) in a finite short time (for instance, in a few seconds), with the resources the system already has at hand. Equivalently, I think: any goal that doesn't push instrumental power seeking.
As to how we know a system has a short term goal: if we could argue that systems prefer short term goals by default, then we still wouldn't know as to the goals of a particular system but we could hazard a guess that the goals are short term. Perhaps we could expect short term goals by default if they were, for instance, easier to specify, and thus to have. As pointed out by others, if we try to give systems long term goals on purpose, they will probably end up with long term goals.
So long term goals aren't a default; market pressure will put them there as humans slowly cede more and more control to AIs, simply because the latter are making decisions that work out better. Presumably this would start with lower level decisions (e.g. how exactly to write this line of code; which employee to reward based on performance) and then slowly be given higher level decisions to make. In particular, we don't die the first time someone creates an AI with the ability to (escape, self improve and then) kill the competing humans, because that AI is likely focused on a much smaller more near term goal. That way, if we're careful and clever we have a chance to study a smarter-than-human general intelligence without dying. Is that an accurate description of how you see things playing out?
I'm not sure I understand; are you saying that given these, we have high P(Doom), or that these are necessary to be safe even if GIs have only short term goals? Or something else entirely?
I am using Wikipedia's definition: "Ensuring that emergent goals match the specified goals for the system is known as inner alignment."
Inner alignment is definitely a problem. In the case you described, the emergent goal was long term (ensure I remember the answer to 1+1), and I remain wondering whether by default short term specified goals do or do not lead to strange long term goals like in your example.
I wrote the following in custom instructions, under "How would you like ChatGPT to respond?":
After every answer, give a short reflection, like "I looked at this answer again and it looks good" or "I thought about this again and I think it could be better if ..." or "After looking this over again, this makes no sense because...".I then had the following conversation:Q: What is denser, gold or uranium?
A: Uranium is denser than gold. The density of gold is about 19.32 grams per cubic centimeter, while the density of uranium is about 19.1 grams per cubic centimeter.
Upon reflection, I see that I made an error in my initial response. The correct densities are indeed 19.32 g/cm³ for gold and 19.1 g/cm³ for uranium, but these values indicate that gold is denser than uranium, not the other way around. I apologize for the confusion, and I appreciate your patience.
I feel like there is a major explaining paragraph missing here, explaining the difference between causality and probability. Something like:
Armed with knowledge of the future, we could know exactly what will happen. (e.g. The other doctor will give Alice medicine, and she will get better.) Given a full probability distribution over events we could make optimal predictions. (e.g. There is a 2/3 chance the other doctor will give Alice medicine, 1/4 chance of her getting better if he doesn't and 2/3 chance of her getting better if he does.) Causality gives us a way to combine a partial probability distribution with additional knowledge of the world to make predictions about events that are out of distribution. (e.g. Since I understand that the medicine works mostly by placebo, I can intervene and give Alice a placebo when the other doctor doesn't give her the medicine, raising her chances. Furthermore, if I have a distribution of how effective a placebo is relative to the medicine, I can quantify how helpful my intervention is.)
An intervention is a really important example of an out of distribution generalization; but if I gave you the full probability distribution of the outcomes that your interventions would achieve it would no longer be out of distribution (and you'd need to deal with paradoxes involving seemingly not having chocies about certain things).
No Free Lunch means that optimization requires taking advantage of underlying structure in the set of possible environments. In the case epistemics, we all share close-to-the-same-environment (including having similar minds), so there are a lot of universally-useful optimizations for learning about the environment.
Optimizations over the space of "how-to-behave instructions" requires some similar underlying structure. Such structure can emerge for two reasons: (1) because of the shared environment, or (2) because of shared goals. (Yeah, I'm thinking about agents as cartesian, in the sense of separating the goals and the environment, but to be fair so do L+P+S+C.)
On the environment side, this leads to convergent behaviours (which can also be thought of as behaviours resulting from selection theorems), like good epistemics, or gaining power over resources.
When it comes to goals, on the other hand, it is both possible (by the orthogonality thesis) and the case that different peole have vastly different goals (e.g. some people want to live forever, some want to commit suicide, and these two groups probably require mostly different strategies). Less in common between different people's goals means less universally-useful how-to-behave instructions. Nonetheless, optimizing behaviours that are commonly prioritized is close enough to universally useful, e.g. doing relationships well.
Perhaps an "Instrumental Sequences" would include the above categories as major chapters. In such a case, as indicated in the post, current reseaerch being posted on Lesswrong gives an approximate idea of what such sequences could look like.