Fun Theory is the study of questions such as "How much fun is there in the universe?", 
"Will we ever run out of fun?", "Are we having fun yet?" and "Could we be having 
more fun?". It's relevant to designing utopias and AIs, among other things.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
After finishing MATS 2, I believed this: "RL optimization is what makes an LLM potentially dangerous. LLMs by themselves are just simulators, and therefore are not likely to become a misaligned intelligent agent. Therefore, a reasonable alignment strategy is to use LLMs, (and maybe small amounts of finetuning) to build useful superhuman helpers. Small quantities of finetuning won't shift the LLM very far from being a simulator (which is safe), so it'll probably still be safe." I even said this in a public presentation to people learning about alignment. Ahh. I now think this was wrong. I was misled by ambient memes of the time and also made the mistake of trying too hard to update on the current top-performing technology. More recently, after understanding why this was wrong, I cowrote this post in the hope that it would be a good reference for why this belief was wrong, but I think it ended up trying to do too many other things. So here's a briefer explanation: The main problem was that I didn't correctly condition on useful superhuman capability. Useful superhuman capabilities involve goal-directedness, in the sense that the algorithm must have some model of why certain actions lead to certain future outcomes. It must be choosing actions for a reason algorithmically downstream of with the intended outcome. This is the only way handle new obstacles and still succeed.  My reasoning was that since LLMs don't seem to contain this sort of algorithm, and yet are still useful, then we can leverage that usefulness without danger of misalignment. This was pretty much correct. Without goals, there are no goals to be misaligned. It's what we do today. The mistake was that I thought this would keep being true for future-LLMs-hypothetically-capable-of-research. I didn't grok that goal-directedness at some level was necessary to cross the gap between LLM capabilities and research capability. My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It's a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness. I was also wrong that LLMs should be thought of as simulators (although it's a useful frame sometimes). There was a correct grain of truth in the idea that simulators would be safe. It would be great if we could train actual people-simulators. If we could build a real algorithm-level simulator of a person, this would of course be aligned (it would have the goals of the person simulated). But the way current LLMs are being built, and the way future systems will be built, they aren't even vaguely trying to make extremely-robustly-generalizing people-simulators.[1] And they won't, because it would involve a massive tradeoff with competence. 1. ^ And the level of OOD generalization required to remain a faithful simulation during online learning and reflection is intuitively quite high.
‘Feature’ is overloaded terminology In the interpretability literature, it’s common to overload ‘feature’ to mean three separate things: 1. Some distinctive, relevant attribute of the input data. For example, being in English is a feature of this text. 2. A general activity pattern across units which a network uses to represent some feature in the first sense. So we might say that a network represents the feature (1) that the text is in English with a linear feature (2) in layer 32. 3. The elements of some process for discovering features, like an SAE. An SAE learns a dictionary of activation vectors, which we hope correspond to the features (2) that the network actually uses. It is common to simply refer to the elements of the SAE dictionary as ‘features’ as well. For example, we might say something like ‘the network represents features of the input with linear features in its representation space, which are recovered well by feature 3242 in our SAE. This seems bad; at best it’s a bit sloppy and confusing, at worst it’s begging the question about the interpretability or usefulness of SAE features. It seems important to carefully distinguish between them in case these don’t coincide. We think that it’s probably worth making a bit of an effort to carefully distinguish between all of these different concepts by giving them different names. A terminology that we prefer is to reserve ‘feature’ for the conceptual senses of the word (1 and 2) and use alternative terminology for case 3, like ‘SAE latent’ instead of ‘SAE feature’. So we might say, for example, that a model uses a linear representation for a Golden Gate Bridge feature, which is recovered well by an SAE latent. We have tried to follow this terminology in our recent Gemma Scope report. We think that this added precision is helpful in thinking about feature representations, and SAEs more clearly. To illustrate what we mean with an example, we might ask whether a network has a feature for numbers - i.e, whether it has any kind of localized representation of numbers at all. We can then ask what the format of this feature representation is ; for example, how many dimensions does it use, where is it located in the model, does it have any kind of geometric structure, etc. We could then separately ask whether it is discovered by a feature discovery algorithm; i.e is there a latent in a particular SAE that describes it well? We think it’s important to recognise that these are all distinct questions, and use a terminology that is able to distinguish between them. We (the deepmind language model interpretability team) started using this terminology in the GemmaScope report, but we didn’t really justify the decision much there and I thought it was worth making the argument separately.
(ramblingly) Does the No Free Lunch Theorem imply that there's no one single technique that would always work for AGI alignment? Initial thought is probably not, because the theorem states that the performance of all optimization algorithms are identical across all possible problems. However,  AGI alignment is a subset of these problems.
Fabien RogerΩ20362
5
I recently listened to the book Chip War by Chris Miller. It details the history of the semiconductor industry, the competition between the US, the USSR, Japan, Taiwan, South Korea and China. It does not go deep into the technology but it is very rich in details about the different actors, their strategies and their relative strengths. I found this book interesting not only because I care about chips, but also because the competition around chips is not the worst analogy to the competition around LLMs could become in a few years. (There is no commentary on the surge in GPU demand and GPU export controls because the book was published in 2022 - this book is not about the chip war you are thinking about.) Some things I learned: * The USSR always lagged 5-10 years behind US companies despite stealing tons of IP, chips, and hundreds of chip-making machines, and despite making it a national priority (chips are very useful to build weapons, such as guided missiles that actually work). * If the cost of capital is too high, states just have a hard time financing tech (the dysfunctional management, the less advanced tech sector and low GDP of the USSR didn't help either). * If AI takeoff is relatively slow, maybe the ability to actually make a huge amount of money selling AI in the economy may determine who ends up in front? (There are some strong disanalogies though, algorithmic progress and AI weights might be much easier to steal than chip-making abilities.) * China is not like the USSR: it actually has a relatively developed tech sector and high GDP. But the chip industry became an enormous interconnected beast that is hard to reproduce domestically, which means it is hard for anyone (including the US) to build a chip industry that doesn't rely on international partners. (Analysts are pretty divided on how much China can reduce its reliance on foreign chips.) * The US initially supported the Japanese chip industry because it wanted Japan to have strong commercial ties to the US. Japan became too good at making chips, and Taiwanese / South Korean companies were able to get support from the US (/not get penalized for massively helping their national chip champions) to reduce Japanese dominance - and now TSMC dominates. Economic policies are hard to get right... (The author sometimes says stuff like "US elites were too ideologically committed to globalization", but I don't think he provides great alternative policies.) * It's amazing how Intel let a massive advantage slip. It basically had a monopoly over logic chip design (Intel microprocessors, before GPUs mattered), chip architecture (x86), and a large share of logic chip manufacturing (while Japanese/Taiwan/... were dominating in other sectors, like RAM, special purpose chips, ...). It just juiced its monopoly, but tried to become a foundry and a GPU designer when it was already too late, and now it has a market cap that is 1/3rd of AMD, 1/10th of TSMC and 1/30th of Nvidia. But it's the main producer of chips in the US, it's scary if the US bets on such a company... * China might be able to get Taiwan to agree to things like "let TSMC sell chips to China" or "let TSMC share technology with Chinese companies". * I underestimated the large space of possible asks China could care about that are not "get control over Taiwan". * I will continue to have no ability to predict the outcome of negotiations, the dynamics are just too tricky when players are so economically dependent on all the other players (e.g. China imports ~$400B worth of chips per year, 13% of all its imports).
Ruby251
23
Seeking Beta Users for LessWrong-Integrated LLM Chat Comment here if you'd like access. (Bonus points for describing ways you'd like to use it.) A couple of months ago, a few of the LW team set out to see how LLMs might be useful in the context of LW.  It feels like they should be at some point before the end, maybe that point is now. My own attempts to get Claude to be helpful for writing tasks weren't particularly succeeding, but LLMs are pretty good at reading a lot of things quickly, and also can be good at explaining technical topics. So I figured just making it easy to load a lot of relevant LessWrong context into an LLM might unlock several worthwhile use-cases. To that end, Robert and I have integrated a Claude chat window into LW, with the key feature that it will automatically pull in relevant LessWrong posts and comments to what you're asking about. I'm currently seeking beta users.  Since using the Claude API isn't free and we haven't figured out a payment model, we're not rolling it out broadly. But we are happy to turn it on for select users who want to try it out.  Comment here if you'd like access. (Bonus points for describing ways you'd like to use it.)

Popular Comments

Recent Discussion

Date: Saturday, September 7th, 2024

Time: 1 pm – 3 pm PT

Address: Yerba Buena Gardens in San Francisco, just outside the Metreon food court, coordinates 37°47'04.4"N 122°24'11.1"W  

Contact34251super@gmail.com

Come join San Francisco’s First Saturday (or SFFS – easy to remember, right?) ACX meetup. Whether you're an avid reader, a first time reader, or just a curious soul, come meet! We will make introductions, talk about a recent ACX article (Matt Yglesias Considered As The Nietzschean Superman), and veer off into whatever topic you’d like to discuss. You can get food from one of the many neighbouring restaurants.

We relocate inside the food court if there is inclement weather, or too much noise/music outside.

I will carry a stuffed-animal green frog to help you identify the group. You can let me know you are coming by either RSVPing on LW or sending an email to 34251super@gmail.com, or you can also just show up!

Given that Biden has dropped out, do you believe that the market was accurately priced at the time?

(ramblingly) Does the No Free Lunch Theorem imply that there's no one single technique that would always work for AGI alignment? Initial thought is probably not, because the theorem states that the performance of all optimization algorithms are identical across all possible problems. However,  AGI alignment is a subset of these problems.

Suppose you believe the following:

  1. the universe is infinite in the sense that every possible combination of atoms is repeated an infinite number of times (either because the negative curvature of the universe implies the universe is unbounded or because of MWI)
  2. Consciousness is an atomic phenomena[1]. That is to say, the only special relationship between past-you and present you is that present you remembers being past you.

In this case, we seem to get something similar to "dust" in Greg Egan's Permutation City, where any sequence of events leading to the present you having your present memories could be considered the "real you".

However, the "conscious you" of your dreams does not have any special attachment or memory to the waking you.  That is to say at least sometimes when...

you presumably also think that teleportation would only create copies while destroying the originals. You might then be hesitant to use teleportation.

As an aside, Holden's view of identity makes him unconcerned about this question, and I've gradually gotten round to it as well.

1M. Y. Zuo
How do you define ‘real’, ‘me’, ‘real me’, etc…? This seems to be stemming from some internal confusion.
2habryka
I don't understand, how is "not predicting errors" either a thing we have observed, or something that has anything to do with simulation?  Yeah, I really don't know what you are saying here. Like, if you prompt a completion model with badly written text, it will predict badly written text. But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it's not really a simulator.
Signer10

But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.

I'm saying that this won't work with current systems at least for strong hash, because it's hard, and instead of learning to undo, the model will learn to simulate, because it's easier. And then you can vary the strength of hash to... (read more)

2RHollerith
If we all die because an AI put super-human amounts of optimization pressure into some goal incompatible with human survival (i.e., almost any goal if the optimization pressure is high enough) it does not matter whether the AI would have had some other goal in some other context.
3mattmacdermott
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”. Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal. I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.

My assertion is that all utility functions (i.e., all functions that satisfy the 4 VNM axioms plus perhaps some additional postulates most of us would agree on) are static (do not change over time).

I should try to prove that, but I've been telling myself I should for months now, but haven't mustered the energy, so am posting the assertion now without proof because an weak argument posted now is better then a perfect argument that might never be posted.

I've never been tempted to distinguish between "the outside-of-time all-timepoints-included utility functi... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
4Ruby
@Neel Nanda @Stephen Fowler @Saul Munn – you've been added. I'm hoping to get a PR deployed today that'll make a few improvements: - narrow the width so doesn't overlap the post on smaller screens than before - load more posts into the context window by default - upweight embedding distance relative to karma in the embedding search for relevant context to load in - various additions to the system response to improve tone and style
8Saul Munn
great! how do i access it on mobile LW?
Ruby20

Not available on mobile at this time, I'm afraid.

2Ruby
Added! That's been one of my go-to questions for testing variations of the system, I'd suggest just trying it yourself.

Thanks for that. The "the fate of all mankind" line really throws me. without this line, everything I said above applies. Its existence (assuming that it exists, specificly refers to AI, and Xi really means it) is some evidence towards him thinking that it's important. I guess it just doesn't square with the intuitions I've built for him as someone not particularly bright or sophisiticated. Being convinced by good arguments does not seem to be one of his strong suits.

Edit: forgot to mention that I tried and failed to find the text of the guide itself.

I have never been more ready for Some Football.

Have I learned all about the teams and players in detail? No, I have been rather busy, and have not had the opportunity to do that, although I eagerly await Seth Burn’s Football Preview. I’ll have to do that part on the fly.

But oh my would a change of pace and chance to relax be welcome. It is time.

The debate over SB 1047 has been dominating for weeks. I’ve now said my peace on the bill and how it works, and compiled the reactions in support and opposition. There are two small orders of business left for the weekly. One is the absurd Chamber of Commerce ‘poll’ that is the equivalent of a pollster asking if you support John...

Also, the guy is spamming his post about spamming applications into all the subreddits, which gives the whole thing a great meta twist, I wonder if he’s using AI for that too.

I'm pretty sure I saw what must be the same account, posting blatantly AI generated replies/answers across a ton of different subreddits, including at least some that explicitly disallow that.

Either that or someone else's bot was spamming AI answer comments while also spamming copycat "I applied to 1000 jobs with AI" posts.