A new paper proposes an unsupervised way to extract knowledge from language models. The authors argue this could be a key part of aligning superintelligent AIs, by letting us figure out what the AI "really believes" rather than what it thinks humans want to hear. But there are still some challenges to overcome before this could work on future superhuman AIs.
I start off this post with an apology for two related mistakes from last week.
The first is the easy correction: I incorrectly thought he was the head of ‘alignment’ at OpenAI rather than his actual title ‘mission alignment.’
Both are important, and make one’s views important, but they’re very different.
The more serious error, which got quoted some elsewhere, was: In the section about OpenAI, I noted some past comments from Joshua Achiam, and interpreted them as him lecturing EAs that misalignment risk from AGI was not real.
While in isolation I believe this is a reasonable way to interpret this quote, this issue is important to get right especially if I’m going to say things like that. Looking at it only...
(still) speculative, but I think the pictures of Shard Theory, activation engineering and Simulators (and e.g. Bayesian interpretations of in-context learning) are looking increasingly similar: https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed?commentId=qX4k7y2vymcaR6eio
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#SfPw5ijTDi6e3LabP
You’ve probably seen this chart from Mark Perry at the American Enterprise Institute.
I’ve seen this chart dozens of times and have always enjoyed how many different and important stories it can tell.
There is a story of the incredible abundance offered by technological growth and globalization. Compared to average hourly wages, cars, furniture, clothing, internet access, software, toys, and TVs have become far more accessible than they were 20 years ago. Flatscreens and Fiats that were once luxuries are now commodities.
There is also a story of sclerosis and stagnation. Sure, lots of frivolous consumer goods have gotten cheaper but healthcare, housing, childcare, and education, all the important stuff, has exploded in price. Part of this is “cost disease” where the high productivity of labor in advancing industries like...
The health and education categories would be quite different in most european countries
Gurnee & Tegmark (2023) trained linear probes to take an LLM's internal activation on a landmark's name (e.g. "The London Eye"), and predict the landmark's longitude and latitude. The results look like this:[1]
So LLMs (or at least, Llama 2, which they used for this experiment) contain a pretty good linear representation of an atlas.
Sometimes, like when thinking about distances, a globe is more useful than an atlas. Do models use the globe representation? To find out, we can train probes to predict the (x,y,z) coordinates of landmarks, viewed as living in 3D space....
This is cool, although I suspect that you'd get something similar from even very simple models that aren't necessarily "modelling the world" in any deep sense, simply due to first and second order statistical associations between nearby place names. See e.g. https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1551-6709.2008.01003.x , https://escholarship.org/uc/item/2g6976kg .
Well, karma is not a perfect tool. It is good at keeping good stuff above zero and bad stuff below zero, by distributed effort. It is not good at quantifying how good or how bad the stuff is.
Solving alignment = positive karma. Cute kitten = positive karma. Ugly kitten = negative karma. Promoting homeopathy = negative karma.
It is a good tool for removing homeopathy and ugly kittens. Without it, we would probably have more of those. So despite all the disadvantages, I want the karma system to stay. Until perhaps we invent something better.
I think we currentl...
I interact with journalists quite a lot and I have specific preferences. Not just for articles, but for behaviour. And journalists do behave pretty strangely at times.
This account comes from talking to journalists on ~10 occasions. Including being quoted in ~5 articles.
I do not trust journalists to abide by norms of privacy. If I talk to a friend and without asking, share what they said, with their name attached, I expect they'd be upset. But journalists regularly act as if their profession sets up the opposite norm - that everything is publishable, unless explicitly agreed otherwise. This is bizarre to me. It's like they have taken a public oath to be untrustworthy.
Perhaps they would argue that it’s a few bad journalists who behave like this, but how...
Solutions do not have to be perfect to be useful. Trust can be built up over time.
misinformation is impossible to combat
If you take the US government who at the same time tells Facebook not to delete the anti-vaccine misinformation that the US government is spreading while telling Facebook to delete certain anti-vaccine misinformation that the US government doesn't like, it's obvious that the institutions aren't trustworthy and thus they have a hard time fighting misinformation.
If the US government would stop lying, it would find it a lot easier to f...
How can we make many humans who are very good at solving difficult problems?
I made up the made-up numbers in this table of made-up numbers; therefore, the numbers in this table of made-up numbers are made-up numbers.
If you have a shitload of money, there are some projects you can give money to that would make supergenius humans on demand happen faster. If you have a fuckton of money, there are projects whose creation you could fund that would greatly accelerate this technology.
If you're young and smart, or are already an expert in either stem cell / reproductive biology, biotech, or anything related to brain-computer interfaces, there are some projects you could work on.
If neither, think hard, maybe I missed something.
You can...
Thanks for answering my question directly in the second half.
I find the testimonies of rationalists who experimented with meditation less convincing than perhaps I should, simply because of selection bias. People who have pre-existing affinity towards "woo" will presumably be more likely to try meditation. And they will be more likely to report that it works, whether it does or not. I am not sure how much should I discount for this, perhaps I overdo it. I don't know.
A proper experiment would require a control group -- some people who were originally skepti...
I feel like there should exist a more advanced sequence that explains problems with filtered evidence leading to “confirmation bias”. I think the Luna sequence is already a great step in the right direction. I do feel like there is a lack of the equivalent non-fiction version, that just plainly lays out the issue. Maybe what I am envisioning is just a version of What evidence filtered evidence with more examples of how to practice this skill (applied to search engines, language models, someone’s own thought process, information actively hidden from you, ra...
During the last Foresight Intelligent Cooperation Workshop I got very curious about what collective intelligence tools currently exist. A list:
I also gather the "Coordination and epistemic tools" resources https://www.pawel.world/Coordination-and-epistemic-tools-6508c74fbeaf4fbd8405c729993db3eb?pvs=4