More Dakka On Your Expectations
After hearing my friend talk about his roommate’s brash decision-making from the despair at getting rejected by girls he liked several times, my friend mentioned that his roommate had asked out a total of three people since high school. Only three!
While there are more factors in the story involved, I’ve heard similar enough troubles that it seems worth saying: Three people is not a lot. Certainly not enough rejections to merit the magnitude of self-worth issues people can walk away with that few from.
If you had the expec...
I made the same mistake when I was young. And it is difficult in retrospect to find out exactly why. I don't remember hearing or reading this explicitly, but somehow I got the idea that first you need to figure out who is your "true love" and then you need to ask them out and... hope that the feeling is reciprocated?
Which is why I have wasted lots of time worrying about how I truly feel about some person, and when I finally felt sure this was the right choice, I got rejected and was emotionally devastated.
And when I put it like this, of course it sounds co...
Frequency of Physical interaction with a media as a contributor to its addictive potential.
It seems to me that a contributor to the addictive nature of some things, short form content in particular could be related to the fact that you just have to freaking touch your screen every seven seconds and can't take your eyes away or you might end up watching some piece of content that is at the very least uninteresting. The reason YouTube, is less addictive by a few degrees could be that you aren't by the very modality of the content, required to touch it ...
Also, I'm new here so feedback/criticism would be welcomed, and direction on whether this is an appropriate "quick take" would be great as well.
I think it's okay, as in: you are not violating the local norms, but also there is a high risk of not getting any interaction, so don't be too disappointed.
On topic:
This is an interesting thought that I don't have a clear opinion on. Television is quite addictive to many people, but there is almost no interaction... except for switching channels. Smartphones are also quite addictive to many people, and there is con...
Does this help outer alignment?
Goal: tile the universe with niceness, without knowing what niceness is.
Method
We create:
- a bunch of formulations of what niceness is.
- a tiling AI, that given some description of niceness, tiles the universe with it.
- a forecasting AI, that given a formulation of niceness, a description of the tiling AI, a description of the universe and some coordinates in the universe, generates a prediction of what the part of the universe at the coordinates looks like after the tiling AI has tiled it with the formulation of niceness.
Foll...
I like the relative simplicity of this approach, but yeah, there is a risk that a tiling agent would produce (a more sophisticated version of) humans that have a permanent smile on their faces but feel horrible pain inside. Something bad that would look convincingly good at first sight, enough to fool the forecasting AI, or rather enough to fool the people who are programming and testing the forecasting AI.
(Not a take, just pulling out infographics and quotes for future reference from the new DeepMind paper outlining their approach to technical AGI safety and security)
Overview of risk areas, grouped by factors that drive differences in mitigation approaches:
Overview of their approach to mitigating misalignment:
Overview of their approach to mitigating misuse:
Path to deceptive alignment:
How to use interpretability:
Goal | Understanding v Control | Confidence | Concept v Algorithm | (Un)supervised? | How context specific? |
Alignment evaluations | Understanding | Any | Concept |
Some versions of the METR time horizon paper from alternate universes:
Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh)
Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the excep...
I'm aware of a study that found that the human brain clearly responds to changes in direction of the earth's magnetic field (iirc, the test chamber isolated the participant from the earth's field then generated its own, then moved it, while measuring their brain in some way) despite no human having ever been known to consciously perceive the magnetic field/have the abilities of a compass.
So, presumably, compass abilities could be taught through a neurofeedback training exercise.
I don't think anyone's tried to do this ("neurofeedback magnetoreception" finds no results)
But I guess the big mystery is why don't humans already have this.
has anyone seen a good way to comprehensively map the possibility space for AI safety research?
in particular: a map from predictive conditions (eg OpenAI develops superintelligence first, no armistice is reached with China, etc) to strategies for ensuring human welfare in those conditions.
most good safety papers I read map one set of conditions to a one/a few strategies. the map would put juxtapose all these conditions so that we can evaluate/bet on their likelihoods and come up with strategies based on a full view of SOTA safety research.
for format, im imagining either a visual concept map or at least some kind of hierarchal collaborative outlining tool (eg Roam Research)
every 4 years, the US has the opportunity to completely pivot its entire policy stance on a dime. this is more politically costly to do if you're a long-lasting autocratic leader, because it is embarrassing to contradict your previous policies. I wonder how much of a competitive advantage this is.
We just put out a Manifund proposal to take short timelines and automating AI safety seriously. I want to make a more detailed post later, but here it is: https://manifund.org/projects/coordinal-research-accelerating-the-research-of-safely-deploying-ai-systems
I think I've just figured out why decision theories strike me as utterly pointless: they get around the actual hard part of making a decision. In general, decisions are not hard because you are weighing payoffs, but because you are dealing with uncertainty.
To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility. "Difficult problems" in decision theory are problems where the p...
To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility.
The theories typically assume that each choice option has a number of known mutually exclusive (and jointly exhaustive) possible outcomes. And to each outcome the agent assigns a utility and a probability. So uncertainty is in fact modelled insofar the agent can assign subjective probabilities to those outcomes occurr...
In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
For some reason it took me until now to notice that:
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
I guess just briefly want to flag that I think this summary of inner-vs-outer alignment is confusing in a way that it sounds like one could have a good enough ground-truth reward and then that just has to be internalized.
I think this summary is better: 1. "The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)". 2. Something else went wrong [not easily compressible].
Lee Billings' book Five Billion Years of Solitude has the following poetic passage on deep time that's stuck with me ever since I read it in Paul Gilster's post:
...Deep time is something that even geologists and their generalist peers, the earth and planetary scientists, can never fully grow accustomed to.
The sight of a fossilized form, perhaps the outline of a trilobite, a leaf, or a saurian footfall can still send a shiver through their bones, or excavate a trembling hollow in the chest that breath cannot fill. They can measure celestial motions and l
Context: LessWrong has been acquired by EA
Goodbye EA. I am sorry we messed up.
EA has decided to not go ahead with their acquisition of LessWrong.
Just before midnight last night, the Lightcone Infrastructure board presented me with information suggesting at least one of our external software contractors has not been consistently candid with the board and me. Today I have learned EA has fully pulled out of the deal.
As soon as EA had sent over their first truckload of cash, we used that money to hire a set of external software contractors, vetted ...
How much serendipity is too much?
Twice now, within the month-to-date, I've started seriously looking into an idea and, within a week or two, a directly related post finds its way to the HackerNews front page. As far as I know, HN doesn't have a personalization engine that could've tracked me across the internet to show these to me specifically.
I don't mean to imply anything crazy is going on here -- it's entirely possible that some decent subsection of the HN cohort was primed in some way, and I'm perfectly aware of the frequency bias (and more vulnerable ...
"Just ask the LM about itself" seems like a weirdly effective way to understand language models' behaviour.
There's lots of circumstantial evidence that LMs have some concept of self-identity.
Nice reminiscence from Stephen Wolfram on his time with Richard Feynman:
...Feynman loved doing physics. I think what he loved most was the process of it. Of calculating. Of figuring things out. It didn’t seem to matter to him so much if what came out was big and important. Or esoteric and weird. What mattered to him was the process of finding it. And he was often quite competitive about it.
Some scientists (myself probably included) are driven by the ambition to build grand intellectual edifices. I think Feynman — at least in the years I knew him — was m
LLM activation space is spiky. This is not a novel idea but something I believe many mechanistic interpretability researchers are not aware of. Credit to Dmitry Vaintrob for making this idea clear to me, and to Dmitrii Krasheninnikov for inspiring this plot by showing me a similar plot in a setup with categorical features.
Under the superposition hypothesis, activations are linear combinations of a small number of features. This means there are discrete subspaces in activation space that are "allowed" (can be written as the sum of a small number of features...