Quick Takes


More Dakka On Your Expectations

After hearing my friend talk about his roommate’s brash decision-making from the despair at getting rejected by girls he liked several times, my friend mentioned that his roommate had asked out a total of three people since high school. Only three!

While there are more factors in the story involved, I’ve heard similar enough troubles that it seems worth saying: Three people is not a lot. Certainly not enough rejections to merit the magnitude of self-worth issues people can walk away with that few from.

If you had the expec... (read more)

I made the same mistake when I was young. And it is difficult in retrospect to find out exactly why. I don't remember hearing or reading this explicitly, but somehow I got the idea that first you need to figure out who is your "true love" and then you need to ask them out and... hope that the feeling is reciprocated?

Which is why I have wasted lots of time worrying about how I truly feel about some person, and when I finally felt sure this was the right choice, I got rejected and was emotionally devastated.

And when I put it like this, of course it sounds co... (read more)

Frequency of Physical interaction with a media as a contributor to its addictive potential. 

It seems to me that a contributor to the addictive nature of some things, short form content in particular could be related to the fact that you just have to freaking touch your screen every seven seconds and can't take your eyes away or you might end up watching some piece of content that is at the very least uninteresting. The reason YouTube, is less addictive by a few degrees could be that you aren't by the very modality of the content, required to touch it ... (read more)

Also, I'm new here so feedback/criticism would be welcomed, and direction on whether this is an appropriate "quick take" would be great as well.

I think it's okay, as in: you are not violating the local norms, but also there is a high risk of not getting any interaction, so don't be too disappointed.

On topic:

This is an interesting thought that I don't have a clear opinion on. Television is quite addictive to many people, but there is almost no interaction... except for switching channels. Smartphones are also quite addictive to many people, and there is con... (read more)

Does this help outer alignment?

Goal: tile the universe with niceness, without knowing what niceness is.

Method

We create:
- a bunch of formulations of what niceness is.
- a tiling AI, that given some description of niceness, tiles the universe with it.
- a forecasting AI, that given a formulation of niceness, a description of the tiling AI, a description of the universe and some coordinates in the universe, generates a prediction of what the part of the universe at the coordinates looks like after the tiling AI has tiled it with the formulation of niceness.

Foll... (read more)

I like the relative simplicity of this approach, but yeah, there is a risk that a tiling agent would produce (a more sophisticated version of) humans that have a permanent smile on their faces but feel horrible pain inside. Something bad that would look convincingly good at first sight, enough to fool the forecasting AI, or rather enough to fool the people who are programming and testing the forecasting AI.

(Not a take, just pulling out infographics and quotes for future reference from the new DeepMind paper outlining their approach to technical AGI safety and security)

Overview of risk areas, grouped by factors that drive differences in mitigation approaches: 

Refer to caption

Overview of their approach to mitigating misalignment: 

Refer to caption

Overview of their approach to mitigating misuse:

Refer to caption

Path to deceptive alignment:

Refer to caption

How to use interpretability:

GoalUnderstanding v ControlConfidenceConcept v Algorithm(Un)supervised?How context specific?
Alignment evaluationsUnderstandingAnyConcept
... (read more)

Some versions of the METR time horizon paper from alternate universes:

Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh)

Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the excep... (read more)

1wonder
Would the take over for small countries also about humans using just an advanced AI for taking over? (or would the human using advanced AI for take over happen faster?)
27Buck
A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬

Didn't watch the video but is there the short version of this argument? France is at the 90th percentile of population sizes and also has the 4th-most nukes.

I'm aware of a study that found that the human brain clearly responds to changes in direction of the earth's magnetic field (iirc, the test chamber isolated the participant from the earth's field then generated its own, then moved it, while measuring their brain in some way) despite no human having ever been known to consciously perceive the magnetic field/have the abilities of a compass.

So, presumably, compass abilities could be taught through a neurofeedback training exercise.

I don't think anyone's tried to do this ("neurofeedback magnetoreception" finds no results)

But I guess the big mystery is why don't humans already have this.

4Alexander Gietelink Oldenziel
I've heard of this extraordinary finding. As for any extraordinary evidence, the first question should be: is the data accurate? Does anybody know if this has been replicated?

I briefly glanced at wikipedia and there seemed to be two articles supporting it. This one might be the one I'm referring to (if not, it's a bonus) and this one seems to suggest that conscious perception has been trained.

has anyone seen a good way to comprehensively map the possibility space for AI safety research?

in particular: a map from predictive conditions (eg OpenAI develops superintelligence first, no armistice is reached with China, etc) to strategies for ensuring human welfare in those conditions.

most good safety papers I read map one set of conditions to a one/a few strategies. the map would put juxtapose all these conditions so that we can evaluate/bet on their likelihoods and come up with strategies based on a full view of SOTA safety research.

for format, im imagining either a visual concept map or at least some kind of hierarchal collaborative outlining tool (eg Roam Research)

every 4 years, the US has the opportunity to completely pivot its entire policy stance on a dime. this is more politically costly to do if you're a long-lasting autocratic leader, because it is embarrassing to contradict your previous policies. I wonder how much of a competitive advantage this is.

Coordinal Research: Accelerating the research of safely deploying AI systems.

 

We just put out a Manifund proposal to take short timelines and automating AI safety seriously. I want to make a more detailed post later, but here it is: https://manifund.org/projects/coordinal-research-accelerating-the-research-of-safely-deploying-ai-systems 

I think I've just figured out why decision theories strike me as utterly pointless: they get around the actual hard part of making a decision. In general, decisions are not hard because you are weighing payoffs, but because you are dealing with uncertainty.

To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility. "Difficult problems" in decision theory are problems where the p... (read more)

To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility.

The theories typically assume that each choice option has a number of known mutually exclusive (and jointly exhaustive) possible outcomes. And to each outcome the agent assigns a utility and a probability. So uncertainty is in fact modelled insofar the agent can assign subjective probabilities to those outcomes occurr... (read more)

3mako yass
I think unpacking that kind of feeling is valuable, but yeah it seems like you've been assuming we use decision theory to make decisions, when we actually use it as an upper bound model to derive principles of decisionmaking that may be more specific to human decisionmaking, or to anticipate the behavior of idealized agents, or (the distinction between CDT and FDT) as an allegory for toxic consequentialism in humans.

In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

For some reason it took me until now to notice that:

... (read more)
Showing 3 of 9 replies (Click to show all)

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

I guess just briefly want to flag that I think this summary of inner-vs-outer alignment is confusing in a way that it sounds like one could have a good enough ground-truth reward and then that just has to be internalized.

I think this summary is better: 1. "The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)". 2. Something else went wrong [not easily compressible].

1Towards_Keeperhood
Sounds like we probably agree basically everywhere. Yeah you can definitely mark me down in the camp of "not use 'inner' and 'outer' terminology". If you need something for "outer", how about "reward specification (problem/failure)". ADDED: I think I probably don't want a word for inner-alignment/goal-misgeneralization. It would be like having a word for "the problem of landing a human on the moon, except without the part of the problem where we might actively steer the rocket into wrong directions". Yeah I agree they don't appear in actor-critic model-based RL per se, but sufficiently smart agents will likely be reflective, and then they will appear there on the reflective level I think. Or more generally I think when you don't use utility functions explicitly then capability likely suffers, though not totally sure.
2Towards_Keeperhood
Thanks. Yeah I guess I wasn't thinking concretely enough. I don't know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here's a bit of rambling: (I think point 6 is most important.) 1. As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires. 2. I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account. 1. The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn't update strongly on abstract arguments like "I should correct my estimates based on outside view". 3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward. 4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it's best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants. 1. Another question of course is whether the inner self-reflective optimizers are likely al

Recently, several promising diffusion language models have been introduced(Dream7b, Ladda). They are still based on transformers. In case they become more popular, how will they impact the interpretability and scaling of LLMs?

Any chance we could get Ghibli Mode back? I miss my little blue monster :(

Lee Billings' book Five Billion Years of Solitude has the following poetic passage on deep time that's stuck with me ever since I read it in Paul Gilster's post:

Deep time is something that even geologists and their generalist peers, the earth and planetary scientists, can never fully grow accustomed to. 

The sight of a fossilized form, perhaps the outline of a trilobite, a leaf, or a saurian footfall can still send a shiver through their bones, or excavate a trembling hollow in the chest that breath cannot fill. They can measure celestial motions and l

... (read more)
lc46-1

My strong upvotes are now giving +1 and my regular upvotes give +2.

Showing 3 of 10 replies (Click to show all)

Hmm I wonder if this is why so many April Fools posts have >200 upvotes. April Fools Day in cahoots with itself?

13Richard_Kennaway
I notice that although the loot box is gone, the unusually strong votes that people made yesterday persist.
4Richard_Kennaway
But now they’re gone! I didn’t expect them to be real, but still, owowowowow! That’s loss aversion for you.

Context: LessWrong has been acquired by EA 

Goodbye EA. I am sorry we messed up. 

EA has decided to not go ahead with their acquisition of LessWrong.

Just before midnight last night, the Lightcone Infrastructure board presented me with information suggesting at least one of our external software contractors has not been consistently candid with the board and me. Today I have learned EA has fully pulled out of the deal.

As soon as EA had sent over their first truckload of cash, we used that money to hire a set of external software contractors, vetted ... (read more)

How much serendipity is too much?

Twice now, within the month-to-date, I've started seriously looking into an idea and, within a week or two, a directly related post finds its way to the HackerNews front page. As far as I know, HN doesn't have a personalization engine that could've tracked me across the internet to show these to me specifically.

I don't mean to imply anything crazy is going on here -- it's entirely possible that some decent subsection of the HN cohort was primed in some way, and I'm perfectly aware of the frequency bias (and more vulnerable ... (read more)

"Just ask the LM about itself" seems like a weirdly effective way to understand language models' behaviour. 

There's lots of circumstantial evidence that LMs have some concept of self-identity. 

... (read more)
Showing 3 of 5 replies (Click to show all)
5eggsyntax
Self-identity / self-modeling is increasingly seeming like an important and somewhat neglected area to me, and I'm tentatively planning on spending most of the second half of 2025 on it (and would focus on it more sooner if I didn't have other commitments). It seems to me like frontier models have an extremely rich self-model, which we only understand bits of. Better understanding, and learning to shape, that self-model seems like a promising path toward alignment. I agree that introspection is one valuable approach here, although I think we may need to decompose the concept. Introspection in humans seems like some combination of actual perception of mental internals (I currently dislike x), ability to self-predict based on past experience (in the past when faced with this choice I've chosen y), and various other phenomena like coming up with plausible but potentially false narratives. 'Introspection' in language models has mostly meant ability to self-predict, in the literature I've looked at. I have the unedited beginnings of some notes on approaching this topic, and would love to talk more with you and/or others about the topic. Thanks for this, some really good points and cites.
2the gears to ascension
willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I'll try some more ideas.

a context where the capability is even part of the author context

Can you unpack that a bit? I'm not sure what you're pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?

Nice reminiscence from Stephen Wolfram on his time with Richard Feynman:

Feynman loved doing physics. I think what he loved most was the process of it. Of calculating. Of figuring things out. It didn’t seem to matter to him so much if what came out was big and important. Or esoteric and weird. What mattered to him was the process of finding it. And he was often quite competitive about it. 

Some scientists (myself probably included) are driven by the ambition to build grand intellectual edifices. I think Feynman — at least in the years I knew him — was m

... (read more)

LLM activation space is spiky. This is not a novel idea but something I believe many mechanistic interpretability researchers are not aware of. Credit to Dmitry Vaintrob for making this idea clear to me, and to Dmitrii Krasheninnikov for inspiring this plot by showing me a similar plot in a setup with categorical features.

Under the superposition hypothesis, activations are linear combinations of a small number of features. This means there are discrete subspaces in activation space that are "allowed" (can be written as the sum of a small number of features... (read more)

Load More