Lightcone Infrastructure FundraiserGoal 1:$868,874 of $1,000,000
Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
leogao160
0
I decided to conduct an experiment at neurips this year: I randomly surveyed people walking around in the conference hall to ask whether they had heard of AGI I found that out of 38 respondents, only 24 could tell me what AGI stands for (63%) we live in a bubble (https://x.com/nabla_theta/status/1869144832595431553)
I really should have blogged more about faithful CoT as soon as I left OpenAI, or better yet, while I was working on it there in 2023. Oh well. A snippet I could have said but didn't: A core reason to be excited about faithful CoT as a big-picture alignment agenda is that it can be pursued as a builder/breaker game, i.e. we have clear empirical feedback / metrics we can progress on. Builders try to design AI systems that are interpretable / have faithful CoT properties; breakers then fine-tune those systems to see if they are capable of learning steganography, subtle biases, deception, etc. (i.e. breaking those faithfulness properties). * Builder example: Shoggoth/face + paraphraser * Visible-to-reward-system scratch pad → Secret scratch pad → Shoggoth/face * Recurrency or other voodoo → tokens → tokens-passed-through-paraphraser-bottleneck * Breaker example: DC evals for steganography, subtle bias, etc. * Ideally we’d just instruct it to conceal stuff in the secret scratch pad, and see if it can. * But we should do more serious elicitation involving fine-tuning etc. * Goal: Stay in the regime longer as our systems get smarter, hopefully resulting in more learning opportunities…
Sam Rosen381
1
Parable of the Purple Creatures Some Medieval townsfolk thought witches were poisoning their wells. Witches, of course, are people—often ugly—who are in league with Satan, can do magic, and have an affinity for broomsticks. These villagers wanted to catch the witches so they could punish them. Everyone in the town felt that witches should be punished. So they set up a commission to investigate the matter. Well, it turned out little purple alien creatures were poisoning their wells.  They *weren’t* human.  They *couldn’t* do magic.  They *weren’t* in league with Satan.  But they *did* cackle. They *did* kind of look like witches.  They *did* constantly carry around broomsticks.  And they *did* poison wells. The commission then split into two factions. Faction one: These purple creatures are what we should mean by the term “witches." Their argument: "We used to think ‘stars’ were holes in the firmament, but now we know they are massive fusion reactions many light years away. We revised our star concept, so why can't we revise our witch concept? If we continue using the witch concept, none of our laws have to change!" Faction two: Witches don't exist. We need a new word for purple-creatures. Their argument: “A huge component of our witch concept is that they are humans and can do magic. These creatures are so different from our traditional conception of witches that we should accept that witches don't exist and we should start calling these things something else. Also, it likely cause confusion because many people will think these things can do magic.” Now, there is no epistemic reason why you should side with one faction over the other. You just sort of have to ask yourself which does more damage to your intuitions. You have to ask yourself about the political and pragmatic effects of supporting faction 1 or faction 2. So much of philosophy is just deciding whether to call purple creatures witches or not. So let me flesh that out. 1) People often thi
What it's like to organise AISC About once or twice per week this time of year someone emails me to ask:  My response: Because not leaving the loophole would be too restrictive for other reason, and I'm not going to not tell people all their options.  The fact that this puts the responsibility back on them is a bonus feature I really like. Our participants are adults, and are allowed to make their own mistakes. But also, sometimes it's not a mistake, because there is no set of rules for all occasion, and I don't have all the context of their personal situation.
Sam MarksΩ26507
15
x-posting a kinda rambling thread I wrote about this blog post from Tilde research. --- If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability! I'll discussed some things that surprised me about this case study in --- The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets. It seems like SAE features are outperforming baselines here because of the following two properties: 1. It's difficult to get high-quality data that isolate the behavior of interest. (I.e. it's difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines. Property (1) is not surprising in the abstract, and I've often argued that if interpretability is going to be useful, then it will be for tasks where there are structural obstacles to collecting high-quality supervised data (see e.g. the opening paragraph to section 4 of Sparse Feature Circuits https://arxiv.org/abs/2403.19647).  However, I think property (1) is a bit surprising in this particular instance—it seems like getting good data for the regex task is more "tricky and annoying" than "structurally difficult." I'd weakly guess that if you are a whiz at synthetic data generation then you'd be able to get good enough data here to train probes that outperform the SAEs. But that's not too much of a knock against SAEs—it's still cool if they enable an application that would otherwise require synthetic datagen expert

Popular Comments

Recent Discussion

This is a linkpost for https://www.thecompendium.ai/

We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.

We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.

We would appreciate your feedback, whether or not you agree with us:

  • If you do agree with us, please point out where you
...

I think that arguments for why godlike AI will make us extinct are not described well in the Compendium. I could not find them in AI Catastrophe, only a hint at the end that it will be in the next section:

    "The obvious next question is: why would godlike-AI not be under our control, not follow our goals, not care about humanity? Why would we get that wrong in making them?"

In the next section, AI Safety, we can find the definition of AI alignment and arguments for why it is really hard. This is all good, but it does not answer the question of w... (read more)

Some meditation advice has a vibe like... “To become more present, all you need to do is practice! It's just a skill, like learning to ride a bike 😊”

This never worked for me. When I tried to force presence through practice, I made little progress.

Being present isn't a skill to build — it's the natural state! What blocks presence is unconscious predictions of bad outcomes. Remove these blocks to make presence automatic.

  1. I struggled to be present with emotions in my body. Daily meditation didn't help. Finally I asked: “What bad thing happens if I feel my feelings?” My system responded: emotional pain would be overwhelming, feelings would make me less productive, expressing emotions would anger others. After untangling these predictions, emotional presence became much easier. (Details)
  2. With others'
...

Logical counterfactuals are when you say something like "Suppose  , what would that imply?"

They play an important role in logical decision theory.

Suppose you take a false proposition  and then take a logical counterfactual in which  is true. I am imagining this counterfactual as a function  that sends counterfactually true statements to 1 and false statements to 0.

Suppose  is " not Fermat's last theorem".  In the counterfactual where Fermat's last theorem is false, I would still expect 2+2=4. Perhaps not with measure 1, but close. So 

On the other hand, I would expect trivial rephrasing of Fermat's last theorem to be false, or at least mostly false.  

But does this counterfactual produce a specific counter example? Does it think that ? Or does it do something where the counterfactual insists a counter-example exists, but...

2Donald Hobson
There is a model of bounded rationality, logical induction.  Can that be used to handle logical counterfactuals?
2Donald Hobson
  And here the main difficulty pops up again. There is no causal connection between your choice and their choice. Any correlation is a logical one. So imagine I make a copy of you. But the copying machine isn't perfect. A random 0.001% of neurons are deleted. Also, you know you aren't a copy. How would you calculate that probability p,q? Even in principle.
2Donald Hobson
If two Logical Decision Theory agents with perfect knowledge of each other's source code play prisoners dilemma, theoretically they should cooperate.  LDT uses logical counterfactuals in the decision making. If the agents are CDT, then logical counterfactuals are not involved.
JBlack20

If they have source code, then they are not perfectly rational and cannot in general implement LDT. They can at best implement a boundedly rational subset of LDT, which will have flaws.

Assume the contrary: Then each agent can verify that the other implements LDT, since perfect knowledge of the other's source code includes the knowledge that it implements LDT. In particular, each can verify that the other's code implements a consistent system that includes arithmetic, and can run the other on their own source to consequently verify that they themselves impl... (read more)

Rationalists are people who have an irrational preference for rationality.

This may sound silly, but when you think about it, it couldn't be any other way. I am not saying that all reasons in favor of rationality are irrational -- in fact, there are many rational reasons to be rational! It's just that "rational reasons to be rational" is a circular argument that is not going to impress anyone who doesn't already care about rationality for some other reason.

So when there is a debate like "but wouldn't the right kind of self-deception be more instrumentally useful than perfectly calibrated rationality? do you care more about rationality or about winning?", well... you can make good arguments for both sides...

On one hand, yes, if your goal is to maximize your...

I have an irrational preference

If your utility function weights you knowing things higher than most people's, that is not an irrationality.

Pacing—walking repeatedly over the same ground—often feels ineffably good while I’m doing it, but then I forget about it for ages, so I thought I’d write about it here.

I don’t mean just going for an inefficient walk—it is somehow different to just step slowly in a circle around the same room for a long time, or up and down a passageway.

I don’t know why it would be good, but some ideas:

  1. It’s good to be physically engaged while thinking for some reason. I used to do ‘gymflection’ with a friend, where we would do strength exercises at the gym, and meanwhile be reflecting on our lives and what is going well and what we might do better. This felt good in a way that didn’t
...

Pacing is a common stimming behavior. It's associated with autism / sensory processing disorder, but neurotypical people do it too.

As the year comes to an end, the days have grown shorter and the nights have grown longer. Tonight we remember the ancient days of humanity's past, and sing of the future we might hope for. There is uncertainty in both directions. We cannot know for sure what has been, and we cannot say what the future will be. There is this, however:

Tomorrow will be brighter than today.

The Montréal Solstice Celebration will be on December 21th. Doors open at 6:30; Solstice starts at 7:00. Please arrive on time to avoid disrupting the somber and more serious parts of the program. RSVPs are appreciated so we can plan appropriately.
If you have never been to a Secular Solstice in this style before, you should expect alternating speeches and songs,...

@BionicD0LPH1N : there will be ~8 candles :)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

I am currently looking for a system which will help me execute some of my massive backlog of ideas. By “ideas” I include my hundreds and hundreds of story outlines for films with a handful of finished screenplays, but also things like: alternative income streams, or day jobs, or skills or abilities I’d like to learn/get (coding, traditional animation, dance the Tango, conversational Italian), as well as a host of other projects.

Before I get to the determining how to better pick which ideas I should pursue, I was wondering if there was any more I could do to optimize my current idea recording method. Some of this overlaps with the GTD concept of the "Someday" bucket. But what I don't like about that is that I'd very...

It started with me taking notes while playing RPGs, but turned into a daily journal.

 

Interesting! Is that because you find that your most creative while playing RPGs? How much detail are in those notes? How often do you find you pause the game to write one (reminds me of the Mitch Hedberg joke about thinking of a joke at night, he either needs to get up, or convince himself that the joke isn't that funny).

How often do you text search for ideas? What seems to trigger revisiting an idea?

1CstineSublime
I'm not familiar with that app, but could you go into more detail about how you use it with regards to storing and capturing ideas? Like do you instantly, say when on the bus, or at the dinner table note down an idea? How much detail do you put in? How does it integrate with your to-do list or calendar or whatever productivity system, formal or informal you have? An idea may not necessarily represent a commitment just yet, so how do you use this app to revisit ideas? How often do you revisit them? Do you organize or store your "ideas" notes differently to other notes?
2Seth Herd
I second Obsidian. It's free, lightning fast, local (you own you data so can't be held hostage to a subscription fee) and the Markdown format is simple and common enough that there will always be some way to use the system you've set up. There are more in depth theories about how to actually organize your notes, but obsidian can do it in a variety of ways, almost however you want.
1CstineSublime
  Which theories have you found suite you best and why? How do you organize your notes? And having captured your ideas in Obsidian, how do you go about revisiting them and ensuring that they don't remain captured but forgotten?

Many people who find value in the Sequences do something which looks to me like adopting a virtue called "align your map to the territory." I recently was thinking about experimental results, and it got me thinking about how we don't really know what the territory is, other than the thing we look at to see if our maps are right. Everything we know is map. What we know consists of a variety of models that describe aspects of reality, and we have to treat them like reality to get anything done. It wasn't relevant to my post at the time, but it occurred to me that it doesn't really matter what reality is, because my values live at a higher level of abstraction with my sense...

2Raemon
lol at the strong downvote and wondering if it is more objecting to the idea itself or more because Claude co-wrote it?
3Tahp
I'm putting in my reaction to your original comment as I remember it in case it provides useful data for you. Please do not search for subtext or take this as a request for any sort of response; I'm just giving data at the risk of oversharing because I wonder if my reaction is at all indicative of the people downvoting. I thought about downvoting because your comment seemed mean-spirited. I think the copypasta format and possibly the flippant use of an LLM made me defensive. I mostly decided I was mistaken about it being mean spirited because I don't think that you would post a mean comment on a post like this based on my limited-but-nonzero interaction with you. At that point, I either couldn't see what mixing epistemology in with the pale blue dot speech added to the discussion, or it didn't resonate with me, so I stopped thinking about it and left the comment alone.
Raemon20

Nod. 

Fwiw I mostly just thought it was funny in a way that was sort of neutral on "is this a reasonable frame or not?". It was the first thing I thought of as soon as I read your post title.

(I think it's both true that in an important sense everything we care about is in the Map, and also true in an important sense that it's not, and in the ways it was true it felt like a kind of legitimately poignant rewrite that felt like it helped me appreciate your post, and insofar as it was false it seemed hilarious (non-meanspiritedly, just in a "it's funny that so many lines from the original remain reasonable sentences when you reframe it as about epistemology"))

2Noosphere89
Yes, and as a contrapositive, if you had enough computing power, you could narrow down the set of models to 1 for even arbitrarily fine-grained predictions.

Sometimes two people are talking past each other, and I try to help them understand each other (with varying degrees of success).

It’s as if they are looking at the same object, but from different angles. Mostly they see the same thing – most of the words have shared meanings. But some key words and assumptions have a different meaning to them.

Often, I find that one person (call them A) has a perspective that’s easier for me to understand. It comes naturally. But B’s perspective is initially harder. So if I want to translate from B to A, I first need to understand B.

I remember a time when I sat listening to two people having a conversation, both getting increasingly agitated and repeating the same points without making...

5Said Achmiz
Why is it necessary (or even relevant) to imagine anything like this? It seems like this part is wholly superfluous (at best!); remove it from the reasoning you report, and… you still have your answer, right? You write: This seems like a complete answer; no explanatory components are missing. As far as I can tell, the part about a “betrayal in [B’s] past … that this situation was now reminding her of” is, at best, a red herring—and at worst, a way to denigrate and dismiss a perspective which otherwise seems to be eminently reasonable, understandable, and (IMO) correct.

It seems to me this is an example of you and Kaj talking past each other. To you, B's perspective is "eminently reasonable" and needs no further explanation. To Kaj, B's perspective was a bit unusual, and to fully inhabit that perspective, Kaj wanted a bit more context to understand why B was holding that principle higher than other things (enjoying the social collaboration, the satisfaction of optimally solving a problem, etc.).